Mapping Search Queries to GeoLibrary Metadata Fields

Geo4Lib Camp 2021



Phil White & Erik Radio

Feb. 10, 2021

outpw.github.io/slides/geometa.html

Phil White

Earth, Environment & Geospatial Librarian

Philip.White@Colorado.EDU


Erik Radio

Metadata Librarian

Erik.Radio@Colorado.EDU


Outline


  1. Background
  2. Review
  3. Methods
  4. Results
  5. Discussion/Analysis

Background

Colorado GeoLibrary


  • Proposed way back in 2017
  • Launched in 2019 (geo.colorado.edu)
  • Provides access & discovery of Colorado GIS data

How are people using the GeoLibrary?


What sort of search terms are they using?


Do people search for subjects? Places? Other?


Do those terms match subjects & placenames?

Research Questions:


  1. What query terms mapped most frequently to which metadata fields?

  2. Is there a type of query that more generally appears to map to certain fields?

Answering these questions could improve search performance and UX.

Review

GeoBlacklight Repositories and Metadata Evaluation

  • GBL metadata lacks unified approach across institutions (Batista et al., 2017).

  • GBL UX study found disambiguation problems in both subjects & places (Blake et al., 2017).

  • Blake et al. also found users rely most on the description field.

  • Few query log studies; some have found organizational, data type, or publication to be most common (Schindler et al., 2019).

Query Log Analysis and Metadata

  • 47% match between LC subjects and queries (Carlyle, 1989)

  • 80% of all queries are short (Park and Lee, 2014)

Methods

Data

Query Types

Type Description Example
Datatype type of GIS data Basemap, contours
Format file type Geotiff, shapefile
Locational general type of place Campsites, buildings
Place name specific place Continental Divide, Colorado Springs
Organization corporate entity Colorado DOT, Census Bureau
Person human being John Doe
Publication issuance of a particular resource Census tracts, Bureau of Land Management roads
Topical subject of interest Agriculture, aliens
Unknown ? 2000, trib

Methods

  • Python script counted number of times a query matched to a metadata field

  • Second Python script tallied each time a query of a category type matched to a field

  • Code repository is on GitHub

Results

Most popular queries (descending order):


  1. topical

  2. place name

  3. locational

#### Results - Most matches found in *dc.description* - **topical** queries had high matches in *dc.title*, many less in *dc.subject* - **place name** queries matched frequently in *dc.publisher* and *dc.creator* - **locational** queries matched frequently in *dc.title*

Discussion

#### Analysis - Subjects not as useful as anticipated - Synonyms have potential for increasing matches - Best fields for matching remain *dc.description* and *dc.title*
#### Challenges - Datasets have unique descriptive requirements - Large gap exists between metadata professionals and data creators

Implications

  • ~20 institutions use GeoBlackLight, and most share metadata via OpenGeoMetadata Repository

  • We found many "false positive" results happen because "Colorado" is in so many metadata fields; state institutions should not index "provenance".

  • Metadata creators should include rich description and title for improved discovery

  • This is easily replicable for other GeoBlacklight repos.

  • We would like to see if others have similar results. Would results be different based on different metadata practices?

Fin!


Thanks everyone!

Question time.