Of Historical Mapathons, Seeds, and Graphs

“To make a digital historical gazetteer, it would help to have a digital historical gazetteer.” — anonymous

The World-Historical Gazetteer project (WHG) is soliciting contributions of place data (attestations of places in historical sources) from any region and period, in any quantity, in order to link them in a “union index” and thereby link the research that discovered them. Almost all of the historical sources are texts or tabular datasets, and although text sources often include descriptions of relative location (e.g. in a province or near a river) rarely do either include geographical coordinates. Considering that one of the main reasons we record place names in such sources is to map them or further analyze and compare them, this presents a problem.

A Problem

If we look up the names in the global modern place authority sources like GeoNames, Getty Thesaurus of Geographic Names (TGN), DBpedia, or Wikidata (a process commonly called “reconciliation”) we obtain generally poor results. Many historical names are no longer in use, many refer to multiple places, and many potential matches get lost in the shuffle due to varying transliteration schemes, alternate spellings, and OCR transcription errors.

A typical scenario we encounter is that a sizable majority of places referenced in a historic text or corpus remain un-located after reconciliation against modern authorities and are therefore not mappable. Granted, mapping is not always the point, but it is often an essential goal. Even if it isn’t, some graphical or otherwise computable representation of the spatio-temporality in a historical source usually is.

We are coming to realize there are some steps we as a community can take to improve this situation: “historical mapathons,” “prioritizing seed datasets,” and “geographical graphs.”

Historical Mapathons

To quote Wikipedia, “a mapathon is a coordinated mapping event.” Until now they have almost always involved adding features to OpenStreetMap in a area for which they are relatively sparse, often in response to a disaster. Mapathons might occur in a single room, where some guidance (and/or pizza) is provided to participants, or “virtually” – where anyone across the globe makes contributions using a web-based software like the iD map editor.

A historical mapathon is a coordinated mapping event where the activity is “feature extraction” for one or more historical maps. That is, tracing features as point, line, or area geometries, along with the associated place name and potentially other attributes the cartography offers. This activity has been performed routinely by creators of historical GISes and others, but we’re aware of only one web-based group “crowd-sourced” virtual mapathon – GB1900, the self-described “name transcription” project of University of Portsmouth in 2017-18 (results). Over a period of many months, volunteers transcribed over 2.5 million text strings represented on Ordnance Survey six inch to the mile County Series maps published between 1888 and 1914. The result is a dataset that should prove immensely valuable to historians of Britain in that period. We can assume that if the GB1900 data is indexed by WHG (our intention), future efforts to map texts of the period should be improved greatly. This is one example of the “seed” principle discussed below.

Seed Datasets

Thanks to the Pleiades project, “a community-built gazetteer and graph of ancient places,” researchers wishing to study the geographies of texts of and about the ancient Mediterranean can locate a very high proportion of place references found in their sources. The Pelagios Commons’ Peripleo project has used Pleiades data as a “seed” in a growing index of place attestations. Records of places within the index are continually augmented with further attestations of places contributed by others. Over time the index has been expanding, spatially and temporally. WHG is following on from Peripleo, extending its aims in a few ways and offering unlimited spatial and temporal coverage. Therefore, seed datasets for particular regions and periods are highly desirable for us.

A seed dataset can take a few forms and might come from a few directions: 1) a repository of attestations laboriously curated by a group of scholars over time (e.g. Pleiades); 2) a historical geographic information system (HGIS), also laboriously developed by one or more scholars and derived from cited primary and secondary sources (e.g. China HGIS, HGIS de las Indias); 3) a historical mapathon.

An Example

Two datasets at the top of our long queue of pending contributions are Werner Stampl’s HGIS de las Indias, and “the Alcedo gazetteer” [1]. Each is fairly comprehensive for 18th century Latin America (~15k and ~18k records respectively). Alcedo happens to be one of over 200 sources for HGIS de las Indias, and there is considerable overlap in coverage. The LatAm Gazetteer project, recently initiated under a Pelagios Commons micro-grant, developed digital text versions of Alcedo and its English translation from scanned images, and from those, a dataset of headwords and place types. At that stage, effective reconciliation with modern gazetteers was impossible — for example, Getty TGN has eight distinct Acapulco listings in modern day México. The original entry text reads, “situada en la Costa de la Mar del S,” which could narrow the possibilities considerably, but there is no ready way to send that phrase as actionable context when searching TGN.

LatAm research developer Nidia Hernández was then able to match 60% of the Alcedo entries to an Indias HGIS entry, and because it has containing districts, provinces, and countries (with geometry), we can record those topological relations and ultimately improve results of the reconciliation to Getty TGN. Still, we have thousands of records that are locatable only by painstaking reading of the original entry, which might still place them only in relation to entities that no longer exist or whose names have changed. A mess! And commonplace.

What we can take away from this is a) it helps to have a large authoritative “seed” for any given region and period – like HGIS de las Indias in this case; and b) in any case, it would help immeasurably to have data from historical maps of the period in place before working with texts — data developed in historical mapathons, that is. Maps provide approximate geometry and a hierarchy of “within” relationships, making reconciliation to modern gazetteers easier, and a “nice to have” option rather than essential.

Realizing a Historical Mapathon

Extracting (tracing) place data from old maps can be tedious and time consuming, so it’s best if a) highly motivated groups do it; b) it’s limited to a few key maps per group; and c) tools are available to make it as easy as possible. The steps involved are:

  1. Choose a few maps having the desired coverage and a viable license (e.g. using Old Maps Online or the David Rumsey Map Collection).
  2. Decide on an encoding strategy: what to digitize and whether to geo-rectify each map to a modern map. This will vary according to the group’s purposes and the individual maps’ cartography. If the distortion is not too extreme, digitizing features will produce estimated geometries that may be of value.
  3. If indicated, geo-rectify maps using desktop GIS (QGIS, ArcMap) or a web-based tool like MapWarper or Georeferencer (built in to the Rumsey site or standalone). This important step is best done by someone with experience at it or willingness to master it.
  4. Have group members use an online tool to view the map(s) (overlaid on a “real-world map” if geo-rectified) and create point, line, and polygon features according to the encoding strategy of Step 2. Saved data can be downloaded and mapped at any stage, and when complete, uploaded to WHG as a contribution.

Having completed this, any subsequent groups attempting to map texts of the region and period in tools like Recogito should see radically better results. Over time: more seeds like this, an easier contribution workflow to WHG, better coverage generally, leading to a true world historical gazetteer resource.

Next steps

The roadblocks to staging a historical mapathon right now are a) the lack of a single tool designed specifically for Step 4*; b) lack of straightforward tutorial for Steps 1-4. The WHG team is committed to working on both of these, and to have them in place by mid-July 2019. We’ll post progress periodically. In the meantime, think of which maps you’d really like to mine in this way, and who you might get to join your mapathon team.

*It should be noted that in some scenarios for non-georeferenced maps, the Recogito tool can be used as is. In that case, all that’s missing is a workflow for converting “within” relationship tags to the Linked Places format used for contributions to WHG. The result will be a graph dataset that could be useful for some purposes.

————-

[1] The 1787 “Diccionario geográfico-histórico de las Indias Occidentales o América” (Alcedo) and its English translation (Thoimposn, 1812).

Contributing in Linked Places format

HGIS de lasIndias spatial footprint
Fig. 1 – HGIS de las Indias spatial footprint

We are pleased to announce version 1 of the Linked Places (LP) format, to be used for contributions to World-Historical Gazetteer (WHG) and to the Pelagios “registry index” that underlies its Recogito annotation tool and Peripleo search interface. We were joined in this effort by Rainer Simon of Pelagios and Graham Klyne and Arno Bosse of Oxford’s Cultures of Knowledge, so a big shout of thanks to all three.

A basic specification of the format has been published to GitHub, along with some relevant files:

The spatial footprint of the HGIS de las Indias contribution appears in Figure 1; more about that data and ongoing efforts to make it the “seed” for a virtual LatAm Gazetteer within WHG are forthcoming soon.

An Interconnection Format, Not “The OneModel

Historical research projects producing gazetteer data have distinctive data models reflecting their source data and project-specific requirements. We are completely agnostic as to contributors’ internal models and formats. The Linked Places format provides a uniform way to build links between different gazetteers.

Temporal Scoping

WHG and Pelagios are aggregating contributions which have great variety in scope and granularity. With LP format, we are striking a balance between accommodating detail and offering simplicity. Two underlying conceptual models: of Place and of attestations. We say (hopefully not controversially) that places have, over time,

  • one or more names in various languages
  • one or more functional types
  • zero or more known spatial footprints (geometries)
  • one or more relations to other places

Gazetteer data developers gather attestations of these properties, often temporally scoped. That is, names, types, geometries, and relations were/are true for some timespan—whether we have that detail to hand or not. Thing is, some projects do and some don’t, and we must accommodate both cases.

The simple subject, predicate, object form of RDF is not adequate for these kinds of relations and so we have reified them as attestations: NameAttestation, TypeAttestation, Setting, and RelAttestation. Each can be temporally scoped with a “when” element and include citation information, although this is not required. We also permit (and encourage) a global “when” element for the record—effectively the union of temporal attributes for names, types, geometries, and relations.

Links

We ask that all contributions include, if at all possible, at least one closeMatch or exactMatch relation to a published place record having a de-referenceable URI. These links are “coin of the realm” so to speak. If an incoming record shares such a link with an existing record in our registry index, we  add it to the existing record. If it shares no such link with any existing records and doesn’t otherwise match name, type, and geometry, a new registry index record will be created using it as a seed.

Three other kinds of LinkAttestations to existing web resources are recognized: subjectOf, primaryTopicOf, and seeAlso.

Record-Level Properties

Several “record-level” properties round out this format, as detailed in the GitHub specification: title, ccode (modern country), description(s), and depiction(s).

Moving Forward

Please get in touch  if you are interested in contributing to WHG. The Linked Places format is not set in stone; this v1 is subject to revision based on our experiences ingesting contributions. The ingest of HGIS de las Indias data as a gazetteer offered to users of Recogito has succeeded, following significant manual effort to make the transformation from that project’s complex model to LP format. Temporally scoped namings and parthood relations in that dataset made it over, but are not reflected in Recogito. Simpler example datasets will be added soon. The WHG web interface and API are being designed to expose all contributed place attributes, but they won’t be available for several months.

Linked Places Annotations

Next up in our format-creation effort is an update to the standard format for contributions of annotations. Just as Peripleo displays coins and inscriptions associated with places in its registry index, the WHG graphical search will such annotation records, but in our case for historical routes, datasets, and bibliographic records. The format previously used by Pelagios/Peripleo will be updated soon—details to follow.

Contributing to World-Historical Gazetteer: a Preview

The World-Historical Gazetteer project (WHG) will soon begin aggregating and indexing historical gazetteer datasets, and exposing them as Linked Open Data via graphical and programmatic web interfaces — just as Pelagios Commons’ Peripleo project has done for a few years. And like Peripleo, WHG will also index contributions of annotation records that associate historical “items” with place identifiers. Typical items for Peripleo have included coins, coin hoards, and inscriptions of the Classical Mediterranean. Items records WHG will focus on include journey events, regions, and datasets. In fact, annotated items could be anything for which location is relevant, e.g. people and various types of events.

We are almost ready to begin accepting contributions; this post previews the pipeline and formats involved.

Contributions to WHG can include, in some combination: 1) gazetteer data, i.e. place records drawn from historical sources; 2) annotation records that associate a published record about an item with a place identifier; 3) collections of item metadata records referenced in annotations; and 4) a file describing the contributed dataset(s) in Vocabulary of Interlinked Datasets” format (VoID).

Over the past several weeks we have collaboratively developed a new Linked Places format (LPF here for short) with Rainer Simon of Pelagios, to be used for contributions of historical place data to both WHG and Pelagios’ Peripleo. The Linked Places format is designed around the JSON-LD syntax of RDF (it is also valid GeoJSON, with temporal extensions, as explained in the GitHub README). The new format makes use of several existing vocabularies and also introduces some terms specific to our shared purposes.

Several expert colleagues contributed valuable input, including Graham Klyne, Richard Light, Lex Berman, Arno Bosse, and Rob Sanderson [1]. We are in the process of updating the template Peripleo has used for annotation contributions (formerly Open Annotation in RDF Turtle, now its next-generation W3C Web Annotation in JSON-LD). Both are discussed in a little more detail below.

Contributing historical place records

There will be two separate workflows for contributions: from larger projects and from smaller ones. The distinction is whether a project has the capability and resources to meet two criteria which are accepted norms for publishing Linked Data: 1) publishing data in some syntax of RDF (in our case the new LPF); and 2) providing a unique URI and associated “landing page” for each resource described.

Case 1: Larger projects

If your project has (or will have) a web presence that provides public pages describing your individual places and/or “items”, (routes, regions, etc.), then we ask that you perform a transformation and export of your data in the standard formats mentioned above – Linked Places, a future annotation format (see Contributing Annotations below), and VoIF. Upon validation, we will ingest those records, link them with those already in the system, and expose them in a nice GUI and API. Details of WHG interfaces are forthcoming soon.

Case 2: Smaller projects

If your project does not entail creating a web site providing per-record landing pages, then we can accept your data contribution as CSV, mint unique URIs, and provide very basic landing pages for places and other items. The records will also be made available as JSON-LD (bonafide RDF) via our API. We will provide a Python program for converting CSV to LPF, but note that the CSV will have to conform to a template that aligns with LPF (available soon). Conversion from your native format to our CSV template will probably be more manageable than to LPF. In other words, upon submitting CSV data we can parse, a semi-automated conversion and ingest procedure will result their publication as Linked Open Data.

Contributing annotations

WHG will index metadata describing historical “items” annotated with gazetteer record identifiers. These annotation records assert, in effect: “this item is/was associated with this place, in this way;” and optionally, “at this time.”

The result of such annotations can be seen in the current Peripleo interface, where upon navigating to a given place, you can view metadata (including images) for coins and inscriptions associated with it in e.g. a foundAt or hasLocation relation. Annotations exposed in the WHG web interface will include historical journeys for which the given place was a waypoint, and regions, works, and datasets including or referring to the place.

Annotation contributions will comprise two sets of data: 1) collections of brief Item metadata records; and 2) collections of annotation records in W3C Web Annotation format. The contribution template in use by Pelagios’ Peripleo now is currently being updated to better account for typing of items and relations. Details of that new Linked Places annotation format (LPAF?) will be published soon. Collaborators in that modeling effort are most welcome!

[1] Twitter handles, in order: @gklyne, @RichardOfSussex, @mlex, @kintopp, and @azaroth42