What’s New in the WHG Index

47 million places, 67 million toponyms, and a phonetic search engine that works across scripts.

By Stephen Gadd, WHG Technical Director


The World Historical Gazetteer helps researchers, educators, and students discover how places connect across time, language, and culture. This post describes the most substantial infrastructure change since the platform launched: a full rebuild of the reconciliation index, a new phonetic search capability, and an automated clustering system that links place records across independent gazetteers.

Infrastructure: University of Pittsburgh CRC

The new system runs on dedicated infrastructure provided by the Center for Research Computing (CRC) at the University of Pittsburgh, replacing the previous single-server deployment. The CRC environment provides the compute and storage capacity needed to index and serve tens of millions of records — including the GPU resources used to train and run phonetic embedding models.

47 Million Places, 67 Million Toponyms

The previous WHG reconciliation index drew on two sources (GeoNames and Wikidata) and contained approximately 13.6 million records. The new index incorporates authority data from six major global gazetteers:

SourcePlacesDescription
OpenStreetMap~15 millionCrowdsourced global mapping data
GeoNames~12 millionThe world’s largest open geographical database
Wikidata~8 millionCommunity-curated structured knowledge base
Getty TGN~3 millionThe Thesaurus of Geographic Names, with substantial historical depth
Pleiades~37,000Gazetteer of the ancient Mediterranean world
Library of CongressExtensiveGeographic authority records

The total distinct place count is approximately 47 million. More importantly, the index now contains approximately 67 million toponyms — the individual name forms by which those places are or have been known, across languages, scripts, and historical periods. Each toponym is linked to its source places and carries a phonetic embedding (see below), making it possible to search not just by exact string but by sound.

Symphonym: Phonetic Search Across Scripts and Centuries

A persistent difficulty in historical gazetteer work is that the same place may appear under many different names: transliterated into different scripts, adapted to different phonologies, abbreviated, or simply spelled according to conventions that are centuries out of date. Standard text search can match “Florence” but will miss “Firenze”; it can find “Constantinople” but not “Konstantiniyye” or “قسطنطنية”.

Symphonym is a phonetic search system developed for WHG that addresses this problem. Every toponym in the index is converted into a fixed-dimensional phonetic embedding — a vector representation of how the name sounds, derived from Grapheme-to-Phoneme (G2P) conversion and articulatory phonetic feature extraction. Names that sound similar end up close together in embedding space, regardless of script or orthography. A search for “Konstantiniyye” will retrieve “Constantinople” and “قسطنطنية”; “Firenze” will match “Florenz”; “Stamboul” will surface alongside “Istanbul” and “İstanbul”.

This is particularly valuable for work with archaic and historical spellings. Researchers working with early modern catalogues, medieval charters, colonial-era maps, or any primary source material will encounter place names in spellings that no longer appear in modern gazetteers. Symphonym’s phonetic matching can bridge this gap: variant historical spellings like “Lipsick” or “Venedig” can be matched to their standard forms (“Leipzig”, “Venezia”) on the basis of phonetic proximity. This enables the enrichment, linking, and geolocation of catalogue descriptions and historical documents that would otherwise require extensive manual identification.

Note that phonetic search finds names that sound alike — it does not resolve etymologically unrelated names for the same place (e.g. “Eboracum” and “York”, or “Thessaloniki” and “Solun”). Those connections are established through other signals in the clustering pipeline, such as authority cross-references and spatial co-occurrence.

Automated Clustering Across Gazetteers

When the same physical location is described independently by GeoNames, Wikidata, TGN, and Pleiades — each with their own identifiers and naming conventions — determining which records refer to the same place is a non-trivial problem. The new system includes an automated clustering pipeline that combines multiple signals:

  • Explicit authority cross-references (e.g. sameAs links between Pleiades and GeoNames)
  • Exact toponym co-attestation — places in different gazetteers sharing the same name string, filtered by spatial proximity and country-code overlap
  • Phonetic similarity between toponyms (via Symphonym embeddings), with thresholds calibrated automatically from the authority hard links
  • Spatial proximity of coordinates
  • Feature type alignment across classification systems

The pipeline runs in four phases, from high-confidence explicit links through to phonetic similarity matching. Thresholds for the phonetic phase are not set manually but are learned from the authority hard links themselves: the system samples known-same and known-different place pairs, computes their phonetic and spatial signals, and fits a logistic regression to determine optimal similarity and distance cutoffs. In the most recent run, this calibration yielded a cosine similarity threshold of 0.79 and a spatial distance threshold of 5 km — substantially tighter than the initial manual defaults.

The result is a set of approximately 7 million clusters grouping 19 million of the 47 million place records. Each cluster represents a single real-world location as attested across multiple gazetteers. For users reconciling their own datasets, this means a search can return a single grouped result for a location rather than a confusing set of separate entries from different sources.

Importantly, the clustering algorithm is designed to be adaptive. Users can assert that particular place records do not belong in a given cluster, and these assertions feed back into the system, improving clustering quality over time.

Clustering also unlocks richer contextual information. When a place record from one authority (e.g. GeoNames or TGN) is clustered with a Wikidata record, the system can follow Wikidata’s links to retrieve supplementary data from Wikipedia — descriptions, images, and other reference material — and present it alongside the place. This means that a search result can surface Wikipedia content even when the original matching authority has no such links itself.

Applications

  • Search retrieves results across scripts, languages, and historical spelling variants
  • Reconciliation matches uploaded data against a substantially larger and more diverse authority base than before
  • Data linking connects user places to the broader Linked Open Data ecosystem via clustered authority identifiers
  • Catalogue enrichment — institutions holding historical documents with place references can use phonetic matching to identify, link, and geolocate those references against modern authority records

Data Architecture

The underlying data model separates places from names (toponyms) and tracks which source attests which name at which point in time. This structure — built on Normalised Place Records, Toponyms, and Attestations — reflects the scholarly reality that place identity is complex and historically contingent.

The current indexing and clustering system, which runs batch computations over Elasticsearch, is the first major step towards a graph-based architecture in which pairwise links between place records are stored as edges and cluster membership is resolved by graph traversal at query time. Under this model, batch clustering becomes unnecessary: clusters can be computed on-the-fly for any query, and users can adjust confidence thresholds interactively (e.g. “show me only high-confidence links” vs “include tentative matches”).

The graph architecture also enables a fundamental shift in how contributions work. Rather than uploading datasets of place records and reconciling them after the fact, the predominant form of contribution becomes the attestation: a name, date, source reference, or classification attached to an existing place in the index. Contributors find the place and attach their evidence to it; new place identities are minted only when no existing record matches. This attestation-centric model better reflects scholarly practice — researchers typically have evidence about known places, not inventories of new ones — and the dense authority backbone of 47 million indexed places makes it practical. (See the design discussion for further detail.)

The new indexing and clustering system will be rolled out progressively. Updates will be posted at whgazetteer.org and on the documentation site.

New Published Datasets!

We’re excited to share the following newly published datasets from WHG:

The Belgian Historical Gazetteer – Provinces Antwerp and East-Flanders dataset brings together historical place names from the reduced cadastre (gereduceerd kadaster) of Belgium (1847–1855), focusing on the provinces of Antwerp and East Flanders. Contributed by Léa Hermenault, it forms part of the wider Belgian Historical Gazetteer Project (CLARIAH-VL+ and the University of Antwerp).

Cliopatria – A geospatial database of world-wide political entities from 3400BCE to 2024CE, a comprehensive open-source geospatial dataset of worldwide states, political groups, events, and rulers from 3400BCE to 2024CE. Presently it comprises over 1600 political entities sampled at varying timesteps and spatial scales. This dataset is edited by Ed Chalstrey, James Bennett, and Erin Mutch and was converted into Linked Places Format by Stephen Gadd.

La sfera_ (_The Globe_), written by the Florentine merchant, Goro Dati, is a textbook designed to introduce the next generation of Florentine merchants to natural phenomena, navigation, and the topography of the Mediterranean. Dati’s Globe (La sfera) is a dataset of places from Book IV which contains an itinerary of and maps major Mediterranean and Black Sea ports.

WHG is always excited to welcome new contributions. If you’re interested in working with us, we’d love to hear from you!

WHG Creates Video for “What’s in a Name? ​Exploring Place Names as Forms of Social and Geographic Storytelling”

The World Historical Gazetteer was invited to contribute an instructional video as part of a set of online modules and classroom resources focused on exploring place names as meaningful forms of social and geographic storytelling. The instructional video was created for the Tennessee Geographic Alliance as an extension of “What’s in a Name? Exploring Place Names as Forms of Social and Geographic Storytelling,” an interactive workshop that invited educators to explore the profound social, historical, and political meanings behind the names attached to places.

The video includes an introduction to the World Historical Gazetteer website and its features, presented by Ruth Mostern; highlights of datasets in the WHG index that reflect the contested nature of place names, presented by Palak Vashist; and a tutorial on creating your own collection of places in the WHG, presented by Ali Straub.

Other videos in the series explore the power of place name repatriation, tools for analyzing place names in social texts, and the relationship between names, identity, and memory.

Explore the videos here!

ISHI at Linked Pasts Symposium 11

On December 9, 2025, members of the ISHI and World Historical Gazetteer (WHG) teams led a session at Linked Pasts Symposium 11: “Linking Knowledge Through Place: ISHI, WHG, and the Future of Gazetteer Collaboration.” The Linked Pasts Symposium is a goal-oriented forum focused on building, planning, and learning about the application of linked open data (LOD) to historical texts, events, and datasets.

During the event, Ruth Mostern, ISHI Director and WHG Project Director, introduced ISHI and its new role in the Pelagios Place Activity (formerly the Gazetteer Alignment Activity). Stephen Gadd, WHG Technical Director, presented recent updates to the World Historical Gazetteer, including ORCID-based authentication, new Entity and Reconciliation Service APIs, and improved reconciliation with Wikidata. Ali Straub and Palak Vashist then discussed new and improved contributor documentation, including a more user-friendly LP-TSV template and a Submission Readiness Checklist that outlines the steps and criteria required to successfully publish a dataset on the WHG.

After the presentations, an open discussion addressed challenges in modeling historical places from uncertain data, WHG’s plans for a graph data model in its next version, its curation strategy and current content gaps, and opportunities for partnerships and capacity building. Participants also expressed strong interest in developing a centralized portal or clearinghouse for syllabi, courses, and training materials related to spatial history and gazetteers.

View the session agenda and discussion summary here. 

WHG Transitions to ORCiD Authentication

The World Historical Gazetteer will now require authentication via ORCiD (Open Researcher and Contributor ID) for all registered users. The use of ORCID will ensure accurate attribution, persistent researcher identity, interoperability, and scholarly credit, while also maintaining accessibility beyond institutional affiliations, within the linked open data ecosystem. Using ORCiD also streamlines account access by removing the need to manage passwords or depend on third-party login services. Users with existing ORCiD accounts can sign in seamlessly and securely. We’ve adopted ORCiD so that users can benefit from a more secure, flexible, and research-friendly system. 

The use of ORCiD will also enable secure and controlled access to the WHG’s APIs, including two new complementary APIs for use with WHG data. The Entity API will allow users to retrieve full metadata, names, types, geometries, temporal bounds, authority info, and linked resources from among datasets, collections, and over 2.2 million places. The Reconciliation Service API will allow users to match historical geographical entities with the WHG data in both automated workflows and manual tools such as OpenRefine.

Our transition to ORCID authentication was made possible thanks to a collaboration with the University of Pittsburgh Library System (ULS).

What This Means for Users 

  • Existing WHG users can link their WHG account with an ORCiD by using the Legacy WHG Login and Link ORCiD button on whgazetteer.org/accounts/login/.
  • If you are a new user who already has an ORCiD, you can simply log in using your existing ORCid credentials.
  • If you are a new user who does not have an ORCiD, you can create one during sign-in on whgazetteer.org/accounts/login/.
  • Registration is NOT required to search WHG’s indexes or to view datasets or collections. Registration is required only for dataset contribution, the creation of collections, or use of our APIs. With your consent, it would also allow us to let you know about important updates and new features.

A Message about WHG Technical Director Karl Grossner’s Retirement

After more than seven years of dedicated work on the World Historical Gazetteer (WHG), Technical Director and Lead Developer Dr. Karl Grossner has announced his retirement from the project team. Karl has been instrumental in all aspects of envisioning, guiding, and building the WHG into groundbreaking digital infrastructure that includes a spatially and temporally referenced index of world historical place names and a linked data ecosystem. Karl has led the development of the platform through three versions, the most recent of which indexes over 3.4 million place names. 

Karl’s contributions have gone far beyond technical expertise. He has taken a leading role in setting the vision for the project, building a collaborative and robust community of scholars who work with linked open geodata, and soliciting and developing the content that we index. His dedication, expertise, and commitment have been fundamental to the project’s success and evolution. We are grateful that Karl remains committed to the success of the WHG and that he will continue contributing actively to it in his retirement. You can keep up with Karl’s continued work on his X account: @kgeographer. We’ve posted a statement from Karl on the WHG blog here : http://blog.whgazetteer.org/2024/07/26/a/

Karl’s accomplishments ensure that the WHG has a bright future. We will continue improving the platform, growing the community, and expanding the index of named places. We are pleased to announce that Dr. Stephen Gadd, a scholar of early modern economic history, has transitioned into the role of lead developer. Stephen has worked closely with Karl over the past year and shares a passion for the WHG and the linked open data community. In the coming months, Stephen and the project team will continue enhancing and extending the platform’s features including the API and the reconciliation process, accessioning historical place datasets, and building our community.

We hope that you will stay in touch during this transition and that you will join us in expressing gratitude and esteem to Karl and sharing good wishes for his future. Please use the contact form under About on whgazetter.org  to contact the project team. 

Ruth Mostern, Principal Investigator and Project Director

Stephen Gadd, Lead Developer

Alexandra Straub, Project Manager

A message from Dr. Karl Grossner on his retirement

Dear WHG community,

With more than a little regret and some relief, I have withdrawn from my roles as Technical Director and Lead Developer on the World Historical Gazetteer (WHG) project team after 7 ½ years, and will depart the team entirely at the end of 2024. As of now, the technical lead on WHG is the estimable Stephen Gadd (@docuracy). The rest of the team is unchanged. The project is in good hands, led as always by Dr. Ruth Mostern, and the prospects for continued development as an increasingly important resource in DH software infrastructure are great.

It was my good fortune in 2016 to be asked by Ruth and Dr. Patrick Manning to help write an NEH grant to initiate the WHG project, and to serve as its Technical Director should the grant be awarded. And it was my further good fortune to work with great teammates in bringing the platform along through versions 1 and 2 to version 3, released in June 2024. The initial “proof of concept” led to a second significant NEH grant and support from abroad and from several internal groups at Pitt.

When I left Stanford Libraries in 2016 it was intended to be a ‘semi-retirement’! Well, that was obviously deferred, but now is the time to turn away from software development and focus more on my (always applied) research agenda, “computing place.” 

I will be an active data contributor going forward, and may even make an occasional small pull request, but only as one of a hopefully growing number of contributing developers; it is an open-source project! I’m not vanishing, just stepping aside.

With all best wishes to the team, and to all the people I’ve met along this journey…

Karl

Adding GeoNames to Wikidata for reconciliation

In the upcoming Version 3 beta of World Historical Gazetteer (early June 2024), we have added about 10 million GeoNames place records to the 3.6 million Wikidata records in the index we have been using for reconciliation. This means that for geocoding purposes (one of the main reasons for using the WHG reconciliation service) the will be a higher likelihood of finding prospective matches for your records.

It also adds some complexity to the review of hits (see the screen below) and we are looking for feedback on how this will work. During the beta phase we can refine or even discard this feature – up to our users!

So…how it will work:

  • When you create a new reconciliation task you have the option to exclude GeoNames records; if you do, they will be skipped in the search for matches
  • If you don’t exclude them, hits from GeoNames will be returned along with those from Wikidata, but…
    • If there were were both Wikidata and GeoNames hits, the GeoNames ones would be hidden initially, but displayed on click of a toggle button
    • If there were no Wikidata hits but there were GeoNames hits, those would display right away.
  • As usual, you can select zero or more of these hits as close matches, press Save and move on to the next.

Below you can see the before and after choosing to display GeoNames hits.

Wikidata hit shown, GeoNames hidden until requested
GeoNames hits displayed on request

Version 3 due in June!

We have been busy, with both software and content development. Version 3 of the World Historical Gazetteer has been in development since February 2023, and a beta version will be available mid-2024. What follows is a brief outline of what we have been working on, much of which came as suggestions from our user community. Details will follow in the coming weeks and months, on this blog and on Twitter. We do expect to establish a Mastodon account soon as well.

Version 3 (alpha) home page

New “Gazetteer Builder” feature

  • Link multiple datasets in a single collection, e.g. for a group or individual to assemble a “Historical Gazetteer of {x}”
  • Merge multiple datasets into new dataset

Home page

  • A map(!), with search and advanced search
  • ‘Carousels’ of published datasets and collections, with extents previewed on the map
  • Improved explanation of what the WHG offers
  • News and announcements

Maps

  • All 14 maps on the site significantly upgraded
  • Most maps now have temporal controls: a timespan ‘slider’ and/or a sequence ‘player’
  • Faster display of large datasets and collections, thanks to WHG’s own new “tileboss” server

Search

  • Search now across all published records-the confusing “search the index or database” choice is gone!
  • Options for ‘starts with”, “contains”, “similar to” (aka fuzzy) as well as ‘exact’
  • Spatial filter on search results
  • More information returned in search result items

Place Portal pages

  • Complete makeover of its design
  • Physical geographic context: ecoregions, watersheds, rivers, boundaries
  • Nearby places
  • Preview of annotated collections that include the place

Publication and editorial workflow

  • We are now especially highlighting three types of publications: Datasets, annotated Place Collections, and Dataset Collections
  • Expanded Managing Editor role
  • Improved tracking of contributors and data, from ‘interested’ to full accessioning
  • DOIs for data publications, enhanced metadata, significantly enhanced presentation pages
  • Improved download options

Annotated place collections for teaching

  • Support for class and workshop group scenarios
  • Optional image per annotation
  • Order places sequentially with or without dates
  • Enhanced display and temporal control options
  • Optional gallery per class
  • Site-wide student gallery

“My Data” dashboard and profile

  • Single page, simpler

Study Areas

  • Discontiguous areas, e.g. Iberian peninsula and S. America as a single area

API and data dumps

  • More endpoints, better documented
  • Regular dumps of published data in multiple formats

Codebase

  • Improved file upload validation and error reporting
  • The codebase is now “dockerized,” making it much easier to contribute to the platform’s development
  • Upgraded versions of all major components: Django, PostgreSQL, Elasticsearch, etc.
  • All map-related functions refactored for efficiency

Community Feedback Meetings: September 2023

As the project team continues developing the WHG platform and expanding indexed and published content about historical places, we want to ensure that we are maximizing opportunities to involve our community in decision making and feedback. To that end, the WHG project team will be holding a series of community feedback meetings. We have scheduled four 60-minute Zoom meetings in September of 2023 during which we hope to solicit feedback from those interested in and knowledgeable about the WHG project and also humanistic linked data more broadly. We seek your help in identifying and prioritizing next steps in making the WHG platform more useful and more usable for more people.

The four meetings will take place on Thursday, September 7 at 9:00AM and 4:00PM Eastern and Friday, September 8 at 9:00AM and 4:00PM Eastern. You can register for a meeting using the Zoom links below. 

September 7, 2023 9:00-10:00 am EST

September 7, 2023 4:00-5:00 pm EST 

September 8, 2023 9:00-10:00 am EST

September 8, 2023 4:00-5:00 pm EST 

The agenda will be the same for all sessions: a welcome and introduction, a brief overview of project developments over the past year, a walkthrough of new developments for the upcoming Version 3, and time for questions and discussion. We can be most productive if participants visit WHG beforehand (https://whgazetteer.org), register as a user if you haven’t already, and browse the site guide, tutorials, and features themselves. Prior to the meeting, you will receive a Google Survey form that you can use to record comments and thoughts during and after the meeting.