What’s New in the WHG Index

47 million places, 67 million toponyms, and a phonetic search engine that works across scripts.

By Stephen Gadd, WHG Technical Director


The World Historical Gazetteer helps researchers, educators, and students discover how places connect across time, language, and culture. This post describes the most substantial infrastructure change since the platform launched: a full rebuild of the reconciliation index, a new phonetic search capability, and an automated clustering system that links place records across independent gazetteers.

Infrastructure: University of Pittsburgh CRC

The new system runs on dedicated infrastructure provided by the Center for Research Computing (CRC) at the University of Pittsburgh, replacing the previous single-server deployment. The CRC environment provides the compute and storage capacity needed to index and serve tens of millions of records — including the GPU resources used to train and run phonetic embedding models.

47 Million Places, 67 Million Toponyms

The previous WHG reconciliation index drew on two sources (GeoNames and Wikidata) and contained approximately 13.6 million records. The new index incorporates authority data from six major global gazetteers:

SourcePlacesDescription
OpenStreetMap~15 millionCrowdsourced global mapping data
GeoNames~12 millionThe world’s largest open geographical database
Wikidata~8 millionCommunity-curated structured knowledge base
Getty TGN~3 millionThe Thesaurus of Geographic Names, with substantial historical depth
Pleiades~37,000Gazetteer of the ancient Mediterranean world
Library of CongressExtensiveGeographic authority records

The total distinct place count is approximately 47 million. More importantly, the index now contains approximately 67 million toponyms — the individual name forms by which those places are or have been known, across languages, scripts, and historical periods. Each toponym is linked to its source places and carries a phonetic embedding (see below), making it possible to search not just by exact string but by sound.

Symphonym: Phonetic Search Across Scripts and Centuries

A persistent difficulty in historical gazetteer work is that the same place may appear under many different names: transliterated into different scripts, adapted to different phonologies, abbreviated, or simply spelled according to conventions that are centuries out of date. Standard text search can match “Florence” but will miss “Firenze”; it can find “Constantinople” but not “Konstantiniyye” or “قسطنطنية”.

Symphonym is a phonetic search system developed for WHG that addresses this problem. Every toponym in the index is converted into a fixed-dimensional phonetic embedding — a vector representation of how the name sounds, derived from Grapheme-to-Phoneme (G2P) conversion and articulatory phonetic feature extraction. Names that sound similar end up close together in embedding space, regardless of script or orthography. A search for “Konstantiniyye” will retrieve “Constantinople” and “قسطنطنية”; “Firenze” will match “Florenz”; “Stamboul” will surface alongside “Istanbul” and “İstanbul”.

This is particularly valuable for work with archaic and historical spellings. Researchers working with early modern catalogues, medieval charters, colonial-era maps, or any primary source material will encounter place names in spellings that no longer appear in modern gazetteers. Symphonym’s phonetic matching can bridge this gap: variant historical spellings like “Lipsick” or “Venedig” can be matched to their standard forms (“Leipzig”, “Venezia”) on the basis of phonetic proximity. This enables the enrichment, linking, and geolocation of catalogue descriptions and historical documents that would otherwise require extensive manual identification.

Note that phonetic search finds names that sound alike — it does not resolve etymologically unrelated names for the same place (e.g. “Eboracum” and “York”, or “Thessaloniki” and “Solun”). Those connections are established through other signals in the clustering pipeline, such as authority cross-references and spatial co-occurrence.

Automated Clustering Across Gazetteers

When the same physical location is described independently by GeoNames, Wikidata, TGN, and Pleiades — each with their own identifiers and naming conventions — determining which records refer to the same place is a non-trivial problem. The new system includes an automated clustering pipeline that combines multiple signals:

  • Explicit authority cross-references (e.g. sameAs links between Pleiades and GeoNames)
  • Exact toponym co-attestation — places in different gazetteers sharing the same name string, filtered by spatial proximity and country-code overlap
  • Phonetic similarity between toponyms (via Symphonym embeddings), with thresholds calibrated automatically from the authority hard links
  • Spatial proximity of coordinates
  • Feature type alignment across classification systems

The pipeline runs in four phases, from high-confidence explicit links through to phonetic similarity matching. Thresholds for the phonetic phase are not set manually but are learned from the authority hard links themselves: the system samples known-same and known-different place pairs, computes their phonetic and spatial signals, and fits a logistic regression to determine optimal similarity and distance cutoffs. In the most recent run, this calibration yielded a cosine similarity threshold of 0.79 and a spatial distance threshold of 5 km — substantially tighter than the initial manual defaults.

The result is a set of approximately 7 million clusters grouping 19 million of the 47 million place records. Each cluster represents a single real-world location as attested across multiple gazetteers. For users reconciling their own datasets, this means a search can return a single grouped result for a location rather than a confusing set of separate entries from different sources.

Importantly, the clustering algorithm is designed to be adaptive. Users can assert that particular place records do not belong in a given cluster, and these assertions feed back into the system, improving clustering quality over time.

Clustering also unlocks richer contextual information. When a place record from one authority (e.g. GeoNames or TGN) is clustered with a Wikidata record, the system can follow Wikidata’s links to retrieve supplementary data from Wikipedia — descriptions, images, and other reference material — and present it alongside the place. This means that a search result can surface Wikipedia content even when the original matching authority has no such links itself.

Applications

  • Search retrieves results across scripts, languages, and historical spelling variants
  • Reconciliation matches uploaded data against a substantially larger and more diverse authority base than before
  • Data linking connects user places to the broader Linked Open Data ecosystem via clustered authority identifiers
  • Catalogue enrichment — institutions holding historical documents with place references can use phonetic matching to identify, link, and geolocate those references against modern authority records

Data Architecture

The underlying data model separates places from names (toponyms) and tracks which source attests which name at which point in time. This structure — built on Normalised Place Records, Toponyms, and Attestations — reflects the scholarly reality that place identity is complex and historically contingent.

The current indexing and clustering system, which runs batch computations over Elasticsearch, is the first major step towards a graph-based architecture in which pairwise links between place records are stored as edges and cluster membership is resolved by graph traversal at query time. Under this model, batch clustering becomes unnecessary: clusters can be computed on-the-fly for any query, and users can adjust confidence thresholds interactively (e.g. “show me only high-confidence links” vs “include tentative matches”).

The graph architecture also enables a fundamental shift in how contributions work. Rather than uploading datasets of place records and reconciling them after the fact, the predominant form of contribution becomes the attestation: a name, date, source reference, or classification attached to an existing place in the index. Contributors find the place and attach their evidence to it; new place identities are minted only when no existing record matches. This attestation-centric model better reflects scholarly practice — researchers typically have evidence about known places, not inventories of new ones — and the dense authority backbone of 47 million indexed places makes it practical. (See the design discussion for further detail.)

The new indexing and clustering system will be rolled out progressively. Updates will be posted at whgazetteer.org and on the documentation site.