Stephen Gadd – World Historical Gazetteer

A whole world, alphabetically: an 1856 gazetteer, read by 2026 tools

We’ve just put something a little unusual online: an interactive map and reader of A Gazetteer of the World, a seven-volume reference book from 1856 that set out to describe, in alphabetical order, every place its compilers could find. Tens of thousands of towns, rivers, mountains, ruins, provinces and ports, each with a paragraph on where it is, what kind of place it is, who lives there and how many of them. We’ve turned those printed pages into data you can search, map, and read, all explorable in your browser with no server doing the work behind the scenes.

Before anything else: please treat it as a research demonstrator, not an authority. It was made automatically, end to end, and most of the map locations are currently wrong; the statistical tables, too, are only roughly rendered. More on why that’s interesting, rather than just embarrassing, below.

The book, and its rather elusive editor

A gazetteer is a geographical dictionary, and in the mid-nineteenth century they were serious undertakings. Ours was published in Edinburgh by A. Fullarton & Co. between 1850 and 1856, and it’s firmly in the public domain, which is what makes a project like this possible at all.

It’s also, charmingly, anonymous: the title page credits only “a Member of the Royal Geographical Society.” That member is now generally identified as George Godfrey Cunningham (c. 1802–1860), a Scottish writer, compiler and translator who was himself a partner in Fullarton, and who seems to have spent his career assembling other examples of this sort of vast reference work (he also produced an eight-volume Lives of Eminent and Illustrious Englishmen) and rendering German Romantic tales into English on the side. Beyond his memberships and a scatter of addresses across Scotland and England, remarkably little about him survives; the Gazetteer is reckoned his principal achievement, yet he put his name to none of it.

A gazetteer is never a neutral list, either. Cunningham’s world is the world as seen from mid-Victorian Britain, with all the imperial framing, uneven coverage and confident judgements that implies. That is worth keeping in mind, and it is also why the World Historical Gazetteer records sourced attestations rather than facts: an entry says “this source, at this date, called this place this, and placed it here”, not “this is the truth”. Cunningham’s 1856 view becomes one attestation among many.

How it was made (the short version)

Nobody typed any of this in. We started from public-domain page scans on HathiTrust and ran them through modern, layout-aware OCR (Surya) to turn the printed columns back into text. Then a large language model (the open Llama 3.3, with gpt-oss double-checking and Qwen3 repairing the cases it flagged) read each entry and pulled out structured facts: name, country, coordinates, population, and a feature type.

Those types are not free text. Each is drawn from the Getty Art & Architecture Thesaurus (AAT), a published controlled vocabulary in which every term has a stable web address, so a “river” or a “ruined city” carries an identifier that other datasets can point at. That is the idea behind Linked Open Data: shared identifiers instead of isolated labels, so the data can join up with the wider web rather than sitting in a silo. The statistical tables and the engraved plates were read by a vision model, Qwen2.5-VL.

Every place was then matched against the World Historical Gazetteer through its Reconciliation API, so the 1856 entry gets a modern location, and sometimes a boundary outline. All of the AI runs on our own machines at the University of Pittsburgh’s Center for Research Computing and Data: no per-token bills, nothing sent to a third party. The result is around 116,000 places, most of them linked (as-yet wrongly 😳) to a modern location, plus the tables and plates, served as a static website. If you want the gory detail, it is all on GitHub.

What this actually is, and isn’t

Here is the important part, and the reason we are writing it up rather than quietly shipping a demo. This is a scoping exercise, not a blueprint. It probes one possible strand of future WHG work, ingesting authoritative historical print gazetteers as reference data, and it is emphatically not a preview of “the WHG to come”. We built it to learn, on a deliberately awkward, large, genuinely historical source, where our current tools cope and where they don’t. And it threw back some genuinely useful failures.

Where it bumped into our reconciliation gaps

Matching a nineteenth-century place name to a modern gazetteer entry is hard, and we already knew automatic matching would never be perfect: it is an active area of work at WHG, and it improves as we fold more reference data into our indices. This experiment put a few specific gaps into sharp relief.

Same name, wrong place. This was not a surprise so much as a confirmation. Where we cannot identify a suitable containment polygon (a parent region to match a place inside), or do not yet hold one in our indices, name similarity alone is a weak signal, and a confident-looking match is often a same-named place somewhere else entirely, occasionally on the far side of the planet.
We were ignoring the coordinates the book hands us. Many entries print their own latitude and longitude. Once we checked the matches against those, well over half of the coordinate-bearing places sat hundreds (sometimes thousands) of kilometres from where the book puts them. So now, where coordinates exist, we trust them: we look for the best name match within a radius of the printed point, and otherwise leave the place located but explicitly unmatched rather than force a bad link.
Stated region versus real coordinates. Cross-checking each entry’s printed coordinates against the region it claims to sit in flagged a lot of disagreement. Some of that is the ordinary drift between 1856 administrative geography and modern boundaries, but only some; the rest is genuine error worth surfacing.

None of these are solved here. They are surfaced here, which for a scoping exercise is exactly the point: each one translates fairly directly into a concrete improvement for reconciliation, such as stronger spatial priors, trusting coordinates when a source provides them, and treating “we know where it is but not what it is” as a proper, visible result.

Smaller worlds, deeper local knowledge

A single global gazetteer is an extraordinary feat of compilation, but its coverage is inevitably broad and uneven. Some of the most valuable historical place data comes instead from compilers with deep local knowledge. A favourite example, already in the WHG, is John Adams’s Index Villaris of 1680: an alphabetical table of some 24,000 cities, market towns, parishes, villages and private seats in England and Wales, each with its distance from London and a latitude and longitude that Adams worked out by triangulation. (Adams, an English barrister and surveyor, c. 1643–1690, never finished the wider survey it belonged to; there is more on him here.) Its precision and regional focus are exactly the qualities a worldwide gazetteer like Cunningham’s cannot match.

This is where we would welcome help. We are keen to find more sources of that kind: specialist, regionally-focused, authoritative print gazetteers that are out of copyright and available as PDFs, especially ones that would fill gaps in WHG’s current coverage. If you know your own corner of the world’s reference shelf, that local expertise is precisely what we are short of.

To get us started, my colleague Palak Vashist has put together a candidate bibliography of exactly this kind of source: public-domain print gazetteers (mostly nineteenth- and early twentieth-century, mostly available as scans on the Internet Archive, HathiTrust, the Library of Congress and the Digital Library of India), chosen for the gaps they could help fill. The list leans deliberately into South Asia (the Bombay, Bengal, Madras, Punjab, United Provinces, Central Provinces, Bihar & Orissa and Assam district series, the Imperial Gazetteer of India, Ceylon, Burma and the North-West Frontier), with a global comparator set spanning the Middle East, Africa, the Americas, Oceania and Eastern Europe. Each entry is graded against a selection rubric and tagged with a suggested next step, so the same pipeline used here can be pointed at any of them with relatively little new work. The full list, with sources and notes, is here; suggestions for additions or corrections are very welcome (please email Palak at PAV82@pitt.edu).

Have a look

The Gazetteer of the World Explorer is here. Search a place, wander the map, open a volume and read Cunningham’s prose with its plates set back in place. It is an early experiment and it shows, so do take it in that spirit: a first attempt to let a 170-year-old book speak to a modern index, with a great deal still to fix. We will have more to say as the reconciliation work it prompted takes shape.

Thanks to Humphrey Southall, whose nudge got this project started!

What’s New in the WHG Index

47 million places, 67 million toponyms, and a phonetic search engine that works across scripts.

By Stephen Gadd, WHG Technical Director

The World Historical Gazetteer helps researchers, educators, and students discover how places connect across time, language, and culture. This post describes the most substantial infrastructure change since the platform launched: a full rebuild of the reconciliation index, a new phonetic search capability, and an automated clustering system that links place records across independent gazetteers.

Infrastructure: University of Pittsburgh CRC

The new system runs on dedicated infrastructure provided by the Center for Research Computing (CRC) at the University of Pittsburgh, replacing the previous single-server deployment. The CRC environment provides the compute and storage capacity needed to index and serve tens of millions of records — including the GPU resources used to train and run phonetic embedding models.

47 Million Places, 67 Million Toponyms

The previous WHG reconciliation index drew on two sources (GeoNames and Wikidata) and contained approximately 13.6 million records. The new index incorporates authority data from six major global gazetteers:

Source	Places	Description
OpenStreetMap	~15 million	Crowdsourced global mapping data
GeoNames	~12 million	The world’s largest open geographical database
Wikidata	~8 million	Community-curated structured knowledge base
Getty TGN	~3 million	The Thesaurus of Geographic Names, with substantial historical depth
Pleiades	~37,000	Gazetteer of the ancient Mediterranean world
Library of Congress	Extensive	Geographic authority records

The total distinct place count is approximately 47 million. More importantly, the index now contains approximately 67 million toponyms — the individual name forms by which those places are or have been known, across languages, scripts, and historical periods. Each toponym is linked to its source places and carries a phonetic embedding (see below), making it possible to search not just by exact string but by sound.

Symphonym: Phonetic Search Across Scripts and Centuries

A persistent difficulty in historical gazetteer work is that the same place may appear under many different names: transliterated into different scripts, adapted to different phonologies, abbreviated, or simply spelled according to conventions that are centuries out of date. Standard text search can match “Florence” but will miss “Firenze”; it can find “Constantinople” but not “Konstantiniyye” or “قسطنطنية”.

Symphonym is a phonetic search system developed for WHG that addresses this problem. Every toponym in the index is converted into a fixed-dimensional phonetic embedding — a vector representation of how the name sounds, derived from Grapheme-to-Phoneme (G2P) conversion and articulatory phonetic feature extraction. Names that sound similar end up close together in embedding space, regardless of script or orthography. A search for “Konstantiniyye” will retrieve “Constantinople” and “قسطنطنية”; “Firenze” will match “Florenz”; “Stamboul” will surface alongside “Istanbul” and “İstanbul”.

This is particularly valuable for work with archaic and historical spellings. Researchers working with early modern catalogues, medieval charters, colonial-era maps, or any primary source material will encounter place names in spellings that no longer appear in modern gazetteers. Symphonym’s phonetic matching can bridge this gap: variant historical spellings like “Lipsick” or “Venedig” can be matched to their standard forms (“Leipzig”, “Venezia”) on the basis of phonetic proximity. This enables the enrichment, linking, and geolocation of catalogue descriptions and historical documents that would otherwise require extensive manual identification.

Note that phonetic search finds names that sound alike — it does not resolve etymologically unrelated names for the same place (e.g. “Eboracum” and “York”, or “Thessaloniki” and “Solun”). Those connections are established through other signals in the clustering pipeline, such as authority cross-references and spatial co-occurrence.

Automated Clustering Across Gazetteers

When the same physical location is described independently by GeoNames, Wikidata, TGN, and Pleiades — each with their own identifiers and naming conventions — determining which records refer to the same place is a non-trivial problem. The new system includes an automated clustering pipeline that combines multiple signals:

Explicit authority cross-references (e.g. sameAs links between Pleiades and GeoNames)
Exact toponym co-attestation — places in different gazetteers sharing the same name string, filtered by spatial proximity and country-code overlap
Phonetic similarity between toponyms (via Symphonym embeddings), with thresholds calibrated automatically from the authority hard links
Spatial proximity of coordinates
Feature type alignment across classification systems

The pipeline runs in four phases, from high-confidence explicit links through to phonetic similarity matching. Thresholds for the phonetic phase are not set manually but are learned from the authority hard links themselves: the system samples known-same and known-different place pairs, computes their phonetic and spatial signals, and fits a logistic regression to determine optimal similarity and distance cutoffs. In the most recent run, this calibration yielded a cosine similarity threshold of 0.79 and a spatial distance threshold of 5 km — substantially tighter than the initial manual defaults.

The result is a set of approximately 7 million clusters grouping 19 million of the 47 million place records. Each cluster represents a single real-world location as attested across multiple gazetteers. For users reconciling their own datasets, this means a search can return a single grouped result for a location rather than a confusing set of separate entries from different sources.

Importantly, the clustering algorithm is designed to be adaptive. Users can assert that particular place records do not belong in a given cluster, and these assertions feed back into the system, improving clustering quality over time.

Clustering also unlocks richer contextual information. When a place record from one authority (e.g. GeoNames or TGN) is clustered with a Wikidata record, the system can follow Wikidata’s links to retrieve supplementary data from Wikipedia — descriptions, images, and other reference material — and present it alongside the place. This means that a search result can surface Wikipedia content even when the original matching authority has no such links itself.

Applications

Search retrieves results across scripts, languages, and historical spelling variants
Reconciliation matches uploaded data against a substantially larger and more diverse authority base than before
Data linking connects user places to the broader Linked Open Data ecosystem via clustered authority identifiers
Catalogue enrichment — institutions holding historical documents with place references can use phonetic matching to identify, link, and geolocate those references against modern authority records

Data Architecture

The underlying data model separates places from names (toponyms) and tracks which source attests which name at which point in time. This structure — built on Normalised Place Records, Toponyms, and Attestations — reflects the scholarly reality that place identity is complex and historically contingent.

The current indexing and clustering system, which runs batch computations over Elasticsearch, is the first major step towards a graph-based architecture in which pairwise links between place records are stored as edges and cluster membership is resolved by graph traversal at query time. Under this model, batch clustering becomes unnecessary: clusters can be computed on-the-fly for any query, and users can adjust confidence thresholds interactively (e.g. “show me only high-confidence links” vs “include tentative matches”).

The graph architecture also enables a fundamental shift in how contributions work. Rather than uploading datasets of place records and reconciling them after the fact, the predominant form of contribution becomes the attestation: a name, date, source reference, or classification attached to an existing place in the index. Contributors find the place and attach their evidence to it; new place identities are minted only when no existing record matches. This attestation-centric model better reflects scholarly practice — researchers typically have evidence about known places, not inventories of new ones — and the dense authority backbone of 47 million indexed places makes it practical. (See the design discussion for further detail.)

The new indexing and clustering system will be rolled out progressively. Updates will be posted at whgazetteer.org and on the documentation site.