NLP on 800-Year-Old Chinese Stories

The corpus

I didn't expect to spend a semester reading ghost stories from the Song Dynasty, but here we are.

The project started as a digital-humanities research position: build a searchable database for 2,700+ literary works from the 10th to 13th centuries — poems, essays, fiction, administrative records. Most of it has never been cataloged in a way that lets you search across it at all. The goal was to fix that, and to make the corpus reachable for researchers who don't read classical Chinese.

Straightforward enough — until you point a modern NLP tool at the text.

Why classical Chinese breaks everything

Classical Chinese is not modern Mandarin with older vocabulary. It's structurally different in ways that make standard pipelines useless. No punctuation in the original. No spaces between words. Grammar that left spoken Chinese centuries ago. A single character can be a noun, a verb, or a particle depending on context that takes real historical knowledge to resolve.

I ran a few off-the-shelf Chinese NER models on sample passages. The results were bad enough to be funny. A model trained on People's Daily has no idea what to do with a sentence where one character means "to govern" in one clause and is somebody's surname in the next — and where the place name is a region that hasn't existed since the Southern Song.

Building the pipeline

So we went hybrid. Rule-based heuristics take the predictable cases: reign-era dates follow fixed formats, official titles come from a known set, certain character sequences reliably mark a geographic reference. Fast and precise where they apply.

For the hard cases — ambiguous names, literary allusions, places referenced obliquely — we fine-tuned models on a hand-annotated subset. The training data was expensive (your annotators have to actually read classical Chinese), but the result generalized across genres and reigns better than I'd expected.

The piece that paid off most was automated cross-referencing against existing historical databases. Every extracted entity gets a confidence score from how many independent sources confirm it: a person who also appears in the official dynastic histories scores high; a reference that could be three different people gets flagged for review. That's where the ~70% cut in manual cataloging time came from — not by removing human judgment, but by aiming it at the cases that actually need it.

The ghost stories

The most interesting finding wasn't technical. It was about the corpus itself.

Ghost stories and supernatural tales — the stuff literary scholars long filed under minor entertainment — hold some of the richest geographic and social data in the whole collection. The authors were meticulous about grounding their hauntings in real places with real social dynamics. A tale about a haunted bridge will hand you the exact location, the names of nearby officials, the local economy, and the social class of everyone involved before it ever gets to the ghost.

That makes them quietly invaluable to historians. The "serious" literature — poetry, philosophical essays — tends toward the abstract and allusive. The ghost stories are concrete and specific, precisely because their settings had to feel real for the supernatural to land.

I went in thinking I'd build an NLP pipeline. I came out with a much deeper appreciation for how the tools you build shape the questions you can ask — and for how the most interesting answers sometimes hide in the material no one thought to look at carefully.