The Hirsch Argument Atlas

A development log — pipeline, tools, experiments, and process

Stian Håklev · April 2026

This document records, in some detail, how the Hirsch Argument Atlas was built: the data model, the extraction pipeline, the calibration loop, the experiments that moved the needle, the experiments that didn’t, the cross-book consolidation machinery, and the final stack. The intended reader is someone already deep in the argument-mapping space — comfortable with Toulmin structure, with the Discourse Graph tradition, with the practical difficulties of getting an LLM to emit a dependency edge — who wants enough detail to either replicate the approach or argue with it.

Live site: hirsch-atlas.pages.dev
Code: github.com/houshuang/hirsch-atlas
Underlying library: github.com/houshuang/limbic

The final corpus contains 10,232 claims across 101 chapters from 10 books (1977–2024), with 656 cross-book argument clusters classified by how the argument evolved over time, a 119-page static reading layer, and a calibration pipeline tuned against human ground truth to a weighted recall score of 0.948. The interesting thing is not the numbers but the route taken to get there.

1. Why this project exists

The project began as a deliberate inversion. A larger knowledge-graph project I had been working on — call it the big ingestion system — had been optimised for scale: tens of thousands of claims across thousands of sources, a hierarchical tree with include/exclude descriptions at every node, typed argument links. It worked in the sense that the counts kept going up. But the output felt flat. Every piece of research I kept reading on personal knowledge systems said essentially the same thing: selectivity beats exhaustiveness, claims alone aren’t enough (you need argument structure), and usefulness comes from frontiers, not from archives. I kept agreeing and continuing to ingest.

The Hirsch project tests the opposite hypothesis. Instead of shallow extraction across thousands of texts: deep extraction from a tightly bounded corpus. One thinker. Ten books. Every argument, every warrant, every dependency, every counter-argument that can be found or sourced. Nobody, as far as I can tell, has done this for a major public intellectual — certainly not with cross-book evolution tracking. Hirsch is a good candidate because the argument is coherent, the corpus is finite, the topic has genuine public interest, the books are citation-rich, and — as it turns out — he has been making more or less the same core argument with impressive stubbornness since 1977, which makes him ideal for studying how a single thinker’s argument evolves over decades.

2. Data model

Before any code, a manual extraction was done on the thirteen-page Prologue of Why Knowledge Matters (2016), followed by three independent critiques — an education scholar, an argument-mapper, and a UX designer. The argument-mapper’s verdict was blunt and set the direction for the rest of the project:

“Captures 60% of the content. 0% of the structure.”

The manual extraction had recorded Hirsch’s claims reasonably well but had recorded almost nothing about how they connected — which claim depended on which, where the warrants were, where the inference gaps lay. The content was a list; the argument was somewhere else. The fix has to be architectural: separate content extraction from structure extraction, and give the data model explicit slots for structural objects, not just propositions.

Node types

Ten node types ended up in the schema, most of them load-bearing:

Type	Role
claim	Atomic assertible proposition. Extracted aggressively.
evidence	Specific study, dataset, example, observation. Carries a `citation_status` field tracking whether it is cited, uncited-but-verifiable, or common knowledge.
warrant	The principle connecting evidence to claim — the Toulmin W. Usually implicit in the source text and must be generated.
objection	A challenge to a claim, a warrant, or an evidence-claim link.
response	A dialectical reply to an objection.
framework	A composite: a coherent package of related claims (e.g. “The Three Tyrannical Ideas” contains naturalism + individualism + skill-centrism). Can be attacked as a unit or per-component.
concept	A contested or defined term with the author’s definition and alternatives (e.g. “developmental appropriateness”).
case	A specific country, school, or implementation with narrative and extracted claims (e.g. “France 1975–1985”).
thinker	An intellectual position the author engages with — not just a name in a bibliography.
reference	A parsed, enriched citation — endnote text → structured fields → Crossref metadata.

Edge types

Eight edge types: supports, opposes, undermines (attacks the warrant, not the claim itself), depends-on, objects-to, responds-to, refines, instantiates.

Two design decisions that punch above their weight

Extract aggressively. Defer importance to post-processing. Every assertion the author makes is a claim. Importance is computed from graph structure — in-degree, evidence count, cross-chapter recurrence, main-conclusion participation — and never judged at extraction time. This is not obvious and it matters a great deal. During early calibration, the pipeline was being too selective because the prompt was asking for “important” claims, which caused it to drop factual premises (“reading scores declined 1960–1980”) and concessions (“testing did improve mechanics”) that turned out to be load-bearing for arguments several chapters away. You cannot recognise a load-bearing premise by looking at it alone — you can only recognise it by computing what eventually depends on it, which is a graph property and has to come later.

Cases hold narratives alongside claims. The France case, the NCLB case, the Finland case — these contain long descriptive passages that are not literally claims but are essential context. The instinct to shred them into atomic propositions loses the narrative that makes the argument memorable. Cases in the final schema carry both a preserved human-readable story and the claims extracted from the interpretive parts of that story. This decision is visible in the reading layer, where cases are narrative entry points rather than claim lists.

3. The calibration loop

The calibration infrastructure is the part of the project I’d most urge other argument-extraction projects to build before anything else, and it’s the part most often skipped.

The basic problem: how do you know whether extraction is any good? The naive answer — read a chapter, look at the output, mutter “seems about right” — gives you the illusion of feedback without the substance of it, and makes principled decisions about prompts, models, or pipeline structure impossible. You need a numerical score that moves when you change the pipeline, and you need a ground-truth corpus to score against. Three pieces ended up in the loop: an HTML review interface, a ground-truth file, and an automated eval harness.

3.1 The HTML review interface

A self-contained calibration page, deliberately low-tech. The design: left panel shows the original chapter text; right panel shows the extracted claims, evidence, warrants, and dependencies; each item has rating sliders for completeness and accuracy; navigation is keyboard-driven; all state persists to localStorage so a review session survives browser crashes and can be resumed the next day. A text-selection popup attaches notes to source passages (“this should be a claim and isn’t”). At the end, the session exports as JSON to the clipboard.

The Hirsch-specific instances live in the repo: calibrate.html (first version), then calibrate_ch1.html, calibrate_ch1_v2.html, and calibrate_ch7.html for subsequent chapters. Opening one of these gives a direct view of what a real review session looked like.

The generator has since become a general-purpose Claude Code skill (calibration-eval) that builds self-contained eval pages for four modes: rating, A/B comparison, threshold tuning, and extraction recall. Keyboard-driven, localStorage-persisted, clipboard export, gold-question self-consistency checks. The shared harness lives at ~/.claude/skills/shared/. The format has generalised well enough that I now use variations of it for unrelated projects.

3.2 Ground truth

During manual review of the Prologue and Chapter 1, I ended up flagging the claims the pipeline should have caught and didn’t. Twenty-four items in total. They live in ground_truth.json, each with an ID (GT-P1, GT-P2, …), the target text, a type, and a note about what the pipeline got wrong.

Twenty-four is a tiny number and I was nervous about it. In practice it was more than enough. The marginal value of ground-truth item #25 turned out to be small compared to the marginal value of running the existing 24 through another experiment. My working hypothesis now is that something like 15–30 carefully curated ground-truth items per pipeline is enough to drive serious experimentation, provided the items are well-chosen and cover the actual failure modes you care about. The items matter; the count mostly doesn’t.

3.3 The eval harness

eval_extraction.py runs the extraction pipeline, takes the output, and asks an LLM judge to match each ground-truth item against the extracted claims with partial-credit scoring: exact match = 1.0, partial = 0.5, miss = 0.0. The judge prompt is deliberately permissive about wording — “substantially equivalent to GT-P3?” — because the question is whether the claim got extracted, not whether the wording is identical. The result is a single weighted recall score per run.

The critical practical fact is that LLM-judge scoring has variance. Running the same code on the same input produced scores in the 0.92–0.98 range across repeat runs, and the variance floor on this harness is about 0.06 weighted-score points. Any experiment whose effect size is smaller than 0.06 is indistinguishable from noise, and every experiment had to be run two or three times before it could be trusted. The corresponding discipline: if your change does not move the score by roughly 0.08, you are probably imagining the improvement. This killed several plausible-looking experiments.

3.4 A shared harness at `limbic`-level

The calibration-eval skill integrates with limbic.amygdala.calibrate — the calibration-metrics module inside the shared embedding/search library — for Cohen’s kappa, validation of LLM judges against human ground truth, and inter-rater reliability. The goal over time is that anything built from the calibration skill can be fed into a common limbic.amygdala.calibrate report without glue code. This project is one of the validation cases for that interface.

4. Experiments: what actually moved the needle

Once the loop was in place, experiments became tractable. Eight experiments ran across four parallel agents, each modifying exactly one variable, each scored against the same ground truth, each repeated two or three times because of the variance floor.

Experiment	Weighted score	Verdict
Baseline (single-pass, 15 rules)	0.854	—
+ 4 explicit extraction rules	0.813	Reverted — confused the model
Raised claim floor 15 → 25	0.875	Kept
+ Concrete mistake examples in prompt	0.875	Kept
+ Completeness sweep (pass 2)	0.896	Kept
Sharpened sweep prompt	0.875	Reverted
Sliding window (6K chars, 1K overlap)	0.979	Kept
Few-shot examples from ground truth	0.938	Moderate effect
Targeted detection questions	0.854	No improvement
Compound-claim splitting (pass 3)	0.938	Reverted

4.1 Sliding window is the single largest lever

The sliding-window result is worth dwelling on, because it was surprising. The change was minimal: instead of feeding an entire 28K-character chapter to the LLM in one call, split the chapter into 6,000-character windows with 1,000 characters of overlap, prefer paragraph boundaries, extract from each window, and then deduplicate the combined results using a 60% word-overlap threshold. That single structural change moved the weighted score from 0.854 to 0.979 in isolation, and produced ~80–150 claims per chapter where whole-chapter extraction had been giving ~20–30.

The hypothesis is simple attention degradation: at 28K+ characters, LLMs drop claims from the middle of long inputs regardless of prompting. Windowing eliminates this. It is not a clever trick; it is structural, general, and it works for any extraction task operating on book-length inputs. If one thing from this project generalises to other argument-extraction pipelines, it’s this.

4.2 Structural changes beat prompt tuning

Across the experiment table, the deltas cluster in a revealing way. Prompt-level experiments — more rules, sharper language, targeted detection questions, few-shot examples, compound splitting — produced deltas in the noise range (-0.04 to +0.04). The structural experiments — windowing the input, windowing Phase 2 the same way, batching structure extraction with relevant text sections — produced deltas of seven to ten points. If the pipeline’s architecture is wrong, no prompt will rescue it; if the architecture is right, the prompt is nearly a rounding error. This is not a general claim about all extraction tasks, but it held remarkably consistently across every experiment run on this one.

4.3 A bug worth documenting

Phase 2 (structure extraction) had been silently slicing its input with [:12000] on chapters that could run to 58,000 characters. The slice was a leftover from when Phase 2 was only being tested on the 13K-character Prologue, where it was a no-op. When Chapter 7 (58K characters, the “France” chapter, where Hirsch makes his most structurally complex argument) finally came through the pipeline, Phase 2 returned almost no dependencies and almost no warrants. For about a day the hypothesis was that structural extraction was fundamentally harder than content extraction on long chapters. The real cause was the truncation. Fixing it and windowing Phase 2 in the same way as Phase 1 produced a 7× increase in structural output: 8 → 55 dependencies, 4 → 22 warrants on Chapter 7.

The general lesson: any fixed numeric slice in a pipeline is a latent bug until proven otherwise. End-to-end assertions over input sizes would have caught this immediately; I had not written any.

4.4 Chapter 7 as the generalisation test

Chapter 7 was also the first unseen chapter — all calibration had been done on the Prologue and Chapter 1, and Chapter 7 was a test of whether the pipeline generalised beyond the calibration set. The extraction produced 145 claims, 43 evidence items, 30 concepts, 13 cases, and 18 thinkers. Human review of the full output found exactly two missed claims, both depth-of-interpretation issues rather than recall failures. This was treated as validation that the pipeline was ready to run at corpus scale.

4.5 Model selection

I tested Gemini 2.5 Flash against Sonnet on Phase 2 (structure). Sonnet produced noticeably richer structural reasoning — better warrants, sharper vulnerabilities, more incisive objection framing — but would not reliably follow the JSON schema. Flash followed the schema cleanly. For a pipeline that needs to run 101 chapters without manual intervention on JSON repair, schema compliance is a more important production property than reasoning depth. Flash plus a structural second pass beats Sonnet plus manual cleanup, at this scale. This may not hold in all settings, but it held consistently here.

5. The final extraction pipeline

  PDF / ePub / OCR intake
          ↓
  chapter segmentation
          ↓
  sliding window (6K / 1K overlap, paragraph boundaries)
          ↓
  Phase 1a — content extraction (per window, Gemini Flash)
          ↓
  Phase 1b — dedup (60% word-overlap threshold)
          ↓
  Phase 1c — completeness sweep (second pass over the full text)
          ↓
  Phase 2 — structure extraction (batches of ~20 claims
            with relevant text section, Gemini Flash)
          ↓
  Phase 3 — self-critique pass (Gemini Flash)
          ↓
  per-chapter JSON
          ↓
  Human calibration (HTML review interface)
          ↓
  eval_extraction.py against ground_truth.json
          ↓
  Backfill passes (endnote matching, warrant generation,
                   evidence enrichment)

Key parameters, for reference:

Window size: 6,000 characters, 1,000 overlap
Window boundaries: paragraph-aware
Dedup threshold: 60% word overlap
Phase 2 batch size: ~20 claims per LLM call with a relevant text slice
Model: Gemini 2.5 Flash throughout, including for OCR on scanned/photographed books
Pipeline timing: roughly 6 minutes per chapter, end-to-end

The pipeline writes structured logs at every phase so individual steps can be re-run without redoing upstream work. Each phase’s output is a standalone JSON file that the next phase consumes, which made experimentation fast and made it possible to replay any single phase with modifications.

5.1 Intake pathways

Ten books meant three different intake paths. Digital PDFs with clean metadata went through a simple PDF parser with chapter detection by TOC matching. Books that only existed as ePub went through ebooklib + BeautifulSoup. A few scanned PDFs had no extractable TOC and needed one typed by hand. And The Ratchet Effect (2024), which only existed for me as a Kindle book, was photographed page by page — 103 screenshots — and OCR’d through Gemini Flash’s vision capability, which turned out to be remarkably reliable for this use. All three pathways feed a common pre-parsed JSON intake format so that downstream phases never need to know how the book arrived.

5.2 Backfill passes

After per-chapter extraction, three targeted passes run over the full book:

Endnote matching — links evidence items to endnote references via string matching against the notes section. On Why Knowledge Matters, this moved endnote coverage from 55% to 59%; the remaining 41% are inline citations with no endnote entry, which is a corpus property rather than a pipeline failure.
Warrant generation — for every main-conclusion evidence-claim pair that lacks an explicit warrant, an LLM pass generates the connecting principle. On WKM this moved warrants from 208 to 260 covered pairs.
Evidence enrichment — the endnotes themselves are mined for evidence items the content extraction missed, since endnotes often contain additional study details the main text only alludes to. On WKM this added 66 new evidence items (358 → 424).

These passes are conservative by design: they only fill in gaps against structural invariants (“every main conclusion should have a warrant chain to its evidence”), not invent new structure.

6. Cross-book consolidation

Once all ten books were extracted in isolation, the consolidation pipeline (consolidate.py) builds the cross-corpus layer. Five phases.

6.1 Entity deduplication

Thinkers, concepts, and cases use embedding similarity plus an LLM verification pass. Results:

Thinkers: 1,059 → 521 distinct (146 appear in more than one book)
Concepts: 2,432 → 1,412 distinct (206 multi-book)
Cases: 907 → 537 distinct (57 multi-book)

The multi-book counts are the useful signal — they identify the recurring elements of the author’s intellectual world.

6.2 Claim clustering across books

This is where the scale became nontrivial. Naive pairwise cosine similarity over 10,232 claims is ~52 million comparisons, which is slow on a laptop. An inverted-index optimisation — shingle the claim text, only compare claims that share a shingle — reduced the candidate-pair count from 52M to 4.3M, and the clustering step now runs in about 5.3 seconds.

I mention this explicitly because I nearly reached for a vector database, which would have been enormous overkill for a 10K-item corpus. At this scale, brute-force numpy with a shingling optimisation beats any infrastructure you might be tempted to stand up. limbic.amygdala’s VectorIndex uses the same principle — numpy brute force is actually faster than ANN indices below roughly 100K vectors, and much easier to reason about. The crossover point where ANN starts to win is higher than most people expect.

6.3 Evolution classification

For each cross-book cluster, Gemini Flash classifies how the argument evolved across the books it appears in. Six categories:

repeated — same claim, same framing
refined — more precise evidence or formulation
evolved — substantive shift in position
broadened — scope expanded
narrowed — scope contracted
new_evidence — same claim, new empirical support

Out of 656 cross-book argument clusters:

Category	Count
Repeated	250
Refined	238
Evolved	74
Broadened	56
Narrowed	21
New evidence	17

I had expected this to be the noisiest step in the pipeline — LLMs classifying intellectual evolution seemed like exactly the kind of task where hallucination and smoothing would dominate. It turned out to be one of the most useful. The classifications are consistent across re-runs, the distribution is interpretable, and the output makes cross-book patterns visible that a human reader would not notice from reading the books sequentially.

6.4 A methodological caveat

There is a serious question buried in all this: is the pipeline measuring the author, or the LLM’s judgment of the author? The entity dedup and the evolution classification are both LLM-driven, and the LLM is doing some amount of smoothing. The honest claim is that the pipeline makes visible patterns that are consistent across re-runs and that match what a careful reader would notice given enough time — but the pipeline is not a neutral instrument. For any claim about what “Hirsch does over time”, the appropriate discount has to be applied.

6.5 What the cross-book view reveals

With those caveats in place, the cross-book view surfaces things I would not have seen from reading the books one at a time:

The top argument in the entire corpus, appearing in 8 of 10 books (1987–2024), is “There is no general reading skill independent of domain knowledge.”
Rousseau and Dewey appear as engaged thinkers in 9 of 10 books.
George Miller appears in 9 of 10 books (the cognitive-science grounding).
The US/France education comparison appears in all 10 books.
The “received views” concept — Hirsch’s umbrella term for the progressive-education consensus he is arguing against — appears in every single book, 1977 through 2024.
The Core Knowledge Sequence is referenced in 8 books (1996 onward).
The dominant evolution category is refined — Hirsch’s arguments get more precise over time, rather than changing substantively. Over 47 years, the most striking fact is the stability rather than the drift.
An early linguistic argument about text comprehension (1977) broadens through four decades of refinement into a sweeping claim about the democratic duty of common schooling (2020s). Same underlying claim, very different register.

6.6 Cross-book importance scoring

A composite score, with no LLM involvement. Components: book recurrence (35%), dependency centrality (20%), evidence density (15%), counter-arguments (10%), main-conclusion bonus (30%). The top 15 claims by this score are surfaced on the landing page. Because it’s purely graph-structural, it’s reproducible and explainable in a way that an LLM-judged importance score would not be.

7. Things that did not work

The failures are often more useful than the successes, so the partial list, for the record:

Single-pass “extract the whole argument” prompts. The original sin. Produces content, drops structure. The entire reason for two-phase extraction.
More explicit extraction rules in the prompt. Four additional carefully-written rules dropped the score from 0.854 to 0.813. The model seems to get confused when over-specified. The general pattern is that prompt elaboration past a certain point starts actively hurting.
Targeted detection questions. Asking the model “did you miss any prescriptive claims?” did nothing. Generic completeness sweeps worked; targeted ones did not. The current working theory is that targeted prompts push the model into a confirmation-bias mode where it answers “no I did not” and moves on.
Compound-claim splitting as a post-pass. Splitting already-extracted compound claims broke as many good claims as it rescued bad ones. Net negative.
Stronger model for the sweep pass. Dropping a larger model in for the completeness sweep produced JSON parse failures more often than recall gains. Schema compliance is the whole point of the sweep.
The [:12000] Phase 2 truncation. A latent bug, cost 7× in structural density on long chapters, described in §4.3.
Force-directed argument graphs. Built and discarded. They look like argument maps and do not function as argument maps for a non-specialist reader. Argument atlases have to read.
A database-search UI as the main reader interface. Tried first. Too much friction for the target reader. People want to read the argument, not query it.
Kialo-style fragmented debate trees. Kill narrative flow. Readers lose context within three clicks.
AI-generated counter-arguments presented without attribution. This is the failure mode I’m most concerned about, because the outputs are fluent and plausible and the problem is only visible if you ask the right question. A four-persona review of an early version of the site (teacher, parent, politician, researcher) flagged it immediately: counter-arguments synthesised by an LLM are not the same as scholar-attributed critique, and presenting them without attribution quietly transforms the project from “a map of Hirsch” into “a map of the imaginary controversy”. The partial fix is to source counter-arguments from named critics — Christine Counsell, Robert Chapman, David Reynolds, and the Bildung/Didaktik tradition — and attribute them explicitly. Until this is completed, the site carries a provenance warning.

8. The reading layer

The public reading layer (build_corpus_site.py, live at hirsch-atlas.pages.dev) is a 119-page static site generated from a single corpus_consolidated.json file. Page types:

Landing — one-sentence thesis, two entry paths (Hirsch’s argument / scholarly context), book timeline, top cross-book arguments, importance-ranked claims
Evolution — all 656 cross-book clusters with timeline dots, evolution badges, filterable by evolution type
Per-book — 10 pages, one per book, with chapter grid and cross-book argument connections
Per-chapter — 101 pages, one per chapter, with claims, evidence, warrants, counter-arguments, argument chains
Claim detail SPA — hash-routed, backed by a single 8.7 MB compact JSON containing all 10,232 claims (every claim has a permanent URL)
Entity indexes — thinkers, concepts, cases, each with multi-book badges
Scholarly context — external research organised by topic, rendered as side-by-side split (Hirsch clusters in amber, external findings in blue) with an explicit provenance banner

A few design choices are strict:

Static only. One big JSON, no database, no backend, no server-side rendering. Linkable, cacheable, version-controllable, indefinitely maintainable.
Argument chains inline, not graphical. Force-directed graphs and Kialo trees have been tried and discarded. Inline argument chains read.
Counter-arguments alongside claims, not in a separate “Criticism” section. The Wikipedia Criticism ghetto is one of the worst patterns in online writing about contested topics.
External research in a separate JSON file. The scholarly-context layer is deliberately isolated from the Hirsch corpus, with topic-level linking rather than claim-level linking, and a visible provenance banner warning that the external compendium is sympathetically oriented. Provenance layers, once mixed, are almost impossible to separate again.

9. The shared infrastructure (`limbic`)

Embedding, clustering, and LLM-judge infrastructure comes from limbic, a small data-curation library extracted from patterns recurring across several projects. The sub-packages:

limbic.amygdala — embedding, vector and hybrid search, novelty detection, clustering, semantic whitening, calibration metrics. Provides the VectorIndex used for claim similarity in this project, the calibrate module for kappa and judge validation, and the hybrid vector+FTS5 search primitives.
limbic.hippocampus — proposals, cascade merges, deduplication pipelines, YAML-backed persistence. Not used directly by Hirsch but shares conventions.
limbic.cerebellum — LLM-judge orchestration, budget tracking, batch audit pipelines. Used for the evolution classification pass in this project.

The sliding-window splitter and the word-overlap dedup functions currently live inside the Hirsch repo, but they are general-purpose enough that they will be lifted into limbic.cerebellum as sliding_window_extract() and word_overlap_dedup() for reuse.

10. Process summary

The full per-book process, end to end:

Acquire. PDF, ePub, or photographed Kindle screenshots. Feed through the appropriate intake path.
Skeleton. One LLM call on the Prologue/Introduction + TOC produces the book’s argument skeleton. Human eyeballs it briefly.
Extract per-chapter. Sliding-window pipeline. Per chapter, outputs phase1_content.json, phase2_structure.json, phase3_critique.json, merged into chapter.json.
Human calibration. First three chapters of each book get full human review using the HTML interface. If quality is consistently ≥4/5 on the review sliders, chapters 4+ get spot-check review (50% of claims, all objections, all warrants).
Backfill passes. Endnote matching, warrant generation, evidence enrichment.
Consolidate. Run consolidate.py across all books extracted so far. Entity dedup, claim clustering, evolution classification, canonical argument merging, importance scoring.
Build site. build_corpus_site.py regenerates the 119-page static site from corpus_consolidated.json.

Every step is replayable in isolation. The pipeline is designed so that re-running a single phase never requires redoing upstream work, which is critical for experimentation velocity.

11. Generalisable findings for argument extraction

The list of things from this project that I think will generalise to other argument-extraction pipelines:

Build the eval harness before the pipeline. Twenty to thirty carefully-curated ground-truth items from two or three representative source-sections is enough to drive every subsequent pipeline decision. Without a numerical score, prompt tuning is theatre.

Window your inputs. Window them aggressively. Sliding-window extraction (6K chars, 1K overlap, paragraph boundaries) is the single largest lever for recall on book-length text. The hypothesis — attention degradation on long inputs — is well-supported by the literature, and the intervention is structural rather than prompt-based, which is why it works so reliably. Window both the content phase and the structure phase.

Separate content from structure. Single-pass “extract the argument” prompts always produce the 60%-content / 0%-structure failure mode. Two phases, with the content output fed as input to the structure phase, is the minimum viable architecture.

Schema compliance over reasoning depth at production scale. Cheap schema-compliant models beat expensive reasoning models that need manual JSON repair, once you are running hundreds of chapters. This is a production observation, not a capability claim.

Extract aggressively, compute importance later. Importance is a graph property and cannot be judged at extraction time. Dropping a claim because it “looks obvious” drops the premises that turn out to be load-bearing two chapters away.

Preserve narratives next to claims. Cases are not reducible to atomic propositions without loss. The narrative is what readers remember, and it is what contestability has to hook into.

Build cross-corpus canonical arguments. The cross-book view is what makes deep-extraction-from-a-bounded-corpus worth doing, and it only becomes possible once per-source extraction is reliable. At scale below ~100K items, numpy brute force with a shingling optimisation beats any vector database you might be tempted to install.

Treat counter-argument sourcing as a first-class pipeline phase, not an afterthought. LLM-synthesised counter-arguments are fluent, plausible, and subtly wrong in a way that damages the project’s epistemic standing. Named critics, attributed quotes, real provenance.

Structural changes beat prompt tuning. Not in every project, but consistently in this one. When the pipeline is underperforming, the first question should be “is the architecture right?”, not “can I rewrite the prompt?”

Latent numeric slices are bugs. Any hard-coded text length in the pipeline is a latent bug waiting to be discovered on a larger input. End-to-end assertions over input sizes catch this kind of problem immediately.

12. Outstanding work

Real counter-arguments from named critics. Counsell, Chapman, Reynolds, and the Bildung/Didaktik tradition. Attribution is the key issue; the LLM-synthesised placeholders need to be replaced or removed.
Reference parsing and Crossref enrichment. Endnotes → structured citation fields → DOI lookup → Crossref metadata → abstract. Turns the bibliography from a list of names into a navigable research layer.
Extract sliding-window + dedup utilities to limbic.cerebellum. General-purpose, validated on a real corpus, overdue.
A Hirsch-and-the-Nordic-debate page. Sundby & Karseth (2022) found that the Norwegian LK20 curriculum prescribes skills over knowledge despite its policy rhetoric. The mapping onto Hirsch’s argument is direct and he seems unaware of it — this is exactly the kind of connection the atlas exists to surface.
Open research questions as first-class objects. During calibration, several deep research questions surfaced (“when are knowledge gaps actually decisive?”, “what is the iron-man argument for skills-based teaching?”) that should be attached to specific claims on the public site, rather than living in my notes.

13. Summary of the corpus

	Count
Books extracted	10
Year range	1977–2024
Chapters	101
Claims	10,232
Cross-book argument clusters	656
Multi-book thinkers	146
Multi-book concepts	206
Multi-book cases	57
Static pages in reading layer	119
Calibration ground-truth items	24
Final eval score (weighted recall)	0.948 (avg), 0.979 (best)
Eval harness variance floor	~0.06

Live site: hirsch-atlas.pages.dev · Code: houshuang/hirsch-atlas · Library: houshuang/limbic