The Hirsch Argument Atlas

A development log — pipeline, tools, experiments, and process

Stian Håklev · April 2026

This document records, in some detail, how the Hirsch Argument Atlas was built: the data model, the extraction pipeline, the calibration loop, the experiments that moved the needle, the experiments that didn’t, the cross-book consolidation machinery, and the final stack. The intended reader is someone already deep in the argument-mapping space — comfortable with Toulmin structure, with the Discourse Graph tradition, with the practical difficulties of getting an LLM to emit a dependency edge — who wants enough detail to either replicate the approach or argue with it.

The final corpus contains 10,232 claims across 101 chapters from 10 books (1977–2024), with 656 cross-book argument clusters classified by how the argument evolved over time, a 119-page static reading layer, and a calibration pipeline tuned against human ground truth to a weighted recall score of 0.948. The interesting thing is not the numbers but the route taken to get there.


1. Why this project exists

The project began as a deliberate inversion. A larger knowledge-graph project I had been working on — call it the big ingestion system — had been optimised for scale: tens of thousands of claims across thousands of sources, a hierarchical tree with include/exclude descriptions at every node, typed argument links. It worked in the sense that the counts kept going up. But the output felt flat. Every piece of research I kept reading on personal knowledge systems said essentially the same thing: selectivity beats exhaustiveness, claims alone aren’t enough (you need argument structure), and usefulness comes from frontiers, not from archives. I kept agreeing and continuing to ingest.

The Hirsch project tests the opposite hypothesis. Instead of shallow extraction across thousands of texts: deep extraction from a tightly bounded corpus. One thinker. Ten books. Every argument, every warrant, every dependency, every counter-argument that can be found or sourced. Nobody, as far as I can tell, has done this for a major public intellectual — certainly not with cross-book evolution tracking. Hirsch is a good candidate because the argument is coherent, the corpus is finite, the topic has genuine public interest, the books are citation-rich, and — as it turns out — he has been making more or less the same core argument with impressive stubbornness since 1977, which makes him ideal for studying how a single thinker’s argument evolves over decades.


2. Data model

Before any code, a manual extraction was done on the thirteen-page Prologue of Why Knowledge Matters (2016), followed by three independent critiques — an education scholar, an argument-mapper, and a UX designer. The argument-mapper’s verdict was blunt and set the direction for the rest of the project:

“Captures 60% of the content. 0% of the structure.”

The manual extraction had recorded Hirsch’s claims reasonably well but had recorded almost nothing about how they connected — which claim depended on which, where the warrants were, where the inference gaps lay. The content was a list; the argument was somewhere else. The fix has to be architectural: separate content extraction from structure extraction, and give the data model explicit slots for structural objects, not just propositions.

Node types

Ten node types ended up in the schema, most of them load-bearing:

Type Role
claim Atomic assertible proposition. Extracted aggressively.
evidence Specific study, dataset, example, observation. Carries a citation_status field tracking whether it is cited, uncited-but-verifiable, or common knowledge.
warrant The principle connecting evidence to claim — the Toulmin W. Usually implicit in the source text and must be generated.
objection A challenge to a claim, a warrant, or an evidence-claim link.
response A dialectical reply to an objection.
framework A composite: a coherent package of related claims (e.g. “The Three Tyrannical Ideas” contains naturalism + individualism + skill-centrism). Can be attacked as a unit or per-component.
concept A contested or defined term with the author’s definition and alternatives (e.g. “developmental appropriateness”).
case A specific country, school, or implementation with narrative and extracted claims (e.g. “France 1975–1985”).
thinker An intellectual position the author engages with — not just a name in a bibliography.
reference A parsed, enriched citation — endnote text → structured fields → Crossref metadata.

Edge types

Eight edge types: supports, opposes, undermines (attacks the warrant, not the claim itself), depends-on, objects-to, responds-to, refines, instantiates.

Two design decisions that punch above their weight

Extract aggressively. Defer importance to post-processing. Every assertion the author makes is a claim. Importance is computed from graph structure — in-degree, evidence count, cross-chapter recurrence, main-conclusion participation — and never judged at extraction time. This is not obvious and it matters a great deal. During early calibration, the pipeline was being too selective because the prompt was asking for “important” claims, which caused it to drop factual premises (“reading scores declined 1960–1980”) and concessions (“testing did improve mechanics”) that turned out to be load-bearing for arguments several chapters away. You cannot recognise a load-bearing premise by looking at it alone — you can only recognise it by computing what eventually depends on it, which is a graph property and has to come later.

Cases hold narratives alongside claims. The France case, the NCLB case, the Finland case — these contain long descriptive passages that are not literally claims but are essential context. The instinct to shred them into atomic propositions loses the narrative that makes the argument memorable. Cases in the final schema carry both a preserved human-readable story and the claims extracted from the interpretive parts of that story. This decision is visible in the reading layer, where cases are narrative entry points rather than claim lists.


3. The calibration loop

The calibration infrastructure is the part of the project I’d most urge other argument-extraction projects to build before anything else, and it’s the part most often skipped.

The basic problem: how do you know whether extraction is any good? The naive answer — read a chapter, look at the output, mutter “seems about right” — gives you the illusion of feedback without the substance of it, and makes principled decisions about prompts, models, or pipeline structure impossible. You need a numerical score that moves when you change the pipeline, and you need a ground-truth corpus to score against. Three pieces ended up in the loop: an HTML review interface, a ground-truth file, and an automated eval harness.

3.1 The HTML review interface

A self-contained calibration page, deliberately low-tech. The design: left panel shows the original chapter text; right panel shows the extracted claims, evidence, warrants, and dependencies; each item has rating sliders for completeness and accuracy; navigation is keyboard-driven; all state persists to localStorage so a review session survives browser crashes and can be resumed the next day. A text-selection popup attaches notes to source passages (“this should be a claim and isn’t”). At the end, the session exports as JSON to the clipboard.

The Hirsch-specific instances live in the repo: calibrate.html (first version), then calibrate_ch1.html, calibrate_ch1_v2.html, and calibrate_ch7.html for subsequent chapters. Opening one of these gives a direct view of what a real review session looked like.

The generator has since become a general-purpose Claude Code skill (calibration-eval) that builds self-contained eval pages for four modes: rating, A/B comparison, threshold tuning, and extraction recall. Keyboard-driven, localStorage-persisted, clipboard export, gold-question self-consistency checks. The shared harness lives at ~/.claude/skills/shared/. The format has generalised well enough that I now use variations of it for unrelated projects.

3.2 Ground truth

During manual review of the Prologue and Chapter 1, I ended up flagging the claims the pipeline should have caught and didn’t. Twenty-four items in total. They live in ground_truth.json, each with an ID (GT-P1, GT-P2, …), the target text, a type, and a note about what the pipeline got wrong.

Twenty-four is a tiny number and I was nervous about it. In practice it was more than enough. The marginal value of ground-truth item #25 turned out to be small compared to the marginal value of running the existing 24 through another experiment. My working hypothesis now is that something like 15–30 carefully curated ground-truth items per pipeline is enough to drive serious experimentation, provided the items are well-chosen and cover the actual failure modes you care about. The items matter; the count mostly doesn’t.

3.3 The eval harness

eval_extraction.py runs the extraction pipeline, takes the output, and asks an LLM judge to match each ground-truth item against the extracted claims with partial-credit scoring: exact match = 1.0, partial = 0.5, miss = 0.0. The judge prompt is deliberately permissive about wording — “substantially equivalent to GT-P3?” — because the question is whether the claim got extracted, not whether the wording is identical. The result is a single weighted recall score per run.

The critical practical fact is that LLM-judge scoring has variance. Running the same code on the same input produced scores in the 0.92–0.98 range across repeat runs, and the variance floor on this harness is about 0.06 weighted-score points. Any experiment whose effect size is smaller than 0.06 is indistinguishable from noise, and every experiment had to be run two or three times before it could be trusted. The corresponding discipline: if your change does not move the score by roughly 0.08, you are probably imagining the improvement. This killed several plausible-looking experiments.

3.4 A shared harness at limbic-level

The calibration-eval skill integrates with limbic.amygdala.calibrate — the calibration-metrics module inside the shared embedding/search library — for Cohen’s kappa, validation of LLM judges against human ground truth, and inter-rater reliability. The goal over time is that anything built from the calibration skill can be fed into a common limbic.amygdala.calibrate report without glue code. This project is one of the validation cases for that interface.


4. Experiments: what actually moved the needle

Once the loop was in place, experiments became tractable. Eight experiments ran across four parallel agents, each modifying exactly one variable, each scored against the same ground truth, each repeated two or three times because of the variance floor.

Experiment Weighted score Verdict
Baseline (single-pass, 15 rules) 0.854
+ 4 explicit extraction rules 0.813 Reverted — confused the model
Raised claim floor 15 → 25 0.875 Kept
+ Concrete mistake examples in prompt 0.875 Kept
+ Completeness sweep (pass 2) 0.896 Kept
Sharpened sweep prompt 0.875 Reverted
Sliding window (6K chars, 1K overlap) 0.979 Kept
Few-shot examples from ground truth 0.938 Moderate effect
Targeted detection questions 0.854 No improvement
Compound-claim splitting (pass 3) 0.938 Reverted

4.1 Sliding window is the single largest lever

The sliding-window result is worth dwelling on, because it was surprising. The change was minimal: instead of feeding an entire 28K-character chapter to the LLM in one call, split the chapter into 6,000-character windows with 1,000 characters of overlap, prefer paragraph boundaries, extract from each window, and then deduplicate the combined results using a 60% word-overlap threshold. That single structural change moved the weighted score from 0.854 to 0.979 in isolation, and produced ~80–150 claims per chapter where whole-chapter extraction had been giving ~20–30.

The hypothesis is simple attention degradation: at 28K+ characters, LLMs drop claims from the middle of long inputs regardless of prompting. Windowing eliminates this. It is not a clever trick; it is structural, general, and it works for any extraction task operating on book-length inputs. If one thing from this project generalises to other argument-extraction pipelines, it’s this.

4.2 Structural changes beat prompt tuning

Across the experiment table, the deltas cluster in a revealing way. Prompt-level experiments — more rules, sharper language, targeted detection questions, few-shot examples, compound splitting — produced deltas in the noise range (-0.04 to +0.04). The structural experiments — windowing the input, windowing Phase 2 the same way, batching structure extraction with relevant text sections — produced deltas of seven to ten points. If the pipeline’s architecture is wrong, no prompt will rescue it; if the architecture is right, the prompt is nearly a rounding error. This is not a general claim about all extraction tasks, but it held remarkably consistently across every experiment run on this one.

4.3 A bug worth documenting

Phase 2 (structure extraction) had been silently slicing its input with [:12000] on chapters that could run to 58,000 characters. The slice was a leftover from when Phase 2 was only being tested on the 13K-character Prologue, where it was a no-op. When Chapter 7 (58K characters, the “France” chapter, where Hirsch makes his most structurally complex argument) finally came through the pipeline, Phase 2 returned almost no dependencies and almost no warrants. For about a day the hypothesis was that structural extraction was fundamentally harder than content extraction on long chapters. The real cause was the truncation. Fixing it and windowing Phase 2 in the same way as Phase 1 produced a 7× increase in structural output: 8 → 55 dependencies, 4 → 22 warrants on Chapter 7.

The general lesson: any fixed numeric slice in a pipeline is a latent bug until proven otherwise. End-to-end assertions over input sizes would have caught this immediately; I had not written any.

4.4 Chapter 7 as the generalisation test

Chapter 7 was also the first unseen chapter — all calibration had been done on the Prologue and Chapter 1, and Chapter 7 was a test of whether the pipeline generalised beyond the calibration set. The extraction produced 145 claims, 43 evidence items, 30 concepts, 13 cases, and 18 thinkers. Human review of the full output found exactly two missed claims, both depth-of-interpretation issues rather than recall failures. This was treated as validation that the pipeline was ready to run at corpus scale.

4.5 Model selection

I tested Gemini 2.5 Flash against Sonnet on Phase 2 (structure). Sonnet produced noticeably richer structural reasoning — better warrants, sharper vulnerabilities, more incisive objection framing — but would not reliably follow the JSON schema. Flash followed the schema cleanly. For a pipeline that needs to run 101 chapters without manual intervention on JSON repair, schema compliance is a more important production property than reasoning depth. Flash plus a structural second pass beats Sonnet plus manual cleanup, at this scale. This may not hold in all settings, but it held consistently here.


5. The final extraction pipeline

  PDF / ePub / OCR intake
          ↓
  chapter segmentation
          ↓
  sliding window (6K / 1K overlap, paragraph boundaries)
          ↓
  Phase 1a — content extraction (per window, Gemini Flash)
          ↓
  Phase 1b — dedup (60% word-overlap threshold)
          ↓
  Phase 1c — completeness sweep (second pass over the full text)
          ↓
  Phase 2 — structure extraction (batches of ~20 claims
            with relevant text section, Gemini Flash)
          ↓
  Phase 3 — self-critique pass (Gemini Flash)
          ↓
  per-chapter JSON
          ↓
  Human calibration (HTML review interface)
          ↓
  eval_extraction.py against ground_truth.json
          ↓
  Backfill passes (endnote matching, warrant generation,
                   evidence enrichment)

Key parameters, for reference:

The pipeline writes structured logs at every phase so individual steps can be re-run without redoing upstream work. Each phase’s output is a standalone JSON file that the next phase consumes, which made experimentation fast and made it possible to replay any single phase with modifications.

5.1 Intake pathways

Ten books meant three different intake paths. Digital PDFs with clean metadata went through a simple PDF parser with chapter detection by TOC matching. Books that only existed as ePub went through ebooklib + BeautifulSoup. A few scanned PDFs had no extractable TOC and needed one typed by hand. And The Ratchet Effect (2024), which only existed for me as a Kindle book, was photographed page by page — 103 screenshots — and OCR’d through Gemini Flash’s vision capability, which turned out to be remarkably reliable for this use. All three pathways feed a common pre-parsed JSON intake format so that downstream phases never need to know how the book arrived.

5.2 Backfill passes

After per-chapter extraction, three targeted passes run over the full book:

These passes are conservative by design: they only fill in gaps against structural invariants (“every main conclusion should have a warrant chain to its evidence”), not invent new structure.


6. Cross-book consolidation

Once all ten books were extracted in isolation, the consolidation pipeline (consolidate.py) builds the cross-corpus layer. Five phases.

6.1 Entity deduplication

Thinkers, concepts, and cases use embedding similarity plus an LLM verification pass. Results:

The multi-book counts are the useful signal — they identify the recurring elements of the author’s intellectual world.

6.2 Claim clustering across books

This is where the scale became nontrivial. Naive pairwise cosine similarity over 10,232 claims is ~52 million comparisons, which is slow on a laptop. An inverted-index optimisation — shingle the claim text, only compare claims that share a shingle — reduced the candidate-pair count from 52M to 4.3M, and the clustering step now runs in about 5.3 seconds.

I mention this explicitly because I nearly reached for a vector database, which would have been enormous overkill for a 10K-item corpus. At this scale, brute-force numpy with a shingling optimisation beats any infrastructure you might be tempted to stand up. limbic.amygdala’s VectorIndex uses the same principle — numpy brute force is actually faster than ANN indices below roughly 100K vectors, and much easier to reason about. The crossover point where ANN starts to win is higher than most people expect.

6.3 Evolution classification

For each cross-book cluster, Gemini Flash classifies how the argument evolved across the books it appears in. Six categories:

Out of 656 cross-book argument clusters:

Category Count
Repeated 250
Refined 238
Evolved 74
Broadened 56
Narrowed 21
New evidence 17

I had expected this to be the noisiest step in the pipeline — LLMs classifying intellectual evolution seemed like exactly the kind of task where hallucination and smoothing would dominate. It turned out to be one of the most useful. The classifications are consistent across re-runs, the distribution is interpretable, and the output makes cross-book patterns visible that a human reader would not notice from reading the books sequentially.

6.4 A methodological caveat

There is a serious question buried in all this: is the pipeline measuring the author, or the LLM’s judgment of the author? The entity dedup and the evolution classification are both LLM-driven, and the LLM is doing some amount of smoothing. The honest claim is that the pipeline makes visible patterns that are consistent across re-runs and that match what a careful reader would notice given enough time — but the pipeline is not a neutral instrument. For any claim about what “Hirsch does over time”, the appropriate discount has to be applied.

6.5 What the cross-book view reveals

With those caveats in place, the cross-book view surfaces things I would not have seen from reading the books one at a time:

6.6 Cross-book importance scoring

A composite score, with no LLM involvement. Components: book recurrence (35%), dependency centrality (20%), evidence density (15%), counter-arguments (10%), main-conclusion bonus (30%). The top 15 claims by this score are surfaced on the landing page. Because it’s purely graph-structural, it’s reproducible and explainable in a way that an LLM-judged importance score would not be.


7. Things that did not work

The failures are often more useful than the successes, so the partial list, for the record:


8. The reading layer

The public reading layer (build_corpus_site.py, live at hirsch-atlas.pages.dev) is a 119-page static site generated from a single corpus_consolidated.json file. Page types:

A few design choices are strict:


9. The shared infrastructure (limbic)

Embedding, clustering, and LLM-judge infrastructure comes from limbic, a small data-curation library extracted from patterns recurring across several projects. The sub-packages:

The sliding-window splitter and the word-overlap dedup functions currently live inside the Hirsch repo, but they are general-purpose enough that they will be lifted into limbic.cerebellum as sliding_window_extract() and word_overlap_dedup() for reuse.


10. Process summary

The full per-book process, end to end:

  1. Acquire. PDF, ePub, or photographed Kindle screenshots. Feed through the appropriate intake path.
  2. Skeleton. One LLM call on the Prologue/Introduction + TOC produces the book’s argument skeleton. Human eyeballs it briefly.
  3. Extract per-chapter. Sliding-window pipeline. Per chapter, outputs phase1_content.json, phase2_structure.json, phase3_critique.json, merged into chapter.json.
  4. Human calibration. First three chapters of each book get full human review using the HTML interface. If quality is consistently ≥4/5 on the review sliders, chapters 4+ get spot-check review (50% of claims, all objections, all warrants).
  5. Backfill passes. Endnote matching, warrant generation, evidence enrichment.
  6. Consolidate. Run consolidate.py across all books extracted so far. Entity dedup, claim clustering, evolution classification, canonical argument merging, importance scoring.
  7. Build site. build_corpus_site.py regenerates the 119-page static site from corpus_consolidated.json.

Every step is replayable in isolation. The pipeline is designed so that re-running a single phase never requires redoing upstream work, which is critical for experimentation velocity.


11. Generalisable findings for argument extraction

The list of things from this project that I think will generalise to other argument-extraction pipelines:

Build the eval harness before the pipeline. Twenty to thirty carefully-curated ground-truth items from two or three representative source-sections is enough to drive every subsequent pipeline decision. Without a numerical score, prompt tuning is theatre.

Window your inputs. Window them aggressively. Sliding-window extraction (6K chars, 1K overlap, paragraph boundaries) is the single largest lever for recall on book-length text. The hypothesis — attention degradation on long inputs — is well-supported by the literature, and the intervention is structural rather than prompt-based, which is why it works so reliably. Window both the content phase and the structure phase.

Separate content from structure. Single-pass “extract the argument” prompts always produce the 60%-content / 0%-structure failure mode. Two phases, with the content output fed as input to the structure phase, is the minimum viable architecture.

Schema compliance over reasoning depth at production scale. Cheap schema-compliant models beat expensive reasoning models that need manual JSON repair, once you are running hundreds of chapters. This is a production observation, not a capability claim.

Extract aggressively, compute importance later. Importance is a graph property and cannot be judged at extraction time. Dropping a claim because it “looks obvious” drops the premises that turn out to be load-bearing two chapters away.

Preserve narratives next to claims. Cases are not reducible to atomic propositions without loss. The narrative is what readers remember, and it is what contestability has to hook into.

Build cross-corpus canonical arguments. The cross-book view is what makes deep-extraction-from-a-bounded-corpus worth doing, and it only becomes possible once per-source extraction is reliable. At scale below ~100K items, numpy brute force with a shingling optimisation beats any vector database you might be tempted to install.

Treat counter-argument sourcing as a first-class pipeline phase, not an afterthought. LLM-synthesised counter-arguments are fluent, plausible, and subtly wrong in a way that damages the project’s epistemic standing. Named critics, attributed quotes, real provenance.

Structural changes beat prompt tuning. Not in every project, but consistently in this one. When the pipeline is underperforming, the first question should be “is the architecture right?”, not “can I rewrite the prompt?”

Latent numeric slices are bugs. Any hard-coded text length in the pipeline is a latent bug waiting to be discovered on a larger input. End-to-end assertions over input sizes catch this kind of problem immediately.


12. Outstanding work


13. Summary of the corpus

Count
Books extracted 10
Year range 1977–2024
Chapters 101
Claims 10,232
Cross-book argument clusters 656
Multi-book thinkers 146
Multi-book concepts 206
Multi-book cases 57
Static pages in reading layer 119
Calibration ground-truth items 24
Final eval score (weighted recall) 0.948 (avg), 0.979 (best)
Eval harness variance floor ~0.06

Live site: hirsch-atlas.pages.dev · Code: houshuang/hirsch-atlas · Library: houshuang/limbic