A development log — pipeline, tools, experiments, and process
This document records, in some detail, how the Hirsch Argument Atlas was built: the data model, the extraction pipeline, the calibration loop, the experiments that moved the needle, the experiments that didn’t, the cross-book consolidation machinery, and the final stack. The intended reader is someone already deep in the argument-mapping space — comfortable with Toulmin structure, with the Discourse Graph tradition, with the practical difficulties of getting an LLM to emit a dependency edge — who wants enough detail to either replicate the approach or argue with it.
The final corpus contains 10,232 claims across 101 chapters from 10 books (1977–2024), with 656 cross-book argument clusters classified by how the argument evolved over time, a 119-page static reading layer, and a calibration pipeline tuned against human ground truth to a weighted recall score of 0.948. The interesting thing is not the numbers but the route taken to get there.
The project began as a deliberate inversion. A larger knowledge-graph project I had been working on — call it the big ingestion system — had been optimised for scale: tens of thousands of claims across thousands of sources, a hierarchical tree with include/exclude descriptions at every node, typed argument links. It worked in the sense that the counts kept going up. But the output felt flat. Every piece of research I kept reading on personal knowledge systems said essentially the same thing: selectivity beats exhaustiveness, claims alone aren’t enough (you need argument structure), and usefulness comes from frontiers, not from archives. I kept agreeing and continuing to ingest.
The Hirsch project tests the opposite hypothesis. Instead of shallow extraction across thousands of texts: deep extraction from a tightly bounded corpus. One thinker. Ten books. Every argument, every warrant, every dependency, every counter-argument that can be found or sourced. Nobody, as far as I can tell, has done this for a major public intellectual — certainly not with cross-book evolution tracking. Hirsch is a good candidate because the argument is coherent, the corpus is finite, the topic has genuine public interest, the books are citation-rich, and — as it turns out — he has been making more or less the same core argument with impressive stubbornness since 1977, which makes him ideal for studying how a single thinker’s argument evolves over decades.
Before any code, a manual extraction was done on the thirteen-page Prologue of Why Knowledge Matters (2016), followed by three independent critiques — an education scholar, an argument-mapper, and a UX designer. The argument-mapper’s verdict was blunt and set the direction for the rest of the project:
“Captures 60% of the content. 0% of the structure.”
The manual extraction had recorded Hirsch’s claims reasonably well but had recorded almost nothing about how they connected — which claim depended on which, where the warrants were, where the inference gaps lay. The content was a list; the argument was somewhere else. The fix has to be architectural: separate content extraction from structure extraction, and give the data model explicit slots for structural objects, not just propositions.
Ten node types ended up in the schema, most of them load-bearing:
| Type | Role |
|---|---|
| claim | Atomic assertible proposition. Extracted aggressively. |
| evidence | Specific study, dataset, example, observation. Carries a
citation_status field tracking whether it is cited,
uncited-but-verifiable, or common knowledge. |
| warrant | The principle connecting evidence to claim — the Toulmin W. Usually implicit in the source text and must be generated. |
| objection | A challenge to a claim, a warrant, or an evidence-claim link. |
| response | A dialectical reply to an objection. |
| framework | A composite: a coherent package of related claims (e.g. “The Three Tyrannical Ideas” contains naturalism + individualism + skill-centrism). Can be attacked as a unit or per-component. |
| concept | A contested or defined term with the author’s definition and alternatives (e.g. “developmental appropriateness”). |
| case | A specific country, school, or implementation with narrative and extracted claims (e.g. “France 1975–1985”). |
| thinker | An intellectual position the author engages with — not just a name in a bibliography. |
| reference | A parsed, enriched citation — endnote text → structured fields → Crossref metadata. |
Eight edge types: supports, opposes,
undermines (attacks the warrant, not the claim itself),
depends-on, objects-to,
responds-to, refines,
instantiates.
Extract aggressively. Defer importance to post-processing. Every assertion the author makes is a claim. Importance is computed from graph structure — in-degree, evidence count, cross-chapter recurrence, main-conclusion participation — and never judged at extraction time. This is not obvious and it matters a great deal. During early calibration, the pipeline was being too selective because the prompt was asking for “important” claims, which caused it to drop factual premises (“reading scores declined 1960–1980”) and concessions (“testing did improve mechanics”) that turned out to be load-bearing for arguments several chapters away. You cannot recognise a load-bearing premise by looking at it alone — you can only recognise it by computing what eventually depends on it, which is a graph property and has to come later.
Cases hold narratives alongside claims. The France case, the NCLB case, the Finland case — these contain long descriptive passages that are not literally claims but are essential context. The instinct to shred them into atomic propositions loses the narrative that makes the argument memorable. Cases in the final schema carry both a preserved human-readable story and the claims extracted from the interpretive parts of that story. This decision is visible in the reading layer, where cases are narrative entry points rather than claim lists.
The calibration infrastructure is the part of the project I’d most urge other argument-extraction projects to build before anything else, and it’s the part most often skipped.
The basic problem: how do you know whether extraction is any good? The naive answer — read a chapter, look at the output, mutter “seems about right” — gives you the illusion of feedback without the substance of it, and makes principled decisions about prompts, models, or pipeline structure impossible. You need a numerical score that moves when you change the pipeline, and you need a ground-truth corpus to score against. Three pieces ended up in the loop: an HTML review interface, a ground-truth file, and an automated eval harness.
A self-contained calibration page, deliberately low-tech. The design:
left panel shows the original chapter text; right panel shows the
extracted claims, evidence, warrants, and dependencies; each item has
rating sliders for completeness and accuracy; navigation is
keyboard-driven; all state persists to localStorage so a
review session survives browser crashes and can be resumed the next day.
A text-selection popup attaches notes to source passages (“this should
be a claim and isn’t”). At the end, the session exports as JSON to the
clipboard.
The Hirsch-specific instances live in the repo: calibrate.html
(first version), then calibrate_ch1.html,
calibrate_ch1_v2.html,
and calibrate_ch7.html
for subsequent chapters. Opening one of these gives a direct view of
what a real review session looked like.
The generator has since become a general-purpose Claude Code skill
(calibration-eval) that builds self-contained eval pages
for four modes: rating, A/B comparison, threshold tuning, and extraction
recall. Keyboard-driven, localStorage-persisted, clipboard
export, gold-question self-consistency checks. The shared harness lives
at ~/.claude/skills/shared/. The format has generalised
well enough that I now use variations of it for unrelated projects.
During manual review of the Prologue and Chapter 1, I ended up
flagging the claims the pipeline should have caught and didn’t.
Twenty-four items in total. They live in ground_truth.json,
each with an ID (GT-P1, GT-P2, …), the target
text, a type, and a note about what the pipeline got wrong.
Twenty-four is a tiny number and I was nervous about it. In practice it was more than enough. The marginal value of ground-truth item #25 turned out to be small compared to the marginal value of running the existing 24 through another experiment. My working hypothesis now is that something like 15–30 carefully curated ground-truth items per pipeline is enough to drive serious experimentation, provided the items are well-chosen and cover the actual failure modes you care about. The items matter; the count mostly doesn’t.
eval_extraction.py
runs the extraction pipeline, takes the output, and asks an LLM judge to
match each ground-truth item against the extracted claims with
partial-credit scoring: exact match = 1.0, partial = 0.5, miss = 0.0.
The judge prompt is deliberately permissive about wording —
“substantially equivalent to GT-P3?” — because the question is whether
the claim got extracted, not whether the wording is identical.
The result is a single weighted recall score per run.
The critical practical fact is that LLM-judge scoring has variance. Running the same code on the same input produced scores in the 0.92–0.98 range across repeat runs, and the variance floor on this harness is about 0.06 weighted-score points. Any experiment whose effect size is smaller than 0.06 is indistinguishable from noise, and every experiment had to be run two or three times before it could be trusted. The corresponding discipline: if your change does not move the score by roughly 0.08, you are probably imagining the improvement. This killed several plausible-looking experiments.
limbic-levelThe calibration-eval skill integrates with limbic.amygdala.calibrate
— the calibration-metrics module inside the shared embedding/search
library — for Cohen’s kappa, validation of LLM judges against human
ground truth, and inter-rater reliability. The goal over time is that
anything built from the calibration skill can be fed into a common
limbic.amygdala.calibrate report without glue code. This
project is one of the validation cases for that interface.
Once the loop was in place, experiments became tractable. Eight experiments ran across four parallel agents, each modifying exactly one variable, each scored against the same ground truth, each repeated two or three times because of the variance floor.
| Experiment | Weighted score | Verdict |
|---|---|---|
| Baseline (single-pass, 15 rules) | 0.854 | — |
| + 4 explicit extraction rules | 0.813 | Reverted — confused the model |
| Raised claim floor 15 → 25 | 0.875 | Kept |
| + Concrete mistake examples in prompt | 0.875 | Kept |
| + Completeness sweep (pass 2) | 0.896 | Kept |
| Sharpened sweep prompt | 0.875 | Reverted |
| Sliding window (6K chars, 1K overlap) | 0.979 | Kept |
| Few-shot examples from ground truth | 0.938 | Moderate effect |
| Targeted detection questions | 0.854 | No improvement |
| Compound-claim splitting (pass 3) | 0.938 | Reverted |
The sliding-window result is worth dwelling on, because it was surprising. The change was minimal: instead of feeding an entire 28K-character chapter to the LLM in one call, split the chapter into 6,000-character windows with 1,000 characters of overlap, prefer paragraph boundaries, extract from each window, and then deduplicate the combined results using a 60% word-overlap threshold. That single structural change moved the weighted score from 0.854 to 0.979 in isolation, and produced ~80–150 claims per chapter where whole-chapter extraction had been giving ~20–30.
The hypothesis is simple attention degradation: at 28K+ characters, LLMs drop claims from the middle of long inputs regardless of prompting. Windowing eliminates this. It is not a clever trick; it is structural, general, and it works for any extraction task operating on book-length inputs. If one thing from this project generalises to other argument-extraction pipelines, it’s this.
Across the experiment table, the deltas cluster in a revealing way. Prompt-level experiments — more rules, sharper language, targeted detection questions, few-shot examples, compound splitting — produced deltas in the noise range (-0.04 to +0.04). The structural experiments — windowing the input, windowing Phase 2 the same way, batching structure extraction with relevant text sections — produced deltas of seven to ten points. If the pipeline’s architecture is wrong, no prompt will rescue it; if the architecture is right, the prompt is nearly a rounding error. This is not a general claim about all extraction tasks, but it held remarkably consistently across every experiment run on this one.
Phase 2 (structure extraction) had been silently slicing its input
with [:12000] on chapters that could run to 58,000
characters. The slice was a leftover from when Phase 2 was only being
tested on the 13K-character Prologue, where it was a no-op. When Chapter
7 (58K characters, the “France” chapter, where Hirsch makes his most
structurally complex argument) finally came through the pipeline, Phase
2 returned almost no dependencies and almost no warrants. For about a
day the hypothesis was that structural extraction was fundamentally
harder than content extraction on long chapters. The real cause was the
truncation. Fixing it and windowing Phase 2 in the same way as Phase 1
produced a 7× increase in structural output: 8 → 55
dependencies, 4 → 22 warrants on Chapter 7.
The general lesson: any fixed numeric slice in a pipeline is a latent bug until proven otherwise. End-to-end assertions over input sizes would have caught this immediately; I had not written any.
Chapter 7 was also the first unseen chapter — all calibration had been done on the Prologue and Chapter 1, and Chapter 7 was a test of whether the pipeline generalised beyond the calibration set. The extraction produced 145 claims, 43 evidence items, 30 concepts, 13 cases, and 18 thinkers. Human review of the full output found exactly two missed claims, both depth-of-interpretation issues rather than recall failures. This was treated as validation that the pipeline was ready to run at corpus scale.
I tested Gemini 2.5 Flash against Sonnet on Phase 2 (structure). Sonnet produced noticeably richer structural reasoning — better warrants, sharper vulnerabilities, more incisive objection framing — but would not reliably follow the JSON schema. Flash followed the schema cleanly. For a pipeline that needs to run 101 chapters without manual intervention on JSON repair, schema compliance is a more important production property than reasoning depth. Flash plus a structural second pass beats Sonnet plus manual cleanup, at this scale. This may not hold in all settings, but it held consistently here.
PDF / ePub / OCR intake
↓
chapter segmentation
↓
sliding window (6K / 1K overlap, paragraph boundaries)
↓
Phase 1a — content extraction (per window, Gemini Flash)
↓
Phase 1b — dedup (60% word-overlap threshold)
↓
Phase 1c — completeness sweep (second pass over the full text)
↓
Phase 2 — structure extraction (batches of ~20 claims
with relevant text section, Gemini Flash)
↓
Phase 3 — self-critique pass (Gemini Flash)
↓
per-chapter JSON
↓
Human calibration (HTML review interface)
↓
eval_extraction.py against ground_truth.json
↓
Backfill passes (endnote matching, warrant generation,
evidence enrichment)
Key parameters, for reference:
The pipeline writes structured logs at every phase so individual steps can be re-run without redoing upstream work. Each phase’s output is a standalone JSON file that the next phase consumes, which made experimentation fast and made it possible to replay any single phase with modifications.
Ten books meant three different intake paths. Digital PDFs with clean
metadata went through a simple PDF parser with chapter detection by TOC
matching. Books that only existed as ePub went through
ebooklib + BeautifulSoup. A few scanned PDFs had no
extractable TOC and needed one typed by hand. And The Ratchet
Effect (2024), which only existed for me as a Kindle book, was
photographed page by page — 103 screenshots — and OCR’d through Gemini
Flash’s vision capability, which turned out to be remarkably reliable
for this use. All three pathways feed a common pre-parsed JSON intake
format so that downstream phases never need to know how the book
arrived.
After per-chapter extraction, three targeted passes run over the full book:
These passes are conservative by design: they only fill in gaps against structural invariants (“every main conclusion should have a warrant chain to its evidence”), not invent new structure.
Once all ten books were extracted in isolation, the consolidation
pipeline (consolidate.py)
builds the cross-corpus layer. Five phases.
Thinkers, concepts, and cases use embedding similarity plus an LLM verification pass. Results:
The multi-book counts are the useful signal — they identify the recurring elements of the author’s intellectual world.
This is where the scale became nontrivial. Naive pairwise cosine similarity over 10,232 claims is ~52 million comparisons, which is slow on a laptop. An inverted-index optimisation — shingle the claim text, only compare claims that share a shingle — reduced the candidate-pair count from 52M to 4.3M, and the clustering step now runs in about 5.3 seconds.
I mention this explicitly because I nearly reached for a vector
database, which would have been enormous overkill for a 10K-item corpus.
At this scale, brute-force numpy with a shingling optimisation beats any
infrastructure you might be tempted to stand up.
limbic.amygdala’s VectorIndex uses the same
principle — numpy brute force is actually faster than ANN
indices below roughly 100K vectors, and much easier to reason about. The
crossover point where ANN starts to win is higher than most people
expect.
For each cross-book cluster, Gemini Flash classifies how the argument evolved across the books it appears in. Six categories:
Out of 656 cross-book argument clusters:
| Category | Count |
|---|---|
| Repeated | 250 |
| Refined | 238 |
| Evolved | 74 |
| Broadened | 56 |
| Narrowed | 21 |
| New evidence | 17 |
I had expected this to be the noisiest step in the pipeline — LLMs classifying intellectual evolution seemed like exactly the kind of task where hallucination and smoothing would dominate. It turned out to be one of the most useful. The classifications are consistent across re-runs, the distribution is interpretable, and the output makes cross-book patterns visible that a human reader would not notice from reading the books sequentially.
There is a serious question buried in all this: is the pipeline measuring the author, or the LLM’s judgment of the author? The entity dedup and the evolution classification are both LLM-driven, and the LLM is doing some amount of smoothing. The honest claim is that the pipeline makes visible patterns that are consistent across re-runs and that match what a careful reader would notice given enough time — but the pipeline is not a neutral instrument. For any claim about what “Hirsch does over time”, the appropriate discount has to be applied.
With those caveats in place, the cross-book view surfaces things I would not have seen from reading the books one at a time:
A composite score, with no LLM involvement. Components: book recurrence (35%), dependency centrality (20%), evidence density (15%), counter-arguments (10%), main-conclusion bonus (30%). The top 15 claims by this score are surfaced on the landing page. Because it’s purely graph-structural, it’s reproducible and explainable in a way that an LLM-judged importance score would not be.
The failures are often more useful than the successes, so the partial list, for the record:
[:12000] Phase 2 truncation. A
latent bug, cost 7× in structural density on long chapters, described in
§4.3.The public reading layer (build_corpus_site.py,
live at hirsch-atlas.pages.dev) is a
119-page static site generated from a single
corpus_consolidated.json file. Page types:
A few design choices are strict:
limbic)Embedding, clustering, and LLM-judge infrastructure comes from limbic, a small data-curation library extracted from patterns recurring across several projects. The sub-packages:
limbic.amygdala — embedding, vector
and hybrid search, novelty detection, clustering, semantic whitening,
calibration metrics. Provides the VectorIndex used for
claim similarity in this project, the calibrate module for
kappa and judge validation, and the hybrid vector+FTS5 search
primitives.limbic.hippocampus — proposals,
cascade merges, deduplication pipelines, YAML-backed persistence. Not
used directly by Hirsch but shares conventions.limbic.cerebellum — LLM-judge
orchestration, budget tracking, batch audit pipelines. Used for the
evolution classification pass in this project.The sliding-window splitter and the word-overlap dedup functions
currently live inside the Hirsch repo, but they are general-purpose
enough that they will be lifted into limbic.cerebellum as
sliding_window_extract() and
word_overlap_dedup() for reuse.
The full per-book process, end to end:
phase1_content.json,
phase2_structure.json, phase3_critique.json,
merged into chapter.json.consolidate.py across
all books extracted so far. Entity dedup, claim clustering, evolution
classification, canonical argument merging, importance scoring.build_corpus_site.py
regenerates the 119-page static site from
corpus_consolidated.json.Every step is replayable in isolation. The pipeline is designed so that re-running a single phase never requires redoing upstream work, which is critical for experimentation velocity.
The list of things from this project that I think will generalise to other argument-extraction pipelines:
Build the eval harness before the pipeline. Twenty to thirty carefully-curated ground-truth items from two or three representative source-sections is enough to drive every subsequent pipeline decision. Without a numerical score, prompt tuning is theatre.
Window your inputs. Window them aggressively. Sliding-window extraction (6K chars, 1K overlap, paragraph boundaries) is the single largest lever for recall on book-length text. The hypothesis — attention degradation on long inputs — is well-supported by the literature, and the intervention is structural rather than prompt-based, which is why it works so reliably. Window both the content phase and the structure phase.
Separate content from structure. Single-pass “extract the argument” prompts always produce the 60%-content / 0%-structure failure mode. Two phases, with the content output fed as input to the structure phase, is the minimum viable architecture.
Schema compliance over reasoning depth at production scale. Cheap schema-compliant models beat expensive reasoning models that need manual JSON repair, once you are running hundreds of chapters. This is a production observation, not a capability claim.
Extract aggressively, compute importance later. Importance is a graph property and cannot be judged at extraction time. Dropping a claim because it “looks obvious” drops the premises that turn out to be load-bearing two chapters away.
Preserve narratives next to claims. Cases are not reducible to atomic propositions without loss. The narrative is what readers remember, and it is what contestability has to hook into.
Build cross-corpus canonical arguments. The cross-book view is what makes deep-extraction-from-a-bounded-corpus worth doing, and it only becomes possible once per-source extraction is reliable. At scale below ~100K items, numpy brute force with a shingling optimisation beats any vector database you might be tempted to install.
Treat counter-argument sourcing as a first-class pipeline phase, not an afterthought. LLM-synthesised counter-arguments are fluent, plausible, and subtly wrong in a way that damages the project’s epistemic standing. Named critics, attributed quotes, real provenance.
Structural changes beat prompt tuning. Not in every project, but consistently in this one. When the pipeline is underperforming, the first question should be “is the architecture right?”, not “can I rewrite the prompt?”
Latent numeric slices are bugs. Any hard-coded text length in the pipeline is a latent bug waiting to be discovered on a larger input. End-to-end assertions over input sizes catch this kind of problem immediately.
limbic.cerebellum. General-purpose, validated on a
real corpus, overdue.| Count | |
|---|---|
| Books extracted | 10 |
| Year range | 1977–2024 |
| Chapters | 101 |
| Claims | 10,232 |
| Cross-book argument clusters | 656 |
| Multi-book thinkers | 146 |
| Multi-book concepts | 206 |
| Multi-book cases | 57 |
| Static pages in reading layer | 119 |
| Calibration ground-truth items | 24 |
| Final eval score (weighted recall) | 0.948 (avg), 0.979 (best) |
| Eval harness variance floor | ~0.06 |
Live site: hirsch-atlas.pages.dev · Code: houshuang/hirsch-atlas · Library: houshuang/limbic