We Tested GPT-5.5, Claude, and Eternity on the Messiest Indian Ledger We Could Build

Frontier LLMs are very good at arithmetic. So we stopped asking whether an AI can sum a tidy column — it can — and asked a harder question: on the kind of messy ledger an Indian Chartered Accountant actually receives, who gets the number right, how fast, and at what cost?

We built the ugliest realistic Tally/Excel export we could, then ran the exact same 10 questions through GPT-5.5 (OpenAI, code interpreter), Claude Opus 4.8 (Anthropic, code execution), and Eternity — scored by the same independent oracle. This post shows everything: the data, the method, the full results, and the part most benchmarks hide — where we failed.

The data: a deliberately ugly Indian ledger

300 transactions across two files (a ledger and a party→state lookup), rendered the way real Indian accounting software actually exports — not clean CSV. Every trap below is something a CA sees every week:

Indian number grouping — ₹1,23,45,678.00 (lakh/crore commas, not thousands)
Currency-symbol soup — the same column mixes ₹, Rs., and INR
Crore/lakh suffixes — 2 Cr, 75 L (round figures abbreviated, as summaries really do)
Parenthesis negatives — (₹2,500.00) for credit notes
A Dr/Cr column — where “Cr” means credit, not crore (same letters, different meaning)
Subtotal & GRAND TOTAL rows interleaved in the data — double-count if you sum() blindly
Mangled categories — Consulting, consulting, CONSULTING all in one column
Dirty join keys — MAHARASHTRA, Maharashtra, maharashtra, and a sentinel -

A real slice of the file every engine saw

Note the same column carrying ₹, Rs., INR and crore/lakh suffixes, the case-mangled categories, and the SUBTOTAL / GRAND TOTAL rows mixed into the data:

txn_id	txn_date	party	category	amount	remarks
TXN-0001	30-11-2025	Apex Traders	Services	₹1,32,49,078.00	-
TXN-0002	07-Jul-2025	Apex Traders	Consulting	Rs. 31,59,480.00	pending - follow up
TXN-0003	03-08-2025	Bharat Industries	GOODS	75 L
TXN-0006	13-Oct-2025	Coastal Exports	Consulting	INR 15,64,055.00
SUBTOTAL				₹6,42,01,233.00	running total
GRAND TOTAL				₹2,53,97,08,926.00

The method (so you can trust — and reproduce — it)

The ground truth is computed from a clean copy of the data using pandas, independently of any engine. The messy CSV is only what the engines see. So the oracle can't be fooled by anyone's parsing — it's fair to all three by construction.

Every engine got the same questions and the same scoring: a number within 0.5% of truth is exact, within 5% is close, otherwise wrong — and a confident wrong answer is flagged hallucinated. The competitors used frontier models (not a cheap one) with their code-interpreter sandboxes uploading the same files. The full dataset, runners, and scorer are open in our repo; the numbers below regenerate with one command.

The results

Ten questions on the messy ledger. Accuracy, average response time, and total tokens consumed:

Engine	Accuracy	Avg response time	Tokens (10 queries)
Eternity	10 / 10	5.5 s	schema-only*
GPT-5.5 (code interpreter)	10 / 10	39.9 s	45,889
Claude Opus 4.8 (code execution)	8 / 10	24.5 s	246,658

*Eternity sends a compact schema plus a few sample values to a small model for planning, and only small aggregated results when it writes a narrative — never your full dataset. See the cost section.

Per question, naked — including the headline net total (₹2.54 crore-scale) and the cross-file top state join:

Question	Eternity	GPT-5.5	Claude Opus 4.8
Net total of all amounts	✅ exact	✅ exact	❌ hallucinated
Largest single transaction	✅ exact	✅ exact	✅ exact
Total — Consulting	✅ exact	✅ exact	✅ exact
Total — Goods	✅ exact	✅ exact	✅ exact
Total refunds (negatives)	✅ exact	✅ exact	✅ exact
Refund count	✅ exact	✅ exact	✅ exact
Transaction count (total rows present)	✅ exact	✅ exact	✅ exact
March 2025 count	🟡 close	✅ exact	✅ exact
Total — Maharashtra (join)	✅ exact	✅ exact	✅ exact
Top state + name (join + rank)	✅ exact	✅ exact	❌ hallucinated

Us, naked: we scored 0/10 first

Here's the part we're not hiding. The first time we ran Eternity on this file, it scored 0 out of 10. It could not read 75 L or Rs. 31,59,480 — our ingestion stripped ₹ and commas but didn't understand Indian magnitude suffixes or the Rs./INR prefixes, so the amount column stayed as text and every total collapsed.

That is exactly why we run benchmarks before we write marketing. We added lakh/crore/Rs./INR parsing (with a guard so it never mis-reads a genuine text column), re-ran, and went from 0/10 to 10/10 — the same day. The one remaining yellow mark (a close on the March count) is a date-edge we're still tightening. We'd rather show you the 0/10 and the fix than a suspiciously perfect score.

Them, naked: the frontier models hallucinated the most important number

GPT-5.5 was flawless on accuracy (10/10) — genuinely impressive. But Claude Opus 4.8 confidently returned the wrong net total — the single most important figure in the whole ledger — and the wrong top state, with no warning that anything was off. That's not a typo; that's a number a CA would file.

And both frontier models are slow: 24–40 seconds per question versus Eternity's ~5. For one-off curiosity that's fine. For a CA running fifty questions across a hundred client files, it isn't.

The cost: your data isn't what gets tokenized

This is the part that doesn't show up in an accuracy score. A code-interpreter answers by loading your entire file into the model and reasoning over it on every single query. That's why Claude burned 246,658 tokens answering ten questions about a 19 KB file, and GPT-5.5 used 45,889. At Opus 4.8's published rate, Claude's run cost about $1.47 — to answer ten questions a CA would expect for free. On a 50 MB file, that approach gets expensive fast, or simply exceeds the model's context and fails.

Eternity works the opposite way. It sends a small model the schema and a handful of sample values — never your full dataset — gets back a query plan, and runs the actual math as deterministic SQL inside a database. The aggregation never happens inside a model. (When it writes a narrative, it sends only the small aggregated result, not your rows.) So the model's input is bounded by your schema, not your row count: whether the file is 300 rows or 30 million, the model sees the same compact summary — and the heavy lifting is SQL, which is effectively free.

The honest verdict

We are not claiming to be more accurate than GPT-5.5. On this test we tied it at 10/10, and on clean data frontier LLMs are excellent. Anyone selling you “our AI is more accurate than ChatGPT” is hand-waving.

What the messiest ledger actually shows is different, and more durable: equal top-tier accuracy, ~7× faster, every answer backed by visible SQL you can audit, at a fraction of the token cost — and it doesn't silently count the GRAND TOTAL row as a transaction or fabricate the net total. For a number you have to sign your name under, that's the difference that matters.

Reproduce it: the dataset, the three runners, and the scorer are in our repo — one command regenerates every number above.
Try it on your own mess: upload a real export and see the SQL behind every answer.