🧾

We Tested GPT-5.5, Claude, and Eternity on the Messiest Indian Ledger We Could Build

Eternity TeamΒ·Β·9 min read

Frontier LLMs are very good at arithmetic. So we stopped asking whether an AI can sum a tidy column β€” it can β€” and asked a harder question: on the kind of messy ledger an Indian Chartered Accountant actually receives, who gets the number right, how fast, and at what cost?

We built the ugliest realistic Tally/Excel export we could, then ran the exact same 10 questions through GPT-5.5 (OpenAI, code interpreter), Claude Opus 4.8 (Anthropic, code execution), and Eternity β€” scored by the same independent oracle. This post shows everything: the data, the method, the full results, and the part most benchmarks hide β€” where we failed.

The data: a deliberately ugly Indian ledger

300 transactions across two files (a ledger and a party→state lookup), rendered the way real Indian accounting software actually exports — not clean CSV. Every trap below is something a CA sees every week:

  • Indian number grouping β€” β‚Ή1,23,45,678.00 (lakh/crore commas, not thousands)
  • Currency-symbol soup β€” the same column mixes β‚Ή, Rs., and INR
  • Crore/lakh suffixes β€” 2 Cr, 75 L (round figures abbreviated, as summaries really do)
  • Parenthesis negatives β€” (β‚Ή2,500.00) for credit notes
  • A Dr/Cr column β€” where β€œCr” means credit, not crore (same letters, different meaning)
  • Subtotal & GRAND TOTAL rows interleaved in the data β€” double-count if you sum() blindly
  • Mangled categories β€” Consulting, consulting, CONSULTING all in one column
  • Dirty join keys β€” MAHARASHTRA, Maharashtra, maharashtra, and a sentinel -

A real slice of the file every engine saw

Note the same column carrying β‚Ή, Rs., INR and crore/lakh suffixes, the case-mangled categories, and the SUBTOTAL / GRAND TOTAL rows mixed into the data:

txn_idtxn_datepartycategoryamountremarks
TXN-000130-11-2025Apex TradersServicesβ‚Ή1,32,49,078.00-
TXN-000207-Jul-2025Apex Traders Consulting Rs. 31,59,480.00pending - follow up
TXN-000303-08-2025Bharat IndustriesGOODS75 L
TXN-000613-Oct-2025Coastal Exports Consulting INR 15,64,055.00
SUBTOTALβ‚Ή6,42,01,233.00running total
GRAND TOTALβ‚Ή2,53,97,08,926.00

The method (so you can trust β€” and reproduce β€” it)

The ground truth is computed from a clean copy of the data using pandas, independently of any engine. The messy CSV is only what the engines see. So the oracle can't be fooled by anyone's parsing β€” it's fair to all three by construction.

Every engine got the same questions and the same scoring: a number within 0.5% of truth is exact, within 5% is close, otherwise wrong β€” and a confident wrong answer is flagged hallucinated. The competitors used frontier models (not a cheap one) with their code-interpreter sandboxes uploading the same files. The full dataset, runners, and scorer are open in our repo; the numbers below regenerate with one command.

The results

Ten questions on the messy ledger. Accuracy, average response time, and total tokens consumed:

EngineAccuracyAvg response timeTokens (10 queries)
Eternity10 / 105.5 sschema-only*
GPT-5.5 (code interpreter)10 / 1039.9 s45,889
Claude Opus 4.8 (code execution)8 / 1024.5 s246,658

*Eternity sends a compact schema plus a few sample values to a small model for planning, and only small aggregated results when it writes a narrative β€” never your full dataset. See the cost section.

Per question, naked β€” including the headline net total (β‚Ή2.54 crore-scale) and the cross-file top state join:

QuestionEternityGPT-5.5Claude Opus 4.8
Net total of all amountsβœ… exactβœ… exact❌ hallucinated
Largest single transactionβœ… exactβœ… exactβœ… exact
Total β€” Consultingβœ… exactβœ… exactβœ… exact
Total β€” Goodsβœ… exactβœ… exactβœ… exact
Total refunds (negatives)βœ… exactβœ… exactβœ… exact
Refund countβœ… exactβœ… exactβœ… exact
Transaction count (total rows present)βœ… exactβœ… exactβœ… exact
March 2025 count🟑 closeβœ… exactβœ… exact
Total β€” Maharashtra (join)βœ… exactβœ… exactβœ… exact
Top state + name (join + rank)βœ… exactβœ… exact❌ hallucinated

Us, naked: we scored 0/10 first

Here's the part we're not hiding. The first time we ran Eternity on this file, it scored 0 out of 10. It could not read 75 L or Rs. 31,59,480 β€” our ingestion stripped β‚Ή and commas but didn't understand Indian magnitude suffixes or the Rs./INR prefixes, so the amount column stayed as text and every total collapsed.

That is exactly why we run benchmarks before we write marketing. We added lakh/crore/Rs./INR parsing (with a guard so it never mis-reads a genuine text column), re-ran, and went from 0/10 to 10/10 β€” the same day. The one remaining yellow mark (a close on the March count) is a date-edge we're still tightening. We'd rather show you the 0/10 and the fix than a suspiciously perfect score.

Them, naked: the frontier models hallucinated the most important number

GPT-5.5 was flawless on accuracy (10/10) β€” genuinely impressive. But Claude Opus 4.8 confidently returned the wrong net total β€” the single most important figure in the whole ledger β€” and the wrong top state, with no warning that anything was off. That's not a typo; that's a number a CA would file.

And both frontier models are slow: 24–40 seconds per question versus Eternity's ~5. For one-off curiosity that's fine. For a CA running fifty questions across a hundred client files, it isn't.

The cost: your data isn't what gets tokenized

This is the part that doesn't show up in an accuracy score. A code-interpreter answers by loading your entire file into the model and reasoning over it on every single query. That's why Claude burned 246,658 tokens answering ten questions about a 19 KB file, and GPT-5.5 used 45,889. At Opus 4.8's published rate, Claude's run cost about $1.47 β€” to answer ten questions a CA would expect for free. On a 50 MB file, that approach gets expensive fast, or simply exceeds the model's context and fails.

Eternity works the opposite way. It sends a small model the schema and a handful of sample values β€” never your full dataset β€” gets back a query plan, and runs the actual math as deterministic SQL inside a database. The aggregation never happens inside a model. (When it writes a narrative, it sends only the small aggregated result, not your rows.) So the model's input is bounded by your schema, not your row count: whether the file is 300 rows or 30 million, the model sees the same compact summary β€” and the heavy lifting is SQL, which is effectively free.

The honest verdict

We are not claiming to be more accurate than GPT-5.5. On this test we tied it at 10/10, and on clean data frontier LLMs are excellent. Anyone selling you β€œour AI is more accurate than ChatGPT” is hand-waving.

What the messiest ledger actually shows is different, and more durable: equal top-tier accuracy, ~7Γ— faster, every answer backed by visible SQL you can audit, at a fraction of the token cost β€” and it doesn't silently count the GRAND TOTAL row as a transaction or fabricate the net total. For a number you have to sign your name under, that's the difference that matters.

  • Reproduce it: the dataset, the three runners, and the scorer are in our repo β€” one command regenerates every number above.
  • Try it on your own mess: upload a real export and see the SQL behind every answer.

Ready to try governed AI analytics?

Upload your spreadsheets and get SQL-backed answers in minutes. No credit card required.

Try Eternity Free