Frontier LLMs are very good at arithmetic. So we stopped asking whether an AI can sum a tidy column β it can β and asked a harder question: on the kind of messy ledger an Indian Chartered Accountant actually receives, who gets the number right, how fast, and at what cost?
We built the ugliest realistic Tally/Excel export we could, then ran the exact same 10 questions through GPT-5.5 (OpenAI, code interpreter), Claude Opus 4.8 (Anthropic, code execution), and Eternity β scored by the same independent oracle. This post shows everything: the data, the method, the full results, and the part most benchmarks hide β where we failed.
The data: a deliberately ugly Indian ledger
300 transactions across two files (a ledger and a partyβstate lookup), rendered the way real Indian accounting software actually exports β not clean CSV. Every trap below is something a CA sees every week:
- Indian number grouping β
βΉ1,23,45,678.00(lakh/crore commas, not thousands) - Currency-symbol soup β the same column mixes
βΉ,Rs., andINR - Crore/lakh suffixes β
2 Cr,75 L(round figures abbreviated, as summaries really do) - Parenthesis negatives β
(βΉ2,500.00)for credit notes - A Dr/Cr column β where βCrβ means credit, not crore (same letters, different meaning)
- Subtotal & GRAND TOTAL rows interleaved in the data β double-count if you
sum()blindly - Mangled categories β
Consulting,consulting,CONSULTINGall in one column - Dirty join keys β
MAHARASHTRA,Maharashtra,maharashtra, and a sentinel-
A real slice of the file every engine saw
Note the same column carrying βΉ, Rs., INR and crore/lakh suffixes, the case-mangled categories, and the SUBTOTAL / GRAND TOTAL rows mixed into the data:
| txn_id | txn_date | party | category | amount | remarks |
|---|---|---|---|---|---|
| TXN-0001 | 30-11-2025 | Apex Traders | Services | βΉ1,32,49,078.00 | - |
| TXN-0002 | 07-Jul-2025 | Apex Traders | Consulting | Rs. 31,59,480.00 | pending - follow up |
| TXN-0003 | 03-08-2025 | Bharat Industries | GOODS | 75 L | |
| TXN-0006 | 13-Oct-2025 | Coastal Exports | Consulting | INR 15,64,055.00 | |
| SUBTOTAL | βΉ6,42,01,233.00 | running total | |||
| GRAND TOTAL | βΉ2,53,97,08,926.00 |
The method (so you can trust β and reproduce β it)
The ground truth is computed from a clean copy of the data using pandas, independently of any engine. The messy CSV is only what the engines see. So the oracle can't be fooled by anyone's parsing β it's fair to all three by construction.
Every engine got the same questions and the same scoring: a number within 0.5% of truth is exact, within 5% is close, otherwise wrong β and a confident wrong answer is flagged hallucinated. The competitors used frontier models (not a cheap one) with their code-interpreter sandboxes uploading the same files. The full dataset, runners, and scorer are open in our repo; the numbers below regenerate with one command.
The results
Ten questions on the messy ledger. Accuracy, average response time, and total tokens consumed:
| Engine | Accuracy | Avg response time | Tokens (10 queries) |
|---|---|---|---|
| Eternity | 10 / 10 | 5.5 s | schema-only* |
| GPT-5.5 (code interpreter) | 10 / 10 | 39.9 s | 45,889 |
| Claude Opus 4.8 (code execution) | 8 / 10 | 24.5 s | 246,658 |
*Eternity sends a compact schema plus a few sample values to a small model for planning, and only small aggregated results when it writes a narrative β never your full dataset. See the cost section.
Per question, naked β including the headline net total (βΉ2.54 crore-scale) and the cross-file top state join:
| Question | Eternity | GPT-5.5 | Claude Opus 4.8 |
|---|---|---|---|
| Net total of all amounts | β exact | β exact | β hallucinated |
| Largest single transaction | β exact | β exact | β exact |
| Total β Consulting | β exact | β exact | β exact |
| Total β Goods | β exact | β exact | β exact |
| Total refunds (negatives) | β exact | β exact | β exact |
| Refund count | β exact | β exact | β exact |
| Transaction count (total rows present) | β exact | β exact | β exact |
| March 2025 count | π‘ close | β exact | β exact |
| Total β Maharashtra (join) | β exact | β exact | β exact |
| Top state + name (join + rank) | β exact | β exact | β hallucinated |
Us, naked: we scored 0/10 first
Here's the part we're not hiding. The first time we ran Eternity on this file, it scored 0 out of 10. It could not read 75 L or Rs. 31,59,480 β our ingestion stripped βΉ and commas but didn't understand Indian magnitude suffixes or the Rs./INR prefixes, so the amount column stayed as text and every total collapsed.
That is exactly why we run benchmarks before we write marketing. We added lakh/crore/Rs./INR parsing (with a guard so it never mis-reads a genuine text column), re-ran, and went from 0/10 to 10/10 β the same day. The one remaining yellow mark (a close on the March count) is a date-edge we're still tightening. We'd rather show you the 0/10 and the fix than a suspiciously perfect score.
Them, naked: the frontier models hallucinated the most important number
GPT-5.5 was flawless on accuracy (10/10) β genuinely impressive. But Claude Opus 4.8 confidently returned the wrong net total β the single most important figure in the whole ledger β and the wrong top state, with no warning that anything was off. That's not a typo; that's a number a CA would file.
And both frontier models are slow: 24β40 seconds per question versus Eternity's ~5. For one-off curiosity that's fine. For a CA running fifty questions across a hundred client files, it isn't.
The cost: your data isn't what gets tokenized
This is the part that doesn't show up in an accuracy score. A code-interpreter answers by loading your entire file into the model and reasoning over it on every single query. That's why Claude burned 246,658 tokens answering ten questions about a 19 KB file, and GPT-5.5 used 45,889. At Opus 4.8's published rate, Claude's run cost about $1.47 β to answer ten questions a CA would expect for free. On a 50 MB file, that approach gets expensive fast, or simply exceeds the model's context and fails.
Eternity works the opposite way. It sends a small model the schema and a handful of sample values β never your full dataset β gets back a query plan, and runs the actual math as deterministic SQL inside a database. The aggregation never happens inside a model. (When it writes a narrative, it sends only the small aggregated result, not your rows.) So the model's input is bounded by your schema, not your row count: whether the file is 300 rows or 30 million, the model sees the same compact summary β and the heavy lifting is SQL, which is effectively free.
The honest verdict
We are not claiming to be more accurate than GPT-5.5. On this test we tied it at 10/10, and on clean data frontier LLMs are excellent. Anyone selling you βour AI is more accurate than ChatGPTβ is hand-waving.
What the messiest ledger actually shows is different, and more durable: equal top-tier accuracy, ~7Γ faster, every answer backed by visible SQL you can audit, at a fraction of the token cost β and it doesn't silently count the GRAND TOTAL row as a transaction or fabricate the net total. For a number you have to sign your name under, that's the difference that matters.
- Reproduce it: the dataset, the three runners, and the scorer are in our repo β one command regenerates every number above.
- Try it on your own mess: upload a real export and see the SQL behind every answer.