We Built the Best Coin Identification and Grading Model in the World. Here Are the Numbers.

Vardera identifies US and World coins with 98% accuracy and grades them with 96% accuracy, in under two seconds.

Written by

Ben Parfitt, Chief AI Officer

Vardera builds computer-vision infrastructure for collectibles and condition-dependent goods. Today we are publicly sharing benchmark results for our coin identification and grading model, tested head-to-head against six of the most capable frontier AI models in the world.

Vardera identifies US and World coins with 98.5% accuracy and grades them with 96% accuracy, in under two seconds. On grading and identification combined, the best frontier model still makes roughly 10x more errors than Vardera.

This post explains why that gap exists, presents the full benchmark data, and describes what it means for the future of coin commerce.

Why general-purpose AI falls short on coins

Frontier multimodal models are remarkable at general reasoning. They can write code, summarize legal documents, and recognize that something in a photo is a Morgan Dollar. But consistent, in-depth coin identification and grading is a fundamentally different problem.

Distinguishing two 1878-CC Morgan Dollars — one a MS-67 and the other a MS-63 — requires detecting subtle differences in luster, contact marks, and strike quality. The visual gap between those two grades represents tens of thousands of dollars in value. A general vision model can tell you something is a Morgan Dollar. It cannot reliably tell you which year and mint, in what condition, at what grade.

This is not a prompting problem or a model-size problem. General-purpose models optimize for breadth across millions of tasks. Coin grading requires purpose-built depth: domain-specific training data, specialized feature extraction, and models designed from the ground up for fine-grained visual discrimination. That’s a different kind of system, not a better prompt.

Our approach: custom models, not LLM wrappers

Vardera trains category-specific computer vision models for each domain we cover. For coins, that means models trained on large-scale datasets of expert-graded imagery, built to do two things: identify the coin — often down to the specific die variety — and grade its condition on the Sheldon scale, the same standard used by NGC and PCGS.

These custom machine learning models are the backbone of our product. This is where the performance advantage comes from: 98.5% identification accuracy, 96% grading accuracy, and sub-2-second response times. None of those numbers are achievable with a general-purpose model today. Meanwhile, our models are constantly learning from their mistakes and improving week-over-week.

The benchmark

We evaluated Vardera against six frontier AI models on a standardized test set of coin images, measuring three things: identification error rate, grading error rate, and average inference time per image. Every model received the same images and the same task.

The dataset contains 1188 coins, spanning 69 different coin types, 191 years, seven mints, and all possible numeric grade values on the Sheldon scale.

Raw error rates and latency

Model	ID Error %	Grade Error %	Average Time (sec)
Vardera	1.5%	4%	2.00 seconds
Fable 5	12.40%	38.00%	7.30 seconds
Opus 4.7 (Thinking)	8.10%	42.51%	9.64 seconds
Gemini 3.1 Pro (Thinking)	18.00%	62.37%	86.39 seconds
GPT 5.5 (Thinking)	11.20%	49.49%	19.46 seconds
GPT 4.1	17.90%	57.07%	6.85 seconds
Kimi K2.6 (Thinking)	16.10%	68.27%	70.26 seconds
Qwen 3.5 397b a17b (Thinking)	30.20%	75.34%	83.79 seconds

Vardera’s error rates are 1.5% on identification and 4% on grading. The closest frontier model on identification (Opus 4.7 at 8.10%) still fails on roughly 1 in 12 coins. On grading, Fable 5 is the closest at 38.00%, meaning it gets the grade wrong on nearly 4 in 10 coins. On speed, even the fastest frontier model (GPT 4.1 at 6.85 seconds) is more than 3x slower than Vardera’s 2-second average.

The relative comparison makes the gap clearer. When Vardera is the baseline (1x), the frontier models look like this:

Frontier models relative to Vardera

Model	ID Error (x)	Grade Error (x)	Avg Time (x)
Vardera	1	1	1
Fable 5	8	10	4
Opus 4.7 (Thinking)	5	11	5
Gemini 3.1 Pro (Thinking)	12	16	43
GPT 5.5 (Thinking)	7	12	10
GPT 4.1	12	14	3
Kimi K2.6 (Thinking)	11	17	35
Qwen 3.5 397b a17b (Thinking)	20	19	42

All values relative to Vardera (1x = baseline). Lower is better.

Every frontier model produces at least 5x more identification errors and at least 10x more grading errors than Vardera — ranging up to 20x and 19x respectively. Several models (Gemini, Kimi, Qwen) take over a minute per coin, making them impractical for any real-time application.

Why the gap exists

The performance difference is not surprising if you understand how these systems are built. Frontier LLMs are trained on internet-scale text and image data to perform well across an enormous range of tasks. They know what a Morgan Dollar is, but have no dedicated feature extraction for surface condition, and no calibration against professional grading standards.

Vardera’s coin models, by contrast, do nothing but coins. From the architecture, to training data, to evaluation metrics, everything is optimized for one job: looking at a coin image and producing an accurate identification and grade. The result is not incremental improvement. It is a category difference in capability.

We expect frontier models to continue improving on general vision tasks. However, the structural gap on fine-grained visual assessment is not something that closes with the next model release. It requires purpose-built training data, domain-specific architectures, and calibration against the standards the industry already trusts. That is what Vardera has built.

We’ll continue to open source how our category intelligence performs against industry standards as we expand into new verticals. If you’re running a marketplace or platform where coins, trading cards, or other collectibles need to be accurately identified, graded, or valued at scale, we’d like to talk. Reach out at info@vardera.com or visit vardera.com.

Written by

Ben Parfitt, Chief AI Officer

Want to see Vardera in action?

Book a demo

We Built the Best Coin Identification and Grading Model in the World. Here Are the Numbers.

Why general-purpose AI falls short on coins

Our approach: custom models, not LLM wrappers

The benchmark

Why the gap exists

We're hiring!

We're hiring!

We're hiring!