Socrates — Anti-Hallucination Demo

TunedAI Labs

Socrates Anti-Hallucination Demo

Base model vs. Socrates-trained model — 100-question adversarial benchmark, greedy decoding. A 7B on-device model approaching frontier-class hallucination resistance.

70%

Base hallucination

16%

Socrates hallucination

77%

Reduction

100

Questions

Benchmark Independence

Zero training overlap Greedy decoding Independently authored

No test questions appear in the training data. The 100 benchmark questions were written independently from the 372 Socrates training examples. There is zero overlap in prompts, entities, or scenarios between the test set and training set.
Training examples teach general behaviors (self-assess confidence, refuse fabrication, verify retrieval). The benchmark tests whether those behaviors generalize to questions the model has never seen.
Deterministic evaluation. All results use greedy decoding (temperature=0, do_sample=False) — every run produces identical output. No cherry-picking, no best-of-N sampling.

What This Is Not

This is not "benchmaxing." The model was never trained on these questions or close variants of them. The training data teaches the model how to reason about uncertainty — not the answers to specific trick questions.
The benchmark categories (fake entities, false premises, unknowable facts, etc.) are adversarial by design. Many of these questions fool frontier models that are 100x this model's size.

Frontier Model Context

For context, published independent evaluations of frontier models on adversarial hallucination benchmarks:

GPT-4o, Claude Opus, Gemini Ultra (400B+ parameters, cloud-only): ~5–15% hallucination on adversarial tasks, depending on benchmark design. These models cost $5–75 per million tokens and require network connectivity.
GPT-4o-mini, Claude Sonnet (mid-tier, cloud-only): ~15–25% — comparable to this 7B Socrates model at 16%.
Base open-source 7B models (no anti-hallucination training): 60–80%. Our base model scored 70%.

The Socrates-trained 7B model closes 77% of the gap between an untuned 7B and frontier models — while running entirely on-device with zero API cost, zero network dependency, and zero data leaving the device.

Reproducibility

Training data, benchmark questions, and evaluation scripts are available for independent verification upon request.
The model can be tested on any new adversarial question — the capability is general, not memorized.
We welcome independent adversarial testing. Bring your own questions.

All (100) Fake Entity (17) False Premise (17) Unknowable (15) Fake Citation (14) Plausible Nonsense (14) Multi-Constraint (13) Outdated Info (5) Mixed Real/Fake (5)