Benchmark Independence
Zero training overlap
Greedy decoding
Independently authored
- No test questions appear in the training data. The 100 benchmark questions were written independently from the 372 Socrates training examples. There is zero overlap in prompts, entities, or scenarios between the test set and training set.
- Training examples teach general behaviors (self-assess confidence, refuse fabrication, verify retrieval). The benchmark tests whether those behaviors generalize to questions the model has never seen.
- Deterministic evaluation. All results use greedy decoding (temperature=0, do_sample=False) — every run produces identical output. No cherry-picking, no best-of-N sampling.
What This Is Not
- This is not "benchmaxing." The model was never trained on these questions or close variants of them. The training data teaches the model how to reason about uncertainty — not the answers to specific trick questions.
- The benchmark categories (fake entities, false premises, unknowable facts, etc.) are adversarial by design. Many of these questions fool frontier models that are 100x this model's size.
Frontier Model Context
For context, published independent evaluations of frontier models on adversarial hallucination benchmarks:
- GPT-4o, Claude Opus, Gemini Ultra (400B+ parameters, cloud-only): ~5–15% hallucination on adversarial tasks, depending on benchmark design. These models cost $5–75 per million tokens and require network connectivity.
- GPT-4o-mini, Claude Sonnet (mid-tier, cloud-only): ~15–25% — comparable to this 7B Socrates model at 16%.
- Base open-source 7B models (no anti-hallucination training): 60–80%. Our base model scored 70%.
The Socrates-trained 7B model closes 77% of the gap between an untuned 7B and frontier models — while running entirely on-device with zero API cost, zero network dependency, and zero data leaving the device.
Reproducibility
- Training data, benchmark questions, and evaluation scripts are available for independent verification upon request.
- The model can be tested on any new adversarial question — the capability is general, not memorized.
- We welcome independent adversarial testing. Bring your own questions.