We took Qwen 2.5-7B — a free, open-source model anyone can download — and trained it to reason: to understand cause and effect, distinguish correlation from causation, and think through what would happen if you changed something. Run the benchmark yourself.
| Model | Score | Accuracy | Notes |
|---|---|---|---|
| TunedAI Labs — Qwen 2.5-7B | 96.96% | 7B params · open-source · fine-tuned | |
| GPT-4o | ~72% | General purpose · closed source | |
| Claude 3.5 Sonnet | ~68% | General purpose · closed source | |
| Base Qwen 2.5-7B | ~62% | Same model · no fine-tuning |
CLadder benchmark — 10,000+ questions. Public. Independently runnable. Verify it yourself →
We didn't evaluate our own model. A third-party applied AI engineer ran the benchmark, analyzed the LoRA adapter, and confirmed the results.
During the call Mark Gentry claimed to have fine-tuned Qwen 2.5-7B to achieve much higher scores (96.96%) than frontier models on the CLadder benchmark. There was a moment when we discussed the terms "overfitting" and "benchmaxxing" and to Mark's credit I ran his tests and confirmed that it is neither overfitting nor benchmaxxed.
Kudos for surviving scrutiny.
Adapter architecture reviewed: rank 64, attention-only, ~40M trainable parameters. CLadder eval independently reproduced from the Colab notebook. Results are not dependent on our reporting.
We take an existing open-source model and train it to reason in your domain — not just to score better on tests, but to understand cause and effect, ask the right questions, and know when a conclusion actually follows from the evidence. You own the result forever.
We benchmark your current model against your actual task. No guessing — you see exactly where it's underperforming before we touch anything.
We curate high-quality examples of the task done correctly. This is where most fine-tuning fails — we specialize in training data that teaches reasoning, not surface-level pattern matching.
We train the model specifically to reason — not just to pattern-match answers. Then we measure reasoning performance on a held-out benchmark. You get before and after numbers, not just our word that it improved.
Every API call to a closed-source model is a recurring bill you don't control, on servers you don't own, with a provider who can change pricing or go down anytime. A fine-tuned open-source model eliminates all of that.
A fine-tuned 7B model on your own hardware costs a fraction of a GPT-4o API call. At scale, the difference is orders of magnitude.
No API call means no data leaving your environment. Critical for healthcare, legal, finance, and defense where data cannot touch a third-party server.
OpenAI deprecates models. Anthropic changes pricing. A fine-tuned model you own runs the same way forever, regardless of what any vendor decides.
A general model pattern-matches. A reasoning-trained model understands cause and effect in your domain — why a diagnosis follows from symptoms, why a clause creates liability, why a signal indicates a threat. The benchmark above shows what that difference looks like.
Cost estimate at 1M queries/month. Closed API based on GPT-4o pricing. Open-source assumes A100 cloud GPU amortized over 24 months.
We work with teams whose models need to reason — not just retrieve or rephrase, but actually understand cause and effect in their domain. General-purpose models weren't built for that. We are.
We train an open-source model to reason in your domain — legal, medical, financial, security, or any field where the model needs to understand cause and effect, not just match patterns. You get the weights and run the model anywhere.
Most PopularWe design a benchmark for your task and measure performance before and after. You get a verifiable number — not a vague claim of "improvement."
Always IncludedMost fine-tuning fails because of bad training data. We build the dataset that actually teaches reasoning — not surface-level pattern matching. This is the hard part.
The Hard PartWe help you deploy the fine-tuned model in your environment — on-premise, private cloud, or air-gapped. We write the inference API and integrate with your stack.
Optional Add-OnFine-tuning is not a one-time event. As your use case evolves, we retrain with new data and measure the delta. Ongoing relationship, not a one-off project.
Retainer OptionIf you have a novel reasoning problem — something no existing model handles well — we research and develop a fine-tuning approach from scratch. We publish when appropriate.
Custom ScopeTunedAI Labs builds fine-tuned AI models for regulated, high-stakes work — healthcare, finance, legal, security — where the output needs to hold up to review, not just sound good.
I'm Mark Gentry, the founder. The company is the convergence point of four threads I've been pulling on for a long time.
Years before I went to grad school, I was sitting in on Hubert Dreyfus's AI courses at UC Berkeley — just for the fun of it. Dreyfus spent four decades making the most serious academic argument against purely computational AI: that reasoning, for minds, is not symbol manipulation. It's embedded in skillful practice, embodied context, and a background of shared meaning that can't be fully formalized. Around that time I also sat in on a Heidegger seminar — the philosophical foundation underneath everything Dreyfus was saying — and a Nietzsche seminar, because I was curious. Years later I went back for a master's in philosophy of consciousness.
I then spent twenty years recruiting in tech — YouTube, Google, BitTorrent, Tango, and others. Recruiting is a privileged seat. You see what companies actually build, not what they claim to build. You meet the people doing the work before the press releases. I had a front-row view of the systems Dreyfus had been critiquing as they grew into the dominant paradigm.
Among the companies I recruited for was Udacity, including their Machine Learning and Self-Driving Car Nanodegrees. I ended up taking the coursework myself, writing the projects by hand — partly out of curiosity, partly to see if I could.
TunedAI Labs is where those four threads meet. Modern LLMs are still mostly next-token predictors wrapped in instruction-following scaffolding. Dreyfus was right about something real — fluent output is not reasoning — and the gap shows up most painfully in regulated, high-stakes domains where the wrong answer has consequences. What I've built is an approach to engineering structure into fine-tuned LLMs so that their output is traceable, auditable, and closer to actual reasoning, not just plausibly worded.
A current result: a custom-tuned Qwen 2.5-7B scored 96.96% on CLadder, a public academic benchmark for causal reasoning. The base model sits in the mid-60s. An independent applied AI engineer — 25 years in security and forensics, former White House — verified the result on a held-out sample. Not benchmaxxed, not overfitted.
None of this claims to have solved Dreyfus's critique. Nobody has. But taking the critique seriously and engineering against it is the work worth doing — and it's what TunedAI Labs is set up to do.
My father spent his career as an engineer at IBM in the mainframe era. His monitor program for the 1401 — the Gentry Monitor — was picked up by IBM and distributed as the FASTER Type II Program. He was awarded IBM's Outstanding Contributor recognition; the cufflinks are still in our family. His oral history is preserved at the Computer History Museum. A historian once told our family he was the unsung hero of CICS, the transaction system the world's banks still run on. I can't prove that part.
But I know what he spent his career doing: making machines reason reliably enough to be trusted with work that mattered. That's the same problem I'm working on, one layer up the stack.
We'll tell you whether reasoning fine-tuning can fix it, which model to start from, and what a realistic benchmark score looks like. No commitment — just an honest conversation about whether we can actually help.
Or email us directly: [email protected]