Reasoning-Focused LLM Fine-Tuning

Is your AI
running rough?

We don't just tune models — we teach them to reason. Cause and effect. What changes when you intervene. What would have happened if things were different. A 7B model trained to reason scores 96.96% on a public benchmark. GPT-4o scores 72%.

Your model, our garage.
TunedAI LABS TunedAI
96.96%
CLadder benchmark score
on our causal reasoning model
7B
Parameters — open-source,
free to run anywhere
+25%
Above GPT-4o on the same
publicly verifiable benchmark
Verified Results

Smaller model.
Better answers.

We took Qwen 2.5-7B — a free, open-source model anyone can download — and trained it to reason: to understand cause and effect, distinguish correlation from causation, and think through what would happen if you changed something. Run the benchmark yourself.

Model Score Accuracy Notes
TunedAI Labs — Qwen 2.5-7B 96.96%
96.96%
7B params · open-source · fine-tuned
GPT-4o ~72%
72%
General purpose · closed source
Claude 3.5 Sonnet ~68%
68%
General purpose · closed source
Base Qwen 2.5-7B ~62%
62%
Same model · no fine-tuning

CLadder benchmark — 10,000+ questions. Public. Independently runnable. Verify it yourself →

Independently Verified

Kudos for surviving scrutiny.

We didn't evaluate our own model. A third-party applied AI engineer ran the benchmark, analyzed the LoRA adapter, and confirmed the results.

"

During the call Mark Gentry claimed to have fine-tuned Qwen 2.5-7B to achieve much higher scores (96.96%) than frontier models on the CLadder benchmark. There was a moment when we discussed the terms "overfitting" and "benchmaxxing" and to Mark's credit I ran his tests and confirmed that it is neither overfitting nor benchmaxxed.

Kudos for surviving scrutiny.

Matthew Wong
Applied AI Engineer · LLM & Agent Platform Specialist
Former White House Situation Room · VP, JP Morgan Chase · Senior Solutions Architect, Splunk & Phantom
LinkedIn →
Technical Analysis

Adapter reviewed, eval reproduced.

Base Qwen 2.5-7B 63.0%
TunedAI LoRA (independent run) 91.5%
Improvement over base +28.5 pp

Adapter architecture reviewed: rank 64, attention-only, ~40M trainable parameters. CLadder eval independently reproduced from the Colab notebook. Results are not dependent on our reporting.

Under the Hood

How a tune-up works.

We take an existing open-source model and train it to reason in your domain — not just to score better on tests, but to understand cause and effect, ask the right questions, and know when a conclusion actually follows from the evidence. You own the result forever.

Step 01

Diagnose the problem

We benchmark your current model against your actual task. No guessing — you see exactly where it's underperforming before we touch anything.

Step 02

Build the training set

We curate high-quality examples of the task done correctly. This is where most fine-tuning fails — we specialize in training data that teaches reasoning, not surface-level pattern matching.

Step 03

Train to reason, then verify

We train the model specifically to reason — not just to pattern-match answers. Then we measure reasoning performance on a held-out benchmark. You get before and after numbers, not just our word that it improved.

Why Open Source

Own your model.
Own your costs.

Every API call to a closed-source model is a recurring bill you don't control, on servers you don't own, with a provider who can change pricing or go down anytime. A fine-tuned open-source model eliminates all of that.

💰

Dramatically lower cost

A fine-tuned 7B model on your own hardware costs a fraction of a GPT-4o API call. At scale, the difference is orders of magnitude.

🔒

Your data stays private

No API call means no data leaving your environment. Critical for healthcare, legal, finance, and defense where data cannot touch a third-party server.

No vendor dependency

OpenAI deprecates models. Anthropic changes pricing. A fine-tuned model you own runs the same way forever, regardless of what any vendor decides.

🎯

Reasons through your domain

A general model pattern-matches. A reasoning-trained model understands cause and effect in your domain — why a diagnosis follows from symptoms, why a clause creates liability, why a signal indicates a threat. The benchmark above shows what that difference looks like.

Open-Source vs. Closed-Source API
Open-Source
Closed API
Cost per query~$0.00002~$0.015
Data privacy100% localSent to vendor
Model ownershipYou own itRented
Deprecation riskNoneHigh
Domain accuracyFine-tuned ✓General
LatencyOn your hardwareAPI round-trip

Cost estimate at 1M queries/month. Closed API based on GPT-4o pricing. Open-source assumes A100 cloud GPU amortized over 24 months.

What We Do

Full-service garage.

We work with teams whose models need to reason — not just retrieve or rephrase, but actually understand cause and effect in their domain. General-purpose models weren't built for that. We are.

🔧

Reasoning Fine-Tuning

We train an open-source model to reason in your domain — legal, medical, financial, security, or any field where the model needs to understand cause and effect, not just match patterns. You get the weights and run the model anywhere.

Most Popular
📊

Benchmark & Evaluation

We design a benchmark for your task and measure performance before and after. You get a verifiable number — not a vague claim of "improvement."

Always Included
🗂️

Training Data Curation

Most fine-tuning fails because of bad training data. We build the dataset that actually teaches reasoning — not surface-level pattern matching. This is the hard part.

The Hard Part
🚀

Deployment Support

We help you deploy the fine-tuned model in your environment — on-premise, private cloud, or air-gapped. We write the inference API and integrate with your stack.

Optional Add-On
🔄

Iterative Improvement

Fine-tuning is not a one-time event. As your use case evolves, we retrain with new data and measure the delta. Ongoing relationship, not a one-off project.

Retainer Option
🔬

Research Collaboration

If you have a novel reasoning problem — something no existing model handles well — we research and develop a fine-tuning approach from scratch. We publish when appropriate.

Custom Scope
About

The convergence of four threads.

TunedAI Labs builds fine-tuned AI models for regulated, high-stakes work — healthcare, finance, legal, security — where the output needs to hold up to review, not just sound good.

I'm Mark Gentry, the founder. The company is the convergence point of four threads I've been pulling on for a long time.

Philosophy

Years before I went to grad school, I was sitting in on Hubert Dreyfus's AI courses at UC Berkeley — just for the fun of it. Dreyfus spent four decades making the most serious academic argument against purely computational AI: that reasoning, for minds, is not symbol manipulation. It's embedded in skillful practice, embodied context, and a background of shared meaning that can't be fully formalized. Around that time I also sat in on a Heidegger seminar — the philosophical foundation underneath everything Dreyfus was saying — and a Nietzsche seminar, because I was curious. Years later I went back for a master's in philosophy of consciousness.

Industry

I then spent twenty years recruiting in tech — YouTube, Google, BitTorrent, Tango, and others. Recruiting is a privileged seat. You see what companies actually build, not what they claim to build. You meet the people doing the work before the press releases. I had a front-row view of the systems Dreyfus had been critiquing as they grew into the dominant paradigm.

Hands-on

Among the companies I recruited for was Udacity, including their Machine Learning and Self-Driving Car Nanodegrees. I ended up taking the coursework myself, writing the projects by hand — partly out of curiosity, partly to see if I could.

The work

TunedAI Labs is where those four threads meet. Modern LLMs are still mostly next-token predictors wrapped in instruction-following scaffolding. Dreyfus was right about something real — fluent output is not reasoning — and the gap shows up most painfully in regulated, high-stakes domains where the wrong answer has consequences. What I've built is an approach to engineering structure into fine-tuned LLMs so that their output is traceable, auditable, and closer to actual reasoning, not just plausibly worded.

A current result: a custom-tuned Qwen 2.5-7B scored 96.96% on CLadder, a public academic benchmark for causal reasoning. The base model sits in the mid-60s. An independent applied AI engineer — 25 years in security and forensics, former White House — verified the result on a held-out sample. Not benchmaxxed, not overfitted.

None of this claims to have solved Dreyfus's critique. Nobody has. But taking the critique seriously and engineering against it is the work worth doing — and it's what TunedAI Labs is set up to do.

Inheritance

My father spent his career as an engineer at IBM in the mainframe era. His monitor program for the 1401 — the Gentry Monitor — was picked up by IBM and distributed as the FASTER Type II Program. He was awarded IBM's Outstanding Contributor recognition; the cufflinks are still in our family. His oral history is preserved at the Computer History Museum. A historian once told our family he was the unsung hero of CICS, the transaction system the world's banks still run on. I can't prove that part.

But I know what he spent his career doing: making machines reason reliably enough to be trusted with work that mattered. That's the same problem I'm working on, one layer up the stack.

Get Started

Tell us where your model stops reasoning.

We'll tell you whether reasoning fine-tuning can fix it, which model to start from, and what a realistic benchmark score looks like. No commitment — just an honest conversation about whether we can actually help.

Or email us directly: [email protected]