Reasoning-Focused LLM Fine-Tuning

Is your AI
running rough?

We don't just tune models — we teach them to reason. Cause and effect. What changes when you intervene. What would have happened if things were different. A 7B model trained to reason scores 96.96% on a public benchmark. GPT-4o scores 72%.

Your model, our garage.
TunedAI LABS TunedAI
96.96%
CLadder benchmark score
on our causal reasoning model
7B
Parameters — open-source,
free to run anywhere
+25%
Above GPT-4o on the same
publicly verifiable benchmark
Verified Results

Smaller model.
Better answers.

We took Qwen 2.5-7B — a free, open-source model anyone can download — and trained it to reason: to understand cause and effect, distinguish correlation from causation, and think through what would happen if you changed something. Run the benchmark yourself.

Model Score Accuracy Notes
TunedAI Labs — Qwen 2.5-7B 96.96%
96.96%
7B params · open-source · fine-tuned
GPT-4o ~72%
72%
General purpose · closed source
Claude 3.5 Sonnet ~68%
68%
General purpose · closed source
Base Qwen 2.5-7B ~62%
62%
Same model · no fine-tuning

CLadder benchmark — 10,000+ questions. Public. Independently runnable. Verify it yourself →

Under the Hood

How a tune-up works.

We take an existing open-source model and train it to reason in your domain — not just to score better on tests, but to understand cause and effect, ask the right questions, and know when a conclusion actually follows from the evidence. You own the result forever.

Step 01

Diagnose the problem

We benchmark your current model against your actual task. No guessing — you see exactly where it's underperforming before we touch anything.

Step 02

Build the training set

We curate high-quality examples of the task done correctly. This is where most fine-tuning fails — we specialize in training data that teaches reasoning, not surface-level pattern matching.

Step 03

Train to reason, then verify

We train the model specifically to reason — not just to pattern-match answers. Then we measure reasoning performance on a held-out benchmark. You get before and after numbers, not just our word that it improved.

Why Open Source

Own your model.
Own your costs.

Every API call to a closed-source model is a recurring bill you don't control, on servers you don't own, with a provider who can change pricing or go down anytime. A fine-tuned open-source model eliminates all of that.

💰

Dramatically lower cost

A fine-tuned 7B model on your own hardware costs a fraction of a GPT-4o API call. At scale, the difference is orders of magnitude.

🔒

Your data stays private

No API call means no data leaving your environment. Critical for healthcare, legal, finance, and defense where data cannot touch a third-party server.

No vendor dependency

OpenAI deprecates models. Anthropic changes pricing. A fine-tuned model you own runs the same way forever, regardless of what any vendor decides.

🎯

Reasons through your domain

A general model pattern-matches. A reasoning-trained model understands cause and effect in your domain — why a diagnosis follows from symptoms, why a clause creates liability, why a signal indicates a threat. The benchmark above shows what that difference looks like.

Open-Source vs. Closed-Source API
Open-Source
Closed API
Cost per query~$0.00002~$0.015
Data privacy100% localSent to vendor
Model ownershipYou own itRented
Deprecation riskNoneHigh
Domain accuracyFine-tuned ✓General
LatencyOn your hardwareAPI round-trip

Cost estimate at 1M queries/month. Closed API based on GPT-4o pricing. Open-source assumes A100 cloud GPU amortized over 24 months.

What We Do

Full-service garage.

We work with teams whose models need to reason — not just retrieve or rephrase, but actually understand cause and effect in their domain. General-purpose models weren't built for that. We are.

🔧

Reasoning Fine-Tuning

We train an open-source model to reason in your domain — legal, medical, financial, security, or any field where the model needs to understand cause and effect, not just match patterns. You get the weights and run the model anywhere.

Most Popular
📊

Benchmark & Evaluation

We design a benchmark for your task and measure performance before and after. You get a verifiable number — not a vague claim of "improvement."

Always Included
🗂️

Training Data Curation

Most fine-tuning fails because of bad training data. We build the dataset that actually teaches reasoning — not surface-level pattern matching. This is the hard part.

The Hard Part
🚀

Deployment Support

We help you deploy the fine-tuned model in your environment — on-premise, private cloud, or air-gapped. We write the inference API and integrate with your stack.

Optional Add-On
🔄

Iterative Improvement

Fine-tuning is not a one-time event. As your use case evolves, we retrain with new data and measure the delta. Ongoing relationship, not a one-off project.

Retainer Option
🔬

Research Collaboration

If you have a novel reasoning problem — something no existing model handles well — we research and develop a fine-tuning approach from scratch. We publish when appropriate.

Custom Scope
Pricing

Straightforward engagements.

Fixed-scope projects. You know what you're getting and what it costs before we start.

Starter

$3,000
one-time
  • ✓ 100–200 training examples
  • ✓ 7B model fine-tuned on your domain
  • ✓ Before/after benchmark
  • ✓ Adapter delivered to HuggingFace
  • — Deployment support
Get Started
Most Popular

Standard

$6,000
one-time
  • ✓ 500+ training examples
  • ✓ 7B–14B model fine-tuned
  • ✓ Full benchmark + eval card
  • ✓ Tool call adherence testing
  • ✓ Deployment guide + inference API
Get Started

Retainer

$3,500
per month
  • ✓ Monthly retraining as data grows
  • ✓ Ongoing eval + drift monitoring
  • ✓ New domain expansions included
  • ✓ Priority turnaround
  • ✓ Direct Slack access
Let's Talk

All engagements include a scoping call. No retainer required to start.

FAQ

Common Questions

Straight answers on how the benchmark works and how the model was trained.

No. The live benchmark generates questions at runtime using fictional variable names — yupt, jyka, kwox, glimx — that the model has never seen. Correct answers are computed from the probability parameters given in the question, not retrieved from any corpus. The tuned model scores 93% overall. The base model scores 64% on the same questions. You can't memorize questions that didn't exist until the moment you ran the notebook.
No. The model was not trained on CLadder questions. Training data was synthetically generated with machine-verified answers derived from explicit probability parameters — not sourced from CLadder or any public causal dataset. There is also a keyword-scrubbed version of CLadder that removes the 168 questions where the answer can be guessed from phrases like "collider bias" with no causal reasoning required. The score holds on that version too.
Click Live Demo in the footer. The Colab notebook runs both models against fresh questions generated in your session. No setup required — just a free Google account. Each run produces different questions with different fictional variable names. Watch the scores accumulate in real time.
CLadder is a public 10,112-question benchmark for causal reasoning. It was created by researchers at ETH Zürich and is publicly available on GitHub. GPT-4o scores ~72% on it. Our fine-tuned 7B model scores 96.96%.
General-purpose models like GPT-4o are optimized for breadth. Causal reasoning requires a specific kind of structured inference — working through what changes when you intervene, and what the outcome would have been under different conditions. Fine-tuning a smaller model specifically on that reasoning pattern produces a specialist that outperforms a generalist on this task, the same way a domain-trained radiologist outperforms a general practitioner on reading scans.
Get Started

Tell us where your model stops reasoning.

We'll tell you whether reasoning fine-tuning can fix it, which model to start from, and what a realistic benchmark score looks like. No commitment — just an honest conversation about whether we can actually help.

Or email us directly: hello@tunedailabs.com