TunedAI Labs — AI Tune-Ups for Open-Source Models

Verified Results

Smaller model.
Better answers.

We took Qwen 2.5-7B — a free, open-source model anyone can download — and trained it to reason: to understand cause and effect, distinguish correlation from causation, and think through what would happen if you changed something. Run the benchmark yourself.

Model	Score	Accuracy	Notes
TunedAI Labs — Qwen 2.5-7B	96.96%	96.96%	7B params · open-source · fine-tuned
GPT-4o	~72%	72%	General purpose · closed source
Claude 3.5 Sonnet	~68%	68%	General purpose · closed source
Base Qwen 2.5-7B	63%	63%	Same model · no fine-tuning · independently verified

CLadder benchmark — 10,000+ questions. Public. Independently runnable. Verify it yourself →

In Production

Live deployments.

Causal reasoning in active use across civil rights, legal, and business domains.

Civil Rights

MyGuardian

Real-time causal guidance during civil rights encounters. The model computes probability distributions over outcomes at each decision point — not retrieved advice, but computed causal analysis.

Legal

Attorney Platform

Causal reasoning for defense case building — evidence strength, motion strategy, verdict probability under different decision paths. Same engine, legal domain.

Business

Knapsack

Four causal reasoning agents for business operations — account health, priority sequencing, email triage, and schedule risk. Each scores interventions by probability of changing the outcome.

Under the Hood

How a tune-up works.

We take an existing open-source model and train it to reason in your domain — not just to score better on tests, but to understand cause and effect, ask the right questions, and know when a conclusion actually follows from the evidence. You own the result forever.

Step 01

Diagnose the problem

We benchmark your current model against your actual task. No guessing — you see exactly where it's underperforming before we touch anything.

Step 02

Build the training set

We curate high-quality examples of the task done correctly. This is where most fine-tuning fails — we specialize in training data that teaches reasoning, not surface-level pattern matching.

Step 03

Train to reason, then verify

We train the model specifically to reason — not just to pattern-match answers. Then we measure reasoning performance on a held-out benchmark. You get before and after numbers, not just our word that it improved.

Why Open Source

Own your model.
Own your costs.

Every API call to a closed-source model is a recurring bill you don't control, on servers you don't own, with a provider who can change pricing or go down anytime. A fine-tuned open-source model eliminates all of that.

💰

Dramatically lower cost

A fine-tuned 7B model on your own hardware costs a fraction of a GPT-4o API call. At scale, the difference is orders of magnitude.

🔒

Your data stays private

No API call means no data leaving your environment. Critical for healthcare, legal, finance, and defense where data cannot touch a third-party server.

⚡

No vendor dependency

OpenAI deprecates models. Anthropic changes pricing. A fine-tuned model you own runs the same way forever, regardless of what any vendor decides.

🎯

Reasons through your domain

A general model pattern-matches. A reasoning-trained model understands cause and effect in your domain — why a diagnosis follows from symptoms, why a clause creates liability, why a signal indicates a threat. The benchmark above shows what that difference looks like.

Open-Source vs. Closed-Source API

Open-Source

Closed API

Cost per query~$0.00002~$0.015

Data privacy100% localSent to vendor

Model ownershipYou own itRented

Deprecation riskNoneHigh

Domain accuracyFine-tuned ✓General

LatencyOn your hardwareAPI round-trip

Cost estimate at 1M queries/month. Closed API based on GPT-4o pricing. Open-source assumes A100 cloud GPU amortized over 24 months.

What We Do

Full-service garage.

We work with teams whose models need to reason — not just retrieve or rephrase, but actually understand cause and effect in their domain. General-purpose models weren't built for that. We are.

🔧

Reasoning Fine-Tuning

We train an open-source model to reason in your domain — legal, medical, financial, security, or any field where the model needs to understand cause and effect, not just match patterns. You get the weights and run the model anywhere.

Benchmark & Evaluation

We design a benchmark for your task and measure performance before and after. You get a verifiable number — not a vague claim of "improvement."

Always Included

🗂️

Training Data Curation

Most fine-tuning fails because of bad training data. We build the dataset that actually teaches reasoning — not surface-level pattern matching. This is the hard part.

The Hard Part

🚀

Deployment Support

We help you deploy the fine-tuned model in your environment — on-premise, private cloud, or air-gapped. We write the inference API and integrate with your stack.

Optional Add-On

🔄

Iterative Improvement

Fine-tuning is not a one-time event. As your use case evolves, we retrain with new data and measure the delta. Ongoing relationship, not a one-off project.

Retainer Option

🔬

Research Collaboration

If you have a novel reasoning problem — something no existing model handles well — we research and develop a fine-tuning approach from scratch. We publish when appropriate.

Custom Scope

⚙️

Causal AI Platform

For high-stakes domains — civil rights, legal, healthcare, security — we license causal reasoning as an API service. Your application calls our endpoints and receives structured causal probability outputs. The model stays on our infrastructure. You get the results.

Enterprise

Pricing

Straightforward engagements.

Fixed-scope projects. You know what you're getting and what it costs before we start.

Starter

$3,000

one-time

✓ 100–200 training examples
✓ 7B model fine-tuned on your domain
✓ Before/after benchmark
✓ Adapter delivered to HuggingFace
— Deployment support

Get Started

Standard

$6,000

one-time

✓ 500+ training examples
✓ 7B–14B model fine-tuned
✓ Full benchmark + eval card
✓ Tool call adherence testing
✓ Deployment guide + inference API

Get Started

Retainer

$3,500

per month

✓ Monthly retraining as data grows
✓ Ongoing eval + drift monitoring
✓ New domain expansions included
✓ Priority turnaround
✓ Direct Slack access

Let's Talk

All engagements include a scoping call. No retainer required to start.

FAQ

Common Questions

Straight answers on how the benchmark works and how the model was trained.

Isn't this just overfitting to CLadder? ⌄

No. The live benchmark generates questions at runtime using fictional variable names — yupt, jyka, kwox, glimx — that the model has never seen. Correct answers are computed from the probability parameters given in the question, not retrieved from any corpus. The tuned model scores 93% overall. The base model scores 64% on the same questions. You can't memorize questions that didn't exist until the moment you ran the notebook.

Isn't this benchmaxxing — training directly on CLadder? ⌄

No. The model was not trained on CLadder questions. Training data was synthetically generated with machine-verified answers derived from explicit probability parameters — not sourced from CLadder or any public causal dataset. There is also a keyword-scrubbed version of CLadder that removes the 168 questions where the answer can be guessed from phrases like "collider bias" with no causal reasoning required. The score holds on that version too.

How do I verify the results myself? ⌄

Click Live Demo in the footer. The Colab notebook runs both models against fresh questions generated in your session. No setup required — just a free Google account. Each run produces different questions with different fictional variable names. Watch the scores accumulate in real time.

Has anyone independently verified this? ⌄

Yes. Matt Wong reviewed the model independently after a call where the 96.96% claim was made. He ran the tests himself and reported back in two messages:

"During the call Mark Gentry claimed to have fine tuned Qwen2.5-7B to achieve much higher scores (96.96%) than frontier models on the CLadder benchmark. There was a moment when we discussed the terms 'overfitting' and 'benchmaxxing' and to Mark's credit I ran his tests and confirmed that it is neither overfitting nor benchmaxxed. Kudos for surviving scrutiny!" — Matt Wong, first review

"I dug into your LoRA Adapter and it's pretty cool: rank 64, attention-only, ~40 million trainable parameters. I verified that you are neither benchmaxxed nor overfitted. Base Qwen 2.5-7B: 63.0% — TunedAI LoRA: 91.5% (+28.5 pp). 91.5% is still a huge improvement over base." — Matt Wong, technical follow-up

Matt tested the LoRA adapter directly. The full fine-tuned model scores 96.96% on the same benchmark. Both conclusions hold: the methodology is clean, the results are real.

What is the CLadder benchmark? ⌄

CLadder is a public 10,112-question benchmark for causal reasoning. It was created by researchers at ETH Zürich and is publicly available on GitHub. GPT-4o scores ~72% on it. Our fine-tuned 7B model scores 96.96%.

Why does a 7B model beat GPT-4o on this benchmark? ⌄

General-purpose models like GPT-4o are optimized for breadth. Causal reasoning requires a specific kind of structured inference — working through what changes when you intervene, and what the outcome would have been under different conditions. Fine-tuning a smaller model specifically on that reasoning pattern produces a specialist that outperforms a generalist on this task, the same way a domain-trained radiologist outperforms a general practitioner on reading scans.

Is your AI
running rough?

We don't just tune models — we teach them to reason. Cause and effect. What changes when you intervene. What would have happened if things were different. A 7B model trained to reason scores 96.96% on a public benchmark. GPT-4o scores 72%.

Smaller model.
Better answers.

Live deployments.

MyGuardian

Attorney Platform

Knapsack

How a tune-up works.

Diagnose the problem

Build the training set

Train to reason, then verify

Own your model.
Own your costs.

Dramatically lower cost

Your data stays private

No vendor dependency

Reasons through your domain

Full-service garage.

Reasoning Fine-Tuning

Benchmark & Evaluation

Training Data Curation

Deployment Support

Iterative Improvement

Research Collaboration

Causal AI Platform

Straightforward engagements.

Starter

Standard

Retainer

Common Questions

Tell us where your model stops reasoning.

Is your AIrunning rough?

We don't just tune models — we teach them to reason. Cause and effect. What changes when you intervene. What would have happened if things were different. A 7B model trained to reason scores 96.96% on a public benchmark. GPT-4o scores 72%.

Smaller model.Better answers.

Live deployments.

MyGuardian

Attorney Platform

Knapsack

How a tune-up works.

Diagnose the problem

Build the training set

Train to reason, then verify

Own your model.Own your costs.

Dramatically lower cost

Your data stays private

No vendor dependency

Reasons through your domain

Full-service garage.

Reasoning Fine-Tuning

Benchmark & Evaluation

Training Data Curation

Deployment Support

Iterative Improvement

Research Collaboration

Causal AI Platform

Straightforward engagements.

Starter

Standard

Retainer

Common Questions

Tell us where your model stops reasoning.

Is your AI
running rough?

Smaller model.
Better answers.

Own your model.
Own your costs.