The first time Priya tried to get an AI to answer questions about her company's leave policy, it told an employee she had thirty days of earned leave. She had twelve. It wasn't a small error — it was the kind that ends up in HR complaints. She spent three weeks rewriting her system prompt. The model improved, then got worse, then got better in different ways. It was never reliable. Eventually someone suggested she stop fighting the model and instead teach it. She fine-tuned it on her actual policy documents over a weekend. On Monday morning, it answered every question correctly. It has not been wrong since.
That gap — between a model that approximately does what you want and one that reliably does exactly what you want — is where fine-tuning lives. And in 2026, closing that gap has become something any developer can do, on free hardware, in an afternoon. The research infrastructure that once made this impossible has been replaced by four libraries, a Google account, and about two hours of your time.
This is the guide I wish existed when I started. Not the one that explains what a transformer is. The one that tells you what actually matters, what will actually go wrong, and how to build something that works in the real world.
The Problem With Prompting Your Way to Reliability
There is a version of every AI project where someone is extremely confident that the right prompt will fix everything. The prompt gets longer. It gains subsections. At some point it reads less like instructions and more like a legal document written by someone who has been burned before. And still — still — the model occasionally ignores it entirely and does something baffling.
This is not a failure of cleverness. It is a structural limitation. When you prompt a general-purpose model, you are borrowing behavior from a system trained to do everything. You are not changing what it knows or how it thinks. You are asking it, very politely, to pretend to be something more specific. Under enough pressure — unusual phrasing, edge cases, long conversations — that pretence breaks. The underlying model reasserts itself.
Fine-tuning does something fundamentally different. Instead of instructing the model at inference time, you change the model's weights during training. The behavior you want stops being a request. It becomes the model's actual nature. A fine-tuned model doesn't consult your system prompt and decide to comply. It responds the way it was trained to respond, because that is now what it is.
Fine-tuning is not a better prompt. It is a different kind of thing entirely — the difference between telling someone how to act and actually changing who they are.
The core distinction most guides don't make clearly enoughThe other two approaches — RAG and full fine-tuning — solve different problems. RAG keeps the model's knowledge current: product catalogues updated weekly, policy documents revised monthly, databases too large to memorize. What it cannot do is change how the model speaks, what it refuses, or the instincts it reaches for when a question is ambiguous. Full fine-tuning gives you maximum control at maximum cost — hardware that runs thousands of dollars per training run, for results that are rarely meaningfully better than LoRA for real-world tasks.
LoRA is where you should start. Almost always.
Why a 7-Billion Parameter Model Now Fits in a Free GPU
A language model with 7 billion parameters contains 7 billion individual numbers. Full fine-tuning means updating all of them — which requires storing weights, gradients, and optimizer states simultaneously. For a 7B model at standard 16-bit precision, that is 80 to 160 gigabytes of GPU memory. That is not a Colab notebook. That is an A100 cluster.
LoRA — Low-Rank Adaptation, published in 2021 — makes this tractable through an elegant insight: you do not need to update all the weights to change how the model behaves. Instead, you freeze the original weights entirely and inject small trainable adapter matrices alongside specific layers. These adapters contain 0.1% to 1% of the original parameter count. Only the adapters change during training.
The result is astonishing in practice: training under 1% of a model's parameters delivers 85 to 95% of the quality of updating all of them. For domain-specific tasks — making a model talk about your product correctly, follow your format reliably, respond in your brand voice — the gap is invisible to end users.
QLoRA goes further by loading the base model in 4-bit precision instead of 16-bit. Four-bit quantization compresses weights to a quarter of their normal memory footprint. Mistral 7B drops from 28 gigabytes to roughly 5 gigabytes. A free Colab T4 GPU has 15 gigabytes. That is the entire distance between "requires a research lab" and "open a browser tab and start training."
QLoRA trains slower than full-precision LoRA, and the 4-bit compression adds a small amount of noise to gradient updates. For datasets under 50,000 examples on a single domain task, this doesn't matter. The output quality is indistinguishable from standard LoRA for the use cases in this guide.
The Free Hardware You Already Have Access To
Google Colab free tier gives you a T4 GPU with 15 gigabytes of VRAM. Sessions disconnect after a few hours of inactivity and GPU availability isn't guaranteed during peak times. For getting the process working and training on small datasets, it's enough. One rule: set checkpoints. Losing a training run to an unceremonious timeout is an experience worth avoiding once.
Kaggle Notebooks are strictly better for serious work — two T4 GPUs simultaneously (30 GB combined) for 30 hours per week, free. Sessions are stable. The filesystem persists within a session. Once you have the pipeline working, Kaggle is where to do the real training.
When your datasets grow beyond 10,000 examples, RunPod and Lambda Labs rent A100s by the hour for a few dollars. But that is a problem for later. Start free. Get the pipeline working completely before spending a dollar on hardware.
Set save_steps=50 before training begins. Colab will disconnect eventually. A mid-run disconnection with no checkpoints saved is genuinely demoralizing. Ask me how I know.
The Fine-Tuning Process, Step by Step
We are fine-tuning Mistral 7B Instruct v0.2 using QLoRA. It is the right first model: small enough for free hardware, capable enough for production use, documented well enough that every error you will hit has been solved publicly. Open a Colab or Kaggle notebook, connect a GPU runtime, and run each block in order.
Install the libraries
One cell, three minutes. peft handles LoRA adapter logic. bitsandbytes does the 4-bit quantization. trl provides the SFTTrainer that simplifies the training loop considerably.
!pip install -q transformers datasets peft \
bitsandbytes accelerate trl \
huggingface_hub
Load the model in 4-bit
This is QLoRA happening. The BitsAndBytesConfig block compresses weights to 4-bit as they load into VRAM — Mistral 7B drops from 28 GB to roughly 5 GB. The last line is easy to miss and causes confusing errors downstream. Don't skip it.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# 4-bit config — this is what makes QLoRA possible on free hardware
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # don't skip this
Attach LoRA adapters
Two numbers define the adapter's capacity. r is the rank — higher means more parameters, more capacity to learn complex behaviors, more memory usage. Start at 16. If the model isn't improving enough, try 32. If you hit memory errors, try 8. Running print_trainable_parameters() will show something like 21M / 3.75B / 0.56% — that is why this fits in 15 gigabytes.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16, # adapter rank — tune if results are poor
lora_alpha=32, # scaling factor, conventionally 2× rank
target_modules=[
"q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 21,000,000 || all params: 3,750,000,000 || trainable%: 0.56
Prepare your data — this is where fine-tuning actually wins or loses
Most guides spend a paragraph on data and several pages on model configuration. This is exactly backwards. The model configuration barely moves the needle. Your data is everything. A mediocre LoRA setup with excellent training data beats a perfectly tuned LoRA with inconsistent data, and it is not close.
You need instruction-response pairs in JSONL format — one JSON object per line. The consistency of your examples sets the ceiling on what the model can learn. Inconsistent tone, mixed formats, ambiguous answers — these do not average out during training. They produce a model that is inconsistently toned, inconsistently formatted, and inconsistently ambiguous.
// one example per line — be obsessively consistent with format
{"instruction": "What cities have same-day delivery?", "response": "Same-day delivery is available in Bengaluru, Chennai, Hyderabad, and Pune. All other locations are 2–3 business days."}
{"instruction": "Can I change my address after ordering?", "response": "Yes — within 30 minutes of placing your order. After that window, contact our support team directly."}
from datasets import load_dataset
dataset = load_dataset("json", data_files="your_data.jsonl")
def format_prompt(example):
return {"text": f"[INST] {example['instruction']} [/INST] {example['response']}"}
dataset = dataset.map(format_prompt)
train_dataset = dataset["train"]
500 high-quality pairs is the realistic floor. For production, 2,000–10,000 carefully cleaned examples is the right range. If you have 10,000 mediocre examples and 600 excellent ones — delete the 10,000. The excellent examples will win.
Train
gradient_accumulation_steps=8 with batch size 2 creates an effective batch of 16 — without needing 16 examples in VRAM simultaneously. On a 15 GB GPU, this is often necessary. save_steps=50 is not optional on Colab.
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=50, # non-negotiable on Colab
warmup_ratio=0.03,
report_to="none",
)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
dataset_text_field="text",
max_seq_length=512,
tokenizer=tokenizer,
args=training_args,
)
trainer.train() # ~45–70 min on Colab T4 with 1,000 examples
Watch the loss. It should fall across the first 50 to 100 steps. If it is still flat or rising at step 150, something is wrong — most likely the learning rate (try 5e-5) or inconsistent data. A loss that oscillates wildly throughout usually means your training examples are fighting each other.
Save, merge, and verify
Save the adapters separately for small portable files. Merge them into the base model for faster inference in production. Merging is right for most deployment cases.
from peft import PeftModel
trainer.save_model("./lora-adapters")
# Merge adapters into base model for cleaner deployment
merged = PeftModel.from_pretrained(model, "./lora-adapters")
merged = merged.merge_and_unload()
# Test it
inputs = tokenizer(
"[INST] What cities have same-day delivery? [/INST]",
return_tensors="pt"
).to("cuda")
out = merged.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))
If the response matches the style and content of your training data — you just fine-tuned a large language model. On free hardware. In an afternoon. The rest is refinement.
The Real Comparison
| Factor | LoRA / QLoRA | Full Fine-Tuning | RAG |
|---|---|---|---|
| GPU required | Free T4 — 15 GB | A100 80 GB+ | CPU sufficient |
| Training time | 30 min – 2 hrs | Hours to days | Minutes (indexing) |
| Knowledge updates | Requires retraining | Requires retraining | Update docs only |
| Tone & behavior | Excellent | Excellent | Limited |
| Factual accuracy | Good | Good | Excellent |
| Inference cost | Self-hosted — free | Self-hosted — free | API cost per call |
| Best for | Behavior, voice, domain mastery | Research, maximum ceiling | Frequently-changing documents |
The Numbers From a Real Production Run
The HR policy bot at the start of this piece is a real project. The dataset was 1,400 instruction-response pairs built from six months of support tickets. The raw tickets were unusable — HR staff write replies in shorthand that assumes context the model would never have. Turning them into clean, self-contained pairs took a weekend. That weekend was the most important part of the entire project. The training run itself was almost uneventful.
The training loop rewards the work you did on the data. It does not compensate for the work you skipped.
The Errors That Will Find You
-
CUDA out of memory
Drop
per_device_train_batch_sizeto 1 and raisegradient_accumulation_stepsto 16. The effective batch size is unchanged. The VRAM pressure is not. -
Loss flat or rising after step 150
Try
lr=5e-5instead of2e-4. If that doesn't help, manually read 30 random training examples. Inconsistency in the data causes this more often than any hyperparameter problem. -
Model looping or producing garbage
The pad token line. Add
tokenizer.pad_token = tokenizer.eos_tokenbefore training. It's in the code above. If it's not in yours, that's why. -
Expected tensors on same device
device_map="auto"on model load,.to("cuda")on inference inputs. Both. Not one. -
Colab disconnects mid-training
It will happen.
save_steps=50is the only mitigation. Resume from checkpoint withresume_from_checkpoint=Truein yourtrainer.train()call.
What Most Guides Don't Tell You at the End
Decreasing training loss is not success. It is a prerequisite for success. The number that matters is how the model performs on data it has never seen — and beyond that, on the inputs real users will actually type, which are always stranger and more varied than anything in your eval set.
Before training: set aside 15 to 20 percent of your data. Don't touch it. After training: evaluate against only those held-out examples. If training accuracy is 94% and held-out accuracy is 71%, you have overfit. Lower the rank, reduce epochs, add more diverse examples. Then get someone who has not seen your training data to use the deployed model for 30 minutes without guidance. They will find failure cases your eval set never covered. They always do.
The first time you run this, the goal should be to get the pipeline working end-to-end — not to build something production-ready. Get the pipeline working. Understand what happened at each step. Then build the real thing. The second run will be faster, better, and built on judgment rather than guesswork. That is where the real work begins.
Questions Worth Answering Directly
-
No. You need Python basics and a high tolerance for reading error messages carefully. The libraries handle all the mathematics. What they cannot do is decide what data to use, how to clean it, or what "working well" means for your specific application. Those are judgment calls, and judgment comes from the domain — not from knowing what a gradient is.
-
500 high-quality instruction-response pairs is a realistic floor. Below that, the training signal is too weak to produce consistent behavior. For production, 2,000–10,000 carefully cleaned examples is where reliable models live. Quality beats quantity without exception — 600 excellent examples will outperform 3,000 inconsistent ones every time.
-
On free hardware (Colab, Kaggle), QLoRA. The 4-bit quantization is what fits the model in 15 gigabytes. On a paid instance with 40+ GB VRAM, standard LoRA at 16-bit precision trains faster and has slightly cleaner loss dynamics. For most real-world domain tasks, the final quality difference between the two is too small to measure in practice.
-
No — they are closed-weight models. OpenAI offers fine-tuning via their API for GPT-4o, and Anthropic has enterprise fine-tuning for Claude, but both are expensive and you don't own the result. A fine-tuned Mistral 7B or Llama 3.1 8B is competitive with a fine-tuned closed model on most domain tasks — and you own it, self-host it, and pay nothing per inference call. For the vast majority of real applications, the open-source path is the right one.
-
When the held-out eval accuracy is high, the training-to-eval accuracy gap is small, and a human unfamiliar with the training data cannot find consistent failure modes in 30 minutes of use. All three. Not just the first one. Training loss going down is a necessary condition, not a sufficient one.
-
Mistral 7B Instruct v0.2 or Llama 3.1 8B Instruct as of mid-2026. Both run on free hardware, both follow instructions well out of the box, and both have communities large enough that every problem you will hit has been solved publicly. Start with a 7B model. Complete a full fine-tune from dataset to deployment. Then decide if you need something larger. Most applications do not.

