LLM Fine-Tuning for Beginners — Tech.Journalism

The first time Priya tried to get an AI to answer questions about her company's leave policy, it told an employee she had thirty days of earned leave. She had twelve. It wasn't a small error — it was the kind that ends up in HR complaints. She spent three weeks rewriting her system prompt. The model improved, then got worse, then got better in different ways. It was never reliable. Eventually someone suggested she stop fighting the model and instead teach it. She fine-tuned it on her actual policy documents over a weekend. On Monday morning, it answered every question correctly. It has not been wrong since.

That gap — between a model that approximately does what you want and one that reliably does exactly what you want — is where fine-tuning lives. And in 2026, closing that gap has become something any developer can do, on free hardware, in an afternoon. The research infrastructure that once made this impossible has been replaced by four libraries, a Google account, and about two hours of your time.

This is the guide I wish existed when I started. Not the one that explains what a transformer is. The one that tells you what actually matters, what will actually go wrong, and how to build something that works in the real world.

· · ·

The Problem With Prompting Your Way to Reliability

There is a version of every AI project where someone is extremely confident that the right prompt will fix everything. The prompt gets longer. It gains subsections. At some point it reads less like instructions and more like a legal document written by someone who has been burned before. And still — still — the model occasionally ignores it entirely and does something baffling.

This is not a failure of cleverness. It is a structural limitation. When you prompt a general-purpose model, you are borrowing behavior from a system trained to do everything. You are not changing what it knows or how it thinks. You are asking it, very politely, to pretend to be something more specific. Under enough pressure — unusual phrasing, edge cases, long conversations — that pretence breaks. The underlying model reasserts itself.

Fine-tuning does something fundamentally different. Instead of instructing the model at inference time, you change the model's weights during training. The behavior you want stops being a request. It becomes the model's actual nature. A fine-tuned model doesn't consult your system prompt and decide to comply. It responds the way it was trained to respond, because that is now what it is.

Fine-tuning is not a better prompt. It is a different kind of thing entirely — the difference between telling someone how to act and actually changing who they are.

The core distinction most guides don't make clearly enough

The other two approaches — RAG and full fine-tuning — solve different problems. RAG keeps the model's knowledge current: product catalogues updated weekly, policy documents revised monthly, databases too large to memorize. What it cannot do is change how the model speaks, what it refuses, or the instincts it reaches for when a question is ambiguous. Full fine-tuning gives you maximum control at maximum cost — hardware that runs thousands of dollars per training run, for results that are rarely meaningfully better than LoRA for real-world tasks.

LoRA is where you should start. Almost always.

· · ·

Why a 7-Billion Parameter Model Now Fits in a Free GPU

A language model with 7 billion parameters contains 7 billion individual numbers. Full fine-tuning means updating all of them — which requires storing weights, gradients, and optimizer states simultaneously. For a 7B model at standard 16-bit precision, that is 80 to 160 gigabytes of GPU memory. That is not a Colab notebook. That is an A100 cluster.

LoRA — Low-Rank Adaptation, published in 2021 — makes this tractable through an elegant insight: you do not need to update all the weights to change how the model behaves. Instead, you freeze the original weights entirely and inject small trainable adapter matrices alongside specific layers. These adapters contain 0.1% to 1% of the original parameter count. Only the adapters change during training.

The result is astonishing in practice: training under 1% of a model's parameters delivers 85 to 95% of the quality of updating all of them. For domain-specific tasks — making a model talk about your product correctly, follow your format reliably, respond in your brand voice — the gap is invisible to end users.

QLoRA goes further by loading the base model in 4-bit precision instead of 16-bit. Four-bit quantization compresses weights to a quarter of their normal memory footprint. Mistral 7B drops from 28 gigabytes to roughly 5 gigabytes. A free Colab T4 GPU has 15 gigabytes. That is the entire distance between "requires a research lab" and "open a browser tab and start training."

Honest tradeoff

QLoRA trains slower than full-precision LoRA, and the 4-bit compression adds a small amount of noise to gradient updates. For datasets under 50,000 examples on a single domain task, this doesn't matter. The output quality is indistinguishable from standard LoRA for the use cases in this guide.

· · ·

The Free Hardware You Already Have Access To

Google Colab free tier gives you a T4 GPU with 15 gigabytes of VRAM. Sessions disconnect after a few hours of inactivity and GPU availability isn't guaranteed during peak times. For getting the process working and training on small datasets, it's enough. One rule: set checkpoints. Losing a training run to an unceremonious timeout is an experience worth avoiding once.

Kaggle Notebooks are strictly better for serious work — two T4 GPUs simultaneously (30 GB combined) for 30 hours per week, free. Sessions are stable. The filesystem persists within a session. Once you have the pipeline working, Kaggle is where to do the real training.

When your datasets grow beyond 10,000 examples, RunPod and Lambda Labs rent A100s by the hour for a few dollars. But that is a problem for later. Start free. Get the pipeline working completely before spending a dollar on hardware.

Colab survival rule

Set save_steps=50 before training begins. Colab will disconnect eventually. A mid-run disconnection with no checkpoints saved is genuinely demoralizing. Ask me how I know.

· · ·

The Fine-Tuning Process, Step by Step

We are fine-tuning Mistral 7B Instruct v0.2 using QLoRA. It is the right first model: small enough for free hardware, capable enough for production use, documented well enough that every error you will hit has been solved publicly. Open a Colab or Kaggle notebook, connect a GPU runtime, and run each block in order.

1

Install the libraries

One cell, three minutes. peft handles LoRA adapter logic. bitsandbytes does the 4-bit quantization. trl provides the SFTTrainer that simplifies the training loop considerably.

bash
!pip install -q transformers datasets peft \
           bitsandbytes accelerate trl \
           huggingface_hub
2

Load the model in 4-bit

This is QLoRA happening. The BitsAndBytesConfig block compresses weights to 4-bit as they load into VRAM — Mistral 7B drops from 28 GB to roughly 5 GB. The last line is easy to miss and causes confusing errors downstream. Don't skip it.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit config — this is what makes QLoRA possible on free hardware
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # don't skip this
3

Attach LoRA adapters

Two numbers define the adapter's capacity. r is the rank — higher means more parameters, more capacity to learn complex behaviors, more memory usage. Start at 16. If the model isn't improving enough, try 32. If you hit memory errors, try 8. Running print_trainable_parameters() will show something like 21M / 3.75B / 0.56% — that is why this fits in 15 gigabytes.

python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,               # adapter rank — tune if results are poor
    lora_alpha=32,      # scaling factor, conventionally 2× rank
    target_modules=[
        "q_proj", "k_proj", "v_proj",
        "o_proj", "gate_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 21,000,000 || all params: 3,750,000,000 || trainable%: 0.56
4

Prepare your data — this is where fine-tuning actually wins or loses

Most guides spend a paragraph on data and several pages on model configuration. This is exactly backwards. The model configuration barely moves the needle. Your data is everything. A mediocre LoRA setup with excellent training data beats a perfectly tuned LoRA with inconsistent data, and it is not close.

You need instruction-response pairs in JSONL format — one JSON object per line. The consistency of your examples sets the ceiling on what the model can learn. Inconsistent tone, mixed formats, ambiguous answers — these do not average out during training. They produce a model that is inconsistently toned, inconsistently formatted, and inconsistently ambiguous.

your_data.jsonl
// one example per line — be obsessively consistent with format
{"instruction": "What cities have same-day delivery?", "response": "Same-day delivery is available in Bengaluru, Chennai, Hyderabad, and Pune. All other locations are 2–3 business days."}
{"instruction": "Can I change my address after ordering?", "response": "Yes — within 30 minutes of placing your order. After that window, contact our support team directly."}
python
from datasets import load_dataset

dataset = load_dataset("json", data_files="your_data.jsonl")

def format_prompt(example):
    return {"text": f"[INST] {example['instruction']} [/INST] {example['response']}"}

dataset = dataset.map(format_prompt)
train_dataset = dataset["train"]
On dataset size

500 high-quality pairs is the realistic floor. For production, 2,000–10,000 carefully cleaned examples is the right range. If you have 10,000 mediocre examples and 600 excellent ones — delete the 10,000. The excellent examples will win.

5

Train

gradient_accumulation_steps=8 with batch size 2 creates an effective batch of 16 — without needing 16 examples in VRAM simultaneously. On a 15 GB GPU, this is often necessary. save_steps=50 is not optional on Colab.

python
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,  # effective batch = 16
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=50,                   # non-negotiable on Colab
    warmup_ratio=0.03,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()  # ~45–70 min on Colab T4 with 1,000 examples

Watch the loss. It should fall across the first 50 to 100 steps. If it is still flat or rising at step 150, something is wrong — most likely the learning rate (try 5e-5) or inconsistent data. A loss that oscillates wildly throughout usually means your training examples are fighting each other.

6

Save, merge, and verify

Save the adapters separately for small portable files. Merge them into the base model for faster inference in production. Merging is right for most deployment cases.

python
from peft import PeftModel

trainer.save_model("./lora-adapters")

# Merge adapters into base model for cleaner deployment
merged = PeftModel.from_pretrained(model, "./lora-adapters")
merged = merged.merge_and_unload()

# Test it
inputs = tokenizer(
    "[INST] What cities have same-day delivery? [/INST]",
    return_tensors="pt"
).to("cuda")

out = merged.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))

If the response matches the style and content of your training data — you just fine-tuned a large language model. On free hardware. In an afternoon. The rest is refinement.

· · ·

The Real Comparison

Factor LoRA / QLoRA Full Fine-Tuning RAG
GPU required Free T4 — 15 GB A100 80 GB+ CPU sufficient
Training time 30 min – 2 hrs Hours to days Minutes (indexing)
Knowledge updates Requires retraining Requires retraining Update docs only
Tone & behavior Excellent Excellent Limited
Factual accuracy Good Good Excellent
Inference cost Self-hosted — free Self-hosted — free API cost per call
Best for Behavior, voice, domain mastery Research, maximum ceiling Frequently-changing documents
· · ·

The Numbers From a Real Production Run

The HR policy bot at the start of this piece is a real project. The dataset was 1,400 instruction-response pairs built from six months of support tickets. The raw tickets were unusable — HR staff write replies in shorthand that assumes context the model would never have. Turning them into clean, self-contained pairs took a weekend. That weekend was the most important part of the entire project. The training run itself was almost uneventful.

Production run — HR policy chatbot
Dataset
1,400 instruction-response pairs cleaned from 6 months of HR tickets
Base model
Mistral 7B Instruct v0.2
Training time
52 minutes on Kaggle dual T4
Held-out accuracy
94% — vs 61% with best prompt approach
Hallucinations in 200 test queries
Zero
Hosting cost
₹0 / month on Hugging Face Spaces

The training loop rewards the work you did on the data. It does not compensate for the work you skipped.

· · ·

The Errors That Will Find You

  • CUDA out of memory Drop per_device_train_batch_size to 1 and raise gradient_accumulation_steps to 16. The effective batch size is unchanged. The VRAM pressure is not.
  • Loss flat or rising after step 150 Try lr=5e-5 instead of 2e-4. If that doesn't help, manually read 30 random training examples. Inconsistency in the data causes this more often than any hyperparameter problem.
  • Model looping or producing garbage The pad token line. Add tokenizer.pad_token = tokenizer.eos_token before training. It's in the code above. If it's not in yours, that's why.
  • Expected tensors on same device device_map="auto" on model load, .to("cuda") on inference inputs. Both. Not one.
  • Colab disconnects mid-training It will happen. save_steps=50 is the only mitigation. Resume from checkpoint with resume_from_checkpoint=True in your trainer.train() call.
· · ·

What Most Guides Don't Tell You at the End

Decreasing training loss is not success. It is a prerequisite for success. The number that matters is how the model performs on data it has never seen — and beyond that, on the inputs real users will actually type, which are always stranger and more varied than anything in your eval set.

Before training: set aside 15 to 20 percent of your data. Don't touch it. After training: evaluate against only those held-out examples. If training accuracy is 94% and held-out accuracy is 71%, you have overfit. Lower the rank, reduce epochs, add more diverse examples. Then get someone who has not seen your training data to use the deployed model for 30 minutes without guidance. They will find failure cases your eval set never covered. They always do.

The first time you run this, the goal should be to get the pipeline working end-to-end — not to build something production-ready. Get the pipeline working. Understand what happened at each step. Then build the real thing. The second run will be faster, better, and built on judgment rather than guesswork. That is where the real work begins.

· · ·

Questions Worth Answering Directly

  • No. You need Python basics and a high tolerance for reading error messages carefully. The libraries handle all the mathematics. What they cannot do is decide what data to use, how to clean it, or what "working well" means for your specific application. Those are judgment calls, and judgment comes from the domain — not from knowing what a gradient is.
  • 500 high-quality instruction-response pairs is a realistic floor. Below that, the training signal is too weak to produce consistent behavior. For production, 2,000–10,000 carefully cleaned examples is where reliable models live. Quality beats quantity without exception — 600 excellent examples will outperform 3,000 inconsistent ones every time.
  • On free hardware (Colab, Kaggle), QLoRA. The 4-bit quantization is what fits the model in 15 gigabytes. On a paid instance with 40+ GB VRAM, standard LoRA at 16-bit precision trains faster and has slightly cleaner loss dynamics. For most real-world domain tasks, the final quality difference between the two is too small to measure in practice.
  • No — they are closed-weight models. OpenAI offers fine-tuning via their API for GPT-4o, and Anthropic has enterprise fine-tuning for Claude, but both are expensive and you don't own the result. A fine-tuned Mistral 7B or Llama 3.1 8B is competitive with a fine-tuned closed model on most domain tasks — and you own it, self-host it, and pay nothing per inference call. For the vast majority of real applications, the open-source path is the right one.
  • When the held-out eval accuracy is high, the training-to-eval accuracy gap is small, and a human unfamiliar with the training data cannot find consistent failure modes in 30 minutes of use. All three. Not just the first one. Training loss going down is a necessary condition, not a sufficient one.
  • Mistral 7B Instruct v0.2 or Llama 3.1 8B Instruct as of mid-2026. Both run on free hardware, both follow instructions well out of the box, and both have communities large enough that every problem you will hit has been solved publicly. Start with a 7B model. Complete a full fine-tune from dataset to deployment. Then decide if you need something larger. Most applications do not.