Unsloth has emerged as a game-changer in the world of large language model (LLM) fine-tuning, addressing what has long been a resource-intensive and technically complex challenge. Adapting models like LLaMA, Mistral, or Qwen used to require powerful GPU clusters, intricate engineering, and significant costs. Unsloth changes this narrative by enabling fast, memory-efficient, and accessible fine-tuning, even on a single consumer-grade GPU.

This guide walks you through Unsloth from the ground up, starting from dataset preparation, moving through fine-tuning strategies, quantization optimizations, vision-language training, and finally to mastering Qwen2.5-VL-7B model fine-tuning. Let’s dive into this powerful ecosystem.
- What is Unsloth?
- Detailed Answer to Why Do We Need Unsloth?
- Recent Updates in Unsloth(as of 2024–2025)
- Unsloth’s Support for GGUF
- Planning Your Dataset for Unsloth Fine-Tuning
- Fine-Tuning Qwen2.5-VL-7B on LaTeX-OCR using Unsloth
- Insights after Fine-Tuning Qwen2.5-VL-7B using Unsloth
- Conclusion
- References
What is Unsloth?
Unsloth is a modern Python library designed to speed up and optimize fine-tuning large language models (LLMs) like LLaMA, Mistral, Mixtral, and others. It makes model training and fine-tuning extremely fast, memory-efficient, and easy, especially on limited hardware like a single GPU or even consumer-grade setups.

It’s been gaining attention because it allows users to:
- Fine-tune 9B parameter models on 24GB VRAM using LoRA 16-bit and just 6.5GB VRAM when using QLoRA 4-bit quantization.
- Increase training speeds by 2x–5x compared to traditional Hugging Face methods.
- Reduce memory usage by optimizing model internals.
- Support techniques like QLoRA (Quantized LoRA), 8-bit and 4-bit training, gradient checkpointing, etc.
Unsloth’s training architecture plays a major role in memory efficiency. It doesn’t just make LoRA memory light, but it also makes it invisible at small scales, and contained even at larger scales. That’s part of what makes Unsloth special: you don’t just save compute, you avoid waste entirely.
In simple words:
Unsloth = Speed + Memory efficiency + Simplicity for fine-tuning LLMs.
Detailed Answer to Why Do We Need Unsloth?
Breaking Down the Fine-Tuning Barrier
In the past, fine-tuning large models required full-precision (FP32) computation. This meant 80GB+ VRAM GPUs and monstrous energy bills. Unsloth removes these barriers by combining several innovations:
- QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning models in 4-bit precision, cutting memory requirements by 70%-80% without quality loss.
In the fine-tuning section later in this blog post, we will see the near-zero memory usage for LoRA adapters during the 3,000-sample fine-tuning, which isn’t just due to LoRA being lightweight; it’s also because Unsloth is aggressively optimized to manage memory with surgical precision.
- PEFT (Parameter-Efficient Fine-Tuning) allows you to inject lightweight LoRA adapters into only a few critical layers (like Q, V, and output projections), avoiding the need to retrain billions of parameters.
- SFTTrainer, a customized trainer, ensures loss computation only happens over assistant outputs, aligning fine-tuning closely with real-world usage.
- Dynamic Quantization 2.0 refines the GGUF model export by adapting quantization intelligently layer-by-layer, preserving quality while maximizing speed. Discussed in more detail later in the post.
Speed: Faster Fine-Tuning and Inference
- Traditional training (using Hugging Face, bitsandbytes, DeepSpeed) is often slow because:
- It uses unoptimized implementations for attention, MLP layers, and memory copies.
- There’s overhead in applying adapters like LoRA manually.
- Not fully utilizing Flash Attention 2 or PyTorch compile.
- It uses unoptimized implementations for attention, MLP layers, and memory copies.
- Unsloth addresses this by:
- Rewriting LLM internals (attention, MLP, normalization) for speed.
- Using Flash Attention 2 directly.
- Using PyTorch’s compiler
torch.compile
for backend graph optimization. - Merging QLoRA operations into the model, avoiding bottlenecks.
- Rewriting LLM internals (attention, MLP, normalization) for speed.
- Result
- 2x–5x faster training.
- Faster fine-tuning even with large sequence lengths (e.g., 4k–128k tokens).
- 2x–5x faster training.
Memory Efficiency: Train Big Models on Small GPUs
- Big models like Llama 13B, Mixtral 8×7 B easily require 60GB–100GB VRAM without optimization.
- Even using 8-bit precision isn’t always enough.
- QLoRA helps (4-bit quantization), but Hugging Face QLoRA is still memory-heavy.
- Unsloth’s Advantages:
- True 4-bit quantization is done more intelligently.
- Paged optimizers and gradient checkpointing are built in.
- CPU offloading is optional if memory is still not enough.
- No redundant tensor copies (huggingface models sometimes copy tensors during training).
- True 4-bit quantization is done more intelligently.
- Result
- You can fine-tune a 7B model on 5GB VRAM(QLoRA 4-bit quantized).
- You can fine-tune 13B models on 8GB VRAM(QLORA -bit quantized).
- Gemma 3 (27B) finetuning fits with Unsloth in under 22GB of VRAM. It’s also 1.6x faster.
- You can fine-tune a 7B model on 5GB VRAM(QLoRA 4-bit quantized).
Simplicity for Fine-Tuning
- Setting up Hugging Face + bitsandbytes + Deepspeed + PEFT + QLoRA = complex (5+ libraries to sync and versions to match).
- Unsloth provides:
- Single API to load quantized models.
- Single call to add LoRA adapters.
- Native tokenizer handling.
- Example scripts to plug-and-play.
- Single API to load quantized models.
model, tokenizer = FastLanguageModel.from_pretrained(...)
model = FastLanguageModel.get_peft_model(...)
- No manual:
- LoRA insertion
- Bitsandbytes handling
- Deepspeed configs
- Special optimizer configs
- LoRA insertion
- Result
- 10 minutes to set up instead of hours.
- Less chance of bugs like layer norms not quantizing properly.
- 10 minutes to set up instead of hours.
Hardware Requirements Lowered
- Old days:
- You needed clusters (A100s, H100s).
- Costs: thousands per month.
- You needed clusters (A100s, H100s).
- With Unsloth:
- 4090 24GB or A6000 48GB is enough for most 7B–13B models.
- Even MacBooks (M3, M2) with Metal backend support fine-tuning smaller models.
- AMD GPUs (ROCm) now work too.
- 4090 24GB or A6000 48GB is enough for most 7B–13B models.
- Result:
- Fine-tuning costs become hundreds, not thousands.
- Single-GPU setups (freelancers, startups, students) become powerful enough.
- Fine-tuning costs become hundreds, not thousands.
Better Training Techniques
- Gradient Checkpointing: Save VRAM during backprop.
- Paged Optimizers: Handle large parameter counts more efficiently.
- Long Context Windows: 4k–128k tokens natively.
- Flash Attention 2: Ultra-efficient attention calculation.
- Mixed precision: Smart bfloat16 and float16 handling.
Access to New Model Types (MoE, Llama2, Mixtral, Gemma)
- MoE (Mixture of Experts) models like Mixtral need special handling (routing tokens to experts).
- Traditional libraries are not optimized for MoE fine-tuning yet.
- Unsloth supports these natively.
Feature | Details |
Faster LoRA | LoRA (Low Rank Adaptation) is a method for fine-tuning that’s made even faster and lighter. |
Better QLoRA | QLoRA = Quantized LoRA (using 4-bit precision). Unsloth’s QLoRA is up to 2x faster than Hugging Face’s reference implementation. |
Memory Optimization | Rewrites attention, MLP (feed-forward), normalization, etc., to be more memory-efficient. |
Flash Attention 2 | Leverages PyTorch compiler modes like torch. compile for even faster speeds. |
Pytorch 2.1+ | Leverages PyTorch compiler modes like torch.compile for even faster speeds. |
Multi Backend Support | Supports CUDA, AMD ROCm, and Apple’s Metal (MPS). |
It also supports special features like:
- Paged optimizers.
- Gradient checkpointing (saves VRAM).
- CPU offloading if needed.
Need | Why Unsloth? |
Speed | 2x–5x faster training |
Memory Efficiency | 30–50% less VRAM usage |
Simplicity | Easy 2-line setup |
Hardware Requirements | Run 13B models on 24GB VRAM |
Cost Saving | 5x–10x cheaper fine-tuning |
New Models | MoE, Llama2, Gemma easily supported |
Long Sequence | 128k tokens training possible |
Recent Updates in Unsloth(as of 2024–2025)
- Full support for Mixtral 8x7B MoE models.
- Native support for 128k context lengths (insanely long prompts).
- Automatic True 4-bit training support.
- Full RoCm (AMD) compatibility.
- Apple Silicon (MPS) optimization started.
Unsloth’s Support for GGUFs
Unsloth’s Dynamic Quantization 2.0 sets a new standard for post-training model export. Rather than applying a one-size-fits-all quantization (which hurts critical reasoning layers), Unsloth analyzes each layer’s sensitivity to compression, using a calibration dataset ranging from 300K to 1.5M tokens. Unsloth has integrated robust support for the GGUF (Grokking GGML Unified Format), enabling users to:
- Export fine-tuned models to GGUF: Unsloth provides methods like
model.save_pretrained_gguf()
andmodel.push_to_hub_gguf()
to save models in GGUF format, facilitating deployment across various platforms.
- Utilize Dynamic Quantization: With the introduction of Unsloth Dynamic v2.0, Unsloth employs intelligent layer-specific quantization strategies, enhancing model performance and efficiency in GGUF exports.
- Ensure Compatibility with Inference Engines: Models exported in GGUF format via Unsloth are compatible with inference engines like llama.cpp, Ollama, and Open WebUI, broadening deployment options.
Key Features Introduced in Unsloth Dynamic v2.0 GGUFs
- Revamped Layer Selection + Safetensors Support
- Unlike static quantization, Dynamic v2.0 chooses quant types per layer intelligently.
- This means it quantizes every possible layer differently, depending on sensitivity.
- Uses a smarter method than older static QLoRA or GGUF conversions.
- Also supports exporting in .safetensors when needed.
- Unlike static quantization, Dynamic v2.0 chooses quant types per layer intelligently.
- Dynamic Quantization for All Models (Not Just MoEs)
- Initially used only for MoE (Mixture of Experts) like DeepSeek-R1.
- Now supports all model types, including LLaMA, Mistral, Gemma, Mixtral, and more.
- Confirmed: “Dynamic 2.0 quantization now works on all models (including MoEs)”
- Initially used only for MoE (Mixture of Experts) like DeepSeek-R1.
- New Calibration Dataset for GGUFs
- Calibration uses high-quality data ranging from 300K to 1.5M tokens.
- Calibrated for chat quality and instruction-following, not just loss metrics.
- Produces better quantized GGUFs with minimal performance degradation.
- Calibration uses high-quality data ranging from 300K to 1.5M tokens.
- Model-Specific Quantization Schemes
- Example: Layers quantized in Gemma-3 differ from those in LLaMA-4.
- Unsloth uses a tailored quant plan for each architecture.
- This improves cross-device performance (especially for non-NVIDIA hardware).
- Example: Layers quantized in Gemma-3 differ from those in LLaMA-4.
- Support for More GGUF Quant Formats
- Unsloth Dynamic 2.0 now exports and supports:
- Q4_K_M, Q4_K_S
- Q4_NL, Q5_0, Q5_1, Q6_K, Q8_0
- With nonlinear encoding (IQ4_NL, etc.) for enhanced CPU inference (Apple M chips, ARM).
- Q4_K_M, Q4_K_S
- Unsloth Dynamic 2.0 now exports and supports:
Before proceeding to the code section, we have all of the code confined to one place, which can be downloaded by clicking the ‘Download Code’ button below.
Planning Your Dataset for Unsloth Fine-Tuning
Why Your Dataset Structure Matters
No matter how powerful your training framework is, a poorly structured dataset can doom your fine-tuning. In Unsloth, datasets need to be cleanly tokenizable, role-tagged (user vs assistant), and aligned with model expectations.
When designing your dataset, think carefully about:
- Purpose: Are you creating a conversational agent? A code assistant? A domain-specific expert model?
- Output Style: Should the model output Markdown, plain text, HTML, or programming code?
- Source of Data: Is your data curated from open sources, synthetically generated via GPT models, or manually annotated?
A well-formed dataset is the foundation upon which your fine-tuning success will rest.
Supported Dataset Formats in Unsloth
Unsloth supports multiple common data formats:
- Raw Corpus: Large blocks of text — books, articles — used for continued pretraining (CPT).
{
"text": "Pasta carbonara is a traditional Roman pasta dish..."
}
- Instruction Format (Alpaca-style): Triplets of instruction, optional input, and output.
{
"instruction": "Task we want the model to perform.",
"input": "Optional user query or context.",
"output": "Expected response or result."
}
- Conversation Format (ShareGPT style): Multi-turn conversations where each message is role-tagged (user or assistant).
{
"conversations": [
{"from": "human", "value": "Can you help me make pasta carbonara?"},
{"from": "gpt", "value": "Would you like the traditional Roman recipe..."},
...
]
}
- RLHF Datasets: Datasets containing ranked preferences between different model outputs.
Format | Description | Training Type |
Raw Corpus | Unstructured raw text from books, articles, etc. | Continued Pretraining (CPT) |
Instruct | Instruction + output samples (e.g., Alpaca style) | Supervised Fine-tuning (SFT) |
Conversation | Multi-turn chat between the user and the assistant | Supervised Fine-tuning (SFT) or Dialogue Modeling |
RLHF | Chat with response rankings by humans or scripts | Multi-turn chat between the user and assistant |
Each format suits different fine-tuning goals, conversational agents, task-specific instruction followers, general language models, or preference-trained models.
Understanding Tokenization and Chat Templates
Tokenization – splitting text into tokens that models can understand- is a subtle but crucial step.
Bad tokenization leads to models confusing user inputs with assistant responses, making them hallucinate or answer incorrectly. Unsloth integrates customized chat templates that structure conversations into clear, unambiguous formats.
A proper dataset requires a well-defined chat template and consistent tokenization so models can:
- Understand roles (user vs assistant)
- Learn context boundaries (system prompts, assistant replies)
- Predict the next appropriate token accurately
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template="mixtral")
Whether you use ChatML, ShareGPT, Alpaca, OpenChat, or Vicuna, it must be consistent across the dataset. The template governs how the text gets:
- Segmented
- Labeled (user/assistant/system)
- Encoded (with special tokens)
The template you use directly impacts tokenization, and thus the final embeddings and what the model learns.
Applying Chat Templates with Unsloth
Supported Chat Templates in Unsloth
from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys()))
The supported chat templates –
['unsloth', 'zephyr', 'chatml', 'mistral', 'llama', 'vicuna', 'vicuna_old', 'vicuna old', 'alpaca', 'gemma', 'gemma_chatml', 'gemma2', 'gemma2_chatml', 'llama-3', 'llama3', 'phi-3', 'phi-35', 'phi-3.5', 'llama-3.1', 'llama-31', 'llama-3.2', 'llama-3.3', 'llama-32', 'llama-33', 'qwen-2.5', 'qwen-25', 'qwen25', 'qwen2.5', 'phi-4', 'gemma-3', 'gemma3']
Using a Chat Template
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-3" # Adjust as needed
)
Formatting function – This function loops through your dataset, applying the chat template you defined to each sample.
def formatting_prompts_func(examples):
texts = [tokenizer.apply_chat_template(e, tokenize=False, add_generation_prompt=True) for e in examples["conversations"]]
return {"text": texts}
Loading the dataset
# Import and load dataset
from datasets import load_dataset
dataset = load_dataset("repo_name/dataset_name", split = "train")
# Apply the formatting function to your dataset using the map method
dataset = dataset.map(formatting_prompts_func, batched = True,)
If your dataset uses the ShareGPT format with “from”/”value” keys instead of the ChatML “role”/”content” format, you can use the standardize_sharegpt
function to convert it first. The revised code will now look as follows:
# Import dataset
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
# Convert your dataset to the "role"/"content" format if necessary
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
# Apply the formatting function to your dataset using the map method
dataset = dataset.map(formatting_prompts_func, batched = True,)
ShareGPT to ChatML Conversion
The code below takes a list of ShareGPT messages and turns it into a beautiful ChatML conversation.
def sharegpt_to_chatml(sharegpt_conversation, system_prompt="You are a helpful assistant.", add_default_system_prompt_if_missing=True):
"""
Converts a ShareGPT style conversation (list of dicts) into a ChatML string.
Handles common ShareGPT role keys ('from', 'role') and content keys ('value', 'content').
Handles common ShareGPT roles ('human', 'user', 'gpt', 'assistant', 'system').
"""
chatml_parts = []
has_system_prompt_in_data = False
for turn in sharegpt_conversation:
role_key = 'role' if 'role' in turn else 'from'
if turn.get(role_key) == "system":
has_system_prompt_in_data = True
break
if add_default_system_prompt_if_missing and not has_system_prompt_in_data and system_prompt:
chatml_parts.append(f"<|system|>{system_prompt.strip()}<|end|>")
for turn in sharegpt_conversation:
role_key = 'role' if 'role' in turn else 'from'
content_key = 'content' if 'content' in turn else 'value'
if role_key not in turn or content_key not in turn:
print(f"Skipping turn due to missing keys: {turn}")
continue
role = turn[role_key]
content = turn[content_key].strip()
if role in ["user", "human"]:
chatml_parts.append(f"<|user|>{content}<|end|>")
elif role in ["assistant", "gpt", "model"]:
chatml_parts.append(f"<|assistant|>{content}<|end|>")
elif role == "system":
chatml_parts.append(f"<|system|>{content}<|end|>")
else:
raise ValueError(f"Unknown role: {role} in turn: {turn}")
return "\n".join(chatml_parts)
ChatML to ShareGPT Conversion
The code below extracts user and assistant turns from ChatML text back into a clean ShareGPT-style list.
import re
def chatml_to_sharegpt(
chatml_text,
include_system_messages=False,
role_key_name="role", # or "from"
content_key_name="content" # or "value"
):
"""
Converts a ChatML formatted string back into ShareGPT list format.
Allows configuration for including system messages and output key names.
"""
pattern = r"<\|(\w+)\|>(.*?)<\|end\|>"
matches = re.findall(pattern, chatml_text, flags=re.DOTALL)
sharegpt_conversation = []
for role, content in matches:
role_standardized = role.lower()
if role_standardized == "system" and not include_system_messages:
continue
sharegpt_conversation.append({
role_key_name: role_standardized, # Use the standardized role
content_key_name: content.strip()
})
return sharegpt_conversation
Multi-Turn Conversations in Unsloth (for Alpaca-style datasets)
Alpaca format is single-turn – one instruction, one output.
But LLMs like ChatGPT are designed to handle multi-turn conversations.
Unsloth introduces the conversation_extension
feature to simulate multi-turn conversation using single-turn Alpaca data.
What it does:
- Randomly picks
N
Samples from the dataset - Merges them into one structured conversation (simulated)
- Let the model learn context and flow between turns
Example Before and After:
Before (Single Turn):
{ "instruction": "What is 2+2?", "output": "2 + 2 equals 4." }
{ "instruction": "How are you?", "output": "I'm doing fine!" }
{ "instruction": "Flip a coin.", "output": "I got heads!" }
After (conversation_extension = 3):
{
"instruction": "What is 2+2?",
"output": "2 + 2 equals 4."
},
{
"instruction": "Flip a coin.",
"output": "I got heads!"
},
{
"instruction": "How are you?",
"output": "I'm doing fine!"
}
It becomes a fake but plausible multi-turn chat, which significantly improves SFT (Supervised Fine-Tuning) quality for dialog.
How to Use It Practically
- Set
conversation_extension = N
, whereN
= number of rows to stitch into one conversation
- Set
output_column_name
= name of output column, usually"output"
in Alpaca.
Fine-Tuning Qwen2.5-VL-7B on LaTeX-OCR using Unsloth
To demonstrate Unsloth’s multi-modal fine-tuning capabilities in practice, we chose a real-world task that requires both visual and textual understanding: converting mathematical expressions in images to LaTeX.
For this experiment, we are going to use the LaTeX-OCR dataset, which pairs rendered math images with their corresponding LaTeX markup. This makes it an ideal benchmark for evaluating vision-language model performance in structured output generation. In one of our previous blog posts, we showed fine-tuning the Gemma 3 4B model on the same LatexOCR dataset. However, direct comparison won’t be ideal because of the model’s parameter differences and the way in which the vision and attention layers between Gemma 3 and Qwen2.5-VL are handled.
But still, we are going to see some statistics on memory and time usage used in fine-tuning both models on the same training samples and with the same LoRA and SFT Configs, which will shed some light on the fine-tuning efficiency when using Unsloth.
Loading the Qwen2.5-VL-7B Model with Unsloth
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
This snippet initializes the Qwen2.5-VL-7B-Instruct model using Unsloth’s FastVisionModel
, which is designed for vision-language fine-tuning. The model is loaded in 4-bit precision (load_in_4bit=True
) to significantly reduce GPU memory usage. Additionally, use_gradient_checkpointing="unsloth"
enables memory-efficient backpropagation, allowing longer input sequences without exceeding VRAM limits.
Applying LoRA with Fine-Tuning Control Over Vision & Language Layers
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 8, # The larger, the higher the accuracy, but might overfit
lora_alpha = 16, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = "['down_proj', 'o_proj', 'k_proj', 'q_proj', 'gate_proj', 'up_proj', 'v_proj']",
)
This snippet configures LoRA (Low-Rank Adaptation) for the Qwen2.5-VL model using get_peft_model
. It enables fine-tuning of the vision, language, attention, and MLP layers. By setting use_rslora=False
, it opts out of rank-stabilized LoRA. This setup ensures full control over which parts of the model are fine-tuned while maintaining memory and performance efficiency.
Formatting Samples into Chat-Like Vision-Language Pairs
from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train[:3000]")
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
conversation = [
{ "role": "user",
"content" : [
{"type" : "text", "text" : instruction},
{"type" : "image", "image" : sample["image"]} ]
},
{ "role" : "assistant",
"content" : [
{"type" : "text", "text" : sample["text"]} ]
},
]
return { "messages" : conversation }
pass
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
This function converts a raw image-text pair into a ChatML-style message format compatible with Unsloth’s vision models. It simulates a user asking the assistant to write the LaTeX representation for the given image.
Performing Inference with the Fine-Tuned Vision-Language Model
FastVisionModel.for_inference(model) # Enable for inference!
image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1)
The above code block demonstrates how to run inference with the fine-tuned Qwen2.5-VL model. It first enables inference mode using FastVisionModel.for_inference(model)
, then constructs a multi-modal message that includes an image and a LaTeX instruction. The message is passed through apply_chat_template()
to format it appropriately. After tokenizing the inputs, the model generates a response using .generate()
with streaming output via TextStreamer
. The temperature and min_p
settings control sampling diversity and creativity.
Configuring the Trainer for Vision-Language Fine-Tuning
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
FastVisionModel.for_training(model) # Enable for training!
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
data_collator = UnslothVisionDataCollator(model, tokenizer),
train_dataset = converted_dataset,
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
warmup_steps = 10,
#max_steps = 30,
num_train_epochs = 1,
learning_rate = 2e-4,
fp16 = not is_bf16_supported(),
bf16 = is_bf16_supported(),
logging_steps = 200,
save_strategy='steps',
save_steps=200,
save_total_limit=2,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # For Weights and Biases
# You MUST put the below items for vision finetuning:
remove_unused_columns = False,
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
dataset_num_proc = 4,
max_seq_length = 2048,
),
)
The above snippet sets up the training loop using SFTTrainer
from the TRL library, optimized for Unsloth’s vision-language support. It enables training mode with FastVisionModel.for_training(model)
and uses UnslothVisionDataCollator
, which is required to batch multi-modal inputs (text + image) correctly. The training configuration includes memory-efficient options like adamw_8bit
optimizer, dynamic bfloat16/float16 precision handling, and small batch size with gradient accumulation. Additional parameters specific to vision fine-tuning are provided, like remove_unused_columns=False
and dataset_kwargs
which ensures proper image-text pairing during training.
Monitoring GPU Memory Usage Before Training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
This code displays current GPU stats to help monitor memory availability before training. It fetches the total GPU memory and the amount already reserved by PyTorch using torch.cuda
.
Tracking Final GPU Memory and Training Time Usage
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
The above code logs detailed resource statistics after training is complete. It calculates total GPU memory used, LoRA-specific memory usage, and their respective percentages of the total GPU capacity. It also prints the total training time in seconds and minutes.
Insights after Fine-Tuning Qwen2.5-VL-7B using Unsloth
Memory Usage before Fine-Tuning
GPU = Tesla T4. Max memory = 14.741 GB.
7.111 GB of memory reserved.
Before training, approximately 7.11 GB of GPU memory was reserved while loading the Qwen2.5-VL model in 4-bit precision. This is impressively low given that vision-language models typically require upwards of 15–20 GB even before training begins.
Memory Usage and Time Consumption During Fine-Tuning
Step Training Loss
200 0.254500
400 0.094800
600 0.090900
Unsloth: Will smartly offload gradients to save VRAM!
2921.7907 seconds used for training.
48.7 minutes used for training.
Peak reserved memory = 7.111 GB.
Peak reserved memory for training or used memory for lora = 0.0 GB.
Peak reserved memory % of max memory = 48.24 %.
Peak reserved memory for training % of max memory = 0.0 %.
The training metrics demonstrate exactly why Unsloth is uniquely suited for memory- and time-efficient fine-tuning of large models, even on mid-range GPUs like the Tesla T4.
- The peak reserved memory usage stayed under 50% of total available GPU capacity (14.7 GB), confirming that LoRA with 4-bit quantization significantly reduces the VRAM footprint.
- Memory used specifically by LoRA modules was negligible (
0.0 GB
), further validating the parameter-efficient nature of PEFT-based fine-tuning in Unsloth.
- The training session was completed in just 48.7 minutes (≈2922 seconds), highlighting how Unsloth’s integration of
gradient_checkpointing
,paged_adamw_8bit
, and smart memory allocation offers speed without resource waste.
Despite Qwen2.5 being nearly twice the parameter size (7B vs 4B), Unsloth completed the full fine-tuning in just 51 minutes, whereas Gemma 3 4B took 1 hour and 2 minutes, over 20% longer, using a traditional TRL-based setup.
Even though Gemma 3 is smaller, model size alone does not guarantee faster or more efficient training. Tooling matters. Although the way these models handle the inner layers does matter, still double-sized model in terms of parameters, with the same training configurations, takes less training time and consumes much less memory as compared with the Gemma 3 4B model. Unsloth’s deeply optimized training pipeline clearly outperforms traditional setups, especially for resource-constrained environments or multi-modal tasks.
Conclusion
Fine-tuning large language models, especially vision-language models, has traditionally been a high-resource, high-friction process. But tools like Unsloth are fundamentally changing that equation. Unsloth simplifies training with:
- 4-bit QLoRA quantization, dramatically reducing VRAM requirements without sacrificing performance,
- LoRA adapter injection into both vision and language components for efficient task-specific learning,
- Intelligent batching and data collators tailored for multi-modal datasets,
- And low-level optimizations like gradient checkpointing and paged optimizers that keep memory usage under control even during intensive training runs.
In benchmarking experiments, training large models is completed in under an hour, with memory usage consistently staying under 50% of GPU capacity and minimal overhead from LoRA adapters. These results highlight that model size alone is not the bottleneck, but the tools we use are just as critical. And Unsloth delivers on all fronts: speed, memory efficiency, modularity, and simplicity.