Unsloth: A Guide from Basics to Fine-Tuning Vision Models

Unsloth has emerged as a game-changer in the world of large language model (LLM) fine-tuning, addressing what has long been a resource-intensive and technically complex challenge. Adapting models like LLaMA, Mistral, or Qwen used to require powerful GPU clusters, intricate engineering, and significant costs. Unsloth changes this narrative by enabling

Unsloth has emerged as a game-changer in the world of large language model (LLM) fine-tuning, addressing what has long been a resource-intensive and technically complex challenge. Adapting models like LLaMA, Mistral, or Qwen used to require powerful GPU clusters, intricate engineering, and significant costs. Unsloth changes this narrative by enabling fast, memory-efficient, and accessible fine-tuning, even on a single consumer-grade GPU.

This guide walks you through Unsloth from the ground up, starting from dataset preparation, moving through fine-tuning strategies, quantization optimizations, vision-language training, and finally to mastering Qwen2.5-VL-7B model fine-tuning. Let’s dive into this powerful ecosystem.

  1. What is Unsloth?
  2. Detailed Answer to Why Do We Need Unsloth?
    1. Breaking Down the Fine-Tuning Barrier
    2. Speed: Faster Fine-Tuning and Inference
    3. Memory Efficiency: Train Big Models on Small GPUs
    4. Simplicity for Fine-Tuning
    5. Hardware Requirements Lowered
    6. Better Training Techniques
    7. Access to New Model Types (MoE, Llama2, Mixtral, Gemma)
  3. Recent Updates in Unsloth(as of 2024–2025)
  4. Unsloth’s Support for GGUF
    1. Key Features Introduced in Unsloth Dynamic v2.0 GGUFs
  5. Planning Your Dataset for Unsloth Fine-Tuning
    1. Why Your Dataset Structure Matters
    2. Supported Dataset Formats in Unsloth
    3. Understanding Tokenization and Chat Templates
    4. Applying Chat Templates with Unsloth
    5. ShareGPT to ChatML Conversion
    6. ChatML to ShareGPT Conversion
    7. Multi-Turn Conversations in Unsloth (for Alpaca-style datasets)
  6. Fine-Tuning Qwen2.5-VL-7B on LaTeX-OCR using Unsloth
  7. Insights after Fine-Tuning Qwen2.5-VL-7B using Unsloth
  8. Conclusion
  9. References

What is Unsloth?

Unsloth is a modern Python library designed to speed up and optimize fine-tuning large language models (LLMs) like LLaMA, Mistral, Mixtral, and others. It makes model training and fine-tuning extremely fast, memory-efficient, and easy, especially on limited hardware like a single GPU or even consumer-grade setups.

Unsloth AI Logo
Fig 2. Unsloth AI Logo[Source]

It’s been gaining attention because it allows users to:

  • Fine-tune 9B parameter models on 24GB VRAM using LoRA 16-bit and just 6.5GB VRAM when using QLoRA 4-bit quantization.
  • Increase training speeds by 2x–5x compared to traditional Hugging Face methods.
  • Reduce memory usage by optimizing model internals.
  • Support techniques like QLoRA (Quantized LoRA), 8-bit and 4-bit training, gradient checkpointing, etc.

Unsloth’s training architecture plays a major role in memory efficiency. It doesn’t just make LoRA memory light, but it also makes it invisible at small scales, and contained even at larger scales. That’s part of what makes Unsloth special: you don’t just save compute, you avoid waste entirely.

In simple words:
Unsloth = Speed + Memory efficiency + Simplicity for fine-tuning LLMs.

Detailed Answer to Why Do We Need Unsloth?

Breaking Down the Fine-Tuning Barrier

In the past, fine-tuning large models required full-precision (FP32) computation. This meant 80GB+ VRAM GPUs and monstrous energy bills. Unsloth removes these barriers by combining several innovations:

  • QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning models in 4-bit precision, cutting memory requirements by 70%-80% without quality loss.

In the fine-tuning section later in this blog post, we will see the near-zero memory usage for LoRA adapters during the 3,000-sample fine-tuning, which isn’t just due to LoRA being lightweight; it’s also because Unsloth is aggressively optimized to manage memory with surgical precision.

  • PEFT (Parameter-Efficient Fine-Tuning) allows you to inject lightweight LoRA adapters into only a few critical layers (like Q, V, and output projections), avoiding the need to retrain billions of parameters.
  • SFTTrainer, a customized trainer, ensures loss computation only happens over assistant outputs, aligning fine-tuning closely with real-world usage.
  • Dynamic Quantization 2.0 refines the GGUF model export by adapting quantization intelligently layer-by-layer, preserving quality while maximizing speed. Discussed in more detail later in the post.

Speed: Faster Fine-Tuning and Inference

  • Traditional training (using Hugging Face, bitsandbytes, DeepSpeed) is often slow because:
    • It uses unoptimized implementations for attention, MLP layers, and memory copies.
    • There’s overhead in applying adapters like LoRA manually.
    • Not fully utilizing Flash Attention 2 or PyTorch compile.
  • Unsloth addresses this by:
    • Rewriting LLM internals (attention, MLP, normalization) for speed.
    • Using Flash Attention 2 directly.
    • Using PyTorch’s compiler torch.compile for backend graph optimization.
    • Merging QLoRA operations into the model, avoiding bottlenecks.
  • Result
    • 2x–5x faster training.
    • Faster fine-tuning even with large sequence lengths (e.g., 4k–128k tokens).

Memory Efficiency: Train Big Models on Small GPUs

  • Big models like Llama 13B, Mixtral 8×7 B easily require 60GB–100GB VRAM without optimization.
  • Even using 8-bit precision isn’t always enough.
  • QLoRA helps (4-bit quantization), but Hugging Face QLoRA is still memory-heavy.
  • Unsloth’s Advantages:
    • True 4-bit quantization is done more intelligently.
    • Paged optimizers and gradient checkpointing are built in.
    • CPU offloading is optional if memory is still not enough.
    • No redundant tensor copies (huggingface models sometimes copy tensors during training).
  • Result
    • You can fine-tune a 7B model on 5GB VRAM(QLoRA 4-bit quantized).
    • You can fine-tune 13B models on 8GB VRAM(QLORA -bit quantized).
    • Gemma 3 (27B) finetuning fits with Unsloth in under 22GB of VRAM. It’s also 1.6x faster.

Simplicity for Fine-Tuning

  • Setting up Hugging Face + bitsandbytes + Deepspeed + PEFT + QLoRA = complex (5+ libraries to sync and versions to match).
  • Unsloth provides:
    • Single API to load quantized models.
    • Single call to add LoRA adapters.
    • Native tokenizer handling.
    • Example scripts to plug-and-play.
model, tokenizer = FastLanguageModel.from_pretrained(...)
model = FastLanguageModel.get_peft_model(...)
  • No manual:
    • LoRA insertion
    • Bitsandbytes handling
    • Deepspeed configs
    • Special optimizer configs
  • Result
    • 10 minutes to set up instead of hours.
    • Less chance of bugs like layer norms not quantizing properly.

Hardware Requirements Lowered

  • Old days:
    • You needed clusters (A100s, H100s).
    • Costs: thousands per month.
  • With Unsloth:
    • 4090 24GB or A6000 48GB is enough for most 7B–13B models.
    • Even MacBooks (M3, M2) with Metal backend support fine-tuning smaller models.
    • AMD GPUs (ROCm) now work too.
  • Result:
    • Fine-tuning costs become hundreds, not thousands.
    • Single-GPU setups (freelancers, startups, students) become powerful enough.

Better Training Techniques

  • Gradient Checkpointing: Save VRAM during backprop.
  • Paged Optimizers: Handle large parameter counts more efficiently.
  • Long Context Windows: 4k–128k tokens natively.
  • Flash Attention 2: Ultra-efficient attention calculation.
  • Mixed precision: Smart bfloat16 and float16 handling.

Access to New Model Types (MoE, Llama2, Mixtral, Gemma)

  • MoE (Mixture of Experts) models like Mixtral need special handling (routing tokens to experts).
  • Traditional libraries are not optimized for MoE fine-tuning yet.
  • Unsloth supports these natively.
FeatureDetails
Faster LoRALoRA (Low Rank Adaptation) is a method for fine-tuning that’s made even faster and lighter.
Better QLoRAQLoRA = Quantized LoRA (using 4-bit precision). Unsloth’s QLoRA is up to 2x faster than Hugging Face’s reference implementation.
Memory OptimizationRewrites attention, MLP (feed-forward), normalization, etc., to be more memory-efficient.
Flash Attention 2Leverages PyTorch compiler modes like torch. compile for even faster speeds.
Pytorch 2.1+Leverages PyTorch compiler modes like torch.compile for even faster speeds.
Multi Backend SupportSupports CUDA, AMD ROCm, and Apple’s Metal (MPS).

It also supports special features like:

  • Paged optimizers.
  • Gradient checkpointing (saves VRAM).
  • CPU offloading if needed.
NeedWhy Unsloth?
Speed2x–5x faster training
Memory Efficiency30–50% less VRAM usage
SimplicityEasy 2-line setup
Hardware RequirementsRun 13B models on 24GB VRAM
Cost Saving5x–10x cheaper fine-tuning
New ModelsMoE, Llama2, Gemma easily supported
Long Sequence128k tokens training possible

Recent Updates in Unsloth(as of 2024–2025)

  • Full support for Mixtral 8x7B MoE models.
  • Native support for 128k context lengths (insanely long prompts).
  • Automatic True 4-bit training support.
  • Full RoCm (AMD) compatibility.
  • Apple Silicon (MPS) optimization started.

Unsloth’s Support for GGUFs

Unsloth’s Dynamic Quantization 2.0 sets a new standard for post-training model export. Rather than applying a one-size-fits-all quantization (which hurts critical reasoning layers), Unsloth analyzes each layer’s sensitivity to compression, using a calibration dataset ranging from 300K to 1.5M tokens. Unsloth has integrated robust support for the GGUF (Grokking GGML Unified Format), enabling users to:

  • Export fine-tuned models to GGUF: Unsloth provides methods like model.save_pretrained_gguf() and model.push_to_hub_gguf() to save models in GGUF format, facilitating deployment across various platforms.
  • Utilize Dynamic Quantization: With the introduction of Unsloth Dynamic v2.0, Unsloth employs intelligent layer-specific quantization strategies, enhancing model performance and efficiency in GGUF exports. ​
  • Ensure Compatibility with Inference Engines: Models exported in GGUF format via Unsloth are compatible with inference engines like llama.cpp, Ollama, and Open WebUI, broadening deployment options.​

Key Features Introduced in Unsloth Dynamic v2.0 GGUFs

  • Revamped Layer Selection + Safetensors Support
    • Unlike static quantization, Dynamic v2.0 chooses quant types per layer intelligently.
    • This means it quantizes every possible layer differently, depending on sensitivity.
    • Uses a smarter method than older static QLoRA or GGUF conversions.
    • Also supports exporting in .safetensors when needed.
  • Dynamic Quantization for All Models (Not Just MoEs)
    • Initially used only for MoE (Mixture of Experts) like DeepSeek-R1.
    • Now supports all model types, including LLaMA, Mistral, Gemma, Mixtral, and more.
    • Confirmed: “Dynamic 2.0 quantization now works on all models (including MoEs)”
  • New Calibration Dataset for GGUFs
    • Calibration uses high-quality data ranging from 300K to 1.5M tokens.
    • Calibrated for chat quality and instruction-following, not just loss metrics.
    • Produces better quantized GGUFs with minimal performance degradation.
  • Model-Specific Quantization Schemes
    • Example: Layers quantized in Gemma-3 differ from those in LLaMA-4.
    • Unsloth uses a tailored quant plan for each architecture.
    • This improves cross-device performance (especially for non-NVIDIA hardware).
  • Support for More GGUF Quant Formats
    • Unsloth Dynamic 2.0 now exports and supports:
      • Q4_K_M, Q4_K_S
      • Q4_NL, Q5_0, Q5_1, Q6_K, Q8_0
      • With nonlinear encoding (IQ4_NL, etc.) for enhanced CPU inference (Apple M chips, ARM).

Before proceeding to the code section, we have all of the code confined to one place, which can be downloaded by clicking the ‘Download Code’ button below.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Planning Your Dataset for Unsloth Fine-Tuning

Why Your Dataset Structure Matters

No matter how powerful your training framework is, a poorly structured dataset can doom your fine-tuning. In Unsloth, datasets need to be cleanly tokenizable, role-tagged (user vs assistant), and aligned with model expectations.

When designing your dataset, think carefully about:

  • Purpose: Are you creating a conversational agent? A code assistant? A domain-specific expert model?
  • Output Style: Should the model output Markdown, plain text, HTML, or programming code?
  • Source of Data: Is your data curated from open sources, synthetically generated via GPT models, or manually annotated?

A well-formed dataset is the foundation upon which your fine-tuning success will rest.

Supported Dataset Formats in Unsloth

Unsloth supports multiple common data formats:

  • Raw Corpus: Large blocks of text — books, articles — used for continued pretraining (CPT).
{
  "text": "Pasta carbonara is a traditional Roman pasta dish..."
}
  • Instruction Format (Alpaca-style): Triplets of instruction, optional input, and output.
{
  "instruction": "Task we want the model to perform.",
  "input": "Optional user query or context.",
  "output": "Expected response or result."
}
  • Conversation Format (ShareGPT style): Multi-turn conversations where each message is role-tagged (user or assistant).
{
  "conversations": [
    {"from": "human", "value": "Can you help me make pasta carbonara?"},
    {"from": "gpt", "value": "Would you like the traditional Roman recipe..."},
    ...
  ]
}
  • RLHF Datasets: Datasets containing ranked preferences between different model outputs.
FormatDescriptionTraining Type
Raw CorpusUnstructured raw text from books, articles, etc.Continued Pretraining (CPT)
InstructInstruction + output samples (e.g., Alpaca style)Supervised Fine-tuning (SFT)
ConversationMulti-turn chat between the user and the assistantSupervised Fine-tuning (SFT) or Dialogue Modeling
RLHFChat with response rankings by humans or scriptsMulti-turn chat between the user and assistant

Each format suits different fine-tuning goals, conversational agents, task-specific instruction followers, general language models, or preference-trained models.

Understanding Tokenization and Chat Templates

Tokenization – splitting text into tokens that models can understand- is a subtle but crucial step.
Bad tokenization leads to models confusing user inputs with assistant responses, making them hallucinate or answer incorrectly. Unsloth integrates customized chat templates that structure conversations into clear, unambiguous formats. 

A proper dataset requires a well-defined chat template and consistent tokenization so models can:

  • Understand roles (user vs assistant)
  • Learn context boundaries (system prompts, assistant replies)
  • Predict the next appropriate token accurately
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template="mixtral")

Whether you use ChatML, ShareGPT, Alpaca, OpenChat, or Vicuna, it must be consistent across the dataset. The template governs how the text gets:

  • Segmented
  • Labeled (user/assistant/system)
  • Encoded (with special tokens)

The template you use directly impacts tokenization, and thus the final embeddings and what the model learns.

Applying Chat Templates with Unsloth

Supported Chat Templates in Unsloth

from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys()))

The supported chat templates –

['unsloth', 'zephyr', 'chatml', 'mistral', 'llama', 'vicuna', 'vicuna_old', 'vicuna old', 'alpaca', 'gemma', 'gemma_chatml', 'gemma2', 'gemma2_chatml', 'llama-3', 'llama3', 'phi-3', 'phi-35', 'phi-3.5', 'llama-3.1', 'llama-31', 'llama-3.2', 'llama-3.3', 'llama-32', 'llama-33', 'qwen-2.5', 'qwen-25', 'qwen25', 'qwen2.5', 'phi-4', 'gemma-3', 'gemma3']

Using a Chat Template

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3"  # Adjust as needed
)

Formatting function – This function loops through your dataset, applying the chat template you defined to each sample.

def formatting_prompts_func(examples):
    texts = [tokenizer.apply_chat_template(e, tokenize=False, add_generation_prompt=True) for e in examples["conversations"]]
    return {"text": texts}

Loading the dataset

# Import and load dataset
from datasets import load_dataset
dataset = load_dataset("repo_name/dataset_name", split = "train")

# Apply the formatting function to your dataset using the map method
dataset = dataset.map(formatting_prompts_func, batched = True,)

If your dataset uses the ShareGPT format with “from”/”value” keys instead of the ChatML “role”/”content” format, you can use the standardize_sharegpt function to convert it first. The revised code will now look as follows:

# Import dataset
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

# Convert your dataset to the "role"/"content" format if necessary
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

# Apply the formatting function to your dataset using the map method
dataset = dataset.map(formatting_prompts_func, batched = True,)

ShareGPT to ChatML Conversion

The code below takes a list of ShareGPT messages and turns it into a beautiful ChatML conversation.

def sharegpt_to_chatml(sharegpt_conversation, system_prompt="You are a helpful assistant.", add_default_system_prompt_if_missing=True):
    """
    Converts a ShareGPT style conversation (list of dicts) into a ChatML string.
    Handles common ShareGPT role keys ('from', 'role') and content keys ('value', 'content').
    Handles common ShareGPT roles ('human', 'user', 'gpt', 'assistant', 'system').
    """
    chatml_parts = []
    has_system_prompt_in_data = False

    for turn in sharegpt_conversation:
        role_key = 'role' if 'role' in turn else 'from'
        if turn.get(role_key) == "system":
            has_system_prompt_in_data = True
            break
            
    if add_default_system_prompt_if_missing and not has_system_prompt_in_data and system_prompt:
        chatml_parts.append(f"<|system|>{system_prompt.strip()}<|end|>")
    
    for turn in sharegpt_conversation:
        role_key = 'role' if 'role' in turn else 'from'
        content_key = 'content' if 'content' in turn else 'value'

        if role_key not in turn or content_key not in turn:
            print(f"Skipping turn due to missing keys: {turn}") 
            continue

        role = turn[role_key]
        content = turn[content_key].strip()
        
        if role in ["user", "human"]:
            chatml_parts.append(f"<|user|>{content}<|end|>")
        elif role in ["assistant", "gpt", "model"]:
            chatml_parts.append(f"<|assistant|>{content}<|end|>")
        elif role == "system":
            chatml_parts.append(f"<|system|>{content}<|end|>")
        else:
            raise ValueError(f"Unknown role: {role} in turn: {turn}")
            
    return "\n".join(chatml_parts)

ChatML to ShareGPT Conversion

The code below extracts user and assistant turns from ChatML text back into a clean ShareGPT-style list.

import re

def chatml_to_sharegpt(
    chatml_text,
    include_system_messages=False,
    role_key_name="role",  # or "from"
    content_key_name="content" # or "value"
):
    """
    Converts a ChatML formatted string back into ShareGPT list format.
    Allows configuration for including system messages and output key names.
    """

    pattern = r"<\|(\w+)\|>(.*?)<\|end\|>"
    matches = re.findall(pattern, chatml_text, flags=re.DOTALL)
    
    sharegpt_conversation = []
    
    for role, content in matches:
        role_standardized = role.lower() 
        
        if role_standardized == "system" and not include_system_messages:
            continue  
        
        sharegpt_conversation.append({
            role_key_name: role_standardized, # Use the standardized role
            content_key_name: content.strip()
        })
    
    return sharegpt_conversation

Multi-Turn Conversations in Unsloth (for Alpaca-style datasets)

Alpaca format is single-turn – one instruction, one output.
But LLMs like ChatGPT are designed to handle multi-turn conversations.

Unsloth introduces the conversation_extension feature to simulate multi-turn conversation using single-turn Alpaca data.

What it does:

  • Randomly picks N Samples from the dataset
  • Merges them into one structured conversation (simulated)
  • Let the model learn context and flow between turns

Example Before and After:

Before (Single Turn):

{ "instruction": "What is 2+2?", "output": "2 + 2 equals 4." }
{ "instruction": "How are you?", "output": "I'm doing fine!" }
{ "instruction": "Flip a coin.", "output": "I got heads!" }

After (conversation_extension = 3):

{
  "instruction": "What is 2+2?",
  "output": "2 + 2 equals 4."
},
{
  "instruction": "Flip a coin.",
  "output": "I got heads!"
},
{
  "instruction": "How are you?",
  "output": "I'm doing fine!"
}

It becomes a fake but plausible multi-turn chat, which significantly improves SFT (Supervised Fine-Tuning) quality for dialog.

How to Use It Practically

  • Set conversation_extension = N, where N = number of rows to stitch into one conversation
  • Set output_column_name = name of output column, usually "output" in Alpaca.

Fine-Tuning Qwen2.5-VL-7B on LaTeX-OCR using Unsloth

To demonstrate Unsloth’s multi-modal fine-tuning capabilities in practice, we chose a real-world task that requires both visual and textual understanding: converting mathematical expressions in images to LaTeX.

For this experiment, we are going to use the LaTeX-OCR dataset, which pairs rendered math images with their corresponding LaTeX markup. This makes it an ideal benchmark for evaluating vision-language model performance in structured output generation. In one of our previous blog posts, we showed fine-tuning the Gemma 3 4B model on the same LatexOCR dataset. However, direct comparison won’t be ideal because of the model’s parameter differences and the way in which the vision and attention layers between Gemma 3 and Qwen2.5-VL are handled.

But still, we are going to see some statistics on memory and time usage used in fine-tuning both models on the same training samples and with the same LoRA and SFT Configs, which will shed some light on the fine-tuning efficiency when using Unsloth.

Loading the Qwen2.5-VL-7B Model with Unsloth

from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

This snippet initializes the Qwen2.5-VL-7B-Instruct model using Unsloth’s FastVisionModel, which is designed for vision-language fine-tuning. The model is loaded in 4-bit precision (load_in_4bit=True) to significantly reduce GPU memory usage. Additionally, use_gradient_checkpointing="unsloth" enables memory-efficient backpropagation, allowing longer input sequences without exceeding VRAM limits.

Applying LoRA with Fine-Tuning Control Over Vision & Language Layers

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 8,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    target_modules = "['down_proj', 'o_proj', 'k_proj', 'q_proj', 'gate_proj', 'up_proj', 'v_proj']", 
)

This snippet configures LoRA (Low-Rank Adaptation) for the Qwen2.5-VL model using get_peft_model. It enables fine-tuning of the vision, language, attention, and MLP layers. By setting use_rslora=False, it opts out of rank-stabilized LoRA. This setup ensures full control over which parts of the model are fine-tuned while maintaining memory and performance efficiency.

Formatting Samples into Chat-Like Vision-Language Pairs

from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train[:3000]")

instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : sample["text"]} ]
        },
    ]
    return { "messages" : conversation }
pass

converted_dataset = [convert_to_conversation(sample) for sample in dataset]

This function converts a raw image-text pair into a ChatML-style message format compatible with Unsloth’s vision models. It simulates a user asking the assistant to write the LaTeX representation for the given image.

Performing Inference with the Fine-Tuned Vision-Language Model

FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The above code block demonstrates how to run inference with the fine-tuned Qwen2.5-VL model. It first enables inference mode using FastVisionModel.for_inference(model), then constructs a multi-modal message that includes an image and a LaTeX instruction. The message is passed through apply_chat_template() to format it appropriately. After tokenizing the inputs, the model generates a response using .generate() with streaming output via TextStreamer. The temperature and min_p settings control sampling diversity and creativity.

Configuring the Trainer for Vision-Language Fine-Tuning

from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), 
    train_dataset = converted_dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        #max_steps = 30,
        num_train_epochs = 1, 
        learning_rate = 2e-4,
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        logging_steps = 200,
        save_strategy='steps',
        save_steps=200,
        save_total_limit=2,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)

The above snippet sets up the training loop using SFTTrainer from the TRL library, optimized for Unsloth’s vision-language support. It enables training mode with FastVisionModel.for_training(model) and uses UnslothVisionDataCollator, which is required to batch multi-modal inputs (text + image) correctly. The training configuration includes memory-efficient options like adamw_8bit optimizer, dynamic bfloat16/float16 precision handling, and small batch size with gradient accumulation. Additional parameters specific to vision fine-tuning are provided, like remove_unused_columns=False and dataset_kwargs which ensures proper image-text pairing during training.

Monitoring GPU Memory Usage Before Training

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

This code displays current GPU stats to help monitor memory availability before training. It fetches the total GPU memory and the amount already reserved by PyTorch using torch.cuda.

Tracking Final GPU Memory and Training Time Usage

# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

The above code logs detailed resource statistics after training is complete. It calculates total GPU memory used, LoRA-specific memory usage, and their respective percentages of the total GPU capacity. It also prints the total training time in seconds and minutes.

Insights after Fine-Tuning Qwen2.5-VL-7B using Unsloth

Memory Usage before Fine-Tuning

GPU = Tesla T4. Max memory = 14.741 GB.
7.111 GB of memory reserved.

Before training, approximately 7.11 GB of GPU memory was reserved while loading the Qwen2.5-VL model in 4-bit precision. This is impressively low given that vision-language models typically require upwards of 15–20 GB even before training begins.

Memory Usage and Time Consumption During Fine-Tuning

Step	Training Loss
200	0.254500
400	0.094800
600	0.090900
Unsloth: Will smartly offload gradients to save VRAM!
2921.7907 seconds used for training.
48.7 minutes used for training.
Peak reserved memory = 7.111 GB.
Peak reserved memory for training or used memory for lora = 0.0 GB.
Peak reserved memory % of max memory = 48.24 %.
Peak reserved memory for training % of max memory = 0.0 %.

The training metrics demonstrate exactly why Unsloth is uniquely suited for memory- and time-efficient fine-tuning of large models, even on mid-range GPUs like the Tesla T4.

  • The peak reserved memory usage stayed under 50% of total available GPU capacity (14.7 GB), confirming that LoRA with 4-bit quantization significantly reduces the VRAM footprint.
  • Memory used specifically by LoRA modules was negligible (0.0 GB), further validating the parameter-efficient nature of PEFT-based fine-tuning in Unsloth.
  • The training session was completed in just 48.7 minutes (≈2922 seconds), highlighting how Unsloth’s integration of gradient_checkpointing, paged_adamw_8bit, and smart memory allocation offers speed without resource waste.

Despite Qwen2.5 being nearly twice the parameter size (7B vs 4B), Unsloth completed the full fine-tuning in just 51 minutes, whereas Gemma 3 4B took 1 hour and 2 minutes, over 20% longer, using a traditional TRL-based setup.

Even though Gemma 3 is smaller, model size alone does not guarantee faster or more efficient training. Tooling matters. Although the way these models handle the inner layers does matter, still double-sized model in terms of parameters, with the same training configurations, takes less training time and consumes much less memory as compared with the Gemma 3 4B model. Unsloth’s deeply optimized training pipeline clearly outperforms traditional setups, especially for resource-constrained environments or multi-modal tasks.

Conclusion

Fine-tuning large language models, especially vision-language models, has traditionally been a high-resource, high-friction process. But tools like Unsloth are fundamentally changing that equation. Unsloth simplifies training with:

  • 4-bit QLoRA quantization, dramatically reducing VRAM requirements without sacrificing performance,
  • LoRA adapter injection into both vision and language components for efficient task-specific learning,
  • Intelligent batching and data collators tailored for multi-modal datasets,
  • And low-level optimizations like gradient checkpointing and paged optimizers that keep memory usage under control even during intensive training runs.

In benchmarking experiments, training large models is completed in under an hour, with memory usage consistently staying under 50% of GPU capacity and minimal overhead from LoRA adapters. These results highlight that model size alone is not the bottleneck, but the tools we use are just as critical. And Unsloth delivers on all fronts: speed, memory efficiency, modularity, and simplicity.

References



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​