Fine-tuning Stable Diffusion 3.5: UI images

Recently, the interest in fine-tuning Stable Diffusion models has surged among AI enthusiasts and professionals, driven by the need to incorporate these models into specific requirements. This article walks you through various aspects of fine-tuning a diffusion model on a flat-UI image dataset, providing a comprehensive guide for theoretical insights

fine-tuning stable diffusion 3.5, lora finetuning, dreambooth finetuning, featured image, UI images, UI generation

Recently, the interest in fine-tuning Stable Diffusion models has surged among AI enthusiasts and professionals, driven by the need to incorporate these models into specific requirements. This article walks you through various aspects of fine-tuning a diffusion model on a flat-UI image dataset, providing a comprehensive guide for theoretical insights and practical implementation.

fine-tuning stable diffusion 3.5, lora finetuning, dreambooth finetuning, featured image, UI images, UI generation


This resource is ideal for AI professionals working on image-generation diffusion models and AI enthusiasts looking to get hands-on experience without feeling left out on the programming front.
This article covers techniques like DreamBooth and LoRA and detailed scripts to understand the flow of fine-tuning pipelines. It equips readers with the knowledge and tools to customize diffusion models effectively for specialized applications.

  1. Dataset Preparation
  2. Dreambooth
  3. LoRA
  4. Various Fine-Tuning Tools
  5. Configuration File
  6. Fine-tuning
  7. Inferencing Results
  8. Improvements
  9. Generated UI images
  10. Key Takeaways
  11. Conclusion
  12. References

Dataset Preparation

Finding a good UI dataset was very challenging for a couple of reasons:

  1. Availability issue.
  2. Inconsistent Prompts are given.
  3. Some datasets contain images of individual icons only instead of complete UI.

For the above reasons, we created our huggingface UI dataset. This dataset consists of 20 rows( initially experimented with 10 rows) and two columns (“image” and “text”). We obtained images from dribble.com, which provides exquisite and modern flat UI designs with more than 1000 samples available. 

We generated prompts for UI images by calling Google’s Gemini 2.0 Flash Experimental API. 

If you want to go through the script to prepare this dataset, we recommend downloading the code by clicking the link below.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

You can find this huggingface dataset here to perform your customization and training.

To load this dataset, you might use huggingface’s load_dataset function. This function makes working with various formats, such as parquet, CSV, JSON, and TXT, easier. This method is recommended because the dataset provided is in parquet format, which is advantageous if you use Arrow tools.

Dreambooth

Fig 1: Dreambooth Architecture

Let us briefly look at the DreamBooth fine-tuning technique and how it is used. 

DreamBooth helps personalize image generation with just a few training images(~3 to 5). Associating a unique identifier with the subject allows for precise and consistent outputs without losing the model’s ability to create diverse images. 

DreamBooth addresses the drifting issue commonly observed in diffusion models, where fine-tuning can lead to overfitting and compromise the model’s pre-existing knowledge. The authors identified this problem and introduced a class-specific prior preservation loss that encourages diversity and counters image/language drift. 

Sometimes, people struggle to provide unique identifiers linked to a specific object in their prompt. To address this issue, the paper’s authors found that short sequences work very well, typically consisting of two or three characters. You can find an example of how the unique identifiers are provided in the image below:

The above image shows that the unique identifier is just a single character, ‘V,’ enclosed in square brackets. This makes it a unique character for textual models to understand.

In our discussions of fine-tuning techniques, we have addressed a brief introduction to DreamBooth. It is time to examine another widely recognized technique: LoRA (Low-Rank Adaptation). This method is typically employed when adapting the model to a specific style, such as an artistic or fantasy style. 

To get a more comprehensive view of the Dreambooth fine-tuning technique, visit our blog on Dreambooth using diffusers.

LoRA

LoRA. Fine-tuning stable diffusion 3.5, matrix, model adaptation
Fig 2: LoRA Architecture Simplified

LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique for adapting large models like Stable Diffusion to a new set of data instances. Instead of updating all model parameters, LoRA introduces small trainable matrices into specific layers, such as attention or feedforward layers, while keeping the base model weights frozen.

This approach significantly reduces the number of trainable parameters, making fine-tuning faster and less resource-intensive. LoRA modules can be reused or combined for different tasks or styles. 

While adapting the model using LoRA, you must define a hyperparameter called ‘rank,’ denoted by ‘r’ in the above image. 

The higher the R-value, the closer you will get to completely fine-tuning the model. 

Generally, an R-value of 4 can produce comparable results, with one having a much higher R-value. A higher R-value also leads to more computational costs and higher fine-tuning time. 

Results with different R-values can be seen in the below image, provided in the original paper of LoRA:

LoRA, Finetuning Stable Diffusion 3.5, LoRA Rank Comparison, Results, Experimentation
Fig 3: LoRA Rank Comparison

Various Fine-Tuning Tools

Tools like SimpleTuner from Bghira, Sd-Scripts from Kohya-ss, OneTrainer from Nerogar, and Diffusers from Huggingface are extensively used for implementing Stable-Diffusion 3.5 LoRA fine-tuning or any diffusion model in general. 

Let’s briefly overview the diffusers module as we use it.

Diffusers is an open-source library by Huggingface that provides a well-documented framework for working with diffusion models like Stable Diffusion. It includes extensive scripts for fine-tuning Lora and Dreambooth, which can be directly called using the accelerator library(in the next section). The Diffusers library simplifies configuration management, making it beginner-friendly yet powerful enough for advanced tasks. 

Huggingface’s strong community support and continuous updates make Diffusers a go-to choice for many developers and researchers.

Configuration File

We must define several arguments for Lora fine-tuning a diffusion model to guide our model toward generating better sample quality and prompt adherence than the pre-trained models. Some of these arguments are not mandatory and take a default value without impacting the model’s performance. 

Let us now look at what a configuration file containing mandatory arguments looks like:

accelerate launch train_dreambooth_lora_sd3.py   
--pretrained_model_name_or_path="stabilityai/stable-diffusion-3.5-medium"   
--dataset_name="bhomik7/flat-UI-dataset-small"   
--validation_prompt="""Create a modern user interface design for a mobile 
application that focuses on fitness tracking. The layout should include a vibrant 
dashboard displaying key metrics such as steps taken, calories burned, and 
workout summaries. Incorporate interactive elements like buttons for starting 
workouts, viewing progress, and accessing nutrition information. Use a color 
palette that conveys energy and motivation, with clear typography and intuitive navigation."""   
--num_validation_images=5   
--validation_epochs=1   
--output_dir="sd_3_5m_dreambooth_lora_ft_FULL_RES" 
--train_text_encoder 
--rank=4  
--resolution=1024   
--train_batch_size=1   
--num_train_epochs=20   
--checkpointing_steps=500   
--gradient_accumulation_steps=4   
--learning_rate=1e-04   
--lr_warmup_steps=10000   
--report_to="wandb"   
--mixed_precision="bf16"   
--push_to_hub 
--instance_prompt="A beautiful <TOK> UI for music app" 
--caption_column='text'

The image above shows that the file starts with the word ‘accelerate. ‘ But what does ‘accelerate’ mean?

The official huggingface Accelerate documentation states

“It is a library that enables the same pytorch code to run across any distributed configuration by adding just four lines of code! In short, training and inference at scale are simple, efficient, and adaptable.”

For a better understanding, refer to the image comparison below, which shows how Accelerate can be implemented with very few lines of code: 

fine-tuning stable diffusion 3.5, huggingface library, accelerate module, diffusers, accelerate code comparison
Fig 4: Code snippet of training a dataloader with accelerate module
fine-tuning stable diffusion 3.5, huggingface library, accelerate module, diffusers, accelerate code comparison, fine-tuning stable diffusion 3.5, huggingface library, accelerate module, diffusers, accelerate code comparison
Fig 5: Code snippet of training a dataloader without Accelerate module

Such is the simplicity of Accelerate Library!!

Coming back to our configuration file, let us dissect each argument provided in the image step-by-step:

ArgumentData Type Default Value Description
pre_trained_model_name_or_pathString
Required: True
NonePath to pretrained model or model identifier from huggingface.
dataset_nameString NoneNumber of images generated during validation using ‘—-validation_prompt’.
validation_promptStringNonePrompt used for validation.
num_validation_imagesInteger4After ”–validation_epochs,” run a validation step.
validation_epochsInteger1After ”–validation_epochs” run a validation step.
output_dirString sd-model-finetuned-LoraOutput directory where checkpoints and predictions will be written.
train_text_encoderIf present in the config file, then set to trueNAWhether to train the text encoder (CLIP text encoders only). If set, text encoder should be float32 precision.
rankInteger4Dimension of LoRA update matrices.
resolutionInteger512The resolution for input images in the train/validation dataset will be resized to this resolution.
train_batch_sizeInteger4Batch Size for the training dataloader.
num_train_epochsInteger1Number of epochs.
checkpointing_stepsInteger 500Save the checkpoints of the training state after every ‘–checkpointing_steps’ update. These checkpoints can be used as final checkpoints in case they are better than the last and are also suitable for resuming training using `–resume_from_checkpoint.`
gradient_accumulation_stepsInteger1Number of update steps to accumulate before performing a backward/update pass.
learning_rateFloat1e-4The initial learning rate (after the potential warmup period).
lr_warmup_stepsInteger500Number of steps for the warmup in the lr scheduler.
report_toStringTensorboardThe integration to report the results and logs. Supported platforms are tensorboard (default), wandb, and comet_ml. Use “all” to report to all integrations.
mixed_precisionStringNoneWhether to use mixed precision. Choose between ‘fp16’ or ‘bf16’.
push_to_hubIf present, push the model to the huggingface hubNANA
instance_promptString
Required: True
NoneThe prompt with an identifier specifying the instance, e.g., ‘photo of a TOK dog,’ ‘in the style of TOK’.
caption_columnStringNoneThe column of the dataset containing the instance prompt for each image.

These are crucial arguments you’ll need to pass to produce good-quality images. 

Apart from these, you might encounter a few other arguments during fine-tuning. For Example:  ‘max_image_seq_len,’ ‘use_dynamic_shifting,’ ‘use_exponential_sigmas,’ ‘max_shift,’ ‘use_beta_sigmas,’ ‘invert_sigmas,’ and ‘base_shift.’ To know more about these arguments, visit the official repository of the diffusers module.

Fine-tuning

Here comes the interesting part, in which you will get a general idea of how to fine-tune a diffusion model using the diffusers module and how to use a configuration file and accelerate module, which we have already encountered in the previous section.

All experiments are performed on an Nvidia A6000 GPU on the vast.ai cloud platform.

Before moving further, it is essential to understand the memory requirements for fine-tuning the stable diffusion 3.5 medium model. 

  1. Before Generating Validation Images: 19.6GB out of 49GB, 182W power consumption
  2. During the Generation of Validation Images: 26.4GB out of 49GB, 298W power consumption

Let’s begin by installing the dependencies:

git clone https://github.com/huggingface/diffusers , pip install -e .

pip install datasets

pip install wandb (if you are using weights and biases for model tracking)

pip install -r requirements_sd3.txt

After this, you must log in to your huggingface account and the tracking board you use; in our case, we will proceed with wandb only.

As all the dependencies are installed, along with logging into huggingface and wandb, we can proceed with initializing the accelerate config. 

The Accelerate configuration allows you to default to all the parameters instead of manually defining them, but you can also do the opposite. Here, we will use the default configuration for Accelerate.

accelerate config default

After getting everything done, we will run the configuration script(explained in the previous section) and wait for the model weights to get downloaded before the training starts. 

Once the training begins, you will notice that the progress bar says ‘x’ steps, but we have provided the number of epochs in the configuration. So, what is the difference between steps and epochs, and how do you calculate steps when epochs are given?

Step:  A step is one optimization update. Multiple steps are executed within a single epoch.

Epoch: One complete pass through the entire training dataset.

For Example,

Samples: 20,000 images

Batch Size: 100

1 epoch = 200 (20000/100) steps

But in the configuration setting, we have passed an argument called gradient_accumulation_steps, which means how many steps the model should perform backpropagation after. Hence, to calculate an adequate number of steps in this situation, we will proceed in the following manner:

Effective steps = (Number of Samples/batch_size)/gradient_accumulation

We have gone through the basic details of fine-tuning a stable diffusion model and the essential arguments that must be passed to the configuration file.

It’s time to look at the validation results in various argument settings.

Inferencing Results

Below are some results of the initial iteration of Lora’s fine-tuned stable diffusion 3.5 medium model on the UI image dataset prepared by collecting images from dribble.com and converted into a huggingface dataset.

The dataset used in this iteration is available on huggingface and contains 1000 rows. However, due to poor results, a new dataset was created using images from dribble and prompts generated by Gemini 2.0 flash experimental.

Configuration file:– 

argument sfile, fine-tuning stable diffusion 3.5m, LoRA fine tuning, Dreambooth fine tuning, huggingface models, accelerate module, diffusers module
Fig 6: Arguments

Inference Memory Requirement:  22.5GB is utilized out of 49GB on an A6000 GPU.

Inference Parameters:

Prompt = ‘A vibrant and energetic music app UI inspired by music festivals. The design incorporates bright colors, playful icons, and dynamic animations. The playback screen features a glowing soundwave visualizer in the background and festival-themed illustrations like stages, lights, and crowds. The playlist view is colorful and engaging, with icons representing different genres and moods.’ 

image = pipeline(prompt = prompt, 
                 guidance_scale = 10, 
                 num_inference_steps = 50,
                 height = 1024,
                 width = 1024,
                negative_prompt="bad text, gibberish text, distorted image, distorted figure, distorted text",
                generator=torch.manual_seed(2957138076)).images[0]

Inference Result: 

UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 7: First Iteration Result

Two major faults can be seen in the inference image above: 

  1. The image is blurry, even though the resolution provided is (1024,1024)
  2. Inconsistent UI- Improper icons, buttons, and search bar

Further iterations were made to mitigate these two issues using a configuration file different from that used before for fine-tuning. We’ll talk about these amendments in the next section. 

Improvements

Script to create your huggingface dataset (just like we did for better results in our final iteration) and then push it to the huggingface hub is provided here in the article so that you can go and checkout and customize it as per your requirements.

The New Dataset contains only 20 rows.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Configuration file:

accelerate launch train_dreambooth_lora_sd3.py   
--pretrained_model_name_or_path="stabilityai/stable-diffusion-3.5-medium"   
--dataset_name="bhomik7/flat-UI-dataset-extended"   
--validation_prompt="""Create a modern user interface design for a mobile application 
that focuses on fitness tracking. The layout should include a vibrant dashboard 
displaying key metrics such as steps taken, calories burned, and workout summaries. 
Incorporate interactive elements like buttons for starting workouts, viewing progress, 
and accessing nutrition information. Use a color palette that conveys energy and 
motivation, with clear typography and intuitive navigation."""   
--num_validation_images=5   
--validation_epochs=1   
--output_dir="sd_3_5m_dreambooth_lora_ft_HIGH_CONFIG" 
--train_text_encoder 
--rank=4  
--resolution=1024   
--train_batch_size=1   
--num_train_epochs=20   
--checkpointing_steps=500   
--gradient_accumulation_steps=4   
--learning_rate=1e-04   
--lr_warmup_steps=10000   
--report_to="wandb"   
--mixed_precision="bf16"   
--push_to_hub 
--instance_prompt="A beautiful <TOK> UI for music app" 
--caption_column='text'

Changes Implemented: 

  1. Validation Prompt made better for better validation images
  2. –train_text_encoder argument passed to make the text encoders understand the prompts better
  3. num_train_epochs reduced to 20 only to make the model overfit on the limited dataset of 20 images.
  4. Gradient_accumulation_steps increased to 4 so that training can be done in fewer steps.
  5. lr_warmup_steps:  so that instability and divergence can be prevented.
  6. The instance prompt is better if the unique identifier is inside angle brackets.
  7. The caption_column argument is provided so the model doesn’t take the instance prompt as the prompt for every image in our dataset.

Inference Parameters: The guidance Scale is increased slightly to improve the text in the image.

Prompt =  “A vibrant and energetic music app UI inspired by music festivals. The design incorporates bright colors, playful icons, and dynamic animations. The playback screen features a glowing soundwave visualizer in the background and festival-themed illustrations like stages, lights, and crowds. The playlist view is colorful and engaging, with icons representing different genres and moods.”

image = pipeline(prompt = prompt, 
                 guidance_scale = 10, 
                 num_inference_steps = 50,
                 height = 1024,
                 width = 1024,
                negative_prompt="bad text, gibberish text, distorted image, distorted figure, distorted text",
                generator=torch.manual_seed(2957138076)).images[0]

Inference Result

UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 8: Final Iteration Inference Results.

The quality has improved when comparing the above image with the one generated after the first iteration. There is no blurriness, and icon generation is consistent. There is also an appropriate UI layout, such as the search bar or music tabs.

Final iteration generated image—>

UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 10: Validation Result of final iteration

Although the text might not be apparent for the validation image, Lora’s fine-tuning has adapted to the style of flat UI images quite well, and the text discrepancies can be solved when defining the parameters during inference, like ‘prompt guidance.’

Generated UI images

Grocery UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 11: “A modern e-commerce website UI for a grocery delivery service. The homepage features a search bar at the top for quick product search, followed by product categories like ‘Fruits,’ ‘Vegetables,’ and ‘Dairy’ in a grid layout. Product cards include images, names, prices, and an ‘Add to Cart’ button with quantity selectors. Use a clean white background with green and orange accents for a fresh, approachable look. Include a sticky navigation bar and a floating checkout button.”
E-commerce UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 12: “A colorful and creative e-commerce web UI for an art supplies store. The homepage features a vibrant hero banner with paint splashes and artistic elements, followed by a grid layout for products like brushes, paints, and sketchbooks. Each product card includes vivid images, prices, and a ‘Buy Now’ button with hover effects. The color palette uses bright, playful tones like yellow, red, and blue. Typography is creative yet readable, with artistic embellishments.”
Fitness UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 13: “A vibrant flat mobile UI for a health and fitness app. The main dashboard should display daily activity tracking with colorful, animated progress rings for steps, calories burned, and workout time. Include motivational stats and achievements with bright accent colors like lime green, orange, and cyan on a clean white background. Use rounded cards for individual stats, clean sans-serif typography, and an intuitive top navigation bar with easy access to workout plans, progress charts, and a meal planner.”
Straming Service UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 14: “A trendy flat UI design for a streaming service app. The home screen should feature a sleek horizontal scroll of movie banners with glowing highlights. Use a chic color scheme of black, crimson, and gold for an elegant vibe. The player screen should showcase a minimalistic design with a progress bar at the bottom, a compact action button cluster, and large cover art in the center.”
Educational UI image, fine-tuned stable diffusion 3.5m, image generation, UI generation, LoRA results
Fig 15: Prompt: “A playful flat UI design for a kids’ educational app. The design should use bright primary colors like red, yellow, and blue, with soft, rounded elements. Include large, interactive buttons shaped like animals or toys and a home screen that showcases cartoonish illustrations. Add swipeable lesson cards with bold, child-friendly typography and fun sound effects for interactivity.”

Key Takeaways

  1. Dataset Creation: A custom UI dataset was built using Dribbble images and prompts from Google’s Gemini 2.0 API for better fine-tuning.
  2. Techniques: DreamBooth ensures personalized outputs, while LoRA enables efficient style transfer with resource-light fine-tuning.
  3. Tools: Huggingface’s Diffusers library simplifies fine-tuning with configurable parameters like batch size, learning rate, and resolution.
  4. Improvements: Refinements addressed blurry images and inconsistent UI through better prompts, fewer epochs, and optimized training steps.
  5. Results: The model produced high-quality UI designs for apps, achieving clarity, consistency, and style accuracy.

Conclusion 

The article provides a comprehensive overview of fine-tuning Stable Diffusion 3.5, highlighting the importance of dataset preparation, configuration settings, and iterative improvements for generating high-quality, consistent images in specific styles or domain improvements for generating high-quality, consistent images in specific styles or domains.

References

Diffusers Lora SD_3.5 repo:

Dribble

Thanks to Adam Lucek for explaining the Lora fine-tuning script and arguments on his YouTube channel.

Stability AI lora finetuning tutorial

Google AI studio

Accelerate Documentation



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​