Stable Diffusion 3 model, released in June 2024 by Stability AI, is the third iteration in the stable diffusion family. The Turbo-Large and Large variants of the SD3 family are Stability AI’s most advanced text-to-image open models yet.These models are highly customizable for their size, run on consumer hardware, and are free for both commercial and non-commercial use under the permissive Stability AI Community License.
The popularity they gained is because of their ability to nearly solve the long pose Generative models Trilemma (Mode Coverage, Sample Quality and Sample Speed). No Generative AI model has been able to solve all these 3 issues together but SD3’s have reduced their time of sampling as well as quality of samples as compared to their previous models.
These models are based on the concept of Latent Diffusion, which works by iteratively denoising the latent space derived from Variational Autoencoders to generate a target image based on the provided prompt. One of the main advantages of Stable Diffusion 3 is its high level of optimization.
This article walks you through the concepts and in-depth understanding of how Stable Diffusion 3 works in a very intuitive way. We have provided an Inference script, of Stable Diffusion 3.5 Large Turbo model near the end of this article. Before proceeding any further, let’s take a moment to understand Diffusion Models.
- Defining Diffusion (and Variants) and the Reason for this Nomenclature
- Stable Diffusion
- Inference with SD3.5 Large Turbo, Large and Medium
- Comparison of SD3.5 Large Turbo with Flux Schnell
- Conclusion
- References
Defining Diffusion (and Variants) and the Reason for this Nomenclature
Diffusion in classical mechanics refers to the transfer of particles—or any tangible quantity—from an area of high concentration to an area of low concentration. Inspired by this concept, the authors of Denoising Diffusion Probabilistic Models (DDPM) introduce noise into a target image (which has very low noise) to align it with the noise data distribution. This process is known as Forward Diffusion, as described by the original authors of the diffusion paper. Conversely, when the target data distribution is derived from this noisy image, the process is referred to as Reverse Diffusion.
The image above shows how the forward and reverse diffusion process works by representing it in the form probabilistic notations . However, these notations simply represent the data distributions at each timestamp. As we observe, the image becomes progressively noisier when we follow the path of . This indicates that noise is gradually diffusing into the target distribution. Essentially, our neural network will learn the distribution of noise between two timestamps during the reverse diffusion process by regressing the mean and variance of the latter.
It is not always necessary to regress both central tendencies; we can instead focus solely on regressing the noise by simplifying the diffusion loss function.
Let’s take a look at how the authors have defined the algorithm for Denoising Diffusion Probabilistic Models (DDPM):
Where,
As mentioned, we can see that the model is trying to regress epsilon (parameterized by ) by getting as close to the original noise as possible.
Now that we have some knowledge of the diffusion process we can move on to understanding Latent Diffusion Models(LDM).
Latent Diffusion
Compared to Vanilla Diffusion Latent Diffusion incorporates one key change that makes it more applicable to real-life scenarios. This idea is derived from the models used in Variational Autoencoders (VAEs).
The latent variable ‘Z’ serves as a low-dimensional representation of the original data. This representation is beneficial for generation, as it captures the essential attributes of the data without any extraneous information. Essentially, this vector encapsulates everything needed to reconstruct the original distribution using VAEs.
When generating our target distribution from noise, we utilize a significant amount of resources; in fact, diffusion (considering both forward and reverse diffusion process) is a resource-intensive process. To address this issue, we can apply the concept of latent spaces from Variational Autoencoders (VAEs).
Firstly, a VAE encodes the target data into a latent space, resulting in a lower-dimensional representation that is less complex than the original target distribution. Consequently, applying a diffusion framework or model to this latent vector makes the reverse process more efficient and faster.
Now, let’s take a look at the architecture of latent diffusion models:
The authors of the DDPM paper employed a U-Net architecture as a neural network for the diffusion process, which has since become a standard for diffusion neural networks. This architecture offers several advantages for our purposes, including multi-scale feature extraction, versatility with conditional inputs, and efficient memory and computational usage.
As we can see in the image a section named Conditioning. It refers to the process in which the generation of sample images adheres to the instructions provided in the condition itself or you can say the prompt provided by the user, rather than just being a random prediction from the diffusion model. This conditioning can take various forms, such as semantic maps, text, or images.
With a foundational understanding of diffusion and latent diffusion models, we can now begin our exploration of the work by Stability AI, specifically Stable Diffusion.
Stable Diffusion
Before understanding Stable Diffusion, we need to know what flow ( particularly flow matching ) means.
Flow Matching
The dotted lines shown in the figure above represent what is known as “flow.” This concept describes how a sample point from the latent space distribution transforms into a component of the target distribution. This may seem a bit ambiguous and difficult to grasp at first, but by the end of this section, a clear understanding of how flow matching works and its relation to Stable Diffusion will be established.
Specifically, the authors of the paper introduced flow matching as a simple and intuitive training objective aimed at regressing onto a target vector field that produces a desired probability path. A probability path refers to the route that a particular distribution (in our case, the noise distribution) takes to achieve the state of another distribution. Assuming intermediate knowledge of matrices and determinants; this process resembles the row-wise(or column-wise) matrix transformation.
All in all, we can say that flow matching or normalizing flow transforms a simple distribution ( gaussian distribution ) into a complex one by applying a series of invertible transformation functions.
Let’s go further into this concept by introducing some mathematical details.
FM Objective:
The above expression is nothing but a simple mean square error formula between two vector fields. This equation regresses the vector field , which is parameterized by ,, and ultimately aims to reach a known vector field ,. However, the challenge we face is that we do not have any information about what the final vector field , looks like. Additionally, the objective presented is impractical to use because we lack prior knowledge of what appropriate distributions,and , should be.
To address this, we need to get acquainted with mathematics so that it becomes more intuitive to understand how we can arrive at the final probability distribution or target image. This is where Conditional Flow Matching (CFM) comes into play.
- The authors have demonstrated that optimizing the Conditional Flow Matching (CFM) objective is analogous to optimizing the Flow Matching (FM) objective. This effectively indicates that the divergence of the loss function for flow matching is nearly equal to the divergence of the loss function for conditional flow matching.
- Since we do not have a clear method for obtaining ,, we will transform it into another representation.
- Before proceeding, let’s define a few concepts: a probability path and a vector field are two distinct entities. A vector field assigns a vector to every point in space, while a path specifies a vector for only a subset of points that follow a particular curve (in our case, this will be a Gaussian curve). Assuming each path takes the form of a Gaussian, we can express it according to the equation shown below:
- Now, let’s move on these paths. Mathematically it means shifting the mean of our data point to a new position and then scaling it with a standard deviation.
- Mathematically showing how this point moves over time we can use the push-forward equation and substitute these values in our original CFM loss we get something like this:
In the equations above, , represents a function that defines the transformation of each point.
As the authors explore the optimal mean and standard deviation over time for transforming a point towards the target point, they deduced (through induction alone) that the Optimal Transport method—characterized by a linear or straight flow—is the most effective approach to achieving the ultimate goal of reaching the final target distribution.
The final loss function for the Optimal Transport method is formulated as follows:
As visible in the image shown above, the way a diffusion model travels towards towards the target distribution and the way flow matching approaches the same are quite different. Authors of the paper showed that the Optimal Transport function for flow matching is the best practice to produce most accurate samples in the most optimal time complexity.
Timestamp Sampling
Finally, we have arrived at the last concept in order to understand Stable Diffusion 3 model. A Stable Diffusion 3 model is essentially a refinement of the diffusion model, enhanced by the addition of flow matching and timestamp sampling techniques. In the previous discussion, we outlined how flow matching integrates into the diffusion process to ensure the shortest and most optimal path.
Until now, all our timestamps have been equally weighted and derived from a uniform distribution. However, as noted by the authors of the Stable Diffusion 3 paper, the differences between intermediate timestamps tend to be more significant compared to those at the beginning or end. Thus, it is essential to incorporate a weighted loss function that assigns greater importance to the intermediate timestamps.
In their research on timestamp sampling techniques, the authors concluded that using Logit-Normal Sampling would yield the most optimal results.
In summary, we can say that:
Diffusion + Rectified Flow (Flow Matching) + Logit-Normal Sampling leads us to Stable Diffusion 3.0.
Architecture
The architecture is based on the Diffusion Transformers model, utilizing an Encoder (VAE) to generate noise latent variables, similar to the MMDiT architecture. It also includes a decoder(VAE) to upscale the output of our model.
Let’s begin with the first layer, where captions (textual prompts) are processed and transformed into embeddings within an embedding space.
Textual Embedding
Based on Stable Diffusion 3, MMDiT employs three textual encoders: CLIP-G/14, CLIP-L/14, and a T5 XXL encoder with approximately 5 billion parameters. The rationale behind using these three textual encoders is to enhance performance. This configuration offers flexibility during inference, allowing us to balance model performance and memory efficiency by removing the T5 XXL encoder if desired.
We have the option to include the T5 XXL encoder during inference. Removing this encoder helps reduce memory consumption with only a minimal loss in accuracy. When the T5 XXL encoder is removed, its output becomes a tensor of zeros.
The simplified code appears as follows:
Patchify
As the pooled text representations retain only coarse-grained information about the textual input, the network also needs insights from the sequence representation. To achieve this, we create patch encodings by flattening 2×2 patches and adding positional encodings. We then concatenate this with text encodings, ensuring both have the same dimensionality.
The factor by which we scale down our image (referred to as patch_number) depends on the scale factor, as shown in the image below:
DiT backbone
After the above two procedures, the MM-DiT model follows the DiT backbone and applies a sequence of Modulated Attention and MLP layers.
To best understand the upgrade that MM-DiT brought against vanilla DiT can be summarized in one sentence:- DiT is the special case of MM-DiT with only one set of shared weights for all modalities.
One question you might ask after seeing the architecture is what do alpha, beta, gamma, etc. represent?
The answer lies in the way diffusion transformers implement attention mechanisms. The attention mechanism they are using is termed as modulated attention as we can see in the simplified architecture image.
The input vector, which includes the skip connection input, can be expressed as: input vector + (input * scale) + shift. The scales and shifts refer to the parameters such as alpha, beta, and gamma that are visible in the model architecture.
Similar to DiT, MMDiT employs adaptive layer normalization (adaLN) instead of traditional layer normalization (LN). The authors of the DiT paper explain that adaLN is the most compute-friendly option. Additionally, it is the only conditioning mechanism that applies the same function uniformly across all tokens.
Lastly, the authors discussed the separate sets of weights they used to understand the complexity of each modality individually. This approach is equivalent to having two transformers, one for each modality. While it might seem that this would lead to no correlation between the two modalities, combining the sequences from both for the joint attention operation addresses this issue.
Additionally, the architecture presented differs from the original MMDiT architecture by incorporating more self-attention blocks. This modification pertains to the Stable Diffusion 3.5 medium (MM-DiTx), which is a 2.5 billion parameter model. Because of this lower parameter count, it is less accurate; thus, the authors introduced additional layers of attention to enhance its ability to understand the relationships and semantic meanings within the sequences.
To get a better idea of a model’s way of calculating noise and outputting a sample, we can have a look at a snippet of the model’s pipeline where it regresses noise.
Here we can see that inside the denoising loop, the transformer model is giving predicted noise as output which is then chunked to generate two noises, namely predicted unconditional noise and predicted textual noise. Finally latent(t-1) is generated using the scheduler and latent from the previous (t) timestamp.
Inference with SD 3.5 Large Turbo, Large, Medium
In the image above, Stability AI has outlined the compatibility of various models with different GPU architectures.
Next, let’s explore the inference process and take a closer look at the inference script.
Before we proceed, Let’s have a look at the GPU consumption and usage during the inference process. We have set up an instance with one A6000 GPU that has a maximum memory capacity of 48GB.
Following are the dependencies(like diffusers) that we need to install for the Stable Diffusion 3.5 Large Turbo module.
!pip install -U diffusers
!pip install transformers
!pip install accelerate
!pip install sentencepiece
!pip install protobuf
Next, we proceed to import the necessary functions and create a pipeline for our Stable Diffusion 3.5 Large Turbo Model, which only requires the Model Id and Torch Dtype as arguments.
import torch
from diffusers import StableDiffusion3Pipeline
Now, let’s create an instance of Pipeline and call Stable Diffusion 3.5 Large Turbo Model.
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large-turbo", torch_dtype = torch.bfloat16)
pipe = pipe.to("cuda")
A manual seed is defined to ensure that every time we run the sample.
- The Number of Inference Steps: Intermediate Latents generated before reaching to final output.
- The Guidance Scale: The degree to which we want our model to adhere to the prompt. A value of 0 means the model will generate a sample based on the prompt but will be flexible in its generation, thereby not strictly adhering to it.
Finally, lets use the pipeline instance and create our own synthesized image of a Capybara.
torch.manual_seed(1)
image = pipe("cute capybara holding a card on which 'HELLO WORLD' is written", num_inference_steps = 30, guidance_scale = 0.0).images[0]
image.save("capybara.png")
Comparison of SD3.5 Large Turbo with Flux Schnell
Model similarities and differences overview
- Flux Schnell model is a 12B parameter model whereas SD3.5 large-turbo is an 8.1B parameter model.
- Typically, the number of inference steps for Flux Schnell ranges between 3-6 and for SD3.5 large-turbo it’s fixed to 20.
- The Guidance Scale for both models is set to 0. However, the Flux Schnell model does not explicitly requires a guidance scale parameter in its pipeline.
Conclusion
This article takes a walk through a wide variety of concepts like latent diffusion, flow matching etc. with in a very informative manner covering final loss mathematical equation, elaborative explanation of code blocks etc. Apart from that in the beginning we also talked about the generative models trilemma and how are Stable Diffusion models encountering this trio of problems.
To summarize this article, we covered:
- Diffusion and it’s training and sampling algorithm
- Latent Diffusion Models and their Architecture
- Probability Path and Flow Matching
- Most efficient path for diffusion models to follow
- Stable Diffusion models and their architecture
- Inference code blocks
- Sample comparison between Flux Schnell and SD3.5 Large Turbo
Generative AI has opened up a lot of research opportunities, especially diffusion models as their practical applications keep on increasing day-by-day, ranging from commercial content creation to artistic innovation.
Happy Learning 🙂
References
- Stability AI
- DDPM: Theory to Implementation
- SD3.5 source code
- Scaling Rectified Flow Transformers
- Flow Matching Paper Explanation
- Flow Based Deep Generative Models
- Introduction to Flow Matching
- Review: Adaptive Layer Normalization
- Model Quantization
- Hugging Face Documentation: Pipeline
- Flow Matching in Latent Space
- The Latent: Code The Math