Home
>
Anomaly Detection
>
Fine-Tuning AnomalyCLIP: Class-Agnostic Zero-Shot Anomaly Detection

Shubham
on July 1, 2025

Fine-Tuning AnomalyCLIP: Class-Agnostic Zero-Shot Anomaly Detection

Zero-shot anomaly detection (ZSAD) is a vital problem in computer vision, particularly in real-world scenarios where labeled anomalies are scarce or unavailable. Traditional vision-language models (VLMs) like CLIP fall short in this task because they are primarily trained for classification based on object semantics, not anomaly characteristics. This gap leads

Anomaly Detection, Vision Transformer, VLMs

AnomalyCLIP bridges this gap by introducing three innovations: object-agnostic prompt learning, local visual refinement via DPAM, and glocal context optimization. It leverages the strengths of CLIP while tailoring its behavior to detect anomalies across various domains, including manufacturing, healthcare, and security.

Motivation and Challenges

ZSAD requires detecting unseen anomalies without any sample from the target domain. The primary challenges include:

Domain variance: Anomalies vary in appearance across different domains (e.g., cracks on metals vs. tumors in MRIs).
Semantic bias: Models trained to recognize “cats” and “cars” don’t naturally understand what a “defect” or “lesion” is.
Fine-grained detection: Many anomalies are small, subtle, and not aligned with known object categories.
Lack of labeled target data: Supervised anomaly detection requires pixel-wise masks or labels, which are rarely available in real-world applications.

AnomalyCLIP addresses these by shifting the focus from object semantics to generic normality and abnormality patterns.

Why CLIP Alone is Insufficient

CLIP excels at image-text alignment using class-based prompts (e.g., “a photo of a dog”), but it struggles in ZSAD because:

It relies heavily on class names.
Its embeddings emphasize object presence, not quality or abnormality.
Its attention is distributed toward dominant visual tokens, which may not correlate with subtle anomalies.

What is AnomalyCLIP?

AnomalyCLIP is a zero-shot anomaly detection framework that adapts CLIP by:

Replacing class-specific prompts with object-agnostic prompts (like “a damaged object”).
Refining the text embeddings using multi-layer token tuning.
Improving visual attention through Diagonally Prominent Attention Maps (DPAM).
Training with a combined global and local objective called Glocal Context Optimization.

The overall result is a system that generalizes extremely well across datasets, even from industrial inspection images to medical imaging scans.

Feature	Description
Object-Agnostic Prompt Learning	Learns generic “normal” and “abnormal” prompts instead of relying on class-specific semantics.
Textual Space Refinement	Incorporates learnable prompt tokens across multiple layers of the CLIP text encoder.
DPAM (Diagonally Prominent Attention Map)	Enhances local visual attention using modified self-attention mechanisms.
Glocal Context Optimization	Combines global image-level and local pixel-level anomaly detection losses.
Single Forward Pass	No need for extra decoders or handcrafted prompts; efficient inference.

Architecture Overview of AnomalyCLIP

A high-level architecture diagram of AnomalyCLIP showing the flow from an auxiliary image through a frozen vision encoder with DPAM layers, and from object-agnostic text prompts through a multi-layer text encoder. The model computes global and local similarity scores between textual and visual embeddings to generate anomaly maps. Components like cosine similarity, max pooling, and ground truth supervision are illustrated along with learnable and frozen layers. — Fig 2. Architecture Overview of AnomalyCLIP with Object-Agnostic Prompts and Glocal Optimization

The architecture of AnomalyCLIP modifies CLIP only slightly but strategically:

Object-Agnostic Prompt Templates

Instead of using object-specific prompts like “a photo of a screw with a crack,” AnomalyCLIP defines two general templates:
- g_n: “a normal object”
- g_a: “a damaged object”
These prompts are not tied to specific object names, making them generalizable across domains.

Why this matters: It removes the model’s dependency on object categories, which are not always relevant to anomalies. Instead, the model learns the visual semantics of “normality” and “abnormality.”

Textual Prompt Refinement

The prompt tokens are not fixed. AnomalyCLIP inserts learnable tokens into the first 9 layers of the CLIP text encoder. These tokens evolve during training.
Enables deep semantic refinement.
This helps the model understand prompts not just at the surface level, but deep inside the network’s processing layers. It enables the prompts to become richer and more informative.

A six-column visual comparison showing auxiliary data (e.g., MVTec AD), test data samples (e.g., VisA, Br35H, ColonDB), and their anomaly localization maps using different text prompt strategies. The image contrasts similarity maps generated using original text prompts (CLIP), tailored prompts (WinCLIP), learnable prompts (CoOp), and object-agnostic prompts (AnomalyCLIP). The final column highlights the sharper and more accurate anomaly localization of AnomalyCLIP’s object-agnostic approach. — Fig 3. Comparison of Anomaly Localization Maps across different text prompting strategies

Local Visual Space Enhancement with DPAM

CLIP’s visual encoder naturally attends to a few key tokens, often overlooking local anomalies. DPAM replaces the standard self-attention with more uniform attention patterns using these strategies:

Attention Type	Description
Q-Q	Query-to-query attention promotes horizontal expansion
K-K	Key-to-key, vertical spread of focus
V-V	Value-to-value, diagonal prominence, default in AnomalyCLIP

V-V attention helps the model recognize small but significant features (e.g., scratches, lesions) without being distracted by dominant object tokens.

Promotes diagonally distributed attention to capture fine-grained features.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Glocal Context Optimization

To train AnomalyCLIP, the authors propose a dual-loss strategy that supervises both image-level and patch-level alignment between visual and textual features.

Global Loss

Encourages the model to classify an image as “normal” or “abnormal.”
Based on the similarity between the entire image embedding and g_n / g_a.

Local Loss

Guides the model to detect where an anomaly occurs.
Uses segmentation masks and calculates similarity at the patch level.
Applies Focal and Dice loss to improve class imbalance handling.

By combining the two:

The global loss helps with overall classification.
The local loss helps with fine-grained segmentation.
This strategy is referred to as Glocal Optimization.

Component	Role
Global Loss (Image-Level)	Cross-entropy loss that aligns whole image features with normal/abnormal prompts.
Local Loss (Pixel-Level)	Segmentation-aware loss using focal and Dice losses to align patch-level features.

Combined, this enables AnomalyCLIP to localize and classify anomalies effectively.

Training and Inference

Training

Uses an auxiliary anomaly detection dataset (e.g., MVTec AD or ColonDB).
Only prompt tokens, DPAM layers, and alignment losses are optimized.
CLIP’s encoders remain frozen to preserve their generalization ability.

Inference

Computes cosine similarity between image features and prompt embeddings.
For pixel-wise output:
- Generate similarity maps from intermediate layers
- Average s_n and s_a, and apply Gaussian smoothing

Output	Description
Anomaly Score	Probability of image being abnormal based on similarity with prompt.
Anomaly Map	Pixel-wise prediction indicating abnormal regions.

Experimental Setup of AnomalyCLIP

Domain	Datasets Used
Industrial	MVTec AD, VisA, MPDD, BTAD, SDD, DAGM, DTD-Synthetic
Medical	ISIC, CVC-ClinicDB, CVC-ColonDB, Kvasir, Endo, TN3k(Thyroid), HeadCT, BrainMRI, Br35H, COVID-19

Each of these presents different challenges, from surface texture detection to organ lesion segmentation.
Evaluated using AUROC, Average Precision (AP), and AUPRO.
Compared against: CLIP, CLIP-AC, WinCLIP, CoOp, VAND.

Performance Metrics

Metric	Description
AUROC	Ability to distinguish between normal/abnormal
AP	Average precision, based on precision-recall curve
AUPRO	Area under per-region overlap for segmentation tasks

AnomalyCLIP shows SOTA results in nearly all settings, especially when generalizing across domains.

Ablation Studies

Module-wise Performance

Module	Role	Result
DPAM (T₁)	Improves segmentation performance by refining local visual semantics.	Boosts segmentation
Prompt Learning (T₂)	Provides significant gain in cross-domain generalization.	Best cross-domain generalization
Textual Tuning (T₃)	Multi-layer refinement	Boosts both classification and segmentation by improving semantic clarity.

Context Optimization

Loss Setting	Result
Global Only	Good image-level detection, weak localization.
Local Only	Good segmentation, weak classification.
Glocal	Best of both worlds, superior combined performance.

DPAM Strategy Comparison

Strategy	Observation
Q-Q (CLIPqq)	Good classification, weak segmentation.
K-K (CLIPkk)	Balanced but still lower than default.
V-V (default)	Best overall performance and consistency.

Cross-Domain Generalization

From Industrial → Medical

AnomalyCLIP trained on MVTec AD can generalize to unseen medical domains.
Significantly outperforms WinCLIP and VAND on datasets like ISIC, COVID-19, and BrainMRI.

With Medical Fine-Tuning (ColonDB)

Enhances segmentation on HeadCT and BrainMRI.
Shows limitations on visually different domains (e.g., ISIC vs. ColonDB).

Key Observations:

Performs strongly across domains even when trained on industrial data.
Improves significantly when fine-tuned on similar domains (e.g., ColonDB → CVC, Kvasir).
Slight drop when tested on visually dissimilar targets (e.g., ISIC skin images).

Performance Gain: Object-Agnostic vs. Object-Aware Prompts

Dataset	Image AUROC Gain	Pixel AUROC Gain	AUPRO Gain
MVTec AD	+0.5	+0.2	+0.2
VisA	+0.6	+0.3	+0.5
MPDD	+4.4	+3.3	+1.8
BTAD	+0.9	+0.4	+1.8

Why?

Object semantics are not always aligned with anomaly characteristics. Removing class labels helps the model focus purely on the “visual irregularity.”

Evaluating AnomalyCLIP on TN3K Dataset

About TN3K Dataset

The TN3K dataset is a medically oriented, pixel-level anomaly detection dataset curated for thyroid nodule segmentation. Unlike image-level datasets, pixel-level datasets provide detailed segmentation masks, enabling the evaluation of both detection and localization performance.

TN3K falls under the category of pixel-level medical AD datasets, and therefore, it is fundamentally different from image-level datasets like COVID-19 or ISIC, which only offer classification-level supervision.

A set of three grayscale ultrasound images in the top row showing thyroid nodules from the TN3K dataset, paired with their corresponding binary segmentation masks in the bottom row. The masks highlight pixel-level annotations used for training and evaluating AnomalyCLIP. This visualization illustrates the dataset’s support for pixel-wise anomaly supervision, critical for measuring performance using AUPRO and pixel-level AUROC. — Fig 4. Sample Ultrasound images and Segmentation masks from the TN3K dataset

Given that TN3K supports pixel-level annotations, our evaluation of AnomalyCLIP on this dataset will rely exclusively on pixel-level metrics such as AUPRO and pixel-level AUROC, which are more relevant for segmentation tasks.

Experimental Design

The evaluation is structured in two phases to measure the effect of domain adaptation and fine-tuning:

Phase 1: Zero-Shot Evaluation

Use pre-trained AnomalyCLIP checkpoints
No fine-tuning on TN3K
Checkpoints trained on:
- MVTec AD (industrial dataset)
- VisA (industrial visual inspection dataset)

Phase 2: Fine-Tuning on TN3K

Train AnomalyCLIP using TN3K’s segmentation annotations
Evaluate on TN3K test split using pixel-level metrics

The comparison between these phases will highlight how well AnomalyCLIP generalizes from industrial to medical domains and the extent of gain achieved through domain-specific fine-tuning.

Repository Setup and Configuration

Official Repository Usage

To replicate these experiments, we begin by cloning the official AnomalyCLIP GitHub repository, which contains:

Implementation files for training and inference
Pre-trained checkpoint folders for MVTec AD and VisA
Scripts for evaluation

git clone https://github.com/zqhang/AnomalyCLIP.git

However, this repository uses outdated versions of many dependencies. Thus, a few adjustments are required.

Important Note on Environment

The requirements.txt file in the original repo contains deprecated libraries. We recommend:

Using a base conda environment created with the latest Python version (≥3.10)
Installing all necessary libraries with the updated requirements.txt that we provide (downloadable below)

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

Additionally, a few libraries must be manually installed via pip:

pip install ftfy regex tabulate

File Modifications for TN3K Integration

To support TN3K as a dataset to be fine-tuned in AnomalyCLIP, several files in the repo must be modified. We’ve made the necessary changes and bundled them for convenience.

What We Need To Do:

Clone the official AnomalyCLIP repository
Download the TN3K dataset and extract it into the root directory of the cloned repo
Click the Download Code button provided below to get all the modified files
Replace the corresponding files in the repo with the ones from the downloaded bundle (ensure filenames remain the same)

This includes modifications in files such as:

/AnomalyCLIP/train.py
/AnomalyCLIP/test.py
/AnomalyCLIP/test.sh (we will be using test_before_fine_tuning.sh and test_after_fine_tuning.sh)
/AnomalyCLIP/train.sh
/AnomalyCLIP/metrics.py
/AnomalyCLIP/requirements.txt
/AnomalyCLIP/logger.py
/AnomalyCLIP/AnomalyCLIP_lib/model_load.py
/AnomalyCLIP/generate_dataset_json/tn3k.py

All of the above files can be downloaded from the Download Code button.

How to Run the Evaluation

Once the repo is set up and files replaced, run the following:

Generate the Dataset JSON

cd generate_dataset_json
python tn3k.py

AnomalyCLIP requires a dataset-specific JSON file that defines the structure and category information.

The following lines of code have been modified to work with the TN3K dataset as follows –

if __name__ == '__main__':
    runner = ClinicDBSolver(root='/home/shubham/Work/AnomalyCLIP/Thyroid_Dataset/tn3k')
    runner.run()

Run AnomalyCLIP

Once our dataset and JSON files are ready, we can either run AnomalyCLIP in zero-shot inference mode using pre-trained weights or fine-tune it on TN3K.

Zero-Shot Evaluation (Using Pretrained Weights)

Use the preconfigured shell script:

bash test_before_fine_tuning.sh

Make sure test_before_fine_tuning.sh is edited to include correct paths to the pre-trained checkpoints for both the MVTec AD dataset and the VisA dataset as well. This will run AnomalyCLIP in inference mode and evaluate pixel-level metrics.

Evaluation Metrics score when evaluated using MVTec AD checkpoints

25-07-01 17:06:40.674 - INFO: Logging test...
25-07-01 17:07:04.293 - INFO: 
| objects   |   pixel_auroc |   pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid   |          63.5 |          46.8 |
| mean      |          63.5 |          46.8 |

Evaluation Metrics score when evaluated using VisA checkpoints

25-07-01 17:07:08.398 - INFO: Logging test...
25-07-01 17:07:32.191 - INFO: 
| objects   |   pixel_auroc |   pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid   |          63.4 |          39.8 |
| mean      |          63.4 |          39.8 |

Fine-Tuning AnomalyCLIP on TN3K

To train AnomalyCLIP on TN3K with ground-truth segmentation masks:

bash train.sh

Fine-Tuning Logs

Upon completion, trained weights will be saved to checkpoints/singlescale_tn3k. These can be used for a second round of evaluation.

25-07-01 14:42:18.397 - INFO: epoch [1/15], loss:3.6556, image_loss:0.0433
25-07-01 14:46:37.770 - INFO: epoch [2/15], loss:3.3293, image_loss:0.0096
25-07-01 14:50:56.856 - INFO: epoch [3/15], loss:3.2725, image_loss:0.0083
25-07-01 14:55:15.565 - INFO: epoch [4/15], loss:3.2267, image_loss:0.0069
25-07-01 14:59:35.778 - INFO: epoch [5/15], loss:3.1975, image_loss:0.0064
25-07-01 15:03:55.017 - INFO: epoch [6/15], loss:3.1952, image_loss:0.0062
25-07-01 15:08:14.138 - INFO: epoch [7/15], loss:3.1792, image_loss:0.0063
25-07-01 15:12:35.894 - INFO: epoch [8/15], loss:3.1710, image_loss:0.0061
25-07-01 15:16:57.972 - INFO: epoch [9/15], loss:3.1683, image_loss:0.0064
25-07-01 15:21:19.763 - INFO: epoch [10/15], loss:3.1682, image_loss:0.0061
25-07-01 15:25:41.927 - INFO: epoch [11/15], loss:3.1566, image_loss:0.0059
25-07-01 15:30:04.917 - INFO: epoch [12/15], loss:3.1614, image_loss:0.0061
25-07-01 15:34:27.330 - INFO: epoch [13/15], loss:3.1604, image_loss:0.0064
25-07-01 15:38:49.085 - INFO: epoch [14/15], loss:3.1516, image_loss:0.0057
25-07-01 15:43:10.201 - INFO: epoch [15/15], loss:3.1504, image_loss:0.0060

Post-Training Evaluation

Once training is done, use the following bash command to reference your newly trained model checkpoint and re-run it:

bash test_after_fine_tuning.sh

This step evaluates the fine-tuned model using TN3K’s segmentation ground truth and provides results like AUROC and AUPRO.

Evaluation Metrics score when evaluated using the TN3K checkpoints itself

25-07-01 17:14:28.583 - INFO: Logging test...
25-07-01 17:14:51.843 - INFO: 
| objects   |   pixel_auroc |   pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid   |          83.2 |          54.9 |
| mean      |          83.2 |          54.9 |

This completes the full workflow for training and evaluating AnomalyCLIP on the TN3K dataset within the official repository’s structure.

Insights from Zero-Shot and Fine-Tuned Performance on TN3K

To comprehensively analyze AnomalyCLIP’s behavior on the TN3K thyroid nodule segmentation dataset, we evaluated three different configurations using the pixel-level AUROC and AUPRO metrics. These results are cross-referenced with the original AnomalyCLIP paper to provide interpretability and identify performance differentials.

Evaluation Setup & Results

Configuration	Pixel AUROC	Pixel AUPRO
Zero-Shot (MVTec AD)	63.5	46.8
Zero-Shot (VisA)	63.4	39.8
Fine-Tuned on TN3K	83.2	54.9
Official Paper (AnomalyCLIP)	79.2	47.0

Summary of Experiments

Experiment 1: Evaluation using MVTec AD checkpoint yielded moderate ZSAD scores (AUROC 63.5 / AUPRO 46.8)
Experiment 2: Slightly lower AUPRO when evaluated using the VisA checkpoint (39.8)
Experiment 3: Fine-tuning on TN3K achieved significantly better metrics (AUROC 83.2 / AUPRO 54.9)
Experiment 4: Official paper result on TN3K is lower than your fine-tuned version (AUROC 79.2 / AUPRO 47.0)

Why Did Our Fine-Tuned Model Outperform the Paper?

While it’s rare to outperform official benchmarks using the same architecture, several explainable factors might have contributed to our fine-tuned model’s superior performance:

Updated Library Versions

Although the AnomalyCLIP repo uses older dependencies, our environment (Python, CUDA, PyTorch) might have introduced backend improvements (e.g., more stable loss behavior, faster convergence).

GPU Stability and Precision Handling

Hardware configurations also affect training stability. Differences in numerical precision or FP16 support could have contributed to improved convergence.

AnomalyCLIP outperforms or complements many existing models:

Compared to CLIP-AD, ZOC, and ACR

These methods require target-specific tuning or focus only on classification.
AnomalyCLIP offers both classification and segmentation.

Compared to WinCLIP and VAND

VAND uses projection learning, which weakens semantic alignment.
AnomalyCLIP uses just two prompts and achieves better performance.

Compared to DenseCLIP and CoOp

These need an additional decoder or object-specific prompts.
AnomalyCLIP is fully prompt-based, efficient, and more general.

Model	Weakness	AnomalyCLIP Advantage
CLIP-AD, ZOC	Only support classification	Offers segmentation as well
WinCLIP	Requires manual prompt engineering	Fully learnable with 2 prompts
VAND	Projects features but struggles with semantics	Fully learnable with two prompts
CoOp, DenseCLIP	No segmentation, decoder-based	Decoder-free and efficient

Key Takeaways

Zero-shot detection from industrial datasets like MVTec AD or VisA provides only moderate transfer to TN3K.
Fine-tuning specifically on TN3K boosts segmentation performance substantially.
Our fine-tuned model outperforms the official AnomalyCLIP results — likely due to better dataset alignment, modern environments, or refined training configs.

Conclusion

AnomalyCLIP is a robust, flexible, and accurate framework for zero-shot anomaly detection. It addresses the limitations of existing VLM-based approaches by:

Removing reliance on object semantics
Refining prompts within the model’s text encoder
Enhancing pixel-level attention
Using glocal optimization for training

Extensive experiments show that AnomalyCLIP:

Achieves top-tier performance in both classification and segmentation
Works across domains from manufacturing to radiology
Requires no handcrafted prompts or retraining for each dataset

AnomalyCLIP is a step forward in scalable, cross-domain anomaly detection using vision-language models.

References

Was This Article Helpful?

What Makes DeepSeek OCR So Powerful?

DeepSeek AI just unleashed DeepSeek OCR, a 3B-param beast that compresses entire documents into 100

2D Gaussian Splatting: Geometrically Accurate Radiance Field Reconstruction

Discover how 2D Gaussian Splatting transforms neural rendering by replacing volumetric 3D Gaussians with surface-aligned

TRM: Tiny AI Models beating Giants on Complex Puzzles

Models with billions, or trillions, of parameters are becoming the norm. These models can write

Was This Article Helpful?

AnomalyCLIP, CLIP, Fine Tuning, fine tuning anomalyclip, medical anomaly, object-agnostic prompt learning, thyroid nodule segmentation, tn3k, zero-shot anomaly detection

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

Agentic AIGUIVLMs

Kukil September 30, 2025

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Computer VisionVLMs

Bhomik Sharma September 23, 2025

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.