Fine-Tuning AnomalyCLIP: Class-Agnostic Zero-Shot Anomaly Detection

Zero-shot anomaly detection (ZSAD) is a vital problem in computer vision, particularly in real-world scenarios where labeled anomalies are scarce or unavailable. Traditional vision-language models (VLMs) like CLIP fall short in this task because they are primarily trained for classification based on object semantics, not anomaly characteristics. This gap leads

Zero-shot anomaly detection (ZSAD) is a vital problem in computer vision, particularly in real-world scenarios where labeled anomalies are scarce or unavailable. Traditional vision-language models (VLMs) like CLIP fall short in this task because they are primarily trained for classification based on object semantics, not anomaly characteristics. This gap leads to poor generalization in unseen domains, where anomalies don’t align with known object labels.

AnomalyCLIP bridges this gap by introducing three innovations: object-agnostic prompt learning, local visual refinement via DPAM, and glocal context optimization. It leverages the strengths of CLIP while tailoring its behavior to detect anomalies across various domains, including manufacturing, healthcare, and security.

  1. Motivation and Challenges
  2. Why CLIP Alone is Insufficient
  3. What is AnomalyCLIP?
  4. Architecture Overview of AnomalyCLIP
    1. Object-Agnostic Prompt Templates
    2. Textual Prompt Refinement
    3. Local Visual Space Enhancement with DPAM
  5. Glocal Context Optimization
  6. Training and Inference
    1. Training
    2. Inference
    3. Experimental Setup of AnomalyCLIP
  7. Performance Metrics
  8. Ablation Studies
  9. Cross-Domain Generalization
  10. Performance Gain: Object-Agnostic vs. Object-Aware Prompts
  11. Evaluating AnomalyCLIP on TN3K Dataset
    1. Zero-Shot Evaluation (Using Pretrained Weights)
    2. Fine-Tuning AnomalyCLIP on TN3K
    3. Post-Training Evaluation
  12. Insights from Zero-Shot and Fine-Tuned Performance on TN3K
  13. Why Did Our Fine-Tuned Model Outperform the Paper?
  14. Related Work
  15. Key Takeaways
  16. Conclusion
  17. References

Motivation and Challenges

ZSAD requires detecting unseen anomalies without any sample from the target domain. The primary challenges include:

  • Domain variance: Anomalies vary in appearance across different domains (e.g., cracks on metals vs. tumors in MRIs).
  • Semantic bias: Models trained to recognize “cats” and “cars” don’t naturally understand what a “defect” or “lesion” is.
  • Fine-grained detection: Many anomalies are small, subtle, and not aligned with known object categories.
  • Lack of labeled target data: Supervised anomaly detection requires pixel-wise masks or labels, which are rarely available in real-world applications.

AnomalyCLIP addresses these by shifting the focus from object semantics to generic normality and abnormality patterns.

Why CLIP Alone is Insufficient

CLIP excels at image-text alignment using class-based prompts (e.g., “a photo of a dog”), but it struggles in ZSAD because:

  • It relies heavily on class names.
  • Its embeddings emphasize object presence, not quality or abnormality.
  • Its attention is distributed toward dominant visual tokens, which may not correlate with subtle anomalies.

What is AnomalyCLIP?

AnomalyCLIP is a zero-shot anomaly detection framework that adapts CLIP by:

  • Replacing class-specific prompts with object-agnostic prompts (like “a damaged object”).
  • Refining the text embeddings using multi-layer token tuning.
  • Improving visual attention through Diagonally Prominent Attention Maps (DPAM).
  • Training with a combined global and local objective called Glocal Context Optimization.

The overall result is a system that generalizes extremely well across datasets, even from industrial inspection images to medical imaging scans.

FeatureDescription
Object-Agnostic Prompt LearningLearns generic “normal” and “abnormal” prompts instead of relying on class-specific semantics.
Textual Space RefinementIncorporates learnable prompt tokens across multiple layers of the CLIP text encoder.
DPAM (Diagonally Prominent Attention Map)Enhances local visual attention using modified self-attention mechanisms.
Glocal Context OptimizationCombines global image-level and local pixel-level anomaly detection losses.
Single Forward PassNo need for extra decoders or handcrafted prompts; efficient inference.

Architecture Overview of AnomalyCLIP

A high-level architecture diagram of AnomalyCLIP showing the flow from an auxiliary image through a frozen vision encoder with DPAM layers, and from object-agnostic text prompts through a multi-layer text encoder. The model computes global and local similarity scores between textual and visual embeddings to generate anomaly maps. Components like cosine similarity, max pooling, and ground truth supervision are illustrated along with learnable and frozen layers.
Fig 2. Architecture Overview of AnomalyCLIP with Object-Agnostic Prompts and Glocal Optimization

The architecture of AnomalyCLIP modifies CLIP only slightly but strategically:

Object-Agnostic Prompt Templates

  • Instead of using object-specific prompts like “a photo of a screw with a crack,” AnomalyCLIP defines two general templates:
    • g_n: “a normal object”
    • g_a: “a damaged object”
  • These prompts are not tied to specific object names, making them generalizable across domains.

Why this matters: It removes the model’s dependency on object categories, which are not always relevant to anomalies. Instead, the model learns the visual semantics of “normality” and “abnormality.”

Textual Prompt Refinement

  • The prompt tokens are not fixed. AnomalyCLIP inserts learnable tokens into the first 9 layers of the CLIP text encoder. These tokens evolve during training.
  • Enables deep semantic refinement.
  • This helps the model understand prompts not just at the surface level, but deep inside the network’s processing layers. It enables the prompts to become richer and more informative.
A six-column visual comparison showing auxiliary data (e.g., MVTec AD), test data samples (e.g., VisA, Br35H, ColonDB), and their anomaly localization maps using different text prompt strategies. The image contrasts similarity maps generated using original text prompts (CLIP), tailored prompts (WinCLIP), learnable prompts (CoOp), and object-agnostic prompts (AnomalyCLIP). The final column highlights the sharper and more accurate anomaly localization of AnomalyCLIP’s object-agnostic approach.
Fig 3. Comparison of Anomaly Localization Maps across different text prompting strategies

Local Visual Space Enhancement with DPAM

CLIP’s visual encoder naturally attends to a few key tokens, often overlooking local anomalies. DPAM replaces the standard self-attention with more uniform attention patterns using these strategies:

Attention TypeDescription
Q-QQuery-to-query attention promotes horizontal expansion
K-KKey-to-key, vertical spread of focus
V-VValue-to-value, diagonal prominence, default in AnomalyCLIP
  • V-V attention helps the model recognize small but significant features (e.g., scratches, lesions) without being distracted by dominant object tokens.
  • Promotes diagonally distributed attention to capture fine-grained features.

Glocal Context Optimization

To train AnomalyCLIP, the authors propose a dual-loss strategy that supervises both image-level and patch-level alignment between visual and textual features.

Global Loss

  • Encourages the model to classify an image as “normal” or “abnormal.”
  • Based on the similarity between the entire image embedding and g_n / g_a.

Local Loss

  • Guides the model to detect where an anomaly occurs.
  • Uses segmentation masks and calculates similarity at the patch level.
  • Applies Focal and Dice loss to improve class imbalance handling.

By combining the two:

  • The global loss helps with overall classification.
  • The local loss helps with fine-grained segmentation.
  • This strategy is referred to as Glocal Optimization.
ComponentRole
Global Loss (Image-Level)Cross-entropy loss that aligns whole image features with normal/abnormal prompts.
Local Loss (Pixel-Level)Segmentation-aware loss using focal and Dice losses to align patch-level features.

Combined, this enables AnomalyCLIP to localize and classify anomalies effectively.

Training and Inference

Training

  • Uses an auxiliary anomaly detection dataset (e.g., MVTec AD or ColonDB).
  • Only prompt tokens, DPAM layers, and alignment losses are optimized.
  • CLIP’s encoders remain frozen to preserve their generalization ability.

Inference

  • Computes cosine similarity between image features and prompt embeddings.
  • For pixel-wise output:
    • Generate similarity maps from intermediate layers
    • Average s_n and s_a, and apply Gaussian smoothing
OutputDescription
Anomaly ScoreProbability of image being abnormal based on similarity with prompt.
Anomaly MapPixel-wise prediction indicating abnormal regions.

Experimental Setup of AnomalyCLIP

DomainDatasets Used
IndustrialMVTec AD, VisA, MPDD, BTAD, SDD, DAGM, DTD-Synthetic
MedicalISIC, CVC-ClinicDB, CVC-ColonDB, Kvasir, Endo, TN3k(Thyroid), HeadCT, BrainMRI, Br35H, COVID-19
  • Each of these presents different challenges, from surface texture detection to organ lesion segmentation.
  • Evaluated using AUROC, Average Precision (AP), and AUPRO.
  • Compared against: CLIP, CLIP-AC, WinCLIP, CoOp, VAND.

Performance Metrics

MetricDescription
AUROCAbility to distinguish between normal/abnormal
APAverage precision, based on precision-recall curve
AUPROArea under per-region overlap for segmentation tasks

AnomalyCLIP shows SOTA results in nearly all settings, especially when generalizing across domains.

Ablation Studies

Module-wise Performance

ModuleRoleResult
DPAM (T₁)
Improves segmentation performance by refining local visual semantics.
Boosts segmentation
Prompt Learning (T₂)Provides significant gain in cross-domain generalization.Best cross-domain generalization
Textual Tuning (T₃)Multi-layer refinementBoosts both classification and segmentation by improving semantic clarity.

Context Optimization

Loss SettingResult
Global OnlyGood image-level detection, weak localization.
Local OnlyGood segmentation, weak classification.
GlocalBest of both worlds, superior combined performance.

DPAM Strategy Comparison

StrategyObservation
Q-Q (CLIPqq)Good classification, weak segmentation.
K-K (CLIPkk)Balanced but still lower than default.
V-V (default)Best overall performance and consistency.

Cross-Domain Generalization

From Industrial → Medical

  • AnomalyCLIP trained on MVTec AD can generalize to unseen medical domains.
  • Significantly outperforms WinCLIP and VAND on datasets like ISIC, COVID-19, and BrainMRI.

With Medical Fine-Tuning (ColonDB)

  • Enhances segmentation on HeadCT and BrainMRI.
  • Shows limitations on visually different domains (e.g., ISIC vs. ColonDB).

Key Observations:

  • Performs strongly across domains even when trained on industrial data.
  • Improves significantly when fine-tuned on similar domains (e.g., ColonDB → CVC, Kvasir).
  • Slight drop when tested on visually dissimilar targets (e.g., ISIC skin images).

Performance Gain: Object-Agnostic vs. Object-Aware Prompts

DatasetImage AUROC GainPixel AUROC GainAUPRO Gain
MVTec AD+0.5+0.2+0.2
VisA+0.6+0.3+0.5
MPDD+4.4+3.3+1.8
BTAD+0.9+0.4+1.8

Why?

Object semantics are not always aligned with anomaly characteristics. Removing class labels helps the model focus purely on the “visual irregularity.”

Evaluating AnomalyCLIP on TN3K Dataset

About TN3K Dataset

The TN3K dataset is a medically oriented, pixel-level anomaly detection dataset curated for thyroid nodule segmentation. Unlike image-level datasets, pixel-level datasets provide detailed segmentation masks, enabling the evaluation of both detection and localization performance.

TN3K falls under the category of pixel-level medical AD datasets, and therefore, it is fundamentally different from image-level datasets like COVID-19 or ISIC, which only offer classification-level supervision.

A set of three grayscale ultrasound images in the top row showing thyroid nodules from the TN3K dataset, paired with their corresponding binary segmentation masks in the bottom row. The masks highlight pixel-level annotations used for training and evaluating AnomalyCLIP. This visualization illustrates the dataset’s support for pixel-wise anomaly supervision, critical for measuring performance using AUPRO and pixel-level AUROC.
Fig 4. Sample Ultrasound images and Segmentation masks from the TN3K dataset

Given that TN3K supports pixel-level annotations, our evaluation of AnomalyCLIP on this dataset will rely exclusively on pixel-level metrics such as AUPRO and pixel-level AUROC, which are more relevant for segmentation tasks.

Experimental Design

The evaluation is structured in two phases to measure the effect of domain adaptation and fine-tuning:

Phase 1: Zero-Shot Evaluation

  • Use pre-trained AnomalyCLIP checkpoints
  • No fine-tuning on TN3K
  • Checkpoints trained on:
    • MVTec AD (industrial dataset)
    • VisA (industrial visual inspection dataset)

Phase 2: Fine-Tuning on TN3K

  • Train AnomalyCLIP using TN3K’s segmentation annotations
  • Evaluate on TN3K test split using pixel-level metrics

The comparison between these phases will highlight how well AnomalyCLIP generalizes from industrial to medical domains and the extent of gain achieved through domain-specific fine-tuning.

Repository Setup and Configuration

Official Repository Usage

To replicate these experiments, we begin by cloning the official AnomalyCLIP GitHub repository, which contains:

  • Implementation files for training and inference
  • Pre-trained checkpoint folders for MVTec AD and VisA
  • Scripts for evaluation
git clone https://github.com/zqhang/AnomalyCLIP.git

However, this repository uses outdated versions of many dependencies. Thus, a few adjustments are required.

Important Note on Environment

The requirements.txt file in the original repo contains deprecated libraries. We recommend:

  • Using a base conda environment created with the latest Python version (≥3.10)
  • Installing all necessary libraries with the updated requirements.txt that we provide (downloadable below)
Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Additionally, a few libraries must be manually installed via pip:

pip install ftfy regex tabulate

File Modifications for TN3K Integration

To support TN3K as a dataset to be fine-tuned in AnomalyCLIP, several files in the repo must be modified. We’ve made the necessary changes and bundled them for convenience.

What We Need To Do:

  • Clone the official AnomalyCLIP repository
  • Download the TN3K dataset and extract it into the root directory of the cloned repo
  • Click the Download Code button provided below to get all the modified files
  • Replace the corresponding files in the repo with the ones from the downloaded bundle (ensure filenames remain the same)

This includes modifications in files such as:

  • /AnomalyCLIP/train.py
  • /AnomalyCLIP/test.py
  • /AnomalyCLIP/test.sh (we will be using test_before_fine_tuning.sh and test_after_fine_tuning.sh)
  • /AnomalyCLIP/train.sh
  • /AnomalyCLIP/metrics.py
  • /AnomalyCLIP/requirements.txt
  • /AnomalyCLIP/logger.py
  • /AnomalyCLIP/AnomalyCLIP_lib/model_load.py
  • /AnomalyCLIP/generate_dataset_json/tn3k.py

All of the above files can be downloaded from the Download Code button.

How to Run the Evaluation

Once the repo is set up and files replaced, run the following:

Generate the Dataset JSON

cd generate_dataset_json
python tn3k.py

AnomalyCLIP requires a dataset-specific JSON file that defines the structure and category information.

The following lines of code have been modified to work with the TN3K dataset as follows –

if __name__ == '__main__':
    runner = ClinicDBSolver(root='/home/shubham/Work/AnomalyCLIP/Thyroid_Dataset/tn3k')
    runner.run()

Run AnomalyCLIP

Once our dataset and JSON files are ready, we can either run AnomalyCLIP in zero-shot inference mode using pre-trained weights or fine-tune it on TN3K.

Zero-Shot Evaluation (Using Pretrained Weights)

Use the preconfigured shell script:

bash test_before_fine_tuning.sh

Make sure test_before_fine_tuning.sh is edited to include correct paths to the pre-trained checkpoints for both the MVTec AD dataset and the VisA dataset as well. This will run AnomalyCLIP in inference mode and evaluate pixel-level metrics.

Evaluation Metrics score when evaluated using MVTec AD checkpoints

25-07-01 17:06:40.674 - INFO: Logging test...
25-07-01 17:07:04.293 - INFO: 
| objects   |   pixel_auroc |   pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid   |          63.5 |          46.8 |
| mean      |          63.5 |          46.8 |

Evaluation Metrics score when evaluated using VisA checkpoints

25-07-01 17:07:08.398 - INFO: Logging test...
25-07-01 17:07:32.191 - INFO: 
| objects   |   pixel_auroc |   pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid   |          63.4 |          39.8 |
| mean      |          63.4 |          39.8 |

Fine-Tuning AnomalyCLIP on TN3K

To train AnomalyCLIP on TN3K with ground-truth segmentation masks:

bash train.sh

Fine-Tuning Logs

Upon completion, trained weights will be saved to checkpoints/singlescale_tn3k. These can be used for a second round of evaluation.

25-07-01 14:42:18.397 - INFO: epoch [1/15], loss:3.6556, image_loss:0.0433
25-07-01 14:46:37.770 - INFO: epoch [2/15], loss:3.3293, image_loss:0.0096
25-07-01 14:50:56.856 - INFO: epoch [3/15], loss:3.2725, image_loss:0.0083
25-07-01 14:55:15.565 - INFO: epoch [4/15], loss:3.2267, image_loss:0.0069
25-07-01 14:59:35.778 - INFO: epoch [5/15], loss:3.1975, image_loss:0.0064
25-07-01 15:03:55.017 - INFO: epoch [6/15], loss:3.1952, image_loss:0.0062
25-07-01 15:08:14.138 - INFO: epoch [7/15], loss:3.1792, image_loss:0.0063
25-07-01 15:12:35.894 - INFO: epoch [8/15], loss:3.1710, image_loss:0.0061
25-07-01 15:16:57.972 - INFO: epoch [9/15], loss:3.1683, image_loss:0.0064
25-07-01 15:21:19.763 - INFO: epoch [10/15], loss:3.1682, image_loss:0.0061
25-07-01 15:25:41.927 - INFO: epoch [11/15], loss:3.1566, image_loss:0.0059
25-07-01 15:30:04.917 - INFO: epoch [12/15], loss:3.1614, image_loss:0.0061
25-07-01 15:34:27.330 - INFO: epoch [13/15], loss:3.1604, image_loss:0.0064
25-07-01 15:38:49.085 - INFO: epoch [14/15], loss:3.1516, image_loss:0.0057
25-07-01 15:43:10.201 - INFO: epoch [15/15], loss:3.1504, image_loss:0.0060

Post-Training Evaluation

Once training is done, use the following bash command to reference your newly trained model checkpoint and re-run it:

bash test_after_fine_tuning.sh

This step evaluates the fine-tuned model using TN3K’s segmentation ground truth and provides results like AUROC and AUPRO.

Evaluation Metrics score when evaluated using the TN3K checkpoints itself

25-07-01 17:14:28.583 - INFO: Logging test...
25-07-01 17:14:51.843 - INFO: 
| objects   |   pixel_auroc |   pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid   |          83.2 |          54.9 |
| mean      |          83.2 |          54.9 |

This completes the full workflow for training and evaluating AnomalyCLIP on the TN3K dataset within the official repository’s structure.

Insights from Zero-Shot and Fine-Tuned Performance on TN3K

To comprehensively analyze AnomalyCLIP’s behavior on the TN3K thyroid nodule segmentation dataset, we evaluated three different configurations using the pixel-level AUROC and AUPRO metrics. These results are cross-referenced with the original AnomalyCLIP paper to provide interpretability and identify performance differentials.

Evaluation Setup & Results

ConfigurationPixel AUROCPixel AUPRO
Zero-Shot (MVTec AD)63.546.8
Zero-Shot (VisA)63.439.8
Fine-Tuned on TN3K83.254.9
Official Paper (AnomalyCLIP)79.247.0

Summary of Experiments

  • Experiment 1: Evaluation using MVTec AD checkpoint yielded moderate ZSAD scores (AUROC 63.5 / AUPRO 46.8)
  • Experiment 2: Slightly lower AUPRO when evaluated using the VisA checkpoint (39.8)
  • Experiment 3: Fine-tuning on TN3K achieved significantly better metrics (AUROC 83.2 / AUPRO 54.9)
  • Experiment 4: Official paper result on TN3K is lower than your fine-tuned version (AUROC 79.2 / AUPRO 47.0)

Why Did Our Fine-Tuned Model Outperform the Paper?

While it’s rare to outperform official benchmarks using the same architecture, several explainable factors might have contributed to our fine-tuned model’s superior performance:

Updated Library Versions

Although the AnomalyCLIP repo uses older dependencies, our environment (Python, CUDA, PyTorch) might have introduced backend improvements (e.g., more stable loss behavior, faster convergence).

GPU Stability and Precision Handling

Hardware configurations also affect training stability. Differences in numerical precision or FP16 support could have contributed to improved convergence.

AnomalyCLIP outperforms or complements many existing models:

Compared to CLIP-AD, ZOC, and ACR

  • These methods require target-specific tuning or focus only on classification.
  • AnomalyCLIP offers both classification and segmentation.

Compared to WinCLIP and VAND

  • VAND uses projection learning, which weakens semantic alignment.
  • AnomalyCLIP uses just two prompts and achieves better performance.

Compared to DenseCLIP and CoOp

  • These need an additional decoder or object-specific prompts.
  • AnomalyCLIP is fully prompt-based, efficient, and more general.
ModelWeaknessAnomalyCLIP Advantage
CLIP-AD, ZOCOnly support classificationOffers segmentation as well
WinCLIPRequires manual prompt engineeringFully learnable with 2 prompts
VANDProjects features but struggles with semanticsFully learnable with two prompts
CoOp, DenseCLIPNo segmentation, decoder-basedDecoder-free and efficient

Key Takeaways

  • Zero-shot detection from industrial datasets like MVTec AD or VisA provides only moderate transfer to TN3K.
  • Fine-tuning specifically on TN3K boosts segmentation performance substantially.
  • Our fine-tuned model outperforms the official AnomalyCLIP results — likely due to better dataset alignment, modern environments, or refined training configs.

Conclusion

AnomalyCLIP is a robust, flexible, and accurate framework for zero-shot anomaly detection. It addresses the limitations of existing VLM-based approaches by:

  • Removing reliance on object semantics
  • Refining prompts within the model’s text encoder
  • Enhancing pixel-level attention
  • Using glocal optimization for training

Extensive experiments show that AnomalyCLIP:

  • Achieves top-tier performance in both classification and segmentation
  • Works across domains from manufacturing to radiology
  • Requires no handcrafted prompts or retraining for each dataset

AnomalyCLIP is a step forward in scalable, cross-domain anomaly detection using vision-language models.

References



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​