Zero-shot anomaly detection (ZSAD) is a vital problem in computer vision, particularly in real-world scenarios where labeled anomalies are scarce or unavailable. Traditional vision-language models (VLMs) like CLIP fall short in this task because they are primarily trained for classification based on object semantics, not anomaly characteristics. This gap leads to poor generalization in unseen domains, where anomalies don’t align with known object labels.

AnomalyCLIP bridges this gap by introducing three innovations: object-agnostic prompt learning, local visual refinement via DPAM, and glocal context optimization. It leverages the strengths of CLIP while tailoring its behavior to detect anomalies across various domains, including manufacturing, healthcare, and security.
- Motivation and Challenges
- Why CLIP Alone is Insufficient
- What is AnomalyCLIP?
- Architecture Overview of AnomalyCLIP
- Glocal Context Optimization
- Training and Inference
- Performance Metrics
- Ablation Studies
- Cross-Domain Generalization
- Performance Gain: Object-Agnostic vs. Object-Aware Prompts
- Evaluating AnomalyCLIP on TN3K Dataset
- Insights from Zero-Shot and Fine-Tuned Performance on TN3K
- Why Did Our Fine-Tuned Model Outperform the Paper?
- Related Work
- Key Takeaways
- Conclusion
- References
Motivation and Challenges
ZSAD requires detecting unseen anomalies without any sample from the target domain. The primary challenges include:
- Domain variance: Anomalies vary in appearance across different domains (e.g., cracks on metals vs. tumors in MRIs).
- Semantic bias: Models trained to recognize “cats” and “cars” don’t naturally understand what a “defect” or “lesion” is.
- Fine-grained detection: Many anomalies are small, subtle, and not aligned with known object categories.
- Lack of labeled target data: Supervised anomaly detection requires pixel-wise masks or labels, which are rarely available in real-world applications.
AnomalyCLIP addresses these by shifting the focus from object semantics to generic normality and abnormality patterns.
Why CLIP Alone is Insufficient
CLIP excels at image-text alignment using class-based prompts (e.g., “a photo of a dog”), but it struggles in ZSAD because:
- It relies heavily on class names.
- Its embeddings emphasize object presence, not quality or abnormality.
- Its attention is distributed toward dominant visual tokens, which may not correlate with subtle anomalies.
What is AnomalyCLIP?
AnomalyCLIP is a zero-shot anomaly detection framework that adapts CLIP by:
- Replacing class-specific prompts with object-agnostic prompts (like “a damaged object”).
- Refining the text embeddings using multi-layer token tuning.
- Improving visual attention through Diagonally Prominent Attention Maps (DPAM).
- Training with a combined global and local objective called Glocal Context Optimization.
The overall result is a system that generalizes extremely well across datasets, even from industrial inspection images to medical imaging scans.
Feature | Description |
---|---|
Object-Agnostic Prompt Learning | Learns generic “normal” and “abnormal” prompts instead of relying on class-specific semantics. |
Textual Space Refinement | Incorporates learnable prompt tokens across multiple layers of the CLIP text encoder. |
DPAM (Diagonally Prominent Attention Map) | Enhances local visual attention using modified self-attention mechanisms. |
Glocal Context Optimization | Combines global image-level and local pixel-level anomaly detection losses. |
Single Forward Pass | No need for extra decoders or handcrafted prompts; efficient inference. |
Architecture Overview of AnomalyCLIP
The architecture of AnomalyCLIP modifies CLIP only slightly but strategically:
Object-Agnostic Prompt Templates
- Instead of using object-specific prompts like “a photo of a screw with a crack,” AnomalyCLIP defines two general templates:
g_n
: “a normal object”g_a
: “a damaged object”
- These prompts are not tied to specific object names, making them generalizable across domains.
Why this matters: It removes the model’s dependency on object categories, which are not always relevant to anomalies. Instead, the model learns the visual semantics of “normality” and “abnormality.”
Textual Prompt Refinement
- The prompt tokens are not fixed. AnomalyCLIP inserts learnable tokens into the first 9 layers of the CLIP text encoder. These tokens evolve during training.
- Enables deep semantic refinement.
- This helps the model understand prompts not just at the surface level, but deep inside the network’s processing layers. It enables the prompts to become richer and more informative.
Local Visual Space Enhancement with DPAM
CLIP’s visual encoder naturally attends to a few key tokens, often overlooking local anomalies. DPAM replaces the standard self-attention with more uniform attention patterns using these strategies:
Attention Type | Description |
---|---|
Q-Q | Query-to-query attention promotes horizontal expansion |
K-K | Key-to-key, vertical spread of focus |
V-V | Value-to-value, diagonal prominence, default in AnomalyCLIP |
- V-V attention helps the model recognize small but significant features (e.g., scratches, lesions) without being distracted by dominant object tokens.
- Promotes diagonally distributed attention to capture fine-grained features.
Glocal Context Optimization
To train AnomalyCLIP, the authors propose a dual-loss strategy that supervises both image-level and patch-level alignment between visual and textual features.
Global Loss
- Encourages the model to classify an image as “normal” or “abnormal.”
- Based on the similarity between the entire image embedding and
g_n
/g_a
.
Local Loss
- Guides the model to detect where an anomaly occurs.
- Uses segmentation masks and calculates similarity at the patch level.
- Applies Focal and Dice loss to improve class imbalance handling.
By combining the two:
- The global loss helps with overall classification.
- The local loss helps with fine-grained segmentation.
- This strategy is referred to as Glocal Optimization.
Component | Role |
Global Loss (Image-Level) | Cross-entropy loss that aligns whole image features with normal/abnormal prompts. |
Local Loss (Pixel-Level) | Segmentation-aware loss using focal and Dice losses to align patch-level features. |
Combined, this enables AnomalyCLIP to localize and classify anomalies effectively.
Training and Inference
Training
- Uses an auxiliary anomaly detection dataset (e.g., MVTec AD or ColonDB).
- Only prompt tokens, DPAM layers, and alignment losses are optimized.
- CLIP’s encoders remain frozen to preserve their generalization ability.
Inference
- Computes cosine similarity between image features and prompt embeddings.
- For pixel-wise output:
- Generate similarity maps from intermediate layers
- Average
s_n
ands_a
, and apply Gaussian smoothing
Output | Description |
Anomaly Score | Probability of image being abnormal based on similarity with prompt. |
Anomaly Map | Pixel-wise prediction indicating abnormal regions. |
Experimental Setup of AnomalyCLIP
Domain | Datasets Used |
Industrial | MVTec AD, VisA, MPDD, BTAD, SDD, DAGM, DTD-Synthetic |
Medical | ISIC, CVC-ClinicDB, CVC-ColonDB, Kvasir, Endo, TN3k(Thyroid), HeadCT, BrainMRI, Br35H, COVID-19 |
- Each of these presents different challenges, from surface texture detection to organ lesion segmentation.
- Evaluated using AUROC, Average Precision (AP), and AUPRO.
- Compared against: CLIP, CLIP-AC, WinCLIP, CoOp, VAND.
Performance Metrics
Metric | Description |
AUROC | Ability to distinguish between normal/abnormal |
AP | Average precision, based on precision-recall curve |
AUPRO | Area under per-region overlap for segmentation tasks |
AnomalyCLIP shows SOTA results in nearly all settings, especially when generalizing across domains.
Ablation Studies
Module-wise Performance
Module | Role | Result |
DPAM (T₁) | Improves segmentation performance by refining local visual semantics. | Boosts segmentation |
Prompt Learning (T₂) | Provides significant gain in cross-domain generalization. | Best cross-domain generalization |
Textual Tuning (T₃) | Multi-layer refinement | Boosts both classification and segmentation by improving semantic clarity. |
Context Optimization
Loss Setting | Result |
Global Only | Good image-level detection, weak localization. |
Local Only | Good segmentation, weak classification. |
Glocal | Best of both worlds, superior combined performance. |
DPAM Strategy Comparison
Strategy | Observation |
Q-Q (CLIPqq) | Good classification, weak segmentation. |
K-K (CLIPkk) | Balanced but still lower than default. |
V-V (default) | Best overall performance and consistency. |
Cross-Domain Generalization
From Industrial → Medical
- AnomalyCLIP trained on MVTec AD can generalize to unseen medical domains.
- Significantly outperforms WinCLIP and VAND on datasets like ISIC, COVID-19, and BrainMRI.
With Medical Fine-Tuning (ColonDB)
- Enhances segmentation on HeadCT and BrainMRI.
- Shows limitations on visually different domains (e.g., ISIC vs. ColonDB).
Key Observations:
- Performs strongly across domains even when trained on industrial data.
- Improves significantly when fine-tuned on similar domains (e.g., ColonDB → CVC, Kvasir).
- Slight drop when tested on visually dissimilar targets (e.g., ISIC skin images).
Performance Gain: Object-Agnostic vs. Object-Aware Prompts
Dataset | Image AUROC Gain | Pixel AUROC Gain | AUPRO Gain |
MVTec AD | +0.5 | +0.2 | +0.2 |
VisA | +0.6 | +0.3 | +0.5 |
MPDD | +4.4 | +3.3 | +1.8 |
BTAD | +0.9 | +0.4 | +1.8 |
Why?
Object semantics are not always aligned with anomaly characteristics. Removing class labels helps the model focus purely on the “visual irregularity.”
Evaluating AnomalyCLIP on TN3K Dataset
About TN3K Dataset
The TN3K dataset is a medically oriented, pixel-level anomaly detection dataset curated for thyroid nodule segmentation. Unlike image-level datasets, pixel-level datasets provide detailed segmentation masks, enabling the evaluation of both detection and localization performance.
TN3K falls under the category of pixel-level medical AD datasets, and therefore, it is fundamentally different from image-level datasets like COVID-19 or ISIC, which only offer classification-level supervision.
Given that TN3K supports pixel-level annotations, our evaluation of AnomalyCLIP on this dataset will rely exclusively on pixel-level metrics such as AUPRO and pixel-level AUROC, which are more relevant for segmentation tasks.
Experimental Design
The evaluation is structured in two phases to measure the effect of domain adaptation and fine-tuning:
Phase 1: Zero-Shot Evaluation
- Use pre-trained AnomalyCLIP checkpoints
- No fine-tuning on TN3K
- Checkpoints trained on:
- MVTec AD (industrial dataset)
- VisA (industrial visual inspection dataset)
Phase 2: Fine-Tuning on TN3K
- Train AnomalyCLIP using TN3K’s segmentation annotations
- Evaluate on TN3K test split using pixel-level metrics
The comparison between these phases will highlight how well AnomalyCLIP generalizes from industrial to medical domains and the extent of gain achieved through domain-specific fine-tuning.
Repository Setup and Configuration
Official Repository Usage
To replicate these experiments, we begin by cloning the official AnomalyCLIP GitHub repository, which contains:
- Implementation files for training and inference
- Pre-trained checkpoint folders for MVTec AD and VisA
- Scripts for evaluation
git clone https://github.com/zqhang/AnomalyCLIP.git
However, this repository uses outdated versions of many dependencies. Thus, a few adjustments are required.
Important Note on Environment
The requirements.txt
file in the original repo contains deprecated libraries. We recommend:
- Using a base conda environment created with the latest Python version (≥3.10)
- Installing all necessary libraries with the updated
requirements.txt
that we provide (downloadable below)
Additionally, a few libraries must be manually installed via pip:
pip install ftfy regex tabulate
File Modifications for TN3K Integration
To support TN3K as a dataset to be fine-tuned in AnomalyCLIP, several files in the repo must be modified. We’ve made the necessary changes and bundled them for convenience.
What We Need To Do:
- Clone the official AnomalyCLIP repository
- Download the TN3K dataset and extract it into the root directory of the cloned repo
- Click the Download Code button provided below to get all the modified files
- Replace the corresponding files in the repo with the ones from the downloaded bundle (ensure filenames remain the same)
This includes modifications in files such as:
/AnomalyCLIP/train.py
/AnomalyCLIP/test.py
/AnomalyCLIP/test.sh
(we will be usingtest_before_fine_tuning.sh
andtest_after_fine_tuning.sh
)/AnomalyCLIP/train.sh
/AnomalyCLIP/metrics.py
/AnomalyCLIP/requirements.txt
/AnomalyCLIP/logger.py
/AnomalyCLIP/AnomalyCLIP_lib/model_load.py
/AnomalyCLIP/generate_dataset_json/tn3k.py
All of the above files can be downloaded from the Download Code button.
How to Run the Evaluation
Once the repo is set up and files replaced, run the following:
Generate the Dataset JSON
cd generate_dataset_json
python tn3k.py
AnomalyCLIP requires a dataset-specific JSON file that defines the structure and category information.
The following lines of code have been modified to work with the TN3K dataset as follows –
if __name__ == '__main__':
runner = ClinicDBSolver(root='/home/shubham/Work/AnomalyCLIP/Thyroid_Dataset/tn3k')
runner.run()
Run AnomalyCLIP
Once our dataset and JSON files are ready, we can either run AnomalyCLIP in zero-shot inference mode using pre-trained weights or fine-tune it on TN3K.
Zero-Shot Evaluation (Using Pretrained Weights)
Use the preconfigured shell script:
bash test_before_fine_tuning.sh
Make sure test_before_fine_tuning.sh
is edited to include correct paths to the pre-trained checkpoints for both the MVTec AD dataset and the VisA dataset as well. This will run AnomalyCLIP in inference mode and evaluate pixel-level metrics.
Evaluation Metrics score when evaluated using MVTec AD checkpoints
25-07-01 17:06:40.674 - INFO: Logging test...
25-07-01 17:07:04.293 - INFO:
| objects | pixel_auroc | pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid | 63.5 | 46.8 |
| mean | 63.5 | 46.8 |
Evaluation Metrics score when evaluated using VisA checkpoints
25-07-01 17:07:08.398 - INFO: Logging test...
25-07-01 17:07:32.191 - INFO:
| objects | pixel_auroc | pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid | 63.4 | 39.8 |
| mean | 63.4 | 39.8 |
Fine-Tuning AnomalyCLIP on TN3K
To train AnomalyCLIP on TN3K with ground-truth segmentation masks:
bash train.sh
Fine-Tuning Logs
Upon completion, trained weights will be saved to checkpoints/singlescale_tn3k
. These can be used for a second round of evaluation.
25-07-01 14:42:18.397 - INFO: epoch [1/15], loss:3.6556, image_loss:0.0433
25-07-01 14:46:37.770 - INFO: epoch [2/15], loss:3.3293, image_loss:0.0096
25-07-01 14:50:56.856 - INFO: epoch [3/15], loss:3.2725, image_loss:0.0083
25-07-01 14:55:15.565 - INFO: epoch [4/15], loss:3.2267, image_loss:0.0069
25-07-01 14:59:35.778 - INFO: epoch [5/15], loss:3.1975, image_loss:0.0064
25-07-01 15:03:55.017 - INFO: epoch [6/15], loss:3.1952, image_loss:0.0062
25-07-01 15:08:14.138 - INFO: epoch [7/15], loss:3.1792, image_loss:0.0063
25-07-01 15:12:35.894 - INFO: epoch [8/15], loss:3.1710, image_loss:0.0061
25-07-01 15:16:57.972 - INFO: epoch [9/15], loss:3.1683, image_loss:0.0064
25-07-01 15:21:19.763 - INFO: epoch [10/15], loss:3.1682, image_loss:0.0061
25-07-01 15:25:41.927 - INFO: epoch [11/15], loss:3.1566, image_loss:0.0059
25-07-01 15:30:04.917 - INFO: epoch [12/15], loss:3.1614, image_loss:0.0061
25-07-01 15:34:27.330 - INFO: epoch [13/15], loss:3.1604, image_loss:0.0064
25-07-01 15:38:49.085 - INFO: epoch [14/15], loss:3.1516, image_loss:0.0057
25-07-01 15:43:10.201 - INFO: epoch [15/15], loss:3.1504, image_loss:0.0060
Post-Training Evaluation
Once training is done, use the following bash command to reference your newly trained model checkpoint and re-run it:
bash test_after_fine_tuning.sh
This step evaluates the fine-tuned model using TN3K’s segmentation ground truth and provides results like AUROC and AUPRO.
Evaluation Metrics score when evaluated using the TN3K checkpoints itself
25-07-01 17:14:28.583 - INFO: Logging test...
25-07-01 17:14:51.843 - INFO:
| objects | pixel_auroc | pixel_aupro |
|:----------|--------------:|--------------:|
| thyroid | 83.2 | 54.9 |
| mean | 83.2 | 54.9 |
This completes the full workflow for training and evaluating AnomalyCLIP on the TN3K dataset within the official repository’s structure.
Insights from Zero-Shot and Fine-Tuned Performance on TN3K
To comprehensively analyze AnomalyCLIP’s behavior on the TN3K thyroid nodule segmentation dataset, we evaluated three different configurations using the pixel-level AUROC and AUPRO metrics. These results are cross-referenced with the original AnomalyCLIP paper to provide interpretability and identify performance differentials.
Evaluation Setup & Results
Configuration | Pixel AUROC | Pixel AUPRO |
---|---|---|
Zero-Shot (MVTec AD) | 63.5 | 46.8 |
Zero-Shot (VisA) | 63.4 | 39.8 |
Fine-Tuned on TN3K | 83.2 | 54.9 |
Official Paper (AnomalyCLIP) | 79.2 | 47.0 |
Summary of Experiments
- Experiment 1: Evaluation using MVTec AD checkpoint yielded moderate ZSAD scores (AUROC 63.5 / AUPRO 46.8)
- Experiment 2: Slightly lower AUPRO when evaluated using the VisA checkpoint (39.8)
- Experiment 3: Fine-tuning on TN3K achieved significantly better metrics (AUROC 83.2 / AUPRO 54.9)
- Experiment 4: Official paper result on TN3K is lower than your fine-tuned version (AUROC 79.2 / AUPRO 47.0)
Why Did Our Fine-Tuned Model Outperform the Paper?
While it’s rare to outperform official benchmarks using the same architecture, several explainable factors might have contributed to our fine-tuned model’s superior performance:
Updated Library Versions
Although the AnomalyCLIP repo uses older dependencies, our environment (Python, CUDA, PyTorch) might have introduced backend improvements (e.g., more stable loss behavior, faster convergence).
GPU Stability and Precision Handling
Hardware configurations also affect training stability. Differences in numerical precision or FP16 support could have contributed to improved convergence.
Related Work
AnomalyCLIP outperforms or complements many existing models:
Compared to CLIP-AD, ZOC, and ACR
- These methods require target-specific tuning or focus only on classification.
- AnomalyCLIP offers both classification and segmentation.
Compared to WinCLIP and VAND
- VAND uses projection learning, which weakens semantic alignment.
- AnomalyCLIP uses just two prompts and achieves better performance.
Compared to DenseCLIP and CoOp
- These need an additional decoder or object-specific prompts.
- AnomalyCLIP is fully prompt-based, efficient, and more general.
Model | Weakness | AnomalyCLIP Advantage |
CLIP-AD, ZOC | Only support classification | Offers segmentation as well |
WinCLIP | Requires manual prompt engineering | Fully learnable with 2 prompts |
VAND | Projects features but struggles with semantics | Fully learnable with two prompts |
CoOp, DenseCLIP | No segmentation, decoder-based | Decoder-free and efficient |
Key Takeaways
- Zero-shot detection from industrial datasets like MVTec AD or VisA provides only moderate transfer to TN3K.
- Fine-tuning specifically on TN3K boosts segmentation performance substantially.
- Our fine-tuned model outperforms the official AnomalyCLIP results — likely due to better dataset alignment, modern environments, or refined training configs.
Conclusion
AnomalyCLIP is a robust, flexible, and accurate framework for zero-shot anomaly detection. It addresses the limitations of existing VLM-based approaches by:
- Removing reliance on object semantics
- Refining prompts within the model’s text encoder
- Enhancing pixel-level attention
- Using glocal optimization for training
Extensive experiments show that AnomalyCLIP:
- Achieves top-tier performance in both classification and segmentation
- Works across domains from manufacturing to radiology
- Requires no handcrafted prompts or retraining for each dataset
AnomalyCLIP is a step forward in scalable, cross-domain anomaly detection using vision-language models.