OmniParser: Vision Based GUI Agent

In this article, we explore OmniParser a UI screen parsing pipeline combining fine-tuned YOLO model for icon detection and Florence2 for icon recognition and icon description generation.
Omniparser - Vision Based GUI Agent.

The rapid advancement of Vision-Language Models (VLMs) has significantly improved the ability of AI systems to interact with graphical user interfaces (GUIs). However, existing models often struggle with action grounding. Models like GPT-4V cannot accurately map actions to specific UI elements across diverse applications and operating systems. To bridge this gap, Microsoft OmniParser introduces a pure vision-based screen parsing approach that extracts structured elements from UI screenshots, enhancing the action prediction capabilities of large multimodal models like GPT-4V.

Omniparser - Vision Based GUI Agent.
Omniparser – Vision Based GUI Agent.

OmniParser integrates multiple fine-tuned models into a single pipeline that detects interactable regions, describes UI elements, and overlays structured bounding boxes on the UI screenshot. This robust methodology allows AI agents to perform UI tasks without relying on additional metadata such as HTML or view hierarchies. This article provides an in-depth analysis of OmniParser’s methodology, pipeline, training strategies, and its impact on Vision-Language Models.

OmniParser Methodology 

OmniParser employs a structured screen parsing approach consisting of three major components:

  • Interactable Region Detection
  • Icon and Text Semantics Extraction
  • Bounding Box Integration with Unique Labels

Each component plays a crucial role in translating UI screenshots into actionable structured data.

Interactable Region Detection

OmniParser’s first step involves detecting interactable elements such as icons, buttons, text fields, and clickable links. 

Icons, texts, and interactable elements detected by the object detection model of OmniParser.
Figure 2. Icons, texts, and interactable elements detected by the object detection model of OmniParser.

Instead of relying on predefined HTML structures, it uses a fine-tuned object detection model trained on a curated dataset of UI elements extracted from popular web pages. The Set-of-Marks (SoM) approach is used to overlay bounding boxes on detected elements, ensuring precise action mapping.

Key features of the detection model:

  • Trained on 67K unique UI screenshots with bounding boxes derived from DOM trees.
  • Integration with OCR to extract bounding boxes for text elements.
  • Adaptive bounding box merging to remove redundant overlaps and ensure clarity.
  • YOLOv8 Nano model fine-tuned for optimized screen parsing performance.

Incorporating Local Semantics for Functionality

Once interactable elements are identified, OmniParser enhances their representation by generating localized semantic descriptions. This process mitigates the cognitive burden on GPT-4V by enriching the UI understanding with functional descriptions.

Icon classification and description generated by the fine-tuned Florence or BLIP2 model after detection stage.
Figure 3. Icon classification and description generated by the fine-tuned Florence or BLIP2 model after the detection stage.

The following are the models and methodologies used for icon description and icon identification:

  • Fine-tuned Florence model for icon descriptions, replacing the earlier BLIP-2 approach.
  • PaddleOCR for extracting text from UI elements, including text below icons.
  • Context-aware icon and UI element description generation to distinguish between similar-looking components in different contexts.

Structured Representation of UI Screens

OmniParser produces a structured DOM-like representation of UI elements, which includes:

  • UI screenshots overlaid with bounding boxes and numeric IDs.
  • Functional descriptions of detected icons and text elements.
  • Consolidated screen parsing output for direct use in downstream action prediction models.

The following image shows what the entire screen icon detection and internal icon parsing and descriptions look like.

Screen parsing result in the OmniParser pipeline.
Figure 4. Screen parsing result in the OmniParser pipeline.

Model Pipeline and Training Strategy

The OmniParser pipeline consists of multiple trained models, each contributing to robust screen parsing. The training pipeline includes dataset curation, fine-tuning of detection and description models, and post-processing strategies.

Dataset Curation

To ensure high accuracy in screen parsing, Microsoft curated datasets for both detection and description tasks:

  • Interactable Region Detection Dataset: Collected from 100K popular webpages, extracting bounding boxes from DOM structures.
  • Icon Description Dataset: Created using GPT-4o. The model assisted in the annotation of 7K icon-description pairs, ensuring diverse UI element representations.

Model Training

  • Detection Model: Fine-tuned YOLOv8 Nano on the curated detection dataset to efficiently detect and label interactable UI elements across different platforms and applications.
  • OCR Module: Utilizes PaddleOCR to accurately extract and merge text descriptions from UI elements, including textual information present below icons.
  • Icon Description Model: The fine-tuned Florence model generates high-quality functional descriptions for UI icons, images, graphs, and tables, ensuring accurate interpretation of screen elements in various application environments.

Performance and Evaluation

The authors evaluated OmniParser on multiple benchmarks, demonstrating superior performance over existing models.

SeeAssign Task

A benchmark designed to test bounding box ID prediction accuracy across mobile, desktop, and web platforms. 

There is a task associated with each screenshot. After the screen parsing and icon detection step, the GPT-4V model is fed the output along with the task. It has to correctly predict which box ID to click.

The SeeAssign benchmark.
Figure 5. The SeeAssign benchmark.

OmniParser improved GPT-4V’s accuracy from 70.5% to 93.8% by incorporating local semantics.

ScreenSpot Benchmark

The ScreenSpot dataset is a benchmark consisting of over 600 inferences of screenshots from mobile, desktop, and web platforms. OmniParser’s structured screen parsing approach significantly outperformed baselines in UI understanding tasks:

  • Text/icon widget recognition accuracy improved by over 20% compared to GPT-4V.
  • Surpassed finetuned GUI models (e.g., SeeClick, CogAgent, Fuyu) on multi-platform UI tasks.

Mind2Web Benchmark

Mind2Web is a benchmark designed for evaluating web navigation models. It consists of tasks that require models to interact with and navigate through various real-world websites, simulating user interactions. Unlike traditional approaches relying on HTML parsing, OmniParser integrates its screen parsing results with GPT-4V, allowing for a purely vision-based interaction approach. This methodology:

  • Outperformed HTML-based methods using only UI screenshots.
  • Improved task success rate by 4.1%-5.2% in cross-website and cross-domain tasks, demonstrating OmniParser’s ability to generalize across diverse web environments.

AITW Benchmark (Mobile UI Task Navigation)

The Android in the Wild (AITW) benchmark is designed to evaluate AI-driven interactions with mobile applications. It includes a collection of real-world mobile UI tasks requiring models to detect, interpret, and interact with various UI components across different applications and platforms. By leveraging OmniParser’s vision-based parsing pipeline, GPT-4V achieved:

  • 4.7% increase over GPT-4V baselines in mobile app navigation.
  • Successful detection and interaction with UI elements across multiple mobile operating systems without relying on additional metadata, such as Android view hierarchies.

Discussion and Future Improvements

Despite its strong performance, OmniParser faces challenges such as:

  • Ambiguity in repeated UI elements (e.g., multiple “More” buttons on the same screen).
  • Coarse bounding box predictions affecting fine-grained click targets.
  • Contextual misinterpretation of icons due to isolated description generation.

Future enhancements include:

  • Training a joint OCR-Interactable Detection model to improve clickable text localization.
  • Context-aware icon description models leveraging full-screen context.
  • Adaptive hierarchical bounding box refinement to improve interaction accuracy.

UI Parsing Inference using OmniParser

The official OmniParser repository provides a Gradio demo for UI parsing. We can upload any UI image and get all the detection results for icons, text boxes, and other UI elements. 

We will run the demo here which will give us a first-hand experience of how OmniParser works.

Setting Up OmniParser

Make sure you have either Anaconda or Miniconda installed on your system before moving further with the installation steps. The following steps were tested on an Ubuntu machine.

Clone the OminiParser respiratory:

git clone https://github.com/microsoft/OmniParser.git
cd OmniParser

Create a new conda environment and activate it:

conda create -n "omni" python==3.12
conda activate omni

Install the requirements:

pip install -r requirements.txt

This will install all the necessary libraries.

The final step is to download the pretrained models. Run the following command in your terminal inside the OmniParser directory.

for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

It will download the YOLOv8 Nano model trained for icon detection and fine-tuned Florence model for icon caption generation.

We can launch the Gradio demo using the following command:

python gradio_demo.py

This will open the Gradio demo app where we can upload images and parse the screen elements. 

OmniParser Gradio demo result.
Figure 6. OmniParser Gradio demo result.

With each UI element detection result, the demo also provides a text result of the parsed detection. This helps us understand how well the combination of YOLO, PaddleOCR, and Florence understand the image.

Let’s discuss some results.

OmniParser icon detection result on a Google Document image.
Figure 7. OmniParser icon detection result on a Google Document image.

The first result that we are discussing here is the parsed result of a Google Document page. It has a combination of text, headings, icons, and document tool elements. The YOLOv8 model did a good job of detecting most of the items including the Table of Contents on the left tab. However, in some instances, it partially detects the line of text.

Let’s take a look at the parsed text result returned by the pipeline.

icon 0: {'type': 'text', 'bbox': [0.015729453414678574, 0.007022472098469734, 0.05741250514984131, 0.02738764137029648], 'interactivity': False, 'content': 'OmniParser', 'source': 'box_ocr_content_ocr'}
icon 1: {'type': 'text', 'bbox': [0.017695635557174683, 0.023876404389739037, 0.08847817778587341, 0.042134832590818405], 'interactivity': False, 'content': 'File Edit View Insert', 'source': 'box_ocr_content_ocr'}
icon 2: {'type': 'text', 'bbox': [0.13212740421295166, 0.02738764137029648, 0.1635863184928894, 0.04073033854365349], 'interactivity': False, 'content': 'Extensions', 'source': 'box_ocr_content_ocr'}
.
.
.
icon 50: {'type': 'icon', 'bbox': [0.019255181774497032, 0.6464248895645142, 0.10337236523628235, 0.6727869510650635], 'interactivity': True, 'content': 'Key Features ', 'source': 'box_yolo_content_ocr'}
icon 51: {'type': 'icon', 'bbox': [0.022174695506691933, 0.39646413922309875, 0.10094291716814041, 0.41684794425964355], 'interactivity': True, 'content': 'Performance and Evaluat. ', 'source': 'box_yolo_content_ocr'}

Each element is either recognized as text or an icon. For text boxes, it also returns the content. It does the same for the icons as well, if the icons contain text. However, for icons, one major part is determining whether it is interactable or not which the interactivity attribute signifies.

Let’s check another example containing a UI from a mobile application.

OmniParser icon detection on a Mobile UI page.
Figure 8. OmniParser icon detection on a Mobile UI page.

Following are the parsed contents.

icon 0: {'type': 'text', 'bbox': [0.21107783913612366, 0.14407436549663544, 0.314371258020401, 0.16266460716724396], 'interactivity': False, 'content': 'prime', 'source': 'box_ocr_content_ocr'}
icon 1: {'type': 'text', 'bbox': [0.18113772571086884, 0.6762199997901917, 0.44760480523109436, 0.7041053175926208], 'interactivity': False, 'content': 'FOSSIL', 'source': 'box_ocr_content_ocr'}
icon 2: {'type': 'text', 'bbox': [0.19760479032993317, 0.7079783082008362, 0.4341317415237427, 0.7288923263549805], 'interactivity': False, 'content': 'KEEP IT CLASSIC', 'source': 'box_ocr_content_ocr'}
icon 3: {'type': 'text', 'bbox': [0.1002994030714035, 0.8187451362609863, 0.6871257424354553, 0.8396591544151306], 'interactivity': False, 'content': 'Insnired hv vour Wish I ist', 'source': 'box_ocr_content_ocr'}
icon 4: {'type': 'icon', 'bbox': [0.06284164637327194, 0.1702609360218048, 0.799872875213623, 0.21801446378231049], 'interactivity': True, 'content': 'Search ', 'source': 'box_yolo_content_ocr'}
icon 5: {'type': 'icon', 'bbox': [0.04233129322528839, 0.8364495038986206, 0.18125209212303162, 0.9036354422569275], 'interactivity': True, 'content': 'Home ', 'source': 'box_yolo_content_ocr'}
icon 6: {'type': 'icon', 'bbox': [0.177901029586792, 0.8357771635055542, 0.32755449414253235, 0.9026133418083191], 'interactivity': True, 'content': 'Laptop ', 'source': 'box_yolo_content_ocr'}
icon 7: {'type': 'icon', 'bbox': [0.33590322732925415, 0.835365891456604, 0.48725128173828125, 0.9017225503921509], 'interactivity': True, 'content': 'Mobile ', 'source': 'box_yolo_content_ocr'}
icon 8: {'type': 'icon', 'bbox': [0.8018407821655273, 0.8358777165412903, 0.9442757368087769, 0.9030740261077881], 'interactivity': True, 'content': 'Menu ', 'source': 'box_yolo_content_ocr'}
icon 9: {'type': 'icon', 'bbox': [0.03968415409326553, 0.33482760190963745, 0.9533154368400574, 0.41663098335266113], 'interactivity': True, 'content': 'amazon Win a Nokia 8* DOWNLOAD THE APP & SIGN IN 08:00 ', 'source': 'box_yolo_content_ocr'}
icon 10: {'type': 'icon', 'bbox': [0.6559209823608398, 0.8375254273414612, 0.8008942008018494, 0.901656985282898], 'interactivity': True, 'content': 'Deals ', 'source': 'box_yolo_content_ocr'}
icon 11: {'type': 'icon', 'bbox': [0.4851638972759247, 0.8353713154792786, 0.6618077754974365, 0.901587963104248], 'interactivity': True, 'content': 'TV Television ', 'source': 'box_yolo_content_ocr'}
icon 12: {'type': 'icon', 'bbox': [0.04066063091158867, 0.41592150926589966, 0.9421791434288025, 0.6008225679397583], 'interactivity': True, 'content': '20 CARNIV 19th 21 February UPTO3,000OFF InduslndBank 10Instant discount on Indusind Bank Debit/Credit cards *T&CApply ', 'source': 'box_yolo_content_ocr'}
icon 13: {'type': 'icon', 'bbox': [0.2405502200126648, 0.23145084083080292, 0.4399498701095581, 0.2801952064037323], 'interactivity': True, 'content': 'Wish List ', 'source': 'box_yolo_content_ocr'}
icon 14: {'type': 'icon', 'bbox': [0.43703392148017883, 0.2374911904335022, 0.587538480758667, 0.2797139883041382], 'interactivity': True, 'content': 'Deals ', 'source': 'box_yolo_content_ocr'}
icon 15: {'type': 'icon', 'bbox': [0.05106184259057045, 0.2260615974664688, 0.24924203753471375, 0.279988557100296], 'interactivity': True, 'content': 'Shop By Category ', 'source': 'box_yolo_content_ocr'}
icon 16: {'type': 'icon', 'bbox': [0.042542532086372375, 0.27909427881240845, 0.9558238983154297, 0.3390219211578369], 'interactivity': True, 'content': 'Deliver to Shantanu GURUGRAM 122002 ', 'source': 'box_yolo_content_ocr'}
icon 17: {'type': 'icon', 'bbox': [0.5791370272636414, 0.2366282194852829, 0.6990264058113098, 0.27986839413642883], 'interactivity': True, 'content': 'Sell ', 'source': 'box_yolo_content_ocr'}
icon 18: {'type': 'icon', 'bbox': [0.0757032185792923, 0.11301160603761673, 0.3205451965332031, 0.1488722413778305], 'interactivity': True, 'content': 'amazon.in ', 'source': 'box_yolo_content_ocr'}
icon 19: {'type': 'icon', 'bbox': [0.79616779088974, 0.1689583659172058, 0.931917130947113, 0.21612757444381714], 'interactivity': True, 'content': 'Find', 'source': 'box_yolo_content_yolo'}
icon 20: {'type': 'icon', 'bbox': [0.8107972145080566, 0.11833497136831284, 0.9111638069152832, 0.16139522194862366], 'interactivity': True, 'content': 'shopping cart', 'source': 'box_yolo_content_yolo'}
icon 21: {'type': 'icon', 'bbox': [0.6982802748680115, 0.1165347620844841, 0.7747448086738586, 0.16223861277103424], 'interactivity': True, 'content': 'User profile', 'source': 'box_yolo_content_yolo'}
icon 22: {'type': 'icon', 'bbox': [0.05113839730620384, 0.7825599312782288, 0.9413369297981262, 0.8036486506462097], 'interactivity': True, 'content': 'Slide', 'source': 'box_yolo_content_yolo'}

The above represents a more real-life use case where a user may ask the agent to add an item to cart and proceed to checkout. Here, most of the elements are interactable icons which the pipeline has predicted correctly.

Further in the article, we will see how to use the above pipeline as a GUI agent for computer use.

OmniTool: A Local Windows VM for Computer Use

The latest OmniParser repository provides OmniTool, a comprehensive system that enables AI-driven GUI interactions within a local Windows 11 VM. It integrates OmniParser V2 with various vision models, providing an efficient testing and execution environment for agent-based automation. 

Omnitool logo.
Figure 9. Omnitool logo.

This allows us to try out computer use on our own machines.

Components

  1. OmniParserServer – A FastAPI server running OmniParser V2.
  2. OmniBox – A lightweight Windows 11 VM running inside a Docker container.
  3. Gradio UI – A web-based interface for issuing commands and monitoring AI agent execution.

Key Features

  • OmniParser V2 is 60% faster than its predecessor and supports a broader range of OS and app icons.
  • OmniBox reduces disk space usage by 50% compared to traditional Windows VMs.
  • Supports major vision models, including OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL), and Anthropic Computer Use.

Setup Instructions

The repository provides detailed setup instructions for Omnitool in the README file inside the omnitool directory.

It is recommended to follow the instructions and set it up before carrying out your own experiments.

Here, we will discuss a few experiments that our team carried out.

We used OpenAI GPT-4o for all experiments. The experiments that we will carry out here will mostly include browser use using the agent rather than internal system use.

OmniTool Experiment 1: Downloading the OpenCV GitHub Zip File

For the first experiment, we asked the OmniTool agent to download the zip file for the OpenCV GitHub repository.

Video 1. Omnitool demo where we ask the agent to download the zip file from OpenCV GitHub page.

After initializing the process, the agent carried out the following steps:

  • Opened Google Chrome web browser.
  • Navigated to the OpenCV GitHub repository.
  • Clicked on the Code button and downloaded the zip file.

All the while the left tab showed all the screenshots of the parsed screens and what steps were taken by the LLM in text.

However, in the end, after downloading the file, the agent loop did not end. It kept on downloading the file multiple times and we had to kill the process manually.

We can say that the process was a 90% success and it would have been great to see the agent end the loop.

OmniTool Experiment 2: Adding Items to Cart on Amazon

Next, we gave the OmniTool a more complex task. We asked it to go to the Amazon website, add a Dell Alienware laptop to the cart, and proceed to checkout.

Video 2. Omnitool demo 2. Here, we as the agent to add a laptop to cart on the Amazon website and proceed to checkout.

We observed several interesting actions by the agent here.

Firstly, when the agent navigated to the Amazon website, it was welcomed with a captcha screen which we were not expecting. However, to our surprise, it was able to successfully pass the test. 

Secondly, after some trial and error, it was able to correctly navigate to the Amazon search bar and search for the laptop. However, rather than considering the laptop we asked for, it clicked on the very first link that it was able to see. This shows the inability to keep minute details in memory when carrying out complex tasks.

Nonetheless, it proceeded. However, instead of the “Add to Cart” button, the page contained the “See All Buying Options” button. The agent kept on searching for the “Add to Cart” button and kept on scrolling down the page and the same was also being shown on the left side tab. After multiple such scrolls, we killed the operation as the button would not be present at the bottom of the page.

Key Observations from OmniTool Agent Usage

In both cases, although deemed as failures, we observed some interesting scenarios.

  • In the first case, the model was able to download the zip file but did not end the agentic loop. Probably prompting with an ending instruction would have done so.
  • In the second case, the model was presented with a sudden captcha test which, surprisingly, it was able to pass. However, it could not add the correct laptop to the cart.

In both cases, we observed failure and some intelligent moments as well. This shows that agentic AI and computer use, although good for simple use cases, have a long way to go.

Summary

In this article, we covered OmniParser, a UI screen parsing pipeline that helps autonomous agents with computer use. It is paired with OmniTool which integrates the results from OmniParser and several VLMs to provide users with an autonomous agent for computer use to run in a VM.

We also discussed some interesting use cases of Omnitool, which areas it aced, and where it failed.

Do give this a try on your own with some simple use cases. Maybe you will find something interesting which is worth sharing in the comment section below.



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

🎃 Halloween Sale: Early Bird Offer – 35% Off on All Courses.
D
H
M
S
Expired
 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​