AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agents are usually API-bound workflows, designed to execute specific tasks with minimal human intervention. But when it comes to generic, open-ended automation, we’re still in the very early days. Imagine an AI agent that can handle any task, simply from a natural language instruction. Wouldn’t that be far more powerful?

In this blog post, we’ll walk through how to build a GUI-native, local VLM base AI agent from scratch using Moondream3 and Gemini.

What you’ll learn in this post:

Explore the capabilities of the Moondream3 Preview model.
Learn how to build an open-ended agent that understands and executes natural instructions.
See the agent in action with real applications and use cases.

Why Open Ended Local VLM Based AI Agent?
System Configurations for Running Moondream3 VLM on Local
Moondream3 Preview Vision Language Model
AI Agent Building Pipeline
Action Schema for Building AI Agent Code Explanation
Main Agentic Workflow Script
Results of VLM based AI Agent
Conclusion: Building AI Agents using VLMs

Why Open Ended Local VLM Based AI Agent?

The objective is simple. We want an agent that can process natural language input. It should be open-ended, generic, and not rely on specific app based APIs.

Object detection models are efficient but they can perform detection only on pre-trained classes. Hence we need VLMs (Vision Language Models) having capability to understand scenes based on texts and images. Open-ended automation is possible through VLMs as they can perform zero shot grounding tasks.

1.1 Why Moondream3 VLM?

VLMs in general are heavy and resource intensive. If we reduce the parameters to make it light, they lose accuracy. On the other hand, We have been testing many light weight VLMs for detection and pointing – such as FastVLM from Apple, SmolVLM from Huggingface, InternVL, LiquidAi, Qwen2.5VL, Moondream2 and so on. None of these are accurate enough to locate objects on screen. With the recent release of Moondream3 Preview model however, we have high hopes. The author has claimed that it’s even superior to GTP 5 and Gemini.

So do you think it will be able to precisely locate GUI elements on screen? Continue reading to find out more.

System Configurations for Running Moondream3 VLM on Local

We are using Windows 11, Nvidia GTX 3060 12 GB GPU with 16GB RAM to run the pipeline. I have mentioned Windows 11 because it isn’t working with the Ubuntu instance on the same PC.

Other Ubuntu test setups used so far are as follows:

Nvidia RTX 3080 Ti, 12 GB, 32 GB RAM
Nvidia GTX 3060, 6GB

Moondream3 preview model with fp16 precision requires about 19GB vRAM. As you can see, my systems are way behind. However, the Windows instance is handling it well – by loading the remaining part on RAM.

Note: I am yet to find out how it is happening. If anyone knows, please comment. cc: Vikhyat Korrapati. We have tried doing the same using flag load_in_cpu flag, on 3080 Ti setup (Ubuntu); it faces tensor mismatch error.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Moondream3 Preview Vision Language Model

Moondream 3 (Preview) is a major step up from its earlier versions, shifting from small dense models into a more advanced design. It is a fine-grained sparse mixture-of-experts model with 64 experts, of which 8 are activated for each token.

It scales to 9B total parameters, but only 2B active at a time. On the vision side, it continues to rely on a SigLIP-based ViT encoder, but with improved multi-crop channel concatenation for handling high-resolution images more efficiently.

Moondream 3 also extends context length up to 32K tokens, compared to 4K tokens in v1 and v2. As per release notes, it has the potential to perform few-shot prompting and agentic workflows with tool-use. We will find out how it goes after stable release. Functionally, the preview release shows strong improvements in OCR, document understanding, structured output, visual reasoning, and grounding. Let’s go ahead and run the model on some example images as shown below.

testing moondream3 preview model to build ai agent

3.1 Install and Import Dependencies for Moondream 3

The dependencies are very straightforward. We need Huggingface Hub, Transformers, Pytorch Cuda, and a few other dependent libraries shown in the code below. Additionally you can find the full set of dependencies added in the requirements.txt file in the downloaded code.

!pip install torch torchvision torchaudio accelerate huggingface_hub transformers pillow matplotlib

import torch
import time
from PIL import Image, ImageDraw
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['image.cmap'] = 'gray'
from transformers import AutoModelForCausalLM, AutoTokenizer

Note that it is a gated model, so you have to login to huggingface, accept license and authenticate using a generated token. Check out the blog post VLM on Jetson Nano where the authentication part is explained.

3.2 Load Moondream3 Model and Compile

We are loading the bfloat16 precision which comes by default. Quantized int8 and int4 models are not yet provided by the developers.

moondream = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream3-preview",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()

3.3 Check Parameters and Approx Model Size

Let’s verify parameters and approx size of the model in memory. We can see that it’s approximately 9B and occupies 17.26 GB.

num_params = sum(p.numel() for p in moondream.parameters())
print("Parameters:", num_params)

dtype = next(moondream.parameters()).dtype
bytes_per_param = {torch.float32: 4, torch.float16: 2, torch.bfloat16: 2}.get(dtype, 4)
size_gb = num_params * bytes_per_param / (1024**3)
print(f"Approx size in memory: {size_gb:.2f} GB")

3.4 Load Images and Visualize

The model Moondream3 has more capability than just grounding. However, from the scope of the blog post, we will only focus on pointing objects. You can check out the tests on other tasks such as query, reasoning, ocr etc in the notebook enclosed within download code.

apple_img = Image.open("../../tasks/apples.png")
icon_image = Image.open("../../tasks/icons.png”)

# Display the image
plt.figure(figsize=(14,7))
plt.subplot(121); plt.imshow(apple_img); plt.title("Apples"); 
plt.axis('off');
plt.subplot(122); plt.imshow(icon_image); plt.title("Icons Apps"); plt.axis('off');

3.5 Point To Objects using Moondream3

In the following snippet of code, we are finding the coordinates of the apples in the input image. The P(x,y) points are annotated on the image for better visualization.

# Find points for specific objects
result = moondream.point(apple_img, "Apples in the image")

width = apple_img.width
height = apple_img.height
annotated_img = apple_img.copy()
draw = ImageDraw.Draw(annotated_img)
rad = 5

# Points are normalized coordinates (0-1)
for i, point in enumerate(result["points"]):
    abs_x = int(point['x'] * width)
    abs_y = int(point['y'] * height)
    print(f"Point {i+1}: x={abs_x}, y={abs_y}")
    draw.ellipse((abs_x - rad, abs_y - rad, abs_x + rad, abs_y + rad), outline="red", fill="red")
    draw.ellipse((abs_x - rad*5, abs_y - rad*5, abs_x + rad*5, abs_y + rad*5), outline="blue")
    
plt.figure(figsize=(14,7))
plt.imshow(annotated_img); plt.title("Apples detected");
plt.axis("off")

VLM Moondream 3 grounding test, looks perfect pointing, this has good use cases in building AI agents

We will use the same call on the icon image but to find the Netflix icon this time.

# Find points for specific objects
result = moondream.point(icon_img, "Netflix icon in the image")

Moondream 3 detecting UI element for GUI-native AI agent

The pointing results are decent. We will be able to utilize this feature to locate GUI elements like buttons and icons in the application. Now, let’s go through the VLM agent building pipeline.

AI Agent Building Pipeline

Let’s assume that Moondream3 can precisely pin point elements. How do we use that information? Let’s try to understand via some example problem statements.

Input 1: “Hi, please keep an eye on the classification training notebook on the colab window, send results to Kukil Kashyap on WhatsApp once done, then shutdown the system”.

Input 2: “Hi, please search for top 3 posts on Vision Language Models on Edge devices. Prepare a summary and note down in the text editor, save the file, and send it to me on Telegram and GMail”.

In both cases, our brain is planning the events. Then we are using some applications, providing inputs through keyboard and mouse. On the basis of what appears on screen after the events, we decide what to do next. We are replicating the same process here as well.

Here, Gemini takes the lead on reasoning, planning, OCR, and code generation. On the other hand, Moondream3 shines in grounding tasks like object detection and pointing at elements. Together, they enable zero-shot action generation and execution through a set of custom action functions, bringing agentic automation directly to your desktop. We will discuss this in detail in a bit along with code. Now, let’s go ahead and take a look at Moondream3 Preview model.

Action Schema for Building AI Agent Code Explanation

The set of actions that we perform on a computer screen is more or less fixed. These are mostly mouse clicks (left click, double click, scroll, right click), key press, hold, move to (x,y), hotkey, etc. We have defined a helper script named actions.py. All the action schema is defined here.

Then we have the openAgent.py file which has the event planner and task execution pipeline. We are using Gemini 2.5 Flash for creating event instructions from user queries. It is also used for reading critical information on screen. The Moondream3 based point locator function is called whenever it needs to locate particular elements on screen.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

5.1 PyAutoGUI Events for AI Agent

Let’s go through the predefined action functions that our computer performs generally. We are adding only a few necessary actions to keep the article less cluttered. The functions are pretty much self-explanatory. We are using PyAutoGUI to simulate the events.

def scroll(amount):
    pyautogui.scroll(amount)

def move_mouse(x, y, duration=0.2):
    pyautogui.moveTo(x, y, duration=duration)

def click(x=None, y=None, button="left"):
    pyautogui.click(x, y, button=button)

def double_click(x=None, y=None):
    pyautogui.doubleClick(x, y)

def right_click(x=None, y=None):
    pyautogui.rightClick(x, y)

def drag_and_drop(x1, y1, x2, y2, duration=0.5):
    pyautogui.moveTo(x1, y1)
    pyautogui.dragTo(x2, y2, duration=duration)

def press_key(key):
    pyautogui.press(key)

def hotkey(*keys):
    pyautogui.hotkey(*keys)

def type_text(text):
    pyautogui.typewrite(text, interval=0.05)

def clear_field():
    pyautogui.hotkey("ctrl", "a")
    pyautogui.press("backspace")

5.2 Process Handling and App Loading Events

def is_process_running(name):
    return any(name.lower() in p.name().lower() for p in psutil.process_iter())

def kill_process(name):
    for p in psutil.process_iter():
        if name.lower() in p.name().lower():
            p.terminate()
            return True
    return False

def sleep(seconds=None):
    if seconds is not None:
        time.sleep(seconds)

Next we have the launch_app() function. There are a few pre-defined app paths (we added them for testing). It will load the applications using the os.startfile() method. However, if it fails loading any applications, we have defined fallback methods to search and open the app from the screen. It is designed to handle app load processes on Mac, Linux, and Ubuntu.

Disclaimer: Not yet tested on Mac and Ubuntu. Note that delays have been added to allow applications to load.

predefined_paths = {
    "whatsapp": "whatsapp://",
    "chrome": "C:/Program Files/Google/Chrome/Application/chrome.exe",
    "edge": "C:/Program Files (x86)/Microsoft/Edge/Application/msedge.exe",
    "sublime text": "C:/Program Files/Sublime Text 3/sublime_text.exe"
}

def launch_app(path):
    resolved_path = predefined_paths.get(path.lower(), path)
    system = platform.system()

    try:
        if system == "Windows":
            os.startfile(resolved_path)
        elif system == "Darwin":  # macOS
            os.system(f"open -a '{resolved_path}'")
        elif system == "Linux":
            os.system(f"xdg-open '{resolved_path}'")
        else:
            raise Exception("Unsupported OS")

        print(f"✅ Launched {path} via predefined path")
        time.sleep(2)
        return True

    except Exception as e:
        print(f"❌ Could not launch {path} via predefined path: {e}")
        print(f"⚡ Attempting search fallback...")

        try:
            if system == "Windows":
                pyautogui.press("win")
                time.sleep(1)
                pyautogui.typewrite(path, interval=0.05)
                time.sleep(0.5)
                pyautogui.press("enter")

            elif system == "Darwin":  # macOS Spotlight
                pyautogui.hotkey("command", "space")
                time.sleep(1)
                pyautogui.typewrite(path, interval=0.05)
                time.sleep(0.5)
                pyautogui.press("enter")

            elif system == "Linux":  # Ubuntu GNOME/KDE (Super key opens Activities)
                pyautogui.press("win")
                time.sleep(1)
                pyautogui.typewrite(path, interval=0.05)
                time.sleep(0.5)
                pyautogui.press("enter")

            time.sleep(3)
            return True

        except Exception as ex:
            print(f"❌ Fallback failed for {path}: {ex}")
            return False

Main Agentic Workflow Script

Main agentic workflow representation for AI agent

This is the primary event generator and action handler file. As discussed before, the script contains the following functions.

get_action_plan()
execute_actions()
read_text_from_image_gemini()
locate_object_moondream()

Note: We will go through the major functions only as it is not possible to cover all aspects of a full blown agent within the scope of a single blog post. However, the downloaded file will contain full code.

Let’s start with the main function to understand the workflow better. The prompt input is being done through the argument parser. Here, we are initialising the genai client for calling Gemini models, then loading Moondream3. We can see that the action plan is generated through Gemini, followed by executing tasks.

6.1 Main Function

if __name__ == "__main__":
    time.sleep(0.5)

    parser = argparse.ArgumentParser(description="Agentic Action Plan Executor")
    parser.add_argument(
        "--task",
        type=str,
        required=True,
        help="Task prompt in natural language, e.g., 'Summarize screen content and send report via WhatsApp'"
    )
    args = parser.parse_args()
    task_to_do = args.task
    
    # Setup Genai client for using Google Gemini Models
    client = genai.Client()

    # Load Moondream model
    model_name = "moondream/moondream3-preview"
    moondream = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        dtype=torch.bfloat16,
        device_map={"": "cuda"}
    )
    moondream.compile()

    # Get action plan from Gemini
    print("📝 Generating action plan ...")
    action_pln = get_action_plan(client, task_to_do)
    print("📋 Action plan:", action_pln)

    # 5. Execute actions
    execute_actions(action_pln)

6.2 Get Action Plan Function for AI Agent Event Planner

This is the brain or master mind of the AI Agent. It orchestrates an action plan which contains events from the actions.py schema. It returns the methods in a json string which is parsed later in the execute_actions() method. For a natural input “Summarize screen content and send via WhatsApp”, the json plan looks like the output below.

[
  {"action": "read_text_from_image_gemini", 
   "args": {"query": "Summarize the screen"}},
  {"action": "launch_app", "args": {"path": "whatsapp"}},
  {"action": "type_text", "args": {"text": "&lt;OUTPUT_FROM_read_text_from_image_gemini>"}},
  {"action": "press_key", "args": {"key": "enter"}}
]

Event planner function for AI agents. This is the brain of the agent.

The effectiveness of the action plan depends on how well the system prompt has been designed. This is a zero shot task. As you can see below, we have created the prompt in natural language, mentioning all contingencies, and general steps to be followed. Gemini 2.5 Flash takes care of the rest.

def get_action_plan(client, prompt):
    # Here goes the system prompt
    system_promt = <>
    print('Generating Action Plan. Please wait ...')

    # Call Gemini
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[{"role": "user", "parts": [{"text": system_prompt}]}],
    )
    print('✅ Done.')

    # Access text properly
    try:
        raw_text = response.candidates[0].content.parts[0].text.strip()
        raw_text = re.sub(r"^```(?:json)?|```$", "", raw_text, flags=re.MULTILINE).strip()
        action_plan = json.loads(raw_text)

        # Validate schema
        for i, step in enumerate(action_plan):
            if "action" not in step or "args" not in step:
                print(f"⚠️ Step {i} missing required keys:", step)

        return action_plan

    except Exception as e:
        print("❌ Error generating action plan:", e)
        print("Response text:", raw_text[:500])
        return []

6.3 Designing System Prompt to Retrieve Agentic Action Plan from Gemini

system_prompt = f"""
    You are a gui-native agent planner. Convert the user's instruction into a JSON action plan.
    Use only the following action schema: 
      - read_text_from_image_gemini(client, query) 
      - locate_object_moondream(target_obj, model)
      - click(x=None, y=None, button="left")
      - double_click(x=None, y=None)
      - right_click(x=None, y=None)
      - move_mouse(x, y, duration=0.2)
      - press_key(key)
      - hotkey(*keys)
      - type_text(text)
      - clear_field()
      - launch_app(path)
      - sleep(seconds=None)
      - scroll(amount)

    Pre-defined paths (args) for launch_app action, windows specific:
      - "whatsapp" → will resolve to whatsapp://
      - "chrome"   → C:/Program Files/Google/Chrome/Application/chrome.exe
      - "edge"     → C:/Program Files (x86)/Microsoft/Edge/Application/msedge.exe
      - "sublime text" → "C:/Program Files/Sublime Text 3/sublime_text.exe"

    Apps other than predefined paths are to be searched onscreen or searched from start menu.
    Sleep time is in seconds. Always start by capturing screen, analysing what's on it, then move on to the next actions. 
    Add 2s delay after opening the application. For WhatsApp or Chrome or Telegram or other chat apps, once the app is open, start typing the user name directly. Press down arrow, enter, then type message.
    
    IMPORTANT: Any information retrieved from 'read_text_from_image_gemini' should be referenced in subsequent actions using the placeholder <OUTPUT_FROM_read_text_from_image_gemini>. Do NOT write fixed summaries. This placeholder will be replaced at runtime with the actual output.
    
    Each action must be a JSON object with keys:
      - "action": action name
      - "args": dictionary of arguments for that action
    
    Only output JSON array of actions. Do not include explanations or extra text.
    
    Instruction:
    {prompt}
    """

6.4 The Execute Actions Method in AI Agent

As the name suggests, this is the executor which iterates through an action plan and actually moves mouse, types, clicks etc. It also saves outputs so future steps can use the result. It dispatches each action to the right function as follows.

click_target: uses Moondream to find object coordinates and clicks it.
read_text_from_image_gemini: screenshot OCR with Gemini.
locate_object_moondream: object detection.
launch_app: open predefined apps.
hotkey: simulate key combinations.

Others: mapped to pyautogui or actions.py.

6.5 Function to Read Text from Screen Using Gemini 2.5 Flash

This is the vision + reasoning function of the AI agent. Moondream3 is not good enough for reading small texts. Neither it can generate answers, or perform reasoning as good as Gemini. Before using this method, set up the Gemini API key in Google AI studio. It provides free quotas. The tasks performed by the function are as follows.

Captures the screen using PIL.
Converts screenshot into bytes.
Calls Gemini 2.5 Flash with the query + screenshot.
Returns extracted/processed text.

def read_text_from_image_gemini(client, query):
    screen_capture = ImageGrab.grab() 
    # Convert to bytes
    img_bytes_io = io.BytesIO()
    screen_capture.save(img_bytes_io, format="PNG") 
    img_bytes = img_bytes_io.getvalue()

    # Progress bar in a separate thread
    stop_flag = {"running": True}
    def progress_task():
        with tqdm(total=0, bar_format="⏳ Waiting for Gemini 2.5 Flash...") as pbar:
            while stop_flag["running"]:
                time.sleep(0.1)

    thread = threading.Thread(target=progress_task, daemon=True)
    thread.start()

    # Send request
    try:
        response = client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[
                {
                    "role": "user",
                    "parts": [
                        {"text": query},
                        {"inline_data": {"mime_type": "image/png", "data": img_bytes}},
                    ],
                }
            ],
        )
    finally:
        stop_flag["running"] = False
        thread.join()

    return response.text

6.6 Function to Locate Elements On Screen Using Moondream3

This is the agent’s object grounding ability – say find “the WhatsApp icon” or “Send button.” The function captures the current screen using PIL. Then asks Moondream3 to locate the described object. The normalized coordinates are converted to absolute screen pixel values. These coordinates are passed into the click function by the planner.

def locate_object_moondream(target_obj, model):
    """
    Locate an object with Moondream and return the first detected point.
    """
    # Capture current screen
    screen_capture = ImageGrab.grab()  # PIL image
    height, width = screen_capture.height, screen_capture.width
    screen_capture.save("screen.png")

    # Ask Moondream to locate the target object
    response = model.point(screen_capture, target_obj)
    print("Model response:", response)

    try:
        points = response.get("points", [])
        if not points:
            raise ValueError("No points returned")

        # Pick the first point
        point = points[0]

        abs_x = int(point['x'] * width)
        abs_y = int(point['y'] * height)

        return {"x": abs_x, "y": abs_y}

    except Exception as e:
        print("Could not parse response. Got:", response, "Error:", e)
        return None

Results of VLM based AI Agent

Let’s take a look at the following video demonstrating our AI agent in action.

Input prompt: “check if the classification training is complete, and prepare a report with statistics, within 200 words, and send to Kukil Kashyap Borgohain on WhatsApp, and send the same report to Saved Message profile on Telegram”

Conclusion on Building AI Agents using VLM

The agent we built is designed to be generic, meaning you can ideally ask it to do anything on your desktop – from reading text off the screen to launching apps and sending messages. However, as with most early experiments, the system prompt we’ve designed isn’t fully robust yet. Sometimes Gemini generates perfectly structured plans, other times it struggles with edge cases or misses context. This is part of the iterative process, and we plan to keep refining until we achieve more reliable, repeatable behavior across a wide variety of tasks.

We’re also experimenting with Microsoft’s GUI-Actor project, which promises more native GUI grounding, but at the moment it suffers from stability and compatibility issues (especially on Windows). That’s why we haven’t covered it in detail here – though we see potential and will revisit it once it’s production-ready.

The broader takeaway is that GUI-native agents are becoming real: models that can not only understand text but also see, interpret, and act directly on our screens. Today, they’re still fragile prototypes, but the future points toward agents that can work seamlessly across applications, automate workflows, and act as true digital co-pilots.This project is just a step in that direction, and we’ll continue sharing updates as we test, break, and improve the system. If you’re interested in building along, stay tuned – the next phase will be about making the agent more predictable, extensible, and safe for everyday use.

AI Agent in Action: Automating Desktop Tasks with VLMs

Why Open Ended Local VLM Based AI Agent?

1.1 Why Moondream3 VLM?

System Configurations for Running Moondream3 VLM on Local

Moondream3 Preview Vision Language Model

3.1 Install and Import Dependencies for Moondream 3

3.2 Load Moondream3 Model and Compile

3.3 Check Parameters and Approx Model Size

3.4 Load Images and Visualize

3.5 Point To Objects using Moondream3

AI Agent Building Pipeline

Action Schema for Building AI Agent Code Explanation

5.1 PyAutoGUI Events for AI Agent

5.2 Process Handling and App Loading Events

Main Agentic Workflow Script

6.1 Main Function

6.2 Get Action Plan Function for AI Agent Event Planner

6.3 Designing System Prompt to Retrieve Agentic Action Plan from Gemini

6.4 The Execute Actions Method in AI Agent

6.5 Function to Read Text from Screen Using Gemini 2.5 Flash

6.6 Function to Locate Elements On Screen Using Moondream3

Results of VLM based AI Agent

Conclusion on Building AI Agents using VLM

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

Why Open Ended Local VLM Based AI Agent?

1.1 Why Moondream3 VLM?

System Configurations for Running Moondream3 VLM on Local

Moondream3 Preview Vision Language Model

3.1 Install and Import Dependencies for Moondream 3

3.2 Load Moondream3 Model and Compile

3.3 Check Parameters and Approx Model Size

3.4 Load Images and Visualize

3.5 Point To Objects using Moondream3

AI Agent Building Pipeline

Action Schema for Building AI Agent Code Explanation

5.1 PyAutoGUI Events for AI Agent

5.2 Process Handling and App Loading Events

Main Agentic Workflow Script

6.1 Main Function

6.2 Get Action Plan Function for AI Agent Event Planner

6.3 Designing System Prompt to Retrieve Agentic Action Plan from Gemini

6.4 The Execute Actions Method in AI Agent

6.5 Function to Read Text from Screen Using Gemini 2.5 Flash

6.6 Function to Locate Elements On Screen Using Moondream3

Results of VLM based AI Agent

Conclusion on Building AI Agents using VLM

Subscribe & Download Code

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?