• Home
  • >
  • OCR
  • >
  • Nanonets-OCR-s: Enabling Rich, Structured Markdown for Document Understanding

Nanonets-OCR-s: Enabling Rich, Structured Markdown for Document Understanding

Traditional Optical Character Recognition (OCR) systems are primarily designed to extract plain text from scanned documents or images. While useful, such systems often ignore semantic structure, layout, and visual cues like images, watermarks, and tables, limiting their utility in modern AI pipelines. Enter Nanonets-OCR-s – a groundbreaking Vision-Language Model (VLM)

Traditional Optical Character Recognition (OCR) systems are primarily designed to extract plain text from scanned documents or images. While useful, such systems often ignore semantic structure, layout, and visual cues like images, watermarks, and tables, limiting their utility in modern AI pipelines.

Enter Nanonets-OCR-s – a groundbreaking Vision-Language Model (VLM) that sets a new standard for intelligent document understanding. Whether you’re digitizing academic papers, parsing contracts, or building searchable enterprise knowledge bases, Nanonets-OCR-s delivers an unmatched combination of accuracy, structure, and intelligence.

  1. The Evolution Beyond Traditional OCR
  2. What is Nanonets-OCR-s?
  3. Core Innovations & Technologies Used in Nanonets-OCR-s
  4. Training and Dataset Composition
  5. Semantic Features and Output Capabilities of Nanonets-OCR-s
  6. Markdown Output Comparison: Nanonets-OCR-s vs Donut vs Dolphin
    1. Verdict
  7. Technical Strengths and Limitations of Nanonets-OCR-s
  8. Benchmark Comparison
  9. Conclusion
  10. References

The Evolution Beyond Traditional OCR

Conventional OCR systems like Tesseract (LearnOpenCV’s blog post on Deep Learning Based OCR Text Recognition Using Tesseract and OpenCV) or cloud APIs are adept at pulling plain text from scanned pages. However, they ignore vital structural and contextual elements – images, plots, LaTeX equations, watermarks, checkboxes, and tables – which are essential for meaningful interpretation. This limitation severely hampers their utility in LLM-centric pipelines where semantic richness and structural fidelity matter.

Diagram comparing OCR, OCR with ML, and Intelligent Data Processing. Shows progression from template-based extraction to AI-driven, template-free document understanding across structured and unstructured inputs.
Fig 2. From OCR to Intelligent Data Processing: The shift towards template-free automation[Source]

Nanonets-OCR-s changes the narrative by integrating multimodal understanding into the core of its architecture. It doesn’t just extract text – interprets, organizes, and formats information into clean, contextualized Markdown enriched with semantic tags.

What is Nanonets-OCR-s?

Nanonets‑OCR‑s is an open-weight Vision‑Language Model (VLM) released on June 12, 2025 by Nano Net Technologies. It’s a 3.75 B parameter multimodal transformer based on Qwen2.5-VL‑3B, fine-tuned for image-to-markdown OCR.

Nanonets OCR logo with a scanned document partially highlighted, symbolizing intelligent text extraction using AI. Represents Nanonets' brand and vision for modern OCR. They only released the Nanonets-OCR-s model.
Fig 3. Nanonets OCR: AI-first approach to document intelligence

Unlike traditional OCR systems that extract plain text, Nanonets-OCR-s understands document layout, content type, and context – embedding elements like headings, math formulas, tables, and even image descriptions into Markdown or HTML-like syntax. It goes beyond simple OCR, generating structured Markdown output with semantic tagging, ready for LLM pipelines.

Core Innovations & Technologies Used in Nanonets-OCR-s

This model uses advanced Vision-Language Modeling techniques, likely based on a multimodal transformer architecture (Qwen2.5-VL-3B-Instruct, a 3B parameter VLM). It integrates:

  • Visual encoding: CNN/ViT-based encoders to perceive layout, fonts, symbols, and graphics
  • Language modeling: Transformer-based decoder trained to output Markdown, HTML, and structured tags
  • Fine-tuning strategy: Aligned to produce structured text output with a mixture of real-world and synthetic documents
ComponentDetail
Base ModelQwen2.5-VL-3B, a 3 billion parameter VLM
Model TypeTransformer-based multimodal encoder-decoder
OCR StrategyOCR-free (similar to Donut), trained end-to-end from image to structured Markdown
Semantic TaggingOutputs structured elements like <signature>, <watermark>, <img>, etc.
Target FormatMarkdown with embedded semantic structure, LaTeX, and HTML-like tags

Training and Dataset Composition

To build a model capable of such nuanced understanding, the team at Nanonets curated a diverse and extensive dataset of over 250,000 document pages. This dataset includes:

  • Academic papers with complex mathematical notation
  • Legal documents with signatures and structured clauses
  • Financial documents such as invoices and tax forms
  • Healthcare forms with checkboxes and annotations

The training strategy employed a two-step pipeline:

  • Pretraining on synthetic datasets to establish visual-linguistic grounding
  • Fine-tuning on manually annotated, real-world documents to refine accuracy and domain robustness

The final result is a VLM capable of tackling real-world document variability while producing LLM-ready, semantically structured text.

Semantic Features and Output Capabilities of Nanonets-OCR-s

Nanonets-OCR-s excels by not merely converting text from images but by encoding deep structural awareness into its output. Its Markdown generation is augmented by semantic tagging that aligns with both human readability and machine ingestion.

Some of its key capabilities include:

  • LaTeX Equation Recognition: Automatically converts mathematical equations into LaTeX syntax, distinguishing between inline and block equations for Markdown compatibility.
    • Detects math regions and converts them to:
      • Inline syntax → $...$
      • Display syntax → $$...$$
    • Trained to identify not just symbols but layout context (centered math vs. inline usage)
    • Ideal for academic, scientific, and engineering documents

We’ve also published a blog post on Fine-Tuning Gemma-3 for LaTeX Equation Generation, where the results turned out to be remarkably accurate and robust. Explore the details and results by visiting the above link.

  • Intelligent Image Description: Describes charts, plots, and diagrams using <img> tags with context-aware captions that make visual elements digestible for LLMs.
    • Describes semantic content, visual style, and context
    • Enables downstream multimodal LLMs to reason about visual elements
    • Parses charts, plots, logos, diagrams, and wraps them in:
<img>
  Description of the visual content
</img>
  • Signature Detection & Isolation: Accurately identifies and tags signatures with <signature>, essential for legal and formal documents.
    • Identifies signatures from handwritten or cursive blocks
    • Useful in legal, financial, or HR documents
    • Tags them like:
<signature>John Doe</signature>
  • Watermark Extraction: Detects and isolates document watermarks, embedding them within <watermark> tags to retain provenance and classification.
    • Detects semi-transparent or background text overlays
    • Helps preserve document provenance or classification labels
    • Extracts and wraps them in:
<watermark>Confidential</watermark>
  • Smart Checkbox Handling: Converts checkboxes and radio buttons into standardized Unicode symbols such as ☑, ☒, and ☐, ensuring logical consistency.
    • Detects checkboxes and radio buttons
    • Converts them to standardized Unicode symbols:
      • ☑ checked
      • ☒ crossed
      • ☐ empty
    • This enables structured form processing and reliable boolean tagging
  • Complex Table Extraction: Translates multi-row, multi-column tables into Markdown and HTML with preserved alignment, headers, and merged cells.
    • Captures multi-row, multi-column, nested tables
    • Converts them into Markdown tables or HTML table syntax
    • Maintains alignment, merged cells, header context
    • Crucial for financial, legal, and research documents

Markdown Output Comparison: Nanonets-OCR-s vs Donut vs Dolphin

To provide a clear and fair evaluation of how different models interpret and convert document images into Markdown, we compared the outputs of three notable models: Nanonets-OCR-s, Donut, and Dolphin on the same product detail image. The code links for the respective models implementation sourced from the official sources can be fetched from the References section.

Brief Overview of the Compared Models

  • Nanonets-OCR-s is a 3.75B vision-language model designed specifically for semantic OCR and Markdown generation. It is trained to extract structured content directly from document images without relying on external OCR engines.
  • Donut (Document Understanding Transformer), developed by NAVER AI Lab, is an OCR-free model capable of processing document images and producing JSON or Markdown-style outputs. It is often used for form extraction and document layout understanding.
  • Dolphin is a lightweight OCR model designed for visual language tasks. It typically performs well on structured layouts and image captions but may not match the deep semantic structuring of larger VLMs.

Input Image Description

Screenshot of an e-commerce product page displaying the RPLidar A1M8 360 Degree Laser Scanner Kit. Includes product image, price, SKU, availability, social icons, and a chatbot prompt.
Fig 4. Sample product page used for OCR model comparison

The test image is a commercial product detail page for an RPLidar device as attached below. It includes a product name, SKU, price (with and without GST), stock status, social media icons, and a chatbot-style footer asking “Have a question? Tell me more.”

Markdown Generation Comparison

Nanonets-OCR-s Generated Markdown Output

Page 1 of 1
RPLidar A1M8 360 Degree Laser Scanner Kit - 12M Range

SKU: RB-RPL-0001 Brand Reward Points 0
Excl. GST: ₹7,099.00 Incl. GST: ₹8,376.82
In Stock
Categories: Product Brand SLAMTEC Topic: Topic

<img>Product image</img>
<watermark>RPLIDAR</watermark>

Quantity:
-
1
+

Add to Cart Heart

<img>Facebook icon</img> <img>Twitter icon</img> <img>LinkedIn icon</img> <img>Google+ icon</img> <img>Email icon</img>

Hi! How can we help?
I have a question
Tell me more

Highlights:

  • Semantic segmentation of elements
  • Inline image tags with descriptions
  • Markdown headings and labels structured as readable content
  • Recognition of watermark, chatbot dialogue, and social icons

Donut Generated Markdown Output

{
  "menu": [
    {
      "nm": "RPLidar AIM8 360 Degree Laser Scanner Kit-12M",
      "unitprice": "Range",
      "cnt": "0",
      "price": "0"
    },
    {
      "nm": "SOLI。88-MV-COOL Band version Punto O",
      "unitprice": "709,900",
      "cnt": "0",
      "price": "0"
    },
    {
      "cnt": "0",
      "price": "0"
    }
  ],
  "total": {
    "total_price": "68,078.02"
  }
}

Highlights:

  • Focused on tabular interpretation
  • Unable to semantically segment non-tabular text
  • Includes hallucinated or misread values (e.g., unrelated product entry)

Dolphin Generated Markdown Output

RPLidar A1M8 360 Degree Laser Scanner Kit – 12M Range

![Figure 1](data:image/png;base64,...)

SKU: RB-RPL-0001 Brand Reward Points 0

-1+

![Figure 4](data:image/png;base64,...)

↑. International Olympic Committee. [2022-09-22].

f > in G =

have a question
Tell me more

Highlights:

  • Basic OCR extraction
  • Embedded raw image references
  • Partially captures relevant information, but lacks structure or Markdown styling
  • Contains spurious text from image background noise

Summary

FeatureNanonets-OCR-sDonutDolphin
Markdown Formatting✅ Structured and clean❌ JSON-formatted⚠️ Incomplete, noisy
Image Context Tagging<img> tags⚠️ Raw base64 blobs
Watermarks, Chat UI✅ Detected & tagged⚠️ Partially mixed in text
Semantic Structure✅ Headings, labels, layout⚠️ Table-centric❌ Flat OCR dump
Hallucination RiskLowMediumHigh

Verdict

Nanonets-OCR-s delivers the most readable, LLM-friendly, and semantically tagged Markdown output. Donut provides structured JSON suited to forms but misses document-wide cohesion. Dolphin captures raw text but lacks context or usable structure.

This comparison illustrates the value of semantic-aware VLMs in transforming complex visual content into meaningful structured data, particularly for downstream AI processing.

Technical Strengths and Limitations of Nanonets-OCR-s

Strengths

  • Outputs clean, LLM-compatible Markdown with embedded semantic tags
  • No reliance on third-party OCR engines
  • Recognizes and structures multiple content modalities including equations, tables, images, and checkboxes
  • Runs on open-weight models like Qwen2.5-VL-3B, enabling local deployment and fine-tuning

Limitations

  • Currently not trained on handwritten content, limiting its applicability in informal or cursive-text scenarios
  • May occasionally hallucinate content, especially when visual context is ambiguous or degraded
  • Requires modern GPUs (e.g., A100, RTX 4090) for real-time or batch-scale processing

Benchmark Comparison

When benchmarked against other leading solutions such as Donut and Dolphin, Nanonets-OCR-s demonstrates superior structured output quality:

FeatureNanonets-OCR-sDonutGPT-4V / GeminiDolphin
Output FormatMarkdown + TagsJSON/MarkdownFreeform TextRaw Text/Image
Visual Element SupportLimited✅ (Prompted)⚠️ Basic OCR
Semantic StructuringMediumVariable
LLM CompatibilityHighMediumPrompt-DependentLow
Open-source Deployment❌ (API only)
Image Tagging<img> tags⚠️ Base64 blob
Chat / UI Detection✅ (Prompted)⚠️ Partial
Hallucination RiskLowMediumMediumHigh

Nanonets-OCR-s leads in producing consistent, semantically tagged Markdown without complex prompt engineering, making it ideal for automated pipelines.

Conclusion

Nanonets-OCR-s represents a substantial leap forward in the evolution of document intelligence. By transcending traditional OCR limitations and embracing vision-language modeling, it offers a truly modern solution for extracting meaning from complex visual content. Whether you’re automating financial paperwork, digitizing academic research, or building AI-powered knowledge bases, this model brings unmatched precision, structure, and adaptability.

As we step into a world where LLMs and multimodal AI systems require structured, semantically rich data, Nanonets-OCR-s is not just a tool—it’s a foundational component in building intelligent, scalable, and explainable AI systems.

References



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​