๐Ÿง 

omniparser

by microsoft Model ID: hf-model--microsoft--omniparser
FNI 5.6
Top 79%

"๐Ÿ“ข [Project Page] [Blog Post] [Demo] OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and a..."

๐Ÿ”— View Source
Audited 5.6 FNI Score
Tiny - Params
- Context
488 Downloads

โšก Quick Commands

๐Ÿค— HF Download
huggingface-cli download microsoft/omniparser
๐Ÿ“ฆ Install Lib
pip install -U transformers
๐Ÿ“Š

Engineering Specs

โšก Hardware

Parameters
-
Architecture
Blip2ForConditionalGeneration
Context Length
-
Model Size
15.1GB

๐Ÿง  Lifecycle

Library
-
Precision
float16
Tokenizer
-

๐ŸŒ Identity

Source
HuggingFace
License
Open Access

๐Ÿ•ธ๏ธ Neural Mesh Hub

Interconnecting Research, Data & Ecosystem

๐Ÿ“ˆ Interest Trend

--

* Real-time activity index across HuggingFace, GitHub and Research citations.

No similar models found.

๐Ÿ”ฌTechnical Deep Dive

Full Specifications [+]
---

๐Ÿš€ What's Next?

โšก Quick Commands

๐Ÿค— HF Download
huggingface-cli download microsoft/omniparser
๐Ÿ“ฆ Install Lib
pip install -U transformers
๐Ÿ–ฅ๏ธ

Hardware Compatibility

Multi-Tier Validation Matrix

Live Sync
๐ŸŽฎ Compatible

RTX 3060 / 4060 Ti

Entry 8GB VRAM
๐ŸŽฎ Compatible

RTX 4070 Super

Mid 12GB VRAM
๐Ÿ’ป Compatible

RTX 4080 / Mac M3

High 16GB VRAM
๐Ÿš€ Compatible

RTX 3090 / 4090

Pro 24GB VRAM
๐Ÿ—๏ธ Compatible

RTX 6000 Ada

Workstation 48GB VRAM
๐Ÿญ Compatible

A100 / H100

Datacenter 80GB VRAM
โ„น๏ธ

Pro Tip: Compatibility is estimated for 4-bit quantization (Q4). High-precision (FP16) or ultra-long context windows will significantly increase VRAM requirements.

README

๐Ÿ“ข [Project Page] [Blog Post] [Demo]

Model Summary

OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function.

This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 model on the above dataset respectively. For more details of the models used and finetuning, please refer to the paper.

Responsible AI Considerations

Intended Use

  • OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality.
  • OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser.
  • OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications.

limitations

  • OmniParser is designed to faithfully convert screenshot image into structured elements of interactable regions and semantics of the screen, while it does not detect harmful content in its input (like users have freedom to decide the input of any LLMs), users are expected to provide input to the OmniParser that is not harmful.
  • While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard.
  • For OmniPaser-BLIP2, it may incorrectly infer the gender or other sensitive attribute (e.g., race, religion etc.) of individuals in icon images. Inference of sensitive attributes may rely upon stereotypes and generalizations rather than information about specific individuals and are more likely to be incorrect for marginalized people. Incorrect inferences may result in significant physical or psychological injury or restrict, infringe upon or undermine the ability to realize an individualโ€™s human rights. We do not recommend use of OmniParser in any workplace-like use case scenario.

License

Please note that icon_detect model is under AGPL license, and icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model.

ZEN MODE โ€ข README

๐Ÿ“ข [Project Page] [Blog Post] [Demo]

Model Summary

OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function.

This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 model on the above dataset respectively. For more details of the models used and finetuning, please refer to the paper.

Responsible AI Considerations

Intended Use

  • OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality.
  • OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser.
  • OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications.

limitations

  • OmniParser is designed to faithfully convert screenshot image into structured elements of interactable regions and semantics of the screen, while it does not detect harmful content in its input (like users have freedom to decide the input of any LLMs), users are expected to provide input to the OmniParser that is not harmful.
  • While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard.
  • For OmniPaser-BLIP2, it may incorrectly infer the gender or other sensitive attribute (e.g., race, religion etc.) of individuals in icon images. Inference of sensitive attributes may rely upon stereotypes and generalizations rather than information about specific individuals and are more likely to be incorrect for marginalized people. Incorrect inferences may result in significant physical or psychological injury or restrict, infringe upon or undermine the ability to realize an individualโ€™s human rights. We do not recommend use of OmniParser in any workplace-like use case scenario.

License

Please note that icon_detect model is under AGPL license, and icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model.

๐Ÿ“ Limitations & Considerations

  • โ€ข Benchmark scores may vary based on evaluation methodology and hardware configuration.
  • โ€ข VRAM requirements are estimates; actual usage depends on quantization and batch size.
  • โ€ข FNI scores are relative rankings and may change as new models are added.
  • โš  License Unknown: Verify licensing terms before commercial use.
  • โ€ข Source: Unknown
๐Ÿ“œ

Cite this model

Academic & Research Attribution

BibTeX
@misc{hf_model__microsoft__omniparser,
  author = {microsoft},
  title = {undefined Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/microsoft/omniparser}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
microsoft. (2026). undefined [Model]. Free2AITools. https://huggingface.co/microsoft/omniparser
๐Ÿ”„ Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

๐Ÿ“Š FNI Methodology ๐Ÿ“š Knowledge Baseโ„น๏ธ Verify with original source

๐Ÿ›ก๏ธ Model Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

๐Ÿ†” Identity & Source

id
hf-model--microsoft--omniparser
author
microsoft
tags
transformerssafetensorsblip-2image-to-textimage-text-to-textarxiv:2408.00203license:mitendpoints_compatibleregion:us

โš™๏ธ Technical Specs

architecture
Blip2ForConditionalGeneration
params billions
null
context length
null

๐Ÿ“Š Engagement & Metrics

likes
1,697
downloads
488

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)