GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI grounding, UI perception, and vision-language-action models that interact with computers.
Highlights
87 platforms spanning Windows, macOS, Linux, and cross-platform apps
Annotated UI elements with bounding boxes, text, and coarse semantic categories
SHA-256 file pairing between screenshots and JSON annotations
Supports research on GUI grounding, multimodal agents, and UI understanding
MIT license for broad academic and open source use
text Visible text or a short description of the element.
category Coarse UI type label. Present only for some elements.
id Unique identifier for the annotation entry.
UI Element Categories
Categories are approximate and not guaranteed for all elements. Examples include:
Button
Menu
Input Elements
Navigation
Sidebar
Visual Elements
Information Display
Others
These labels provide light structure for UI grounding tasks but do not form a full ontology.
Example Use Cases
GroundCUA can be used for:
Training computer use agents to perceive and understand UI layouts
Building GUI grounding modules for VLA agents
Pretraining screen parsing and UI element detectors
Benchmarking OCR, layout analysis, and cross-platform UI parsing
Developing models that map UI regions to natural language or actions
Citation
If you use GroundCUA in your research, please cite our work:
bibtex
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han LΓΉ and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}
License
GroundCUA is released under the MIT License. Users are responsible for ensuring compliance with all applicable laws and policies.