import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/Egocentric_10K_Evaluation")
# Launch the App
session = fo.launch_app(dataset)
Dataset Details
Dataset Description
Egocentric-10K-Evaluation is a benchmark evaluation set and analysis protocol for large-scale egocentric (first-person) video datasets, focused on measuring hand visibility and active manipulation in real-world, in-the-wild scenarios, especially relevant for robotics, computer vision, and AI agent training on manipulation tasks.[1][2][3]
This dataset is intended for benchmarking egocentric video data with respect to hand presence and active object manipulation, enabling standardized analysis, dataset comparison, and the development/evaluation of perception and robotics models centered on real-world human skill tasks.
Dataset Structure
Egocentric-10K-Evaluation consists of 10,000 sampled frames from factory egocentric video and comparable samples from other major datasets (Ego4D, EPIC-KITCHENS); each sample includes JSON metadata, hand label annotations (count 0, 1, or 2), and a binary label for presence/absence of active manipulation. The splits are standardized; additional metadata includes dataset, worker, and video index references.
Dataset Creation
Curation Rationale
To create a standardized benchmark for hand visibility and manipulation, facilitating research on manipulation-heavy tasks in robotics and AI using real industrial and skill-focused footage.
Source Data
Data Collection and Processing
The evaluation set comprises frames drawn from the primary Egocentric-10K dataset (real-world factory footage collected via head-mounted cameras), as well as standardized samples from open egocentric datasets Ego4D and EPIC-KITCHENS for comparison. Data is provided in 1080p, 30 FPS H.265 MP4 format, with structured JSON metadata and hand/manipulation annotations.
Who are the source data producers?
Egocentric-10Kβs original video data was produced by real factory workers wearing head-mounted cameras, performing natural work-line activities. Annotation was performed following strict guidelines as described in the evaluation schema.
Annotations
Annotation process
Each sampled frame is annotated for number of visible hands (0/1/2; detailed rules provided) and whether the hands are engaged in active manipulation (βyesβ/βnoβ per explicit definition). The annotation schema and rules are detailed in the benchmark documentation.