A multi-domain corpus of 139,756 real and AI-generated images across 5 domains (faces, scenes, documents, scene_text, id_cards) and 66 distinct generators, paired with structured 3-step reasoning traces annotated by Gemini-2.5-Pro.
Status
đ§ Annotation in progress. Currently ~85K / 110K remaining images annotated. New traces are pushed every ~30 minutes until complete.
Single-VLM annotator. All traces are from Gemini-2.5-Pro. Reviewers should treat 5-vote consensus as variance reduction on one annotator, not a true ensemble. A 3-VLM heterogeneous re-annotation on a 2K subset is in progress for a v2 release.
~10â15% Step-1 â label drift on real samples (Gemini's zero-shot impression sometimes calls real photos with noisy backgrounds "fake"). Step-3 traces are label-conditioned and recover, but the inconsistency is reported here transparently.
Region grounding via post-hoc detector pass. Step-1 anomalies do not yet contain bounding boxes; a GroundingDINO-1.5 retrofit using the existing artifact phrases is planned.
License
Annotations: CC-BY-NC-4.0.
Image bytes are sourced from upstream public datasets (HydraFake, GenImage, FantasyID, OSTF, DocBank, Flickr/Wikimedia, etc.) and our own document-tampering pipeline. Each image inherits its upstream license; redistribution is provided under the upstream terms. See IMAGE_PROVENANCE.md for the source map.