🧠

Model

Got Ocr2 0

Name: Got Ocr2 0
Author: Stepfun Ai

by Stepfun Ai ID: hf-model--stepfun-ai--got-ocr2_0

Scale 716033280B

FNI Rank 31

Percentile Top 5%

Activity

→ 0.0%

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model 🔋Online Demo | 🌟GitHub | 📜Paper Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang !image/jpeg Inference using Huggingface transf...

View Source Code →

Audited 31 FNI Score

Massive 716033280B Params

4k Context

16.1K Downloads

H100+ ~537024963GB Est. VRAM

Model Information Summary
Entity Passport
Registry ID	hf-model--stepfun-ai--got-ocr2_0
Provider	huggingface

💾

Compute Threshold

~537024962.5GB VRAM

Interactive

Analyze Hardware

Hardware Compatibility Test

▼

* Estimated for 4-Bit Quantization. Actual usage varies by context length and parallel batching.

⚡

🔗 Core Ecosystem

⚙️STACK

TRANSFORMERStool

huggingface

→

🔬

🔬 Research & Data

Research Paper

Research Paper

Research Paper

🕸️

Intelligence Hive

Multi-source Relation Matrix

Live Index

📈

Momentum Index

🏷️

Contextual Anchors

safetensors got vision-language ocr2.0 custom_code image-text-to-text multilingual arxiv:2409.01704 arxiv:2405.14295 arxiv:2312.06109 license:apache-2.0 region:us

📜

Cite this model

Academic & Research Attribution

BibTeX

@misc{hf_model__stepfun_ai__got_ocr2_0,
  author = {Stepfun Ai},
  title = {Got Ocr2 0 Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/stepfun-ai/GOT-OCR2_0}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Stepfun Ai. (2026). Got Ocr2 0 [Model]. Free2AITools. https://huggingface.co/stepfun-ai/GOT-OCR2_0

🔬Technical Deep Dive

Full Specifications [+]

⚡ Quick Commands

🤗 HF Download

huggingface-cli download stepfun-ai/got-ocr2_0

⚖️ Free2AI Nexus Index

Methodology → 📘 What is FNI?

31.0

Top 5% Overall Impact

🔥 Popularity (P) 0

🚀 Velocity (V) 0

🛡️ Credibility (C) 0

🔧 Utility (U) 0

Nexus Verified Data

💬 Why this score?

This Got Ocr2 0 has a P score of 0 (popularity from downloads/likes), V of 0 (growth velocity), C of 0 (credibility from citations), and U of 0 (utility/deploy support).

🔗 Source Links (Click to verify)

📊 P: HuggingFace Stats 📈 V: 7-Day Delta 📄 C: Papers With Code 🔧 U: Deploy Score

Data Verified 🕐 Last Updated: Not calculated

Free2AI Nexus Index | Fair · Transparent · Explainable | Full Methodology

---

🚀 What's Next?

📊

Find Training Datasets

Discover datasets compatible with this model

📈

Compare Benchmarks

See how this model ranks on standard tests

⚡

Deployment Guide

Understand deployment options

🖼️ Visual Gallery

1 Images Detected

QCEFY-M_YG3Bp5fn1GQ8X.jpeg

README

pipeline_tag: image-text-to-text
language:

multilingual
tags:
got
vision-language
ocr2.0
custom_code
license: apache-2.0

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

🔋Online Demo | 🌟GitHub | 📜Paper

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

image/jpeg

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval().cuda()


# input your test image
image_file = 'xxx.jpg'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

# format texts OCR:
# res = model.chat(tokenizer, image_file, ocr_type='format')

# fine-grained OCR:
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
# res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
# res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

# multi-crop OCR:
# res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
# res = model.chat_crop(tokenizer, image_file, ocr_type='format')

# render the formatted OCR results:
# res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

print(res)

More details about 'ocr_type', 'ocr_box', 'ocr_color', and 'render' can be found at our GitHub.
Our training codes are available at our GitHub.

More Multimodal Projects

👏 Welcome to explore more multimodal projects of our team:

Vary | Fox | OneChart

Citation

If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

ZEN MODE • README

pipeline_tag: image-text-to-text
language:

multilingual
tags:
got
vision-language
ocr2.0
custom_code
license: apache-2.0

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

🔋Online Demo | 🌟GitHub | 📜Paper

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

image/jpeg

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval().cuda()
input your test image
image_file = 'xxx.jpg'
plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')
format texts OCR:
res = model.chat(tokenizer, image_file, ocr_type='format')
fine-grained OCR:
res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')
res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')
res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')
res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')
multi-crop OCR:
res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')
res = model.chat_crop(tokenizer, image_file, ocr_type='format')
render the formatted OCR results:
res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')
print(res)

More details about 'ocr_type', 'ocr_box', 'ocr_color', and 'render' can be found at our GitHub.
Our training codes are available at our GitHub.

More Multimodal Projects

👏 Welcome to explore more multimodal projects of our team:

Vary | Fox | OneChart

Citation

If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
• Source: Unknown

Top Tier

Social Proof

HuggingFace Hub

1.5KLikes

16.1KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-model--stepfun-ai--got-ocr2_0
source: huggingface
author: Stepfun Ai
tags: safetensorsgotvision-languageocr2.0custom_codeimage-text-to-textmultilingualarxiv:2409.01704arxiv:2405.14295arxiv:2312.06109license:apache-2.0region:us

⚙️ Technical Specs

architecture: GOT
params billions: 716,033,280
context length: 4,096
pipeline tag: image-text-to-text
vram gb: 537,024,962.5
vram is estimated: true
vram formula: VRAM ≈ (params * 0.75) + 2GB (KV) + 0.5GB (OS)

📊 Engagement & Metrics

likes: 1,526
downloads: 16,073

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

🕸️ Neural Mesh Hub

🔗 Core Ecosystem

🔬 Research & Data

Intelligence Hive

Momentum Index

Contextual Anchors

Cite this model

🔬Technical Deep Dive

⚡ Quick Commands

⚖️ Free2AI Nexus Index

💬 Why this score?

🔗 Source Links (Click to verify)

🚀 What's Next?

Find Training Datasets

Compare Benchmarks

Deployment Guide

🖼️ Visual Gallery

README

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Usage

More Multimodal Projects

Citation

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Usage

input your test image

plain texts OCR

format texts OCR:

res = model.chat(tokenizer, image_file, ocr_type='format')

fine-grained OCR:

res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_box='')

res = model.chat(tokenizer, image_file, ocr_type='format', ocr_box='')

res = model.chat(tokenizer, image_file, ocr_type='ocr', ocr_color='')

res = model.chat(tokenizer, image_file, ocr_type='format', ocr_color='')

multi-crop OCR:

res = model.chat_crop(tokenizer, image_file, ocr_type='ocr')

res = model.chat_crop(tokenizer, image_file, ocr_type='format')

render the formatted OCR results:

res = model.chat(tokenizer, image_file, ocr_type='format', render=True, save_render_file = './demo.html')

More Multimodal Projects

Citation

📝 Limitations & Considerations

Social Proof

🛡️ Model Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics