@misc{hf_dataset__opencsg__fineweb_edu_chinese_v2.1,
author = {opencsg},
title = {Fineweb Edu Chinese V2.1 Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/opencsg/fineweb-edu-chinese-v2.1}},
note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
opencsg. (2026). Fineweb Edu Chinese V2.1 [Dataset]. Free2AITools. https://huggingface.co/datasets/opencsg/fineweb-edu-chinese-v2.1
The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders based on their scores, allowing for flexible selection of data according to time and computational power requirements during training.
Expanded Data Sources
Key Features
New Data Sources:
map-cc
opencsg-cc
Score-Based Data Organization:
Data entries are categorized into different folders based on their scores:
4-5: High-quality educational content with clear and coherent writing.
3-4: Suitable educational content with some minor issues in coherence or relevance.
2-3: Potentially useful educational content with notable limitations.
Data Volume:
4-5: 70 GB, approximately 46 billion tokens, 17,790,513 lines.
3-4: 800 GB, approximately 530 billion tokens, 289,975,835 lines.
2-3: 1.4 TB, approximately 930 billion tokens, 649,842,063 lines.
Flexible Training:
The dataset organization allows for selective use of data based on the available time and computational resources.
Researchers and developers can choose specific score ranges to train their models, optimizing for different scenarios.
Data Distribution by Score
score: 4-5
score: 3-4
score: 2-3
We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!
License Agreement
Usage of the Chinese Fineweb Edu dataset requires adherence to the OpenCSG Community License. The Chinese Fineweb Edu dataset supports commercial use. If you plan to use the OpenCSG model or its derivatives for commercial purposes, you must comply with the terms and conditions outlined in the OpenCSG Community License as well as the Apache 2.0 License. For commercial use, please send an email to [email protected] and obtain permission.
使用 Chinese Fineweb Edu V2数据集需要遵循 OpenCSG 社区许可证。Chinese Fineweb Edu V2数据集支持商业用途。如果您计划将 OpenCSG 模型或其衍生产品用于商业目的,您必须遵守 OpenCSG 社区许可证以及 Apache 2.0 许可证中的条款和条件。如用于商业用途,需发送邮件至 [email protected],并获得许可。
Citation
text
@misc{yu2025opencsgchinesecorpusseries,
title={OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training},
author={Yijiong Yu and Ziyun Dai and Zekun Wang and Wei Wang and Ran Chen and Ji Pei},
year={2025},
eprint={2501.08197},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.08197},
}