📊

Dataset

Fineweb Edu Chinese V2.1

Name: Fineweb Edu Chinese V2.1
Creator: opencsg
License: Apache-2.0

by opencsg hf-dataset--opencsg--fineweb-edu-chinese-v2.1

Nexus Index

36.1 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 56

R: Recency 64

Q: Quality 30

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 36.1 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--opencsg--fineweb-edu-chinese-v2.1
License	Apache-2.0
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__opencsg__fineweb_edu_chinese_v2.1,
  author = {opencsg},
  title = {Fineweb Edu Chinese V2.1 Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/opencsg/fineweb-edu-chinese-v2.1}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

opencsg. (2026). Fineweb Edu Chinese V2.1 [Dataset]. Free2AITools. https://huggingface.co/datasets/opencsg/fineweb-edu-chinese-v2.1

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

36.1

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 56

Recency (R) 64

Quality (Q) 30

💬 Index Insight

FNI V2.0 for Fineweb Edu Chinese V2.1: Semantic (S:50), Authority (A:0), Popularity (P:56), Recency (R:64), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

63,600

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

Chinese Fineweb Edu Dataset V2.1 [[中文]](#chinese) [[English]](#english)

[OpenCSG Community] [👾github] [wechat] [Twitter]

📖Technical Report

The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders based on their scores, allowing for flexible selection of data according to time and computational power requirements during training.

Expanded Data Sources

Key Features

New Data Sources:
- map-cc
- opencsg-cc
Score-Based Data Organization:
- Data entries are categorized into different folders based on their scores:
  - 4-5: High-quality educational content with clear and coherent writing.
  - 3-4: Suitable educational content with some minor issues in coherence or relevance.
  - 2-3: Potentially useful educational content with notable limitations.
Data Volume:
- 4-5: 70 GB, approximately 46 billion tokens, 17,790,513 lines.
- 3-4: 800 GB, approximately 530 billion tokens, 289,975,835 lines.
- 2-3: 1.4 TB, approximately 930 billion tokens, 649,842,063 lines.
Flexible Training:
- The dataset organization allows for selective use of data based on the available time and computational resources.
- Researchers and developers can choose specific score ranges to train their models, optimizing for different scenarios.

Data Distribution by Score

score: 4-5

score: 3-4

score: 2-3

We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!

License Agreement

Usage of the Chinese Fineweb Edu dataset requires adherence to the OpenCSG Community License. The Chinese Fineweb Edu dataset supports commercial use. If you plan to use the OpenCSG model or its derivatives for commercial purposes, you must comply with the terms and conditions outlined in the OpenCSG Community License as well as the Apache 2.0 License. For commercial use, please send an email to [email protected] and obtain permission.

📖Technical Report

Chinese Fineweb Edu V2.1数据集介绍

[OpenCSG 社区] [👾github] [微信] [推特]

**Chinese Fineweb Edu Dataset V2.1** 是 V2 数据集的增强版本，专为教育领域的自然语言处理（NLP）任务设计和优化。此版本引入了两个新的数据源 **map-cc** 和 **opencsg-cc**，并保留了评分为 2 到 3 的数据。数据条目根据评分存储在不同的文件夹中，用户可以根据时间和计算资源的需求灵活选择训练数据。

数据筛选范围扩大

新增数据源：
- map-cc
- opencsg-cc
基于评分的数据组织：
- 数据条目按评分存储在不同的文件夹中：
  - 4-5：高质量的教育内容，写作清晰且连贯。
  - 3-4：适合教育使用的内容，可能在连贯性或相关性方面存在一些小问题。
  - 2-3：潜在有用的教育内容，但存在明显的局限性。
数据量：
- 4-5：70 GB，约 46 亿 tokens，17,790,513 行。
- 3-4：800 GB，约 530 亿 tokens，289,975,835 行。
- 2-3：1.4 TB，约 930 亿 tokens，649,842,063 行。
灵活的训练：
- 数据集的组织允许用户根据可用时间和计算资源选择特定评分范围的数据进行训练，优化不同场景下的使用。

按评分的数据分布

score: 4-5

score: 3-4

score: 2-3

我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区，共同推动技术的进步。敬请期待数据集的开源发布！

许可协议

使用 Chinese Fineweb Edu V2数据集需要遵循 OpenCSG 社区许可证。Chinese Fineweb Edu V2数据集支持商业用途。如果您计划将 OpenCSG 模型或其衍生产品用于商业目的，您必须遵守 OpenCSG 社区许可证以及 Apache 2.0 许可证中的条款和条件。如用于商业用途，需发送邮件至 [email protected]，并获得许可。

Citation

text

@misc{yu2025opencsgchinesecorpusseries,
      title={OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training}, 
      author={Yijiong Yu and Ziyun Dai and Zekun Wang and Wei Wang and Ran Chen and Ji Pei},
      year={2025},
      eprint={2501.08197},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.08197}, 
}

📊 Structured Schema (Zero-Fabrication)

Feature Key	Data Type
`text`	`string`
`score`	`float64`
`source`	`string`

Estimated Rows: 957,608,411

Social Proof

HuggingFace Hub

63.6KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--opencsg--fineweb-edu-chinese-v2.1
slug: opencsg--fineweb-edu-chinese-v2.1
source: huggingface
author: opencsg
license: Apache-2.0
tags: task_categories:text-generation, language:zh, license:apache-2.0, size_categories:100m<n<1b, format:parquet, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, arxiv:2501.08197, region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 63,600
stars: 67
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!