📊

Dataset

MNBVC

Name: MNBVC
Creator: liwu
License: ["mit"]

by liwu hf-dataset--liwu--mnbvc

Nexus Index

43.0 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 63

R: Recency 85

Q: Quality 30

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 43 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--liwu--mnbvc
License	["mit"]
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__liwu__mnbvc,
  author = {liwu},
  title = {MNBVC Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/liwu/mnbvc}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

liwu. (2026). MNBVC [Dataset]. Free2AITools. https://huggingface.co/datasets/liwu/mnbvc

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

43.0

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 63

Recency (R) 85

Quality (Q) 30

💬 Index Insight

FNI V2.0 for MNBVC: Semantic (S:50), Authority (A:0), Popularity (P:63), Recency (R:85), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

201,483

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

Dataset Card for MNBVC

Dataset Card for MNBVC

Dataset Description

Homepage: http://mnbvc.253874.net/
Repository: https://github.com/esbatmop/MNBVC
Paper: N/A
Leaderboard: N/A
Point of Contact: N/A

数据集介绍

中文互联网上最古老最神秘(没有之一)的MOP里屋社区于2023.1.1庄重宣布:

在英明神武的猫扑管子带领下，决心发挥社区所长(哪都长)，帮助开源社区长期更新一份最大的中文互联网语料集。

Huggingface上的MNBVC数据集在逐渐更新中，请到https://github.com/esbatmop/MNBVC 获取未完成清洗的更多数据。

可以使用如下脚本加载：

python

from datasets import load_dataset
# 对应语料数据加载
# 请参考： 下面表格中的标签字段字段内容
# 如：序号1，arXiv文献的文本。，标签字典：academic_paper 
dataset_arxiv = load_dataset("liwu/MNBVC", 'academic_paper', split='train', streaming=True)
# 如：序号38，法律判决书文本，标签字典：law_judgement  
dataset_law_judgement = load_dataset("liwu/MNBVC", 'law_judgement', split='train', streaming=True)

next(iter(dataset))  # get the first line

数据子集

MNBVC数据集包含数个子集：

序号	一级目录	二级目录	描述说明	标签字典	备注
1	`academic_paper`	-	来自文献的文本。	-	-
1.1	`academic_paper`	`arxiv`	来自arXiv文献的文本。	`academic_paper`	-
2	`blog`	-	博客语料目录	`blog`	-
2.1	`blog`	`163_blog`	-	`blog`	-
2.2	`blog`	`ai_blog`	-	`blog`	-
2.3	`blog`	`it_blog`	-	`blog`	-
3	`book`	-	书籍语料目录	`book`	-
3.1	`book`	`InfoSec`	-	`book`	-
4	`co_ann_report`	-	企业年报文本。	`co_ann_report`	-
4.1	`code`	-	来自代码的文本	-	-
4.2	`code`	`metadata`	github 仓库的代码元数据	code	-
4.3	code	googlecode	github 中，不同仓库的数据	code	-
4.4	code	githubcode	google code 中，不同仓库的数据	code
5	`crawler`	-	爬虫语料目录	-
5.1	`crawler`	`oscar`	从CommonCrawl中清洗出来的通用文本数据。	`crawler_oscar`	-
6	`forum`	-	论坛语料目录	`forum`	-
7	`game`	-	一些游戏的平行语料数据。	-
7.1	`game`	`Baldurs_Gate_3`	博德之门 3	`game`	-
7.2	`game`	`DarkSouls3`	黑暗之魂III	`game`	-
7.3	`game`	`do_not_starve`	饥荒	`game`	-
7.4	`game`	`EldenRing`	艾尔登法环	`game`	-
7.5	`game`	`Genshin_Anime`	原神	`game`	-
7.6	`game`	`GTA`	侠盗猎车手4 与侠盗猎车手5	`game`	-
7.7	`game`	`Hogwarts_legacy`	霍格沃茨指遗	`game`	-
7.8	`game`	`hades`	哈迪斯	`game`	-
7.9	`game`	`Ib`	Ib恐怖美术馆	`game`	-
7.10	`game`	`RDR2RE`	碧血狂殺2	`game`	-
7.11	`game`	`sekiro`	只狼	`game`	-
7.12	`game`	`Sid_Meiers_CivilizationVI`	文明VI	`game`	-
7.13	`game`	`slay_the_spire`	杀戮尖塔	`game`	-
7.14	`game`	`StarRail`	崩坏：星穹铁道	`game`	-
7.15	`game`	`stellaris`	群星	`game`	-
7.16	`game`	`Terraria`	泰拉瑞亚	`game`	-
7.17	`game`	`The_Wither_3`	巫师三	`game`	-
7.18	`game`	`Turing_Complete`	图灵完备性	`game`	-
7.19	`game`	`witchspring`	魔女之泉R	`game`	-
7.20	`game`	`Wuthering`	鸣潮	`game`	-
7.21	`game`	`Yakuza`	人中之龙	`game`	-
8	`gov`	-	政府资料目录	-
8.1	`gov`	`xuexiqiangguo`	来自学习强国的文本。	`gov_xuexiqiangguo`	-
8.2	`gov`	`gov_report`	来自政府工作报告的文本。	`gov_report`	-
9	`law`	-	来自法律文书的文本。	-
9.1	`law`	judgement	法律判决书文本。	`law_judgement`	-
10	`math`	-	与数学相关的中文语料	`math`	-
10.1	`math`	`qa`	和数学领域有关的问答数据。	`math_qa`	-
10.2	`math`	`emath`	中国数学爱好者论坛语料数据	`emath`	-
10.3	`math`	`chat`	和数学领域有关的对话数据数据，可以提升模型Chain of Thought的能力。	`math_chat`	-
11	`new`	-	来自新闻的的文本数据	`new`	-
11.1	`news`	`peoples_daily`	来自人民日报的文本数据。	`new`	-
12	`parallel`	-	平行语料目录	-	-
12.1	`parallel`	`subtitle`	字幕语料	-	-
12.1.1	`parallel`	`subtitle` \ `yyets`	人人影视	parallel_subtitle_yyets	-
12.1.2	`parallel`	`subtitle` \ `shooter.cn`	射手网	parallel_subtitle_shooter	-
12.2	`parallel`	`united_nations`	联合国平行语料	parallel_united_nations	-
13	`patent`	-	专利文本数据目录	-
14	`qa`	-	来自各大问答语料	-
14.1	`qa`	`chatgpt`	使用ChatGPT构造的问答语料，感谢genggui001贡献语料。	`qa_chatgpt`	-
14.2	`qa`	`mfa`	外交部问答数据。	`qa_mfa`	-
14.3	`qa`	`quora`	来自quora 网站的问答语料	`qa_quora`	-
14.4	`qa`	`stackexchange`	来自StackExchange的问答数据。	`qa_stackexchange`	-
14.5	`qa`	`wikihow`	来自wikihow的问答数据。	`qa_wikihow`	-
14.6	`qa`	`zhihu`	来自知乎的问答数据。	`qa_zhihu`	-
15	`wikipedia`	-	来自维基百科的文本数据。	`wikipedia`	-
16	`medical`	-	来自药品说明书	`medical`	-

数据格式

目前MNBVC数据集包含如下几类数据：

通用文本
问答语料
代码语料
多轮对话
论坛语料
平行语料

可以在MNBVC的wiki页面上查看这几类数据的具体格式。

项目早期所上传的数据使用如下格式，以后这一格式会被废弃，相应数据也会重新上传：

json

{
    "text": datasets.Value("string"),
    "meta": datasets.Value("string")
}

Contributions

Thanks to the Liwu community for constructing this dataset. Thanks to silver 、 jiaming and Mark Leung for adding and uploading this dataset to Huggingface.

Citation

Please cite the repo if you use the data or code in this repo.

text

@misc{mnbvc,
  author = {{MOP-LIWU Community} and {MNBVC Team}},
  title = {MNBVC: Massive Never-ending BT Vast Chinese corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/esbatmop/MNBVC}},
}

Social Proof

HuggingFace Hub

201.5KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--liwu--mnbvc
slug: liwu--mnbvc
source: huggingface
author: liwu
license: ["mit"]
tags: task_categories:text-generation, task_categories:fill-mask, task_ids:language-modeling, task_ids:masked-language-modeling, annotations_creators:other, language_creators:other, multilinguality:monolingual, source_datasets:original, language:zh, license:mit, region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 201,483
stars: 609
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!