📊
Dataset

Github Code 2025 Language Split

by lumees hf-dataset--lumees--github-code-2025-language-split
Nexus Index
32.8 Top 100%
S: Semantic 50
A: Authority 0
P: Popularity 62
R: Recency 48
Q: Quality 30
Tech Context
Vital Performance
0 DL / 30D
0.0%
Data Integrity 32.8 FNI Score
- Size
- Rows
Parquet Format
- Tokens
Dataset Information Summary
Entity Passport
Registry ID hf-dataset--lumees--github-code-2025-language-split
License ["other"]
Provider huggingface
📜

Cite this dataset

Academic & Research Attribution

BibTeX
@misc{hf_dataset__lumees__github_code_2025_language_split,
  author = {lumees},
  title = {Github Code 2025 Language Split Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/lumees/github-code-2025-language-split}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
lumees. (2026). Github Code 2025 Language Split [Dataset]. Free2AITools. https://huggingface.co/datasets/lumees/github-code-2025-language-split

đŸ”ŦTechnical Deep Dive

Full Specifications [+]

âš–ī¸ Nexus Index V2.0

32.8
TOP 100% SYSTEM IMPACT
Semantic (S) 50
Authority (A) 0
Popularity (P) 62
Recency (R) 48
Quality (Q) 30

đŸ’Ŧ Index Insight

FNI V2.0 for Github Code 2025 Language Split: Semantic (S:50), Authority (A:0), Popularity (P:62), Recency (R:48), Quality (Q:30).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
âŦ‡ī¸
Downloads
218,748

đŸ‘ī¸ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

đŸ§Ŧ Field Logic

đŸ§Ŧ

Schema not yet indexed for this dataset.

Dataset Specification

📜 Source Data & Attribution

This dataset is a processed derivative of nick007x/github-code-2025.

Origination

The original data was aggregated by nick007x from public GitHub repositories. We have retained the original content, file paths, and metadata while restructuring the format for easier consumption by language-specific models.

Processing Steps

To create this dataset, we performed the following processing on the source data:

  1. Language Identification: We mapped file extensions (e.g., .py, .rs, .ts) to their respective programming languages using a comprehensive extension map.
  2. Splitting: The dataset was sharded and split into separate sub-directories/categories by programming language to allow for targeted loading (e.g., loading only Python or Rust data).
  3. Filtering: Binary files and ambiguous extensions were categorized as "Unknown" or removed to ensure text-based model compatibility.

Licensing Information

The data contained in this dataset belongs to the original authors of the code repositories on GitHub.

  • Source Aggregation: The aggregation was provided by nick007x/github-code-2025.
  • Individual Code Files: Each file typically retains the license of its original repository (MIT, Apache 2.0, BSD, etc.). Users of this dataset are responsible for adhering to the license terms of the individual code files contained within.

Citation

If you use this dataset, please cite the original source:

bibtex
@misc{github-code-2025,
  author = {nick007x},
  title = {GitHub Code 2025 Dataset},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/nick007x/github-code-2025}}
}

📊 Structured Schema (Zero-Fabrication)

Feature Key Data Type
repo_id string
size int64
file_path string
content string

Estimated Rows: 494,903

Social Proof

HuggingFace Hub
218.7KDownloads
🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseâ„šī¸ Verify with original source

đŸ›Ąī¸ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id
hf-dataset--lumees--github-code-2025-language-split
slug
lumees--github-code-2025-language-split
source
huggingface
author
lumees
license
["other"]
tags
source_datasets:nick007x/github-code-2025, license:other, size_categories:100m<n<1b, format:parquet, modality:text, library:datasets, library:dask, library:polars, library:mlcroissant, region:us

âš™ī¸ Technical Specs

architecture
null
params billions
null
context length
null
pipeline tag

📊 Engagement & Metrics

downloads
218,748
stars
6
forks
0

Data indexed from public sources. Updated daily.