Propella Annotations
Pillar scores are computed during the next indexing cycle.
--- language: - ara - ben - bos - bul - cat - ces - dan - deu - ell - eng - est - eus - fas - fin - fra - gle - glg - gsw - heb - hin - hrv - hun - ind - isl - ita - jpn - kat - kor - lat - lav - lit - ltg - mkd - mlt - nld - nno - nob - pol - por - ron - rus - slk - slv - spa - sqi - srp - swa - swe - tha - tur - ukr - urd - vie - yue - zho tags: - propella - data - annotation - filtering - curation - quality - fineweb - finepdfs - nemotron - german-commons - metadata pret...
| Entity Passport | |
| Registry ID | hf-dataset--openeurollm--propella-annotations |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__openeurollm__propella_annotations,
author = {openeurollm},
title = {Propella Annotations Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/openeurollm/propella-annotations}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Propella Annotations: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
license: cc-by-4.0
language:
- ara
- ben
- bos
- bul
- cat
- ces
- dan
- deu
- ell
- eng
- est
- eus
- fas
- fin
- fra
- gle
- glg
- gsw
- heb
- hin
- hrv
- hun
- ind
- isl
- ita
- jpn
- kat
- kor
- lat
- lav
- lit
- ltg
- mkd
- mlt
- nld
- nno
- nob
- pol
- por
- ron
- rus
- slk
- slv
- spa
- sqi
- srp
- swa
- swe
- tha
- tur
- ukr
- urd
- vie
- yue
- zho
tags: - propella
- data
- annotation
- filtering
- curation
- quality
- fineweb
- finepdfs
- nemotron
- german-commons
- metadata
pretty_name: propella annotations
size_categories: - 1B<n<10B
configs: - config_name: fineweb-2
default: true
data_files:- split: all
path:- data/propella-1-4b/fineweb-2//.parquet
- split: deu_Latn
path:- data/propella-1-4b/fineweb-2/deu_Latn/*.parquet
- split: fin_Latn
path:- data/propella-1-4b/fineweb-2/fin_Latn/*.parquet
- split: fra_Latn
path:- data/propella-1-4b/fineweb-2/fra_Latn/*.parquet
- split: ita_Latn
path:- data/propella-1-4b/fineweb-2/ita_Latn/*.parquet
- split: spa_Latn
path:- data/propella-1-4b/fineweb-2/spa_Latn/*.parquet
- split: swe_Latn
path:- data/propella-1-4b/fineweb-2/swe_Latn/*.parquet
- split: all
- config_name: finepdfs
data_files:- split: all
path:- data/propella-1-4b/finepdfs//.parquet
- split: ces_Latn
path:- data/propella-1-4b/finepdfs/ces_Latn/*.parquet
- split: dan_Latn
path:- data/propella-1-4b/finepdfs/dan_Latn/*.parquet
- split: deu_Latn
path:- data/propella-1-4b/finepdfs/deu_Latn/*.parquet
- split: eng_Latn
path:- data/propella-1-4b/finepdfs/eng_Latn/*.parquet
- split: fin_Latn
path:- data/propella-1-4b/finepdfs/fin_Latn/*.parquet
- split: fra_Latn
path:- data/propella-1-4b/finepdfs/fra_Latn/*.parquet
- split: ita_Latn
path:- data/propella-1-4b/finepdfs/ita_Latn/*.parquet
- split: hun_Latn
path:- data/propella-1-4b/finepdfs/hun_Latn/*.parquet
- split: nld_Latn
path:- data/propella-1-4b/finepdfs/nld_Latn/*.parquet
- split: nob_Latn
path:- data/propella-1-4b/finepdfs/nob_Latn/*.parquet
- split: pol_Latn
path:- data/propella-1-4b/finepdfs/pol_Latn/*.parquet
- split: por_Latn
path:- data/propella-1-4b/finepdfs/por_Latn/*.parquet
- split: ron_Latn
path:- data/propella-1-4b/finepdfs/ron_Latn/*.parquet
- split: spa_Latn
path:- data/propella-1-4b/finepdfs/spa_Latn/*.parquet
- split: swe_Latn
path:- data/propella-1-4b/finepdfs/swe_Latn/*.parquet
- split: all
- config_name: hplt-3
data_files:- split: all
path:- data/propella-1-4b/hplt-3//.parquet
- split: fin_Latn
path:- data/propella-1-4b/hplt-3/fin_Latn/*.parquet
- split: deu_Latn
path:- data/propella-1-4b/hplt-3/deu_Latn/*.parquet
- split: all
- config_name: finewiki
data_files:- split: all
path:- data/propella-1-4b/finewiki/*.parquet
- split: all
- config_name: SYNTH
data_files:- split: all
path:- data/propella-1-4b/SYNTH/*.parquet
- split: all
- config_name: nemotron-cc
data_files:- split: all
path:- data/propella-1-4b/nemotron-cc//.parquet
- split: high_actual
path:- data/propella-1-4b/nemotron-cc/high-actual/*.parquet
- split: all
- config_name: nemotron-cc-10k-sample
data_files:- split: all
path:- data/propella-1-4b/nemotron-cc-10k-sample/*.parquet
- split: all
- config_name: german-commons
data_files:- split: all
path:- data/propella-1-4b/german-commons/*.parquet
- split: all
This dataset contains document annotations produced with propella-1-4b, a small multilingual LLM that annotates text documents across six categories: core content, classification, quality & value, audience & purpose, safety & compliance, and geographic relevance. The annotations can be used to filter, select, and curate LLM training data at scale.
Properties
Each document is annotated across 18 properties organized into six categories:
| Category | Property | Description |
|---|---|---|
| Core Content | Content Integrity | Completeness and technical quality of the content |
| Content Ratio | Proportion of content vs. navigation/UI elements | |
| Content Length | Amount of substantive content | |
| Classification | One-Sentence Description | Ultra-short neutral description of the document |
| Content Type | Functional structure and purpose | |
| Business Sector | Industry domain relevance | |
| Technical Content | Type and intensity of specialized knowledge | |
| Quality & Value | Content Quality | Overall writing and presentation quality |
| Information Density | Ratio of valuable information to redundancy | |
| Educational Value | Potential for teaching and learning | |
| Reasoning Indicators | Presence of logical reasoning and analysis | |
| Audience & Purpose | Audience Level | Target sophistication level |
| Commercial Bias | Commercial influence on objectivity | |
| Time-Sensitivity | How content value changes over time | |
| Safety & Compliance | Content Safety | Presence of inappropriate or harmful content |
| PII Presence | Contains personally identifiable information | |
| Geographic | Regional Relevance | Primary regional/cultural context |
| Country Relevance | Specific country relevance |
Read the property reference for detailed definitions and enum values.
Dataset Overview
This dataset is work-in-progress.
We plan to add lots of annotations over time.
Want to suggest a dataset to be annotated next?
Feel free to open a discussion in the community tab!
Want to contribute significant compute for more annotations?
Get in touch.
Currently, we provide annotations for the following datasets:
fineweb-2
Source: FineWeb-2
| Language | Annotations |
|---|---|
| deu_Latn | 496_029_661 |
| spa_Latn | 441_303_178 |
| fra_Latn | 360_041_218 |
| ita_Latn | 239_025_466 |
| swe_Latn | 59_509_998 |
| fin_Latn | 36_741_214 |
| Total | 1,632,650,735 |
finepdfs
Source: FinePDFs
| Language | Annotations |
|---|---|
| eng_Latn | 206_917_553 |
| deu_Latn | 36_121_915 |
| fra_Latn | 27_312_269 |
| spa_Latn | 25_629_014 |
| ita_Latn | 17_451_182 |
| por_Latn | 12_045_013 |
| pol_Latn | 9_692_213 |
| nld_Latn | 7_795_696 |
| ces_Latn | 5_651_529 |
| swe_Latn | 4_125_120 |
| ron_Latn | 3_265_132 |
| hun_Latn | 3_145_494 |
| dan_Latn | 2_415_047 |
| fin_Latn | 1_980_522 |
| nob_Latn | 1_501_170 |
| Total | 365_048_869 |
hplt-3
Source: HPLT3.0
| Language | Annotations |
|---|---|
| deu_Latn | 645_362_388 |
| fin_Latn | 49_558_089 |
| Total | 694_920_477 |
finewiki
Source: finewiki
| split | Annotations |
|---|---|
| all | 43_097_138 |
SYNTH
Source: PleIAs/SYNTH
Note: text = f"{row['query']}\n\n{row['synthetic_reasoning']}\n\n{row['synthetic_answer']}
| split | Annotations |
|---|---|
| all | 77_908_583 |
nemotron-cc
Source: Nemotron-CC
Note: This is only a subset of the high-actual-actual split.
| split | Annotations |
|---|---|
| high_actual | 155_688_999 |
nemotron-cc-10k-sample
Source: nemotron-cc-10K-sample
A sample from nemotron-cc, containing 10k documents from each of the five quality categories.
| Language | Annotations |
|---|---|
| eng_Latn | 50_000 |
german-commons
Source: German Commons
| split | Annotations |
|---|---|
| all | 35_716_016 |
Usage
import datasets as hfds
load annotations for German FineWeb-2
annotations = hfds.load_dataset("openeurollm/propella-annotations", "fineweb-2", split="deu_Latn")
example filter: high educational value
high_edu_ids = set(
annotations
.filter(lambda x: x["educational_value"] == "high")
["id"]
)
filter German FineWeb-2 by matching ids
ds = hfds.load_dataset("HuggingFaceFW/fineweb-2", "deu_Latn", split="train", streaming=True)
filtered = ds.filter(lambda x: x["id"] in high_edu_ids)
for doc in filtered:
print(doc["text"][:500])
break
License
The annotation data in this repository is released under the CC-BY-4.0 license.
Citation
TBA
Acknowledgements
- This project used compute resources made available via the EuroHPC Joint Undertaking (EuroHPC JU) AI Factories initiative (AI for Industrial Innovation â Large Scale Access) on the EuroHPC supercomputer LEONARDO operated by CINECA and the LEONARDO consortium.
- This project used compute resources made available via the EuroHPC Joint Undertaking (EuroHPC JU) AI Factories initiative (AI for Industrial Innovation â Large Scale Access) on the EuroHPC supercomputer MareNostrum 5 operated by the Barcelona Supercomputing Center (BSC).
- This project is supported by the OpenEuroLLM project, co-funded by the Digital Europe Programme under GA no. 101195233. For more information see openeurollm.eu.
- This project is supported by the LLMs4EU project, co-funded by the Digital Europe Programme under GA no. 101198470. For more information see LLMs4EU website.
- ellamind is supported by the German Federal Ministry for Economic Affairs and Energy (BMWE) under the soofi (Sovereign Open Source Foundation Models for European Intelligence) project.
- ellamind thanks the AI Service Center for Sensitive and Critical Infrastructures (KISSKI), operated by GWDG, for additional compute access.

Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--openeurollm--propella-annotations
- source
- huggingface
- author
- openeurollm
- tags
- language:aralanguage:benlanguage:boslanguage:bullanguage:catlanguage:ceslanguage:danlanguage:deulanguage:elllanguage:englanguage:estlanguage:euslanguage:faslanguage:finlanguage:fralanguage:glelanguage:glglanguage:gswlanguage:heblanguage:hinlanguage:hrvlanguage:hunlanguage:indlanguage:isllanguage:italanguage:jpnlanguage:katlanguage:korlanguage:latlanguage:lavlanguage:litlanguage:ltglanguage:mkdlanguage:mltlanguage:nldlanguage:nnolanguage:noblanguage:pollanguage:porlanguage:ronlanguage:ruslanguage:slklanguage:slvlanguage:spalanguage:sqilanguage:srplanguage:swalanguage:swelanguage:thalanguage:turlanguage:ukrlanguage:urdlanguage:vielanguage:yuelanguage:zholicense:cc-by-4.0size_categories:1b
format:parquetmodality:textlibrary:datasetslibrary:dasklibrary:polarslibrary:mlcroissantregion:uspropelladataannotationfilteringcurationqualityfinewebfinepdfsnemotrongerman-commonsmetadata
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
đ Engagement & Metrics
- likes
- 9
- downloads
- 15,966
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)