📊

Dataset

Legal Roberta Base

Name: Legal Roberta Base
Creator: lexlms

by lexlms hf-model--lexlms--legal-roberta-base

Nexus Index

25.6 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 0

R: Recency 100

Q: Quality 23

Tech Context

Vital Performance

0 DL / 30D

0.0%

Data Integrity 25.6 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-model--lexlms--legal-roberta-base
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_model__lexlms__legal_roberta_base,
  author = {lexlms},
  title = {Legal Roberta Base Dataset},
  year = {2026},
  howpublished = {\url{https://free2aitools.com/dataset/hf-model--lexlms--legal-roberta-base}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

lexlms. (2026). Legal Roberta Base [Dataset]. Free2AITools. https://free2aitools.com/dataset/hf-model--lexlms--legal-roberta-base

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

25.6

TOP 100% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 100

Quality (Q) 23

💬 Index Insight

FNI V2.0 for Legal Roberta Base: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:23).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

LexLM base

This model was continued pre-trained from RoBERTa base (https://huggingface.co/roberta-base) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles).

Model description

LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:

We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
We consider mixed cased models, similar to all recently developed large PLMs.

Intended uses & limitations

More information needed

Training and evaluation data

The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 32
seed: 42
distributed_type: tpu
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 512
total_eval_batch_size: 256
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.05
training_steps: 1000000

Training results

Training Loss	Epoch	Step	Validation Loss
1.0389	0.05	50000	0.9802
0.9685	0.1	100000	0.9021
0.9337	0.15	150000	0.8752
0.9106	0.2	200000	0.8558
0.8981	0.25	250000	0.8512
0.8813	1.03	300000	0.8203
0.8899	1.08	350000	0.8286
0.8581	1.13	400000	0.8148
0.856	1.18	450000	0.8141
0.8527	1.23	500000	0.8034
0.8345	2.02	550000	0.7763
0.8342	2.07	600000	0.7862
0.8147	2.12	650000	0.7842
0.8369	2.17	700000	0.7766
0.814	2.22	750000	0.7737
0.8046	2.27	800000	0.7692
0.7941	3.05	850000	0.7538
0.7956	3.1	900000	0.7562
0.8068	3.15	950000	0.7512
0.8066	3.2	1000000	0.7516

Framework versions

Transformers 4.20.0
Pytorch 1.12.0+cu102
Datasets 2.6.1
Tokenizers 0.12.0

Citation

Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.

text

@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = july,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-model--lexlms--legal-roberta-base
slug: lexlms--legal-roberta-base
source: huggingface
author: lexlms
license
tags

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 0
stars: 0
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!