Legal Roberta Base
| Entity Passport | |
| Registry ID | hf-model--lexlms--legal-roberta-base |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_model__lexlms__legal_roberta_base,
author = {lexlms},
title = {Legal Roberta Base Dataset},
year = {2026},
howpublished = {\url{https://free2aitools.com/dataset/hf-model--lexlms--legal-roberta-base}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for Legal Roberta Base: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:100), Quality (Q:23).
Verification Authority
ποΈ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
𧬠Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
LexLM base
This model was continued pre-trained from RoBERTa base (https://huggingface.co/roberta-base) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles).
Model description
LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:
- We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
- We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
- We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
- We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
- We consider mixed cased models, similar to all recently developed large PLMs.
Intended uses & limitations
More information needed
Training and evaluation data
The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- distributed_type: tpu
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 512
- total_eval_batch_size: 256
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- training_steps: 1000000
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 1.0389 | 0.05 | 50000 | 0.9802 |
| 0.9685 | 0.1 | 100000 | 0.9021 |
| 0.9337 | 0.15 | 150000 | 0.8752 |
| 0.9106 | 0.2 | 200000 | 0.8558 |
| 0.8981 | 0.25 | 250000 | 0.8512 |
| 0.8813 | 1.03 | 300000 | 0.8203 |
| 0.8899 | 1.08 | 350000 | 0.8286 |
| 0.8581 | 1.13 | 400000 | 0.8148 |
| 0.856 | 1.18 | 450000 | 0.8141 |
| 0.8527 | 1.23 | 500000 | 0.8034 |
| 0.8345 | 2.02 | 550000 | 0.7763 |
| 0.8342 | 2.07 | 600000 | 0.7862 |
| 0.8147 | 2.12 | 650000 | 0.7842 |
| 0.8369 | 2.17 | 700000 | 0.7766 |
| 0.814 | 2.22 | 750000 | 0.7737 |
| 0.8046 | 2.27 | 800000 | 0.7692 |
| 0.7941 | 3.05 | 850000 | 0.7538 |
| 0.7956 | 3.1 | 900000 | 0.7562 |
| 0.8068 | 3.15 | 950000 | 0.7512 |
| 0.8066 | 3.2 | 1000000 | 0.7516 |
Framework versions
- Transformers 4.20.0
- Pytorch 1.12.0+cu102
- Datasets 2.6.1
- Tokenizers 0.12.0
Citation
@inproceedings{chalkidis-garneau-etal-2023-lexlms,
title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
author = "Chalkidis*, Ilias and
Garneau*, Nicolas and
Goanta, Catalina and
Katz, Daniel Martin and
SΓΈgaard, Anders",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
month = july,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2305.07507",
}
AI Summary: Based on Hugging Face metadata. Not a recommendation.
π‘οΈ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
π Identity & Source
- id
- hf-model--lexlms--legal-roberta-base
- slug
- lexlms--legal-roberta-base
- source
- huggingface
- author
- lexlms
- license
- tags
βοΈ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
π Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.