@misc{hf_model__kojima_lab__molcrawl_molecule_nat_lang_bert_small,
author = {Kojima Lab},
title = {Molcrawl Molecule Nat Lang Bert Small Model},
year = {2026},
howpublished = {\url{https://huggingface.co/kojima-lab/molcrawl-molecule-nat-lang-bert-small}},
note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
Kojima Lab. (2026). Molcrawl Molecule Nat Lang Bert Small [Model]. Free2AITools. https://huggingface.co/kojima-lab/molcrawl-molecule-nat-lang-bert-small
GPT-2 small (124M parameters) foundation model pre-trained on molecule-related natural language text using a standard GPT-2 BPE tokenizer (vocab_size=50257).
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-molecule-nat-lang-bert-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-molecule-nat-lang-bert-small")
# Predict masked token
# Use tokenizer.mask_token instead of hardcoded "[MASK]":
# BERT-style tokenizers vary ("[MASK]", "", etc.)
if tokenizer.mask_token is None:
raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.")
prompt = "your input {MASK} sequence".replace("{MASK}", tokenizer.mask_token)
inputs = tokenizer(prompt, return_tensors="pt")
mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_token_id = logits[0, mask_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)
result = prompt.replace(tokenizer.mask_token, predicted_token)
print(f"Predicted: {result}")
Training
This model was trained with the RIKEN Foundation Model pipeline.
For more details, please refer to the training configuration files included in this repository.
License
This model is released under the APACHE-2.0 license.