Bidirlm Omni 2.5b Embedding
| Entity Passport | |
| Registry ID | hf-model--bidirlm--bidirlm-omni-2.5b-embedding |
| License | Apache-2.0 |
| Provider | huggingface |
Compute Threshold
~3.2GB VRAM
* Static estimation for 4-Bit Quantization.
Cite this model
Academic & Research Attribution
@misc{hf_model__bidirlm__bidirlm_omni_2.5b_embedding,
author = {BidirLM},
title = {Bidirlm Omni 2.5b Embedding Model},
year = {2026},
howpublished = {\url{https://huggingface.co/bidirlm/bidirlm-omni-2.5b-embedding}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
Quick Commands
ollama run bidirlm-omni-2.5b-embedding huggingface-cli download bidirlm/bidirlm-omni-2.5b-embedding pip install -U transformers βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for Bidirlm Omni 2.5b Embedding: Semantic (S:50), Authority (A:0), Popularity (P:38), Recency (R:98), Quality (Q:65).
Verification Authority
π What's Next?
Technical Deep Dive
BidirLM-Omni-2.5B
BidirLM-Omni is the omnimodal variant of the BidirLM family β a 2.5B bidirectional encoder that jointly embeds text, images, and audio into a shared representation space, enabling state-of-the-art embedding performance.

Supported Tasks
Multimodal embeddings (via Sentence Transformers): cross-modal retrieval (text β image, text β audio), multimodal semantic similarity, clustering, and classification across text, image, and audio modalities.
Text-only downstream fine-tuning (via Transformers): sequence classification (e.g. MNLI, XNLI), token classification (e.g. NER), sequence regression.
Supported Languages Multilingual support across over 119 languages, inherited from the Qwen3 base model and reinforced through contrastive training with 87 languages.
Usage
Sentence Transformers
Pass text strings, PIL.Image objects, or audio dicts (with "array" and "sampling_rate" keys) directly to encode(). All modalities produce embeddings in the same 2048-dimensional space and can be compared cross-modally.
BidirLM-Omni-2.5B-Embedding β Cross-Modal Similarity Demo
Setup
import numpy as np
import PIL.Image
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True)
Inputs
Text queries
texts = [
"An image with a red background.",
"An image with a blue background.",
"A deep bass sound.",
"A high-pitched sound.",
]
Images β synthetic solid-color 256Γ256 images
images = [
PIL.Image.fromarray(np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8)), # red
PIL.Image.fromarray(np.full((256, 256, 3), (30, 30, 220), dtype=np.uint8)), # blue
]
Audio β synthetic sine waves at 16 kHz, 2 seconds each
sr = 16000
t = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32)
audios = [
{"array": np.sin(2 * np.pi * 80 * t), "sampling_rate": sr}, # 80 Hz β bass
{"array": np.sin(2 * np.pi * 7500 * t), "sampling_rate": sr}, # 7500 Hz β high
]
Encoding & Similarity
text_embeddings = model.encode(texts)
image_embeddings = model.encode(images)
audio_embeddings = model.encode(audios)
print(model.similarity(text_embeddings, image_embeddings))
print(model.similarity(text_embeddings, audio_embeddings))
Results
Text β Image Similarity
| Text | π₯ Red image | π¦ Blue image | Best match |
|---|---|---|---|
| "An image with a red background." | +0.6918 | +0.3199 | π₯ Red β |
| "An image with a blue background." | +0.4255 | +0.6498 | π¦ Blue β |
| "A deep bass sound." | +0.1508 | +0.2302 | β (low) |
| "A high-pitched sound." | +0.1404 | +0.1816 | β (low) |
Text β Audio Similarity
| Text | π 80 Hz (bass) | π 7500 Hz (high) | Best match |
|---|---|---|---|
| "An image with a red background." | +0.0022 | +0.0422 | β (low) |
| "An image with a blue background." | +0.0517 | +0.0642 | β (low) |
| "A deep bass sound." | +0.5448 | +0.4217 | π Bass β |
| "A high-pitched sound." | +0.4003 | +0.5170 | π High β |
Audio inputs are automatically resampled to the model's native sampling rate if needed β any source rate is accepted.
Manual Tokenization with Chat Template
Use AutoProcessor directly to build inputs from a conversation dict, giving full control over the prompt before encoding.
import numpy as np
import PIL.Image
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True
)
# ββ Text-only βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
conversation_text = [
{"role": "user", "content": [{"type": "text", "text": "An image with a red background."}]}
]
# ββ Text + Image ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
image = PIL.Image.fromarray(
np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8) # red
)
conversation_image = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image."},
],
}
]
# ββ Text + Audio ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
sr = 16000
t = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32)
audio_array = np.sin(2 * np.pi * 80 * t) # 80 Hz bass tone
conversation_audio = [
{
"role": "user",
"content": [
{"type": "audio"},
{"type": "text", "text": "Describe this sound."},
],
}
]
# ββ Apply chat template and tokenize βββββββββββββββββββββββββββββββββββββββββ
text = processor.apply_chat_template(conversation_text, tokenize=False, add_generation_prompt=False)
inputs_text = processor(text=text, return_tensors="pt")
text = processor.apply_chat_template(conversation_image, tokenize=False, add_generation_prompt=False)
inputs_image = processor(text=text, images=image, return_tensors="pt")
text = processor.apply_chat_template(conversation_audio, tokenize=False, add_generation_prompt=False)
inputs_audio = processor(
text=text,
audio=[audio_array],
return_tensors="pt",
)
Fine-tuning for Downstream Tasks
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True
)
# Sequence classification (e.g., NLI)
seq_model = AutoModelForSequenceClassification.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding",
trust_remote_code=True,
num_labels=3,
)
# Token classification (e.g., NER)
tok_model = AutoModelForTokenClassification.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding",
trust_remote_code=True,
num_labels=7,
)
# Fine-tune with HuggingFace Trainer
Requirements
transformers>=5.5.0
sentence-transformers>=5.2.0
Optional dependency for audio inputs at non-native sample rates:
librosa>=0.10.0
FAQ
1. What pooling strategy does this model use?
The model uses mean pooling across all modalities. This is handled automatically when using Sentence Transformers.
2. Do I need `trust_remote_code=True`?
Yes. BidirLM-Omni uses a custom bidirectional omnimodal architecture that requires loading custom code from the repository.
3. Can I compare embeddings across modalities?
Yes. Text, image, and audio embeddings live in the same 2048-dimensional space and can be compared directly using cosine similarity.
4. What audio formats and sample rates are supported?
Any sample rate is accepted β the model resamples internally using librosa when the source rate differs from the native rate. Any audio format readable by standard libraries (WAV, MP3, FLAC, etc.) can be used by loading it into a NumPy array first.
Citation
@misc{boizard2026bidirlmtextomnimodalbidirectional,
title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs},
author={Nicolas Boizard and ThΓ©o Deschamps-Berger and Hippolyte Gisserot-Boukhlef and CΓ©line Hudelot and Pierre Colombo},
year={2026},
eprint={2604.02045},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.02045},
}
β οΈ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source βπ Limitations & Considerations
- β’ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- β’ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- β’ FNI scores are relative rankings and may change as new models are added.
- β License Unknown: Verify licensing terms before commercial use.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
π‘οΈ Model Transparency Report
Technical metadata sourced from upstream repositories.
π Identity & Source
- id
- hf-model--bidirlm--bidirlm-omni-2.5b-embedding
- slug
- bidirlm--bidirlm-omni-2.5b-embedding
- source
- huggingface
- author
- BidirLM
- license
- Apache-2.0
- tags
- sentence-transformers, safetensors, bidirlm_omni, mteb, transformers, embedding, bidirectional, multilingual, sentence-similarity, custom_code, af, am, ar, az, be, bg, bn, bs, ca, ceb, cs, cy, da, de, el, en, es, et, eu, fa, fi, fr, ga, gl, gu, ha, he, hi, hr, ht, hu, hy, id, ig, is, it, ja, jv, ka, kk, kn, ko, ky, lt, lv, mg, mk, ml, mr, ms, mt, my, nb, ne, nl, nso, ny, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, uk, ur, vi, wo, xh, yo, zh, zu, arxiv:
βοΈ Technical Specs
- architecture
- null
- params billions
- 2.5
- context length
- 4,096
- pipeline tag
- sentence-similarity
- vram gb
- 3.2
- vram is estimated
- true
- vram formula
- VRAM β (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)
π Engagement & Metrics
- downloads
- 8,004
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.