Open SourceHigh Impact·Thursday, April 9, 2026

Sentence Transformers v5.4 Adds Multimodal Embedding and Reranking

Sentence Transformers v5.4 ships native multimodal support — text, image, audio, and video now share one embedding space via a familiar Python API.

What happened

Hugging Face released Sentence Transformers v5.4, adding multimodal embedding and reranker support across text, images, audio, and video. The update includes support for VLM-based models like Qwen3-VL-Embedding-2B and 8B variants, plus legacy CLIP models for low-resource hardware. The API is backward-compatible, with one deprecation: 'tokenizer_kwargs' is renamed to 'processor_kwargs'. Models load via the existing SentenceTransformer class with optional modality-specific pip extras.

Why it matters to you

personalized

Sentence Transformers v5.4 means you no longer need separate pipelines for text and image retrieval — one model, one API, one embedding space. The Qwen3-VL-2B model requires ~8GB VRAM, so it's GPU-only, but CLIP variants still run on CPU for lightweight use cases. The backward-compatible API change (processor_kwargs) means existing code needs a one-line fix, not a rewrite.

What to do about it

Swap your current text-only retrieval index for a Qwen3-VL-Embedding-2B model on any pipeline where users query documents with screenshots or product images — measure recall@5 before and after to validate the upgrade.

Try this now

Python10 min

1
Run: pip install -U 'sentence-transformers[image]' in your environment

Community

8 comments