Sentence Transformers v5.4 ships native multimodal support — text, image, audio, and video now share one embedding space via a familiar Python API.
Hugging Face released Sentence Transformers v5.4, adding multimodal embedding and reranker support across text, images, audio, and video. The update includes support for VLM-based models like Qwen3-VL-Embedding-2B and 8B variants, plus legacy CLIP models for low-resource hardware. The API is backward-compatible, with one deprecation: 'tokenizer_kwargs' is renamed to 'processor_kwargs'. Models load via the existing SentenceTransformer class with optional modality-specific pip extras.
Sentence Transformers v5.4 means you no longer need separate pipelines for text and image retrieval — one model, one API, one embedding space. The Qwen3-VL-2B model requires ~8GB VRAM, so it's GPU-only, but CLIP variants still run on CPU for lightweight use cases. The backward-compatible API change (processor_kwargs) means existing code needs a one-line fix, not a rewrite.
Swap your current text-only retrieval index for a Qwen3-VL-Embedding-2B model on any pipeline where users query documents with screenshots or product images — measure recall@5 before and after to validate the upgrade.
Run: pip install -U 'sentence-transformers[image]' in your environment
Tags
Also today
Signals by role
Also today
Tools mentioned