Open SourceHigh Impact·Thursday, April 16, 2026

Sentence Transformers Now Supports Multimodal Embedding Fine-Tuning

Sentence Transformers released a guide and pipeline for fine-tuning multimodal embedding and reranker models on custom data, with a VDR benchmark beating models 4x larger.

What happened

The Sentence Transformers library now supports training and fine-tuning multimodal embedding and reranker models that handle text, images, audio, and video. A worked example fine-tunes Qwen3-VL-Embedding-2B on Visual Document Retrieval (VDR), improving NDCG@10 from 0.888 to 0.947 — outperforming all tested VDR models including those up to 4x larger. The pipeline reuses the existing SentenceTransformerTrainer, meaning existing text-only training code transfers almost directly. Training scripts and the fine-tuned model (tomaarsen/Qwen3-VL-Embedding-2B-vdr) are publicly available on Hugging Face.

Why it matters to you

personalized

Sentence Transformers' SentenceTransformerTrainer now handles multimodal inputs (text, images, audio, video) with the same API as text-only training. The Qwen3-VL-Embedding-2B fine-tune hitting NDCG@10 of 0.947 on VDR is a concrete proof point: domain-specific fine-tuning on a small model beats general-purpose giants. If you're building any RAG pipeline over document images, PDFs, or mixed-modal corpora, the existing tooling now removes the custom training loop entirely.

What to do about it

Pull the VDR training script from Hugging Face this week, swap in your own document screenshot dataset, and benchmark NDCG@10 against your current embedding model — a 2B model fine-tuned on your domain will likely beat the off-the-shelf 7B you're paying for.

Try this now

Python10 min

1
Run: pip install sentence-transformers and clone the VDR training example from https://huggingface.co/tomaarsen/Qwen3-VL-Embedding-2B-vdr

Community

4 comments