Sentence Transformers released a guide and pipeline for fine-tuning multimodal embedding and reranker models on custom data, with a VDR benchmark beating models 4x larger.
The Sentence Transformers library now supports training and fine-tuning multimodal embedding and reranker models that handle text, images, audio, and video. A worked example fine-tunes Qwen3-VL-Embedding-2B on Visual Document Retrieval (VDR), improving NDCG@10 from 0.888 to 0.947 — outperforming all tested VDR models including those up to 4x larger. The pipeline reuses the existing SentenceTransformerTrainer, meaning existing text-only training code transfers almost directly. Training scripts and the fine-tuned model (tomaarsen/Qwen3-VL-Embedding-2B-vdr) are publicly available on Hugging Face.
Sentence Transformers' SentenceTransformerTrainer now handles multimodal inputs (text, images, audio, video) with the same API as text-only training. The Qwen3-VL-Embedding-2B fine-tune hitting NDCG@10 of 0.947 on VDR is a concrete proof point: domain-specific fine-tuning on a small model beats general-purpose giants. If you're building any RAG pipeline over document images, PDFs, or mixed-modal corpora, the existing tooling now removes the custom training loop entirely.
Pull the VDR training script from Hugging Face this week, swap in your own document screenshot dataset, and benchmark NDCG@10 against your current embedding model — a 2B model fine-tuned on your domain will likely beat the off-the-shelf 7B you're paying for.
Run: pip install sentence-transformers and clone the VDR training example from https://huggingface.co/tomaarsen/Qwen3-VL-Embedding-2B-vdr
Tags
Also today
Signals by role
Also today
Tools mentioned