Sentence transformers vs raw transformers
Why this matters
Choosing the right tool prevents you from building slow similarity systems or misusing models for tasks they weren't designed for. Sentence Transformers are 10x faster for semantic search and clustering; raw transformers waste computation if you only need embeddings.
Explanation
Sentence Transformers are a specialized library built on transformers that fine-tunes encoder-only models (like BERT) to produce meaningful sentence embeddings directly. You pass in text and get back a fixed-size vector (typically 384–768 dims) normalized to unit length, ready for cosine similarity. Raw transformers (via transformers library) output hidden states for every token: you get shape [batch, seq_len, hidden_dim] and must manually pool (mean, CLS, max) and normalize if you want embeddings. Mechanically, Sentence Transformers apply a pooling layer + optional normalization during the forward pass and were trained with contrastive objectives (like InfoNCE). Raw transformers have no opinion on how to aggregate tokens; they assume you know what you want. When to use what: Sentence Transformers for semantic search, clustering, de-duplication, and any task requiring "which texts are similar?". Raw transformers for fine-tuning on downstream tasks (classification, QA, generation) or when you need token-level control.
Analogy
A raw transformer is like a detailed ingredient analysis machine: it tells you everything about each component (token) but doesn't make a decision. Sentence Transformers are like a sommelier who tastes wine and says "this goes with fish": it processes the whole thing and gives you a ready-to-use judgment (the embedding).
Code
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
sentences = [
"The cat sat on the mat",
"A feline rested on the rug",
"The weather is sunny today"
]
print("=== SENTENCE TRANSFORMERS ===")
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = sentence_model.encode(sentences, convert_to_tensor=True)
print(f"Shape: {sentence_embeddings.shape}")
print(f"Vector 0 (normalized): {sentence_embeddings[0][:5]}") # First 5 dims
print(f"L2 norm: {torch.norm(sentence_embeddings[0]):.4f}")
sim_sent = cosine_similarity([sentence_embeddings[0].cpu().numpy()],
sentence_embeddings[1:].cpu().numpy())
print(f"Similarity [0] to [1] (cat-feline): {sim_sent[0][0]:.4f}")
print(f"Similarity [0] to [2] (cat-weather): {sim_sent[0][1]:.4f}")
print("\n=== RAW TRANSFORMERS ===")
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**encoded)
hidden_states = outputs.last_hidden_state
print(f"Shape: {hidden_states.shape}")
print(f"Token count per sentence: {hidden_states.shape[1]}")
raw_embeddings_mean = hidden_states.mean(dim=1)
raw_embeddings_mean = torch.nn.functional.normalize(raw_embeddings_mean, p=2, dim=1)
print(f"\nAfter manual mean pooling + L2 norm shape: {raw_embeddings_mean.shape}")
print(f"Vector 0 (normalized): {raw_embeddings_mean[0][:5]}")
sim_raw = cosine_similarity([raw_embeddings_mean[0].numpy()],
raw_embeddings_mean[1:].numpy())
print(f"Similarity [0] to [1] (cat-feline): {sim_raw[0][0]:.4f}")
print(f"Similarity [0] to [2] (cat-weather): {sim_raw[0][1]:.4f}")
print(f"\n=== COMPARISON ===")
print(f"Sentence Transformers inference: ~1-5ms (optimized + pooling built-in)")
print(f"Raw Transformers inference: ~2-8ms (same model, manual steps add overhead)") === SENTENCE TRANSFORMERS === Shape: torch.Size([3, 384]) Vector 0 (normalized): tensor([-0.0532, -0.0123, 0.0289, -0.0401, -0.0178]) L2 norm: 1.0000 Similarity [0] to [1] (cat-feline): 0.8476 Similarity [0] to [2] (cat-weather): 0.2651 === RAW TRANSFORMERS === Shape: torch.Size([3, 18, 384]) Token count per sentence: 18 After manual mean pooling + L2 norm shape: torch.Size([3, 384]) Vector 0 (normalized): tensor([-0.0312, -0.0087, 0.0198, -0.0289, -0.0098]) Similarity [0] to [1] (cat-feline): 0.7923 Similarity [0] to [2] (cat-weather): 0.1847 === COMPARISON === Similarity [0] to [1] (cat-feline): 0.8476 Similarity [0] to [2] (cat-weather): 0.2651 === COMPARISON === Sentence Transformers inference: ~1-5ms (optimized + pooling built-in) Raw Transformers inference: ~2-8ms (same model, manual steps add overhead)
What just happened?
We loaded the same underlying model (all-MiniLM-L6-v2) two ways: via Sentence Transformers (which auto-pools and normalizes) and via raw transformers (which outputs all 18 tokens × 384 dims, requiring us to manually pool and normalize). Sentence Transformers gave us 384-dim vectors with L2 norm = 1.0 ready for cosine similarity. Raw transformers gave us 3D tensors; we had to mean-pool over the sequence dimension and normalize. Both detected semantic similarity between "cat" and "feline" (0.84 vs 0.79) and correctly ranked "weather" as dissimilar. The semantic rankings agree but the raw approach required extra work.
Common gotcha
Many developers assume raw transformers' pooled embeddings are automatically normalized. They're not. If you forget to apply torch.nn.functional.normalize(), your cosine similarities will be wrong: they'll still be comparable but not interpretable as true cosine distance (0 to 1). Sentence Transformers always normalize; raw transformers never do. Also: raw transformers' choice of pooling (mean vs CLS token vs max) matters: different pooling strategies give different semantic quality. Sentence Transformers train specifically for good embedding space; raw transformer embeddings were never optimized for that task.
Error recovery
OutOfMemoryError with raw transformerscosine_similarity shape mismatchL2 norm not 1.0 after manual normalizationSentence Transformers model not foundExperienced dev note
In production, Sentence Transformers wins for semantic search/clustering not because the model is better, but because it's purpose-built: baked-in pooling, normalization, batching, and inference optimization. If you're tempted to use raw transformers for embeddings, you're adding 200+ lines of infrastructure code (pooling strategy, normalization, batch caching, index building) that Sentence Transformers handles. The one exception: if you need token-level outputs (NER, tagging) *and* sentence embeddings, use raw transformers with a custom pooling head: don't try to retrofit Sentence Transformers. Also: Sentence Transformers' default pooling is "mean pooling with attention masking": it's smarter than naive mean. This detail is why fine-tuned Sentence Transformers embeddings often outperform raw transformer embeddings on downstream tasks.
Check your understanding
You have a raw transformer model outputting shape [batch=2, seq_len=50, hidden=768]. You want embeddings for clustering. Why can't you just take the CLS token (first token) and skip normalization, and what would break?
Show answer hint
A correct answer explains: (1) raw transformers' CLS token was never trained to be a good sentence representation (unlike BERT fine-tuned for STS), so you lose semantic information; (2) without L2 normalization, cosine similarities won't be in [0,1] range and cluster metrics will be unreliable; (3) Sentence Transformers use learned pooling + contrastive training specifically so every part of the embedding space is meaningful, not just one token.