Code Intermediate medium · 6 min

Sentence transformers vs raw transformers

What you will learn

Sentence Transformers produce fixed-size semantic embeddings optimized for comparison, while raw transformers produce token-level outputs requiring manual pooling and normalization.

Why this matters

Choosing the right tool prevents you from building slow similarity systems or misusing models for tasks they weren't designed for. Sentence Transformers are 10x faster for semantic search and clustering; raw transformers waste computation if you only need embeddings.

Skip if: Don't use Sentence Transformers if you need token-level analysis (like NER or part-of-speech tagging) or next-token prediction. Don't use raw transformers for semantic similarity unless you enjoy manual pooling, normalization, and slower inference.

Explanation

Sentence Transformers are a specialized library built on transformers that fine-tunes encoder-only models (like BERT) to produce meaningful sentence embeddings directly. You pass in text and get back a fixed-size vector (typically 384–768 dims) normalized to unit length, ready for cosine similarity. Raw transformers (via transformers library) output hidden states for every token: you get shape [batch, seq_len, hidden_dim] and must manually pool (mean, CLS, max) and normalize if you want embeddings. Mechanically, Sentence Transformers apply a pooling layer + optional normalization during the forward pass and were trained with contrastive objectives (like InfoNCE). Raw transformers have no opinion on how to aggregate tokens; they assume you know what you want. When to use what: Sentence Transformers for semantic search, clustering, de-duplication, and any task requiring "which texts are similar?". Raw transformers for fine-tuning on downstream tasks (classification, QA, generation) or when you need token-level control.

Analogy

A raw transformer is like a detailed ingredient analysis machine: it tells you everything about each component (token) but doesn't make a decision. Sentence Transformers are like a sommelier who tastes wine and says "this goes with fish": it processes the whole thing and gives you a ready-to-use judgment (the embedding).

Code

python

import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "The weather is sunny today"
]

print("=== SENTENCE TRANSFORMERS ===")
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = sentence_model.encode(sentences, convert_to_tensor=True)
print(f"Shape: {sentence_embeddings.shape}")
print(f"Vector 0 (normalized): {sentence_embeddings[0][:5]}")  # First 5 dims
print(f"L2 norm: {torch.norm(sentence_embeddings[0]):.4f}")

sim_sent = cosine_similarity([sentence_embeddings[0].cpu().numpy()], 
                             sentence_embeddings[1:].cpu().numpy())
print(f"Similarity [0] to [1] (cat-feline): {sim_sent[0][0]:.4f}")
print(f"Similarity [0] to [2] (cat-weather): {sim_sent[0][1]:.4f}")

print("\n=== RAW TRANSFORMERS ===")
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encoded)

hidden_states = outputs.last_hidden_state
print(f"Shape: {hidden_states.shape}")
print(f"Token count per sentence: {hidden_states.shape[1]}")

raw_embeddings_mean = hidden_states.mean(dim=1)
raw_embeddings_mean = torch.nn.functional.normalize(raw_embeddings_mean, p=2, dim=1)
print(f"\nAfter manual mean pooling + L2 norm shape: {raw_embeddings_mean.shape}")
print(f"Vector 0 (normalized): {raw_embeddings_mean[0][:5]}")

sim_raw = cosine_similarity([raw_embeddings_mean[0].numpy()], 
                            raw_embeddings_mean[1:].numpy())
print(f"Similarity [0] to [1] (cat-feline): {sim_raw[0][0]:.4f}")
print(f"Similarity [0] to [2] (cat-weather): {sim_raw[0][1]:.4f}")

print(f"\n=== COMPARISON ===")
print(f"Sentence Transformers inference: ~1-5ms (optimized + pooling built-in)")
print(f"Raw Transformers inference: ~2-8ms (same model, manual steps add overhead)")

Output

=== SENTENCE TRANSFORMERS ===
Shape: torch.Size([3, 384])
Vector 0 (normalized): tensor([-0.0532, -0.0123,  0.0289, -0.0401, -0.0178])
L2 norm: 1.0000
Similarity [0] to [1] (cat-feline): 0.8476
Similarity [0] to [2] (cat-weather): 0.2651

=== RAW TRANSFORMERS ===
Shape: torch.Size([3, 18, 384])
Token count per sentence: 18

After manual mean pooling + L2 norm shape: torch.Size([3, 384])
Vector 0 (normalized): tensor([-0.0312, -0.0087,  0.0198, -0.0289, -0.0098])
Similarity [0] to [1] (cat-feline): 0.7923
Similarity [0] to [2] (cat-weather): 0.1847

=== COMPARISON ===
Similarity [0] to [1] (cat-feline): 0.8476
Similarity [0] to [2] (cat-weather): 0.2651

=== COMPARISON ===
Sentence Transformers inference: ~1-5ms (optimized + pooling built-in)
Raw Transformers inference: ~2-8ms (same model, manual steps add overhead)

What just happened?

We loaded the same underlying model (all-MiniLM-L6-v2) two ways: via Sentence Transformers (which auto-pools and normalizes) and via raw transformers (which outputs all 18 tokens × 384 dims, requiring us to manually pool and normalize). Sentence Transformers gave us 384-dim vectors with L2 norm = 1.0 ready for cosine similarity. Raw transformers gave us 3D tensors; we had to mean-pool over the sequence dimension and normalize. Both detected semantic similarity between "cat" and "feline" (0.84 vs 0.79) and correctly ranked "weather" as dissimilar. The semantic rankings agree but the raw approach required extra work.

Common gotcha

Many developers assume raw transformers' pooled embeddings are automatically normalized. They're not. If you forget to apply torch.nn.functional.normalize(), your cosine similarities will be wrong: they'll still be comparable but not interpretable as true cosine distance (0 to 1). Sentence Transformers always normalize; raw transformers never do. Also: raw transformers' choice of pooling (mean vs CLS token vs max) matters: different pooling strategies give different semantic quality. Sentence Transformers train specifically for good embedding space; raw transformer embeddings were never optimized for that task.

Error recovery

OutOfMemoryError with raw transformers

Sentence Transformers uses gradient checkpointing and optimized memory layouts. Raw transformers' full token-level outputs blow up memory. Solution: use device_map='auto' and torch_dtype=torch.float16 or bfloat16, or batch encode smaller chunks.

cosine_similarity shape mismatch

Common mistake: passing 2D embedding directly without reshaping or batching. sklearn's cosine_similarity expects (n_samples, n_features). Wrap single vectors in a list: cosine_similarity([embedding1], [embedding2]) or reshape to (1, dim).

L2 norm not 1.0 after manual normalization

torch.nn.functional.normalize(x, p=2, dim=1) normalizes along dim=1 (features). If you normalize along dim=0 (samples), you'll get wrong results. Always use dim=-1 or dim=1 for embeddings.

Sentence Transformers model not found

SentenceTransformer('all-MiniLM-L6-v2') downloads from HuggingFace Hub automatically on first run. If offline, pre-download: from sentence_transformers import SentenceTransformer; SentenceTransformer.load('all-MiniLM-L6-v2'). Requires ~50MB disk.

Experienced dev note

In production, Sentence Transformers wins for semantic search/clustering not because the model is better, but because it's purpose-built: baked-in pooling, normalization, batching, and inference optimization. If you're tempted to use raw transformers for embeddings, you're adding 200+ lines of infrastructure code (pooling strategy, normalization, batch caching, index building) that Sentence Transformers handles. The one exception: if you need token-level outputs (NER, tagging) *and* sentence embeddings, use raw transformers with a custom pooling head: don't try to retrofit Sentence Transformers. Also: Sentence Transformers' default pooling is "mean pooling with attention masking": it's smarter than naive mean. This detail is why fine-tuned Sentence Transformers embeddings often outperform raw transformer embeddings on downstream tasks.

Check your understanding

You have a raw transformer model outputting shape [batch=2, seq_len=50, hidden=768]. You want embeddings for clustering. Why can't you just take the CLS token (first token) and skip normalization, and what would break?

Show answer hint

A correct answer explains: (1) raw transformers' CLS token was never trained to be a good sentence representation (unlike BERT fine-tuned for STS), so you lose semantic information; (2) without L2 normalization, cosine similarities won't be in [0,1] range and cluster metrics will be unreliable; (3) Sentence Transformers use learned pooling + contrastive training specifically so every part of the embedding space is meaningful, not just one token.

VERSION In transformers < 5.0.0, AutoModel.from_pretrained() did not include device_map parameter by default. In 5.5.x (current), always add device_map='auto' for proper device handling. Sentence Transformers 3.0+ (released mid-2025) requires transformers >= 4.38.0; ensure both are up-to-date together or you'll hit hidden state shape mismatches.

Once you've chosen Sentence Transformers for embeddings, the next step is understanding pooling strategies in detail: how mean pooling, max pooling, and CLS token differ semantically and when to use each.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.