Sentence Transformers Cheat Sheet — Embeddings & Similarity
from sentence_transformers import SentenceTransformer, util, losses
from sentence_transformers import InputExample, models Pre-trained transformers that turn sentences into dense numeric vectors for semantic search.
Like a semantic ZIP code system: each sentence gets a unique numerical address. Sentences with the same meaning have addresses in the same neighborhood. You can find similar sentences by measuring distance between addresses.
Key Concepts
Sentence Transformers Patterns
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"The cat sat on the mat.",
"A feline rested on fabric.",
"The dog ran in the park."
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 384) Embeddings: array of shape (3, 384). Each row is a 384-dim vector. from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
corpus = [
"Python is a programming language.",
"Dogs are loyal pets.",
"Java is used for backend development.",
"Cats are independent animals."
]
query = "What is Python?"
query_embedding = model.encode(query, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=2)
for hit in hits[0]:
print(f"{corpus[hit['corpus_id']]}: {hit['score']:.4f}") Python is a programming language.: 0.8234
Java is used for backend development.: 0.6521 from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"The sky is blue.",
"The ocean is blue.",
"The grass is green."
]
embeddings = model.encode(sentences, convert_to_tensor=True)
similarity_matrix = util.pytorch_cos_sim(embeddings, embeddings)
print(similarity_matrix)
# tensor([[1.0000, 0.8234, 0.2341],
# [0.8234, 1.0000, 0.1923],
# [0.2341, 0.1923, 1.0000]]) 2D tensor (N x N) where [i][j] = similarity between sentence i and j. from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader
model = SentenceTransformer('all-MiniLM-L6-v2')
train_examples = [
InputExample(texts=['Query: diabetes treatment', 'Insulin therapy for diabetes'], label=0.9),
InputExample(texts=['Query: blood pressure', 'Unrelated: machine learning'], label=0.1),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
output_path='./my-finetuned-model'
) Trained model saved to ./my-finetuned-model with improved domain-specific embeddings. from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"Python tutorial for beginners",
"Java programming guide",
"Python advanced topics",
"C++ learning path"
]
embeddings = model.encode(documents)
clusterer = KMeans(n_clusters=2)
labels = clusterer.fit_predict(embeddings)
for doc, label in zip(documents, labels):
print(f"Cluster {label}: {doc}") Cluster 0: Python tutorial for beginners
Cluster 0: Python advanced topics
Cluster 1: Java programming guide
Cluster 1: C++ learning path from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Process 1M texts in batches
texts = [f"Document {i}" for i in range(1000000)]
all_embeddings = []
for batch_start in range(0, len(texts), 32768):
batch = texts[batch_start:batch_start + 32768]
embeddings = model.encode(
batch,
batch_size=256,
show_progress_bar=True,
convert_to_numpy=True
)
all_embeddings.append(embeddings)
all_embeddings = np.vstack(all_embeddings)
print(all_embeddings.shape) # (1000000, 384) NumPy array of shape (1000000, 384) with all embeddings. from sentence_transformers import SentenceTransformer
from sentence_transformers import CrossEncoder
# Step 1: Fast semantic search to get top-100
model = SentenceTransformer('all-MiniLM-L6-v2')
query = "Best Python books"
query_emb = model.encode(query, convert_to_tensor=True)
corpus_embs = model.encode(corpus, convert_to_tensor=True)
hits = util.semantic_search(query_emb, corpus_embs, top_k=100)[0]
# Step 2: Re-rank top-100 with cross-encoder
ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
sentence_pairs = [[query, corpus[hit['corpus_id']]] for hit in hits]
scores = ce_model.predict(sentence_pairs)
for idx, score in sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:5]:
print(f"{corpus[hits[idx]['corpus_id']]}: {score:.4f}") Top-5 re-ranked results with higher accuracy than semantic search alone. Sentence Transformers Comparison
| Model Name | Dims | Speed | Use Case | Size |
|---|
Common Errors & Fixes
RuntimeError: CUDA out of memory Cause: Batch size too large for GPU. Default batch_size=32 tries to fit 32 sentences on GPU at once.
Reduce batch_size in encode() or move model to CPU:
model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')
# Or reduce batch size:
embeddings = model.encode(sentences, batch_size=8)
# Or use GPU with smaller batch:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences, batch_size=32, device='cuda') ValueError: You must install PyTorch to use SentenceTransformer Cause: PyTorch not installed. sentence-transformers depends on torch but doesn't auto-install it.
Install torch explicitly:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Or install both together:
pip install sentence-transformers torch AttributeError: 'numpy.ndarray' has no attribute 'to' Cause: Tried to pass NumPy array to util.semantic_search() which expects PyTorch tensors.
Convert embeddings to tensor before passing to semantic_search():
from sentence_transformers import SentenceTransformer, util
import torch
embeddings = model.encode(sentences) # Returns NumPy
embeddings_tensor = torch.from_numpy(embeddings).float()
hits = util.semantic_search(query_emb, embeddings_tensor, top_k=5)
# Or encode directly to tensor:
embeddings = model.encode(sentences, convert_to_tensor=True) FileNotFoundError: /root/.cache/huggingface/hub/... does not exist Cause: Model not downloaded or cache corrupted. sentence-transformers auto-downloads from HuggingFace hub on first use.
Pre-download model or specify cache directory:
# Pre-download:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # Auto-downloads
# Specify cache location:
import os
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/path/to/cache'
model = SentenceTransformer('all-MiniLM-L6-v2')
# Offline mode:
import os
os.environ['HF_DATASETS_OFFLINE'] = '1'
model = SentenceTransformer('/local/path/to/model') AssertionError: Labels must be floats between 0 and 1 Cause: Fine-tuning InputExample label is not in [0, 1] range.
Normalize labels to [0, 1] when creating InputExample:
from sentence_transformers import InputExample
# Wrong:
train_examples = [InputExample(texts=['A', 'B'], label=2)] # label > 1
# Correct:
train_examples = [
InputExample(texts=['A', 'B'], label=0.9), # Highly similar
InputExample(texts=['C', 'D'], label=0.1), # Not similar
]
# For multi-label classification, use TripletLoss or OnlineContrastiveLoss instead. Production Gotchas
sentence-transformers caches models in ~/.cache/huggingface/hub/. If you update code but the model doesn't change output, it's likely using cached weights. Clear cache with: rm -rf ~/.cache/huggingface/hub/models--sentence-transformers* or set SENTENCE_TRANSFORMERS_HOME=/tmp before loading.
model.encode() returns raw embeddings. For cosine similarity to work correctly, normalize them first: from sklearn.preprocessing import normalize; embeddings = normalize(embeddings, norm='l2'). Or use convert_to_tensor=True and util.pytorch_cos_sim() which handles normalization.
The default batch_size=32 may be too large for GPU or too small for CPU. Profile your setup: start with batch_size=8 on GPU, batch_size=128 on CPU, then increase until OOM. batch_size affects both speed and memory, not accuracy.
CrossEncoder takes [query, document] pairs and returns a single score per pair. You cannot use CrossEncoder for embedding a corpus once: you must re-compute for every new query. Use bi-encoders (SentenceTransformer) for corpus encoding, cross-encoders only for re-ranking top-K.
Fine-tuning on unbalanced or unrelated data can degrade performance on general tasks. Always evaluate on a held-out validation set. Start with a small learning rate (1e-5) and monitor validation similarity scores. More data ≠ better; quality matters.
GPU is faster for inference but slower to initialize (CUDA kernel loading ~2-5 seconds). For <1000 texts, CPU is often faster end-to-end. For >100k texts, GPU wins. device='cuda' in constructor, not in encode().
Some models use CLS token, others use mean pooling. This is fixed per model: you cannot change it in encode(). If you need custom pooling, save the transformer part and add your own pooling layer.
all-MiniLM-L6-v2 is English-only and faster. multilingual-e5-base works in 100+ languages but is slower and larger. Use English-only models if your corpus is English; use multilingual only if you need cross-lingual search.
Complete Production Example: Semantic Search with Re-ranking
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import numpy as np
# Load models
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Sample corpus
corpus = [
"Python is a high-level programming language.",
"Machine learning with Python libraries like scikit-learn.",
"Data science using Python and pandas.",
"Java is used for enterprise applications.",
"C++ is a compiled language for systems programming."
]
# Encode corpus once
corpus_embeddings = bi_encoder.encode(
corpus,
batch_size=32,
convert_to_tensor=True
)
def semantic_search_with_rerank(query, top_k_retrieval=10, top_k_final=3):
# Step 1: Fast semantic search
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)
hits = util.semantic_search(
query_embedding,
corpus_embeddings,
top_k=min(top_k_retrieval, len(corpus))
)[0]
# Step 2: Re-rank with cross-encoder
candidates = [corpus[hit['corpus_id']] for hit in hits]
sentence_pairs = [[query, doc] for doc in candidates]
cross_scores = ce_model.predict(sentence_pairs)
# Combine and sort
results = [
{"text": candidates[i], "score": float(cross_scores[i])}
for i in range(len(candidates))
]
results = sorted(results, key=lambda x: x['score'], reverse=True)[:top_k_final]
return results
# Query
query = "Python machine learning"
results = semantic_search_with_rerank(query)
for i, result in enumerate(results, 1):
print(f"{i}. [{result['score']:.4f}] {result['text']}")