Cheat Sheet intermediate · 8 min read

Sentence Transformers Cheat Sheet — Embeddings & Similarity

version 2.7.x

Semantic embeddings from any text in seconds

install pip install sentence-transformers torch

core imports

python

from sentence_transformers import SentenceTransformer, util, losses
from sentence_transformers import InputExample, models

Mental model

Pre-trained transformers that turn sentences into dense numeric vectors for semantic search.

Like a semantic ZIP code system: each sentence gets a unique numerical address. Sentences with the same meaning have addresses in the same neighborhood. You can find similar sentences by measuring distance between addresses.

Key Concepts

Embedding

A fixed-size vector (typically 768-1024 dimensions) that numerically represents the semantic meaning of a sentence or document.

Semantic Similarity

A score (0-1) measuring how similar two texts are in meaning, computed via cosine distance between their embeddings.

Contrastive Learning

Training method that pushes similar sentence pairs close together and dissimilar pairs apart in embedding space.

Pooling Strategy

Mechanism to reduce token embeddings (one per word) into a single sentence embedding: mean, max, or CLS token pooling.

Fine-tuning

Adapting a pre-trained sentence transformer to your domain by training on your labeled sentence pairs or triplets.

Inference

Converting input text into embeddings; fast operation that happens at prediction time, not training.

Sentence Transformers Patterns

01 Basic Sentence Encoding

Convert text to vectors for similarity or clustering.

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
    "The cat sat on the mat.",
    "A feline rested on fabric.",
    "The dog ran in the park."
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 384)

output Embeddings: array of shape (3, 384). Each row is a 384-dim vector.

encode() returns NumPy arrays by default. For GPU inference, pass device='cuda' to constructor, not encode().

02 Semantic Search / Most Similar

Find top-K most similar sentences from a corpus.

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
corpus = [
    "Python is a programming language.",
    "Dogs are loyal pets.",
    "Java is used for backend development.",
    "Cats are independent animals."
]

query = "What is Python?"
query_embedding = model.encode(query, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=2)
for hit in hits[0]:
    print(f"{corpus[hit['corpus_id']]}: {hit['score']:.4f}")

output

Python is a programming language.: 0.8234
Java is used for backend development.: 0.6521

semantic_search() requires convert_to_tensor=True for embeddings. Use corpus_embeddings_collection to avoid recomputing.

03 Pairwise Similarity Matrix

Compute similarity between all pairs in a batch (clustering, ranking).

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
    "The sky is blue.",
    "The ocean is blue.",
    "The grass is green."
]

embeddings = model.encode(sentences, convert_to_tensor=True)
similarity_matrix = util.pytorch_cos_sim(embeddings, embeddings)
print(similarity_matrix)
# tensor([[1.0000, 0.8234, 0.2341],
#         [0.8234, 1.0000, 0.1923],
#         [0.2341, 0.1923, 1.0000]])

output 2D tensor (N x N) where [i][j] = similarity between sentence i and j.

pytorch_cos_sim() returns PyTorch tensor, not NumPy. Call .numpy() for NumPy or use util.cos_sim() for NumPy inputs.

04 Fine-tune on Custom Domain Data

Improve accuracy for your specific domain (medical, legal, e-commerce).

python

from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader

model = SentenceTransformer('all-MiniLM-L6-v2')

train_examples = [
    InputExample(texts=['Query: diabetes treatment', 'Insulin therapy for diabetes'], label=0.9),
    InputExample(texts=['Query: blood pressure', 'Unrelated: machine learning'], label=0.1),
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100,
    output_path='./my-finetuned-model'
)

output Trained model saved to ./my-finetuned-model with improved domain-specific embeddings.

InputExample label must be between 0-1. Use CosineSimilarityLoss for regression; OnlineContrastiveLoss or TripletLoss for classification.

05 Document Clustering

Group similar documents without predefined labels.

python

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
    "Python tutorial for beginners",
    "Java programming guide",
    "Python advanced topics",
    "C++ learning path"
]

embeddings = model.encode(documents)
clusterer = KMeans(n_clusters=2)
labels = clusterer.fit_predict(embeddings)

for doc, label in zip(documents, labels):
    print(f"Cluster {label}: {doc}")

output

Cluster 0: Python tutorial for beginners
Cluster 0: Python advanced topics
Cluster 1: Java programming guide
Cluster 1: C++ learning path

KMeans clusters depend on n_clusters. Use silhouette_score or elbow method to find optimal k. Embeddings should be normalized first.

06 Efficient Batch Inference at Scale

Encode thousands/millions of texts with memory efficiency.

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Process 1M texts in batches
texts = [f"Document {i}" for i in range(1000000)]

all_embeddings = []
for batch_start in range(0, len(texts), 32768):
    batch = texts[batch_start:batch_start + 32768]
    embeddings = model.encode(
        batch,
        batch_size=256,
        show_progress_bar=True,
        convert_to_numpy=True
    )
    all_embeddings.append(embeddings)

all_embeddings = np.vstack(all_embeddings)
print(all_embeddings.shape)  # (1000000, 384)

output NumPy array of shape (1000000, 384) with all embeddings.

batch_size=256 is encoder batch size; use smaller values if OOM. convert_to_numpy=True saves GPU memory. Index embeddings with Faiss/Pinecone after.

07 Re-ranking with Cross-Encoders (Higher Accuracy)

Re-rank top-K results from semantic search for final ranking.

python

from sentence_transformers import SentenceTransformer
from sentence_transformers import CrossEncoder

# Step 1: Fast semantic search to get top-100
model = SentenceTransformer('all-MiniLM-L6-v2')
query = "Best Python books"
query_emb = model.encode(query, convert_to_tensor=True)
corpus_embs = model.encode(corpus, convert_to_tensor=True)
hits = util.semantic_search(query_emb, corpus_embs, top_k=100)[0]

# Step 2: Re-rank top-100 with cross-encoder
ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
sentence_pairs = [[query, corpus[hit['corpus_id']]] for hit in hits]
scores = ce_model.predict(sentence_pairs)

for idx, score in sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:5]:
    print(f"{corpus[hits[idx]['corpus_id']]}: {score:.4f}")

output Top-5 re-ranked results with higher accuracy than semantic search alone.

Cross-encoders are slow (compute score per pair) but accurate. Use bi-encoders for retrieval, cross-encoders only for top-K re-ranking.

Sentence Transformers Comparison

Model Name	Dims	Speed	Use Case	Size

Common Errors & Fixes

01 RuntimeError: CUDA out of memory

Cause: Batch size too large for GPU. Default batch_size=32 tries to fit 32 sentences on GPU at once.

Fix:

python

Reduce batch_size in encode() or move model to CPU:

model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')
# Or reduce batch size:
embeddings = model.encode(sentences, batch_size=8)

# Or use GPU with smaller batch:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences, batch_size=32, device='cuda')

02 ValueError: You must install PyTorch to use SentenceTransformer

Cause: PyTorch not installed. sentence-transformers depends on torch but doesn't auto-install it.

Fix:

python

Install torch explicitly:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Or install both together:
pip install sentence-transformers torch

03 AttributeError: 'numpy.ndarray' has no attribute 'to'

Cause: Tried to pass NumPy array to util.semantic_search() which expects PyTorch tensors.

Fix:

python

Convert embeddings to tensor before passing to semantic_search():

from sentence_transformers import SentenceTransformer, util
import torch

embeddings = model.encode(sentences)  # Returns NumPy
embeddings_tensor = torch.from_numpy(embeddings).float()
hits = util.semantic_search(query_emb, embeddings_tensor, top_k=5)

# Or encode directly to tensor:
embeddings = model.encode(sentences, convert_to_tensor=True)

04 FileNotFoundError: /root/.cache/huggingface/hub/... does not exist

Cause: Model not downloaded or cache corrupted. sentence-transformers auto-downloads from HuggingFace hub on first use.

Fix:

python

Pre-download model or specify cache directory:

# Pre-download:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # Auto-downloads

# Specify cache location:
import os
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/path/to/cache'
model = SentenceTransformer('all-MiniLM-L6-v2')

# Offline mode:
import os
os.environ['HF_DATASETS_OFFLINE'] = '1'
model = SentenceTransformer('/local/path/to/model')

05 AssertionError: Labels must be floats between 0 and 1

Cause: Fine-tuning InputExample label is not in [0, 1] range.

Fix:

python

Normalize labels to [0, 1] when creating InputExample:

from sentence_transformers import InputExample

# Wrong:
train_examples = [InputExample(texts=['A', 'B'], label=2)]  # label > 1

# Correct:
train_examples = [
    InputExample(texts=['A', 'B'], label=0.9),  # Highly similar
    InputExample(texts=['C', 'D'], label=0.1),  # Not similar
]

# For multi-label classification, use TripletLoss or OnlineContrastiveLoss instead.

Production Gotchas

⚠ Model Caching Can Cause Stale Behavior

sentence-transformers caches models in ~/.cache/huggingface/hub/. If you update code but the model doesn't change output, it's likely using cached weights. Clear cache with: rm -rf ~/.cache/huggingface/hub/models--sentence-transformers* or set SENTENCE_TRANSFORMERS_HOME=/tmp before loading.

⚠ Embeddings Are Not Normalized by Default

model.encode() returns raw embeddings. For cosine similarity to work correctly, normalize them first: from sklearn.preprocessing import normalize; embeddings = normalize(embeddings, norm='l2'). Or use convert_to_tensor=True and util.pytorch_cos_sim() which handles normalization.

⚠ Batch Size Varies by Model & Hardware

The default batch_size=32 may be too large for GPU or too small for CPU. Profile your setup: start with batch_size=8 on GPU, batch_size=128 on CPU, then increase until OOM. batch_size affects both speed and memory, not accuracy.

⚠ Cross-Encoders Are Not Drop-in Replacements for Bi-Encoders

CrossEncoder takes [query, document] pairs and returns a single score per pair. You cannot use CrossEncoder for embedding a corpus once: you must re-compute for every new query. Use bi-encoders (SentenceTransformer) for corpus encoding, cross-encoders only for re-ranking top-K.

⚠ Fine-tuning Requires Balanced, Domain-Specific Data

Fine-tuning on unbalanced or unrelated data can degrade performance on general tasks. Always evaluate on a held-out validation set. Start with a small learning rate (1e-5) and monitor validation similarity scores. More data ≠ better; quality matters.

⚠ GPU vs CPU Trade-off

GPU is faster for inference but slower to initialize (CUDA kernel loading ~2-5 seconds). For <1000 texts, CPU is often faster end-to-end. For >100k texts, GPU wins. device='cuda' in constructor, not in encode().

⚠ Different Models Have Different Pooling Strategies

Some models use CLS token, others use mean pooling. This is fixed per model: you cannot change it in encode(). If you need custom pooling, save the transformer part and add your own pooling layer.

⚠ Multilingual Models Are Not Always Better

all-MiniLM-L6-v2 is English-only and faster. multilingual-e5-base works in 100+ languages but is slower and larger. Use English-only models if your corpus is English; use multilingual only if you need cross-lingual search.

Complete Production Example: Semantic Search with Re-ranking

python

from sentence_transformers import SentenceTransformer, CrossEncoder, util
import numpy as np

# Load models
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Sample corpus
corpus = [
    "Python is a high-level programming language.",
    "Machine learning with Python libraries like scikit-learn.",
    "Data science using Python and pandas.",
    "Java is used for enterprise applications.",
    "C++ is a compiled language for systems programming."
]

# Encode corpus once
corpus_embeddings = bi_encoder.encode(
    corpus,
    batch_size=32,
    convert_to_tensor=True
)

def semantic_search_with_rerank(query, top_k_retrieval=10, top_k_final=3):
    # Step 1: Fast semantic search
    query_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(
        query_embedding,
        corpus_embeddings,
        top_k=min(top_k_retrieval, len(corpus))
    )[0]
    
    # Step 2: Re-rank with cross-encoder
    candidates = [corpus[hit['corpus_id']] for hit in hits]
    sentence_pairs = [[query, doc] for doc in candidates]
    cross_scores = ce_model.predict(sentence_pairs)
    
    # Combine and sort
    results = [
        {"text": candidates[i], "score": float(cross_scores[i])}
        for i in range(len(candidates))
    ]
    results = sorted(results, key=lambda x: x['score'], reverse=True)[:top_k_final]
    return results

# Query
query = "Python machine learning"
results = semantic_search_with_rerank(query)
for i, result in enumerate(results, 1):
    print(f"{i}. [{result['score']:.4f}] {result['text']}")

Verified 2026-04 · v2.7.x

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.