How to beginner · 4 min read

How to find similar sentences using sentence transformers

Quick answer
Use Hugging Face's sentence-transformers library to encode sentences into embeddings, then compute cosine similarity between these embeddings to find similar sentences. This approach leverages pretrained transformer models optimized for semantic similarity tasks.

PREREQUISITES

  • Python 3.8+
  • pip install sentence-transformers
  • pip install scikit-learn

Setup

Install the sentence-transformers library, which provides pretrained models for generating sentence embeddings. Also install scikit-learn for similarity computations.

bash
pip install sentence-transformers scikit-learn

Step by step

This example encodes a list of sentences using a pretrained model, then finds the most similar sentence to a query sentence by cosine similarity.

python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pretrained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# List of sentences to compare
sentences = [
    "The cat sits outside.",
    "A man is playing guitar.",
    "The new movie is awesome.",
    "A woman watches TV.",
    "The dog plays in the garden."
]

# Encode sentences to embeddings
embeddings = model.encode(sentences)

# Query sentence
query = "A person is playing a musical instrument."
query_embedding = model.encode([query])

# Compute cosine similarity between query and all sentences
similarities = cosine_similarity(query_embedding, embeddings)[0]

# Find the index of the most similar sentence
most_similar_idx = similarities.argmax()

print(f"Query: {query}")
print(f"Most similar sentence: {sentences[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.4f}")
output
Query: A person is playing a musical instrument.
Most similar sentence: A man is playing guitar.
Similarity score: 0.8235

Common variations

You can use different pretrained models from sentence-transformers like all-mpnet-base-v2 for higher accuracy or paraphrase-MiniLM-L3-v2 for faster inference. For large datasets, use approximate nearest neighbor libraries like FAISS for efficient similarity search.

python
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load a different model
model = SentenceTransformer('all-mpnet-base-v2')

sentences = ["Sentence one.", "Sentence two.", "Sentence three."]
embeddings = model.encode(sentences, convert_to_numpy=True)

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
faiss.normalize_L2(embeddings)  # Normalize embeddings
index.add(embeddings)

# Query embedding
query = "Example sentence."
query_embedding = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)

# Search top 2 similar sentences
distances, indices = index.search(query_embedding, k=2)
print(indices)
print(distances)
output
[[0 1]]
[[0.95 0.87]]

Troubleshooting

  • If embeddings seem poor, try a larger or more domain-specific model.
  • Ensure sentences are preprocessed consistently (e.g., lowercased if model expects it).
  • If similarity scores are unexpectedly low, verify embeddings are normalized before cosine similarity.
  • For large datasets, use FAISS or similar to avoid slow brute-force searches.

Key Takeaways

  • Use sentence-transformers to convert sentences into dense vector embeddings.
  • Compute cosine similarity between embeddings to find semantically similar sentences.
  • For large-scale similarity search, integrate FAISS for efficient nearest neighbor queries.
Verified 2026-04 · all-MiniLM-L6-v2, all-mpnet-base-v2, paraphrase-MiniLM-L3-v2
Verify ↗