How to find similar sentences using sentence transformers
Quick answer
Use Hugging Face's
sentence-transformers library to encode sentences into embeddings, then compute cosine similarity between these embeddings to find similar sentences. This approach leverages pretrained transformer models optimized for semantic similarity tasks.PREREQUISITES
Python 3.8+pip install sentence-transformerspip install scikit-learn
Setup
Install the sentence-transformers library, which provides pretrained models for generating sentence embeddings. Also install scikit-learn for similarity computations.
pip install sentence-transformers scikit-learn Step by step
This example encodes a list of sentences using a pretrained model, then finds the most similar sentence to a query sentence by cosine similarity.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load pretrained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# List of sentences to compare
sentences = [
"The cat sits outside.",
"A man is playing guitar.",
"The new movie is awesome.",
"A woman watches TV.",
"The dog plays in the garden."
]
# Encode sentences to embeddings
embeddings = model.encode(sentences)
# Query sentence
query = "A person is playing a musical instrument."
query_embedding = model.encode([query])
# Compute cosine similarity between query and all sentences
similarities = cosine_similarity(query_embedding, embeddings)[0]
# Find the index of the most similar sentence
most_similar_idx = similarities.argmax()
print(f"Query: {query}")
print(f"Most similar sentence: {sentences[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.4f}") output
Query: A person is playing a musical instrument. Most similar sentence: A man is playing guitar. Similarity score: 0.8235
Common variations
You can use different pretrained models from sentence-transformers like all-mpnet-base-v2 for higher accuracy or paraphrase-MiniLM-L3-v2 for faster inference. For large datasets, use approximate nearest neighbor libraries like FAISS for efficient similarity search.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# Load a different model
model = SentenceTransformer('all-mpnet-base-v2')
sentences = ["Sentence one.", "Sentence two.", "Sentence three."]
embeddings = model.encode(sentences, convert_to_numpy=True)
# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
faiss.normalize_L2(embeddings) # Normalize embeddings
index.add(embeddings)
# Query embedding
query = "Example sentence."
query_embedding = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)
# Search top 2 similar sentences
distances, indices = index.search(query_embedding, k=2)
print(indices)
print(distances) output
[[0 1]] [[0.95 0.87]]
Troubleshooting
- If embeddings seem poor, try a larger or more domain-specific model.
- Ensure sentences are preprocessed consistently (e.g., lowercased if model expects it).
- If similarity scores are unexpectedly low, verify embeddings are normalized before cosine similarity.
- For large datasets, use FAISS or similar to avoid slow brute-force searches.
Key Takeaways
- Use
sentence-transformersto convert sentences into dense vector embeddings. - Compute cosine similarity between embeddings to find semantically similar sentences.
- For large-scale similarity search, integrate FAISS for efficient nearest neighbor queries.