How to beginner · 4 min read

How to find similar sentences using sentence transformers

Q: How to find similar sentences using sentence transformers

Use Hugging Face's sentence-transformers library to encode sentences into embeddings, then compute cosine similarity between these embeddings to find similar sentences. This approach leverages pretrained transformer models optimized for semantic similarity tasks.

Quick answer

Use Hugging Face's sentence-transformers library to encode sentences into embeddings, then compute cosine similarity between these embeddings to find similar sentences. This approach leverages pretrained transformer models optimized for semantic similarity tasks.

PREREQUISITES

Python 3.8+
pip install sentence-transformers
pip install scikit-learn

Setup

Install the sentence-transformers library, which provides pretrained models for generating sentence embeddings. Also install scikit-learn for similarity computations.

bash

pip install sentence-transformers scikit-learn

Step by step

This example encodes a list of sentences using a pretrained model, then finds the most similar sentence to a query sentence by cosine similarity.

python

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pretrained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# List of sentences to compare
sentences = [
    "The cat sits outside.",
    "A man is playing guitar.",
    "The new movie is awesome.",
    "A woman watches TV.",
    "The dog plays in the garden."
]

# Encode sentences to embeddings
embeddings = model.encode(sentences)

# Query sentence
query = "A person is playing a musical instrument."
query_embedding = model.encode([query])

# Compute cosine similarity between query and all sentences
similarities = cosine_similarity(query_embedding, embeddings)[0]

# Find the index of the most similar sentence
most_similar_idx = similarities.argmax()

print(f"Query: {query}")
print(f"Most similar sentence: {sentences[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.4f}")

output

Query: A person is playing a musical instrument.
Most similar sentence: A man is playing guitar.
Similarity score: 0.8235

Common variations

You can use different pretrained models from sentence-transformers like all-mpnet-base-v2 for higher accuracy or paraphrase-MiniLM-L3-v2 for faster inference. For large datasets, use approximate nearest neighbor libraries like FAISS for efficient similarity search.

python

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load a different model
model = SentenceTransformer('all-mpnet-base-v2')

sentences = ["Sentence one.", "Sentence two.", "Sentence three."]
embeddings = model.encode(sentences, convert_to_numpy=True)

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
faiss.normalize_L2(embeddings)  # Normalize embeddings
index.add(embeddings)

# Query embedding
query = "Example sentence."
query_embedding = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)

# Search top 2 similar sentences
distances, indices = index.search(query_embedding, k=2)
print(indices)
print(distances)

output

[[0 1]]
[[0.95 0.87]]

Troubleshooting

If embeddings seem poor, try a larger or more domain-specific model.
Ensure sentences are preprocessed consistently (e.g., lowercased if model expects it).
If similarity scores are unexpectedly low, verify embeddings are normalized before cosine similarity.
For large datasets, use FAISS or similar to avoid slow brute-force searches.

✅

Key Takeaways

Use sentence-transformers to convert sentences into dense vector embeddings.
Compute cosine similarity between embeddings to find semantically similar sentences.
For large-scale similarity search, integrate FAISS for efficient nearest neighbor queries.

Verified 2026-04 · all-MiniLM-L6-v2, all-mpnet-base-v2, paraphrase-MiniLM-L3-v2

Verify ↗