How to beginner to intermediate · 4 min read

AI for plagiarism detection

Q: AI for plagiarism detection

Use large language models (LLMs) like gpt-4o-mini to detect plagiarism by comparing text similarity and paraphrase detection. Combine embedding models with vector search to identify copied or closely reworded content efficiently.

Quick answer

Use large language models (LLMs) like gpt-4o-mini to detect plagiarism by comparing text similarity and paraphrase detection. Combine embedding models with vector search to identify copied or closely reworded content efficiently.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example uses gpt-4o-mini to check if a given text is plagiarized by comparing it against a reference text. It uses embeddings to measure semantic similarity and a prompt to detect paraphrasing.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Reference text to compare against
reference_text = "Artificial intelligence is the simulation of human intelligence processes by machines."

# Text to check for plagiarism
input_text = "AI involves machines mimicking human intelligence processes."

# Step 1: Get embeddings for both texts
embedding_model = "text-embedding-3-small"

ref_embedding_resp = client.embeddings.create(model=embedding_model, input=reference_text)
input_embedding_resp = client.embeddings.create(model=embedding_model, input=input_text)

ref_embedding = ref_embedding_resp.data[0].embedding
input_embedding = input_embedding_resp.data[0].embedding

# Step 2: Compute cosine similarity
import math

def cosine_similarity(vec1, vec2):
    dot = sum(a*b for a, b in zip(vec1, vec2))
    norm1 = math.sqrt(sum(a*a for a in vec1))
    norm2 = math.sqrt(sum(b*b for b in vec2))
    return dot / (norm1 * norm2)

similarity = cosine_similarity(ref_embedding, input_embedding)

# Step 3: Use GPT to detect paraphrasing if similarity is high
threshold = 0.85
if similarity >= threshold:
    prompt = f"Determine if the following text is a paraphrase of the reference.\nReference: {reference_text}\nInput: {input_text}\nAnswer with 'Yes' or 'No' and a brief explanation."
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.choices[0].message.content
else:
    result = "Texts are not similar enough to be considered plagiarized."

print(f"Cosine similarity: {similarity:.3f}")
print("Plagiarism check result:", result)

output

Cosine similarity: 0.912
Plagiarism check result: Yes. The input text is a paraphrase of the reference, conveying the same meaning with different wording.

Common variations

Use async calls with asyncio and client.chat.completions.acreate() for non-blocking requests.
Try different embedding models like text-embedding-3-large for higher accuracy.
Integrate vector databases like FAISS or Chroma for large-scale plagiarism detection across many documents.

python

import asyncio
from openai import OpenAI

async def async_plagiarism_check():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    reference_text = "Artificial intelligence is the simulation of human intelligence processes by machines."
    input_text = "AI involves machines mimicking human intelligence processes."

    embedding_model = "text-embedding-3-small"
    ref_embedding_resp = await client.embeddings.acreate(model=embedding_model, input=reference_text)
    input_embedding_resp = await client.embeddings.acreate(model=embedding_model, input=input_text)

    ref_embedding = ref_embedding_resp.data[0].embedding
    input_embedding = input_embedding_resp.data[0].embedding

    import math
    def cosine_similarity(vec1, vec2):
        dot = sum(a*b for a, b in zip(vec1, vec2))
        norm1 = math.sqrt(sum(a*a for a in vec1))
        norm2 = math.sqrt(sum(b*b for b in vec2))
        return dot / (norm1 * norm2)

    similarity = cosine_similarity(ref_embedding, input_embedding)

    if similarity >= 0.85:
        prompt = f"Determine if the following text is a paraphrase of the reference.\nReference: {reference_text}\nInput: {input_text}\nAnswer with 'Yes' or 'No' and a brief explanation."
        response = await client.chat.completions.acreate(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
        result = response.choices[0].message.content
    else:
        result = "Texts are not similar enough to be considered plagiarized."

    print(f"Cosine similarity: {similarity:.3f}")
    print("Plagiarism check result:", result)

asyncio.run(async_plagiarism_check())

output

Cosine similarity: 0.912
Plagiarism check result: Yes. The input text is a paraphrase of the reference, conveying the same meaning with different wording.

Troubleshooting

If you get 401 Unauthorized, verify your OPENAI_API_KEY environment variable is set correctly.
Low similarity scores may indicate the embedding model is too small; try a larger embedding model.
For large document sets, use vector databases to avoid performance bottlenecks.

✅

Key Takeaways

Use embedding models to measure semantic similarity for plagiarism detection.
Combine embeddings with LLM prompts to detect paraphrasing effectively.
Async API calls improve performance in large-scale or real-time systems.
Vector databases scale plagiarism detection across many documents.
Always secure your API key via environment variables.

Verified 2026-04 · gpt-4o-mini, text-embedding-3-small, text-embedding-3-large

Verify ↗