DuplicateDocumentError
haystack.document_stores.base.DuplicateDocumentError
Stack trace
Traceback (most recent call last):
File "app.py", line 42, in <module>
document_store.write_documents(docs)
File "/usr/local/lib/python3.9/site-packages/haystack/document_stores/base.py", line 210, in write_documents
raise DuplicateDocumentError("Duplicate document detected based on write policy")
haystack.document_stores.base.DuplicateDocumentError: Duplicate document detected based on write policy Why it happens
Haystack document stores enforce a write policy to prevent duplicate documents based on document IDs or content hashes. When you try to write documents that already exist and the policy forbids duplicates, this error is raised. This protects data integrity but requires careful handling of document IDs and write policies.
Detection
Monitor exceptions during document_store.write_documents calls and log DuplicateDocumentError occurrences along with document IDs to identify duplicates before crashing.
Causes & fixes
Writing documents with IDs that already exist in the document store and the write policy is set to 'fail'.
Change the write policy to 'overwrite' or 'ignore' in the document store configuration, or ensure new documents have unique IDs.
Document content duplicates detected when the store uses content hashing to identify duplicates.
Modify documents to have unique content or disable content-based duplicate detection if appropriate.
Multiple parallel writes causing race conditions leading to duplicate document insert attempts.
Implement write synchronization or retry logic to avoid concurrent writes of the same document.
Code: broken vs fixed
from haystack.document_stores import FAISSDocumentStore
store = FAISSDocumentStore()
docs = [{"id": "doc1", "content": "text"}, {"id": "doc1", "content": "text"}]
store.write_documents(docs) # This line raises DuplicateDocumentError from haystack.document_stores import FAISSDocumentStore
store = FAISSDocumentStore(duplicate_documents='overwrite') # Changed write policy to overwrite
docs = [{"id": "doc1", "content": "text"}, {"id": "doc1", "content": "text"}]
store.write_documents(docs) # Now overwrites duplicates without error
print("Documents written successfully") Workaround
Catch DuplicateDocumentError exceptions around write_documents calls, then filter out or rename duplicate documents before retrying the write operation.
Prevention
Design your document ingestion pipeline to assign unique IDs and choose an appropriate duplicate write policy ('overwrite' or 'ignore') to avoid duplicate write errors.