Deduplication

Deduplication identifies files with identical content and resolves the duplicates without deleting anything. Non-canonical copies are moved to a staging area with full provenance metadata. You decide what happens to them.

How duplicates are identified

Duplicate detection uses BLAKE3 content hashes computed during inventory. Two files with the same BLAKE3 hash have identical content, regardless of filename, path, or modification time.

Files are grouped by hash. Any group with more than one member is a duplicate group. Each group receives a group_id and is written to the duplicates table in SQLite.

See Content Hash as Identity for why content hashing is the correct foundation for deduplication.

Canonical selection

Within each duplicate group, one file is selected as the canonical copy. The rest are non-canonical. The retention strategy is configurable in fialr.toml:

Strategy	Rule	Use case
`shortest-path`	File with the shortest absolute path is canonical	Prefer files closer to the root
`oldest-mtime`	File with the earliest modification time is canonical	Prefer the original version
`newest-mtime`	File with the latest modification time is canonical	Prefer the most recently saved copy

[deduplication]
retention_strategy = "shortest-path"

The canonical file stays in place. It is never moved or modified during deduplication.

Non-canonical handling

Non-canonical copies are moved to a _dupes/ directory under the target root:

~/Documents/_dupes/
  2024-03-15_invoice_acme-corp.pdf
  2024-03-15_invoice_acme-corp_1.pdf
  budget-2024.xlsx

Each moved file receives provenance metadata:

Storage	Attribute	Value
XATTR	`com.fialr.original_path`	Path before dedup move
XATTR	`com.fialr.original_name`	Filename before dedup move
XATTR	`com.fialr.job_uuid`	Deduplication job UUID
SQLite	`operations` table	Full before/after audit record
SQLite	`duplicates` table	Group ID, canonical flag, all member paths

The move follows the same hash verification discipline as reorganization: hash before, move, hash after, compare.

_dupes/ is staging, not trash

_dupes/ is a review area. Files placed there are not deleted, not scheduled for deletion, and not automatically purged. They remain indefinitely until you decide what to do with them.

Options:

Review and delete — inspect the contents, confirm they are true duplicates, delete manually
Restore — move a file back to its original location (original path is preserved in XATTRs)
Archive — move to an external archive or cold storage
Leave in place — _dupes/ is a valid long-term holding area if you are not ready to decide

fialr does not delete files. This is a design decision, not a limitation.

Near-duplicate detection

Beyond exact duplicates, the deduplication module identifies near-duplicates: files that are likely versions or variants of the same document. Three detection methods are available:

Name-based version detection

Files with similar names but different hashes are identified as version sequences:

Sequential naming patterns (e.g., report.docx, report-v2.docx, report-final.docx)
Same base name across different directories
Numeric suffixes and common version tokens (draft, final, signed, v1, v2)

Fuzzy text similarity

For text-extractable files (PDF, DOCX, TXT), fialr computes TF-IDF vectors and compares them using cosine similarity. Files scoring above the threshold (default: 0.85) are flagged as near-duplicates. This catches documents with the same content but different filenames, or documents where minor edits produce a new version.

TF-IDF runs entirely in the standard library — no external dependencies required.

Perceptual image hashing

For image files, fialr computes an average hash (aHash) and compares perceptual similarity. Images scoring above the threshold (default: 0.90) are flagged. This catches resized copies, re-encoded images, and minor edits like cropping or watermarking.

Perceptual hashing requires Pillow as an optional dependency:

pip install 'fialr[images]'

Without Pillow installed, image near-duplicate detection is skipped silently. Exact hash-based image deduplication still works.

Embedding-based near-duplicates

When vector embeddings are available (via fialr embed), deduplication also compares files by semantic similarity using cosine distance. This catches near-duplicates that differ in wording but cover the same content — for example, two versions of a report with different formatting or minor edits.

Embedding-based dedup supplements hash and TF-IDF detection. If embeddings are not available, deduplication falls back to TF-IDF and image hashing without error.

Configuration

Near-duplicate thresholds are configurable in fialr.toml:

[deduplication]
text_similarity_threshold = 0.85
image_similarity_threshold = 0.90
embedding_similarity_threshold = 0.90

Near-duplicates are flagged in the output but not automatically moved. They appear in the job report as version sequences for manual review.

Running deduplication

fialr dedup ~/Documents

Deduplication reads the manifest and hash data, groups duplicates, selects canonical copies, and moves non-canonical files to _dupes/:

jobs/2026-03-11_deduplicate_a1b2c3d4/
  log.json
  report.md
  checkpoint.json

Terminal output:

2,847 files scanned.
  Duplicate groups: 134
  Files deduplicated: 312
  Canonical copies retained: 134
  Near-duplicate sequences: 28
  Space recovered: 1.4 GB (moved to _dupes/)

Duplicate groups are also written to the duplicates table in SQLite for programmatic access.

What comes next

After deduplication, run enrichment to improve filename quality and add structured metadata using local AI inference. Or run validation to verify file integrity across the deduplicated corpus.

For the full command reference, see fialr dedup.