Deduplication
Deduplication identifies files with identical content and resolves the duplicates without deleting anything. Non-canonical copies are moved to a staging area with full provenance metadata. You decide what happens to them.
How duplicates are identified
Section titled “How duplicates are identified”Duplicate detection uses BLAKE3 content hashes computed during inventory. Two files with the same BLAKE3 hash have identical content, regardless of filename, path, or modification time.
Files are grouped by hash. Any group with more than one member is a duplicate group. Each group receives a group_id and is written to the duplicates table in SQLite.
See Content Hash as Identity for why content hashing is the correct foundation for deduplication.
Canonical selection
Section titled “Canonical selection”Within each duplicate group, one file is selected as the canonical copy. The rest are non-canonical. The retention strategy is configurable in fialr.toml:
| Strategy | Rule | Use case |
|---|---|---|
shortest-path | File with the shortest absolute path is canonical | Prefer files closer to the root |
oldest-mtime | File with the earliest modification time is canonical | Prefer the original version |
newest-mtime | File with the latest modification time is canonical | Prefer the most recently saved copy |
[deduplication]retention_strategy = "shortest-path"The canonical file stays in place. It is never moved or modified during deduplication.
Non-canonical handling
Section titled “Non-canonical handling”Non-canonical copies are moved to a _dupes/ directory under the target root:
~/Documents/_dupes/ 2024-03-15_invoice_acme-corp.pdf 2024-03-15_invoice_acme-corp_1.pdf budget-2024.xlsxEach moved file receives provenance metadata:
| Storage | Attribute | Value |
|---|---|---|
| XATTR | com.fialr.original_path | Path before dedup move |
| XATTR | com.fialr.original_name | Filename before dedup move |
| XATTR | com.fialr.job_uuid | Deduplication job UUID |
| SQLite | operations table | Full before/after audit record |
| SQLite | duplicates table | Group ID, canonical flag, all member paths |
The move follows the same hash verification discipline as reorganization: hash before, move, hash after, compare.
_dupes/ is staging, not trash
Section titled “_dupes/ is staging, not trash”_dupes/ is a review area. Files placed there are not deleted, not scheduled for deletion, and not automatically purged. They remain indefinitely until you decide what to do with them.
Options:
- Review and delete — inspect the contents, confirm they are true duplicates, delete manually
- Restore — move a file back to its original location (original path is preserved in XATTRs)
- Archive — move to an external archive or cold storage
- Leave in place —
_dupes/is a valid long-term holding area if you are not ready to decide
fialr does not delete files. This is a design decision, not a limitation.
Near-duplicate detection
Section titled “Near-duplicate detection”Beyond exact duplicates, the deduplication module identifies near-duplicates: files that appear to be versions of the same document. Near-duplicate detection looks for:
- Files with similar names but different hashes (e.g.,
report.docx,report-v2.docx,report-final.docx) - Files in the same directory with sequential naming patterns
- Files with the same base name across different directories
Near-duplicates are flagged in the output but not automatically moved. They appear in the job report as version sequences for manual review.
Running deduplication
Section titled “Running deduplication”fialr deduplicate ~/DocumentsDeduplication reads the manifest and hash data, groups duplicates, selects canonical copies, and moves non-canonical files to _dupes/:
jobs/2026-03-11_deduplicate_a1b2c3d4/ log.json report.md checkpoint.jsonTerminal output:
2,847 files scanned. Duplicate groups: 134 Files deduplicated: 312 Canonical copies retained: 134 Near-duplicate sequences: 28 Space recovered: 1.4 GB (moved to _dupes/)Duplicate groups are also written to the duplicates table in SQLite for programmatic access.
What comes next
Section titled “What comes next”After deduplication, run enrichment to improve filename quality and add structured metadata using local AI inference. Or run validation to verify file integrity across the deduplicated corpus.
For the full command reference, see fialr deduplicate.