Skip to content

Deduplication

Deduplication identifies files with identical content and resolves the duplicates without deleting anything. Non-canonical copies are moved to a staging area with full provenance metadata. You decide what happens to them.

Duplicate detection uses BLAKE3 content hashes computed during inventory. Two files with the same BLAKE3 hash have identical content, regardless of filename, path, or modification time.

Files are grouped by hash. Any group with more than one member is a duplicate group. Each group receives a group_id and is written to the duplicates table in SQLite.

See Content Hash as Identity for why content hashing is the correct foundation for deduplication.


Within each duplicate group, one file is selected as the canonical copy. The rest are non-canonical. The retention strategy is configurable in fialr.toml:

StrategyRuleUse case
shortest-pathFile with the shortest absolute path is canonicalPrefer files closer to the root
oldest-mtimeFile with the earliest modification time is canonicalPrefer the original version
newest-mtimeFile with the latest modification time is canonicalPrefer the most recently saved copy
[deduplication]
retention_strategy = "shortest-path"

The canonical file stays in place. It is never moved or modified during deduplication.


Non-canonical copies are moved to a _dupes/ directory under the target root:

~/Documents/_dupes/
2024-03-15_invoice_acme-corp.pdf
2024-03-15_invoice_acme-corp_1.pdf
budget-2024.xlsx

Each moved file receives provenance metadata:

StorageAttributeValue
XATTRcom.fialr.original_pathPath before dedup move
XATTRcom.fialr.original_nameFilename before dedup move
XATTRcom.fialr.job_uuidDeduplication job UUID
SQLiteoperations tableFull before/after audit record
SQLiteduplicates tableGroup ID, canonical flag, all member paths

The move follows the same hash verification discipline as reorganization: hash before, move, hash after, compare.


_dupes/ is a review area. Files placed there are not deleted, not scheduled for deletion, and not automatically purged. They remain indefinitely until you decide what to do with them.

Options:

  • Review and delete — inspect the contents, confirm they are true duplicates, delete manually
  • Restore — move a file back to its original location (original path is preserved in XATTRs)
  • Archive — move to an external archive or cold storage
  • Leave in place_dupes/ is a valid long-term holding area if you are not ready to decide

fialr does not delete files. This is a design decision, not a limitation.


Beyond exact duplicates, the deduplication module identifies near-duplicates: files that appear to be versions of the same document. Near-duplicate detection looks for:

  • Files with similar names but different hashes (e.g., report.docx, report-v2.docx, report-final.docx)
  • Files in the same directory with sequential naming patterns
  • Files with the same base name across different directories

Near-duplicates are flagged in the output but not automatically moved. They appear in the job report as version sequences for manual review.


Terminal window
fialr deduplicate ~/Documents

Deduplication reads the manifest and hash data, groups duplicates, selects canonical copies, and moves non-canonical files to _dupes/:

jobs/2026-03-11_deduplicate_a1b2c3d4/
log.json
report.md
checkpoint.json

Terminal output:

2,847 files scanned.
Duplicate groups: 134
Files deduplicated: 312
Canonical copies retained: 134
Near-duplicate sequences: 28
Space recovered: 1.4 GB (moved to _dupes/)

Duplicate groups are also written to the duplicates table in SQLite for programmatic access.


After deduplication, run enrichment to improve filename quality and add structured metadata using local AI inference. Or run validation to verify file integrity across the deduplicated corpus.

For the full command reference, see fialr deduplicate.