Image Similarity Detection
Perceptual hashing finds visually similar images -- duplicates, near-duplicates, and related images -- even when they differ in resolution, compression, or minor edits.
How It Works
Perceptual Hashing
Unlike cryptographic hashes (SHA1, MD5) which produce completely different outputs for any change, perceptual hashes produce similar outputs for visually similar images.
Two types of perceptual hashes are computed:
| Hash Type | Description |
|---|---|
| Average Hash (aHash) | Compares the average brightness of image blocks |
| Difference Hash (dHash) | Compares brightness gradients between adjacent pixels |
The difference hash (dHash) is used for similarity comparison because it tolerates small changes (crops, compression, resizing) better than average hash.
Hamming Distance
Similarity is measured by Hamming distance - the number of bits that differ between two hashes. Lower distance means more similar:
| Hamming Distance | Interpretation |
|---|---|
| 0 | Identical images (perceptually) |
| 1-5 | Near-duplicates (same image, minor edits) |
| 6-10 | Similar images (same subject, different versions) |
| 11-15 | Loosely related (similar composition) |
| 16+ | Different images |
Background Hash Worker
A background worker automatically processes images and calculates their hashes.
What Gets Processed
The hash worker processes resources with these content types:
image/jpegimage/pngimage/gifimage/webp
Other file types are skipped.
Processing Flow
- Batch discovery - The worker finds images without hashes
- Hash calculation - Workers compute aHash and dHash for each image
- Cache update - New hashes are added to the in-memory cache
- Similarity detection - New hashes are compared against all cached hashes
- Persistence - Similar pairs are stored in the database
Worker Configuration
Configure the hash worker using command-line flags or environment variables:
| Flag | Env Variable | Default | Description |
|---|---|---|---|
-hash-worker-count | HASH_WORKER_COUNT | 4 | Concurrent workers |
-hash-batch-size | HASH_BATCH_SIZE | 500 | Images per batch |
-hash-poll-interval | HASH_POLL_INTERVAL | 1m | Time between batches |
-hash-similarity-threshold | HASH_SIMILARITY_THRESHOLD | 10 | Max Hamming distance |
-hash-worker-disabled | HASH_WORKER_DISABLED=1 | false | Disable entirely |
-hash-cache-size | HASH_CACHE_SIZE | 100000 | Maximum entries in the LRU similarity cache |
Tuning Examples
High-performance setup (fast processing, more strict matching):
./mahresources \
-hash-worker-count=8 \
-hash-batch-size=1000 \
-hash-poll-interval=30s \
-hash-similarity-threshold=8 \
...
Resource-constrained setup (slower, gentler on resources):
./mahresources \
-hash-worker-count=1 \
-hash-batch-size=100 \
-hash-poll-interval=5m \
...
Disabled (no background processing):
./mahresources -hash-worker-disabled ...
Similarity Threshold Configuration
The -hash-similarity-threshold setting controls how similar images must be to be considered matches:
| Threshold | Effect |
|---|---|
| 5 | Strict - only near-identical images match |
| 10 (default) | Balanced - finds similar images with variations |
| 15 | Loose - includes more distant matches |
| 20+ | Very loose - may include false positives |
Choose based on your use case:
- Deduplication - Use a low threshold (5-8) to find true duplicates
- Related images - Use default (10) for variations like crops, resizes
- Broad discovery - Use higher threshold (12-15) to find related content
Viewing Similar Images
On any resource's detail page, if similar images exist, you will see a Similar Resources section showing:
- Thumbnails of all similar images
- Links to each similar resource
- A form to merge similar resources into one
Finding Images with Similarities
Use the resource search with the filter:
/resources?ShowWithSimilar=true
This shows only resources that have at least one similar image detected.
Merging Duplicates
When you find duplicates, you can merge them:
- Navigate to the resource you want to keep (the "winner")
- Find the Similar Resources section
- Click Merge Others To This
- Confirm the action
Merging:
- Keeps the winner resource with all its metadata
- Transfers all tags, notes, and group associations from merged resources
- Deletes the merged resources
- Preserves the winner's version history
Merging is permanent -- the merged resources are deleted. Verify that the winner resource is the one you intend to keep.
Cache Warming
At startup, the hash worker loads existing hashes into the LRU cache in pages of up to 50,000 entries. This pre-populates the cache so similarity detection is effective immediately without waiting for a full batch cycle.
The cache size is controlled by -hash-cache-size (default: 100,000 entries). If your collection exceeds this limit, older entries are evicted and may not participate in similarity comparisons until they cycle back through batch processing.
Failed Hash Handling
If hashing fails for a Resource (corrupt image, unsupported encoding), the worker stores an empty hash record. This prevents the Resource from being retried on every batch cycle.
Memory Considerations
The hash worker maintains an in-memory LRU cache of image hashes for fast similarity lookups. Memory usage depends on your image count and the -hash-cache-size setting:
| Image Count | Estimated Cache Size |
|---|---|
| 10,000 | ~0.2 MB |
| 100,000 | ~2.4 MB |
| 1,000,000 | ~24 MB |
| 10,000,000 | ~240 MB |
For collections with millions of images, memory usage stays under 250 MB. The cache is loaded at startup and updated incrementally.
On-Upload Processing
When you upload a new image, it is queued for immediate hash processing. This means:
- Upload completes and resource is created
- Resource ID is added to the hash queue
- Worker processes it (usually within seconds)
- Similar images appear on the resource page
If the queue is full (1000 items), new uploads fall back to batch processing on the next poll interval.
Hash Migration
If you have images that were uploaded before hash calculation was available, the hash worker automatically processes them during its batch cycles. No manual intervention is required.
The worker also handles migration of hash format changes transparently. The current storage format uses int64 for efficient Hamming distance calculation. Legacy string-format hashes are still supported and migrated automatically. No action needed from you.
Troubleshooting
Similar images not appearing
- Check that the hash worker is running (not disabled)
- Wait for the next poll interval
- Check logs for processing errors
- Verify the image format is supported
Too many false positives
Lower the similarity threshold:
./mahresources -hash-similarity-threshold=6 ...
Missing obvious duplicates
Raise the similarity threshold:
./mahresources -hash-similarity-threshold=15 ...
High memory usage
If the hash cache is too large:
- Reduce the cache size:
-hash-cache-size=50000 - Disable the worker if not needed:
-hash-worker-disabled - Add more system memory