Skip to main content

Image Similarity Detection

Perceptual hashing finds visually similar images -- duplicates, near-duplicates, and related images -- even when they differ in resolution, compression, or minor edits.

How It Works

Perceptual Hashing

Unlike cryptographic hashes (SHA1, MD5) which produce completely different outputs for any change, perceptual hashes produce similar outputs for visually similar images.

Two types of perceptual hashes are computed:

Hash TypeDescription
Average Hash (aHash)Compares the average brightness of image blocks
Difference Hash (dHash)Compares brightness gradients between adjacent pixels

The difference hash (dHash) is used for similarity comparison because it tolerates small changes (crops, compression, resizing) better than average hash.

Hamming Distance

Similarity is measured by Hamming distance - the number of bits that differ between two hashes. Lower distance means more similar:

Hamming DistanceInterpretation
0Identical images (perceptually)
1-5Near-duplicates (same image, minor edits)
6-10Similar images (same subject, different versions)
11-15Loosely related (similar composition)
16+Different images

Background Hash Worker

A background worker automatically processes images and calculates their hashes.

What Gets Processed

The hash worker processes resources with these content types:

  • image/jpeg
  • image/png
  • image/gif
  • image/webp

Other file types are skipped.

Processing Flow

  1. Batch discovery - The worker finds images without hashes
  2. Hash calculation - Workers compute aHash and dHash for each image
  3. Cache update - New hashes are added to the in-memory cache
  4. Similarity detection - New hashes are compared against all cached hashes
  5. Persistence - Similar pairs are stored in the database

Worker Configuration

Configure the hash worker using command-line flags or environment variables:

FlagEnv VariableDefaultDescription
-hash-worker-countHASH_WORKER_COUNT4Concurrent workers
-hash-batch-sizeHASH_BATCH_SIZE500Images per batch
-hash-poll-intervalHASH_POLL_INTERVAL1mTime between batches
-hash-similarity-thresholdHASH_SIMILARITY_THRESHOLD10Max Hamming distance
-hash-worker-disabledHASH_WORKER_DISABLED=1falseDisable entirely
-hash-cache-sizeHASH_CACHE_SIZE100000Maximum entries in the LRU similarity cache

Tuning Examples

High-performance setup (fast processing, more strict matching):

./mahresources \
-hash-worker-count=8 \
-hash-batch-size=1000 \
-hash-poll-interval=30s \
-hash-similarity-threshold=8 \
...

Resource-constrained setup (slower, gentler on resources):

./mahresources \
-hash-worker-count=1 \
-hash-batch-size=100 \
-hash-poll-interval=5m \
...

Disabled (no background processing):

./mahresources -hash-worker-disabled ...

Similarity Threshold Configuration

The -hash-similarity-threshold setting controls how similar images must be to be considered matches:

ThresholdEffect
5Strict - only near-identical images match
10 (default)Balanced - finds similar images with variations
15Loose - includes more distant matches
20+Very loose - may include false positives

Choose based on your use case:

  • Deduplication - Use a low threshold (5-8) to find true duplicates
  • Related images - Use default (10) for variations like crops, resizes
  • Broad discovery - Use higher threshold (12-15) to find related content

Viewing Similar Images

On any resource's detail page, if similar images exist, you will see a Similar Resources section showing:

  • Thumbnails of all similar images
  • Links to each similar resource
  • A form to merge similar resources into one

Finding Images with Similarities

Use the resource search with the filter:

/resources?ShowWithSimilar=true

This shows only resources that have at least one similar image detected.

Merging Duplicates

When you find duplicates, you can merge them:

  1. Navigate to the resource you want to keep (the "winner")
  2. Find the Similar Resources section
  3. Click Merge Others To This
  4. Confirm the action

Merging:

  • Keeps the winner resource with all its metadata
  • Transfers all tags, notes, and group associations from merged resources
  • Deletes the merged resources
  • Preserves the winner's version history
warning

Merging is permanent -- the merged resources are deleted. Verify that the winner resource is the one you intend to keep.

Cache Warming

At startup, the hash worker loads existing hashes into the LRU cache in pages of up to 50,000 entries. This pre-populates the cache so similarity detection is effective immediately without waiting for a full batch cycle.

The cache size is controlled by -hash-cache-size (default: 100,000 entries). If your collection exceeds this limit, older entries are evicted and may not participate in similarity comparisons until they cycle back through batch processing.

Failed Hash Handling

If hashing fails for a Resource (corrupt image, unsupported encoding), the worker stores an empty hash record. This prevents the Resource from being retried on every batch cycle.

Memory Considerations

The hash worker maintains an in-memory LRU cache of image hashes for fast similarity lookups. Memory usage depends on your image count and the -hash-cache-size setting:

Image CountEstimated Cache Size
10,000~0.2 MB
100,000~2.4 MB
1,000,000~24 MB
10,000,000~240 MB

For collections with millions of images, memory usage stays under 250 MB. The cache is loaded at startup and updated incrementally.

On-Upload Processing

When you upload a new image, it is queued for immediate hash processing. This means:

  1. Upload completes and resource is created
  2. Resource ID is added to the hash queue
  3. Worker processes it (usually within seconds)
  4. Similar images appear on the resource page

If the queue is full (1000 items), new uploads fall back to batch processing on the next poll interval.

Hash Migration

If you have images that were uploaded before hash calculation was available, the hash worker automatically processes them during its batch cycles. No manual intervention is required.

The worker also handles migration of hash format changes transparently. The current storage format uses int64 for efficient Hamming distance calculation. Legacy string-format hashes are still supported and migrated automatically. No action needed from you.

Troubleshooting

Similar images not appearing

  1. Check that the hash worker is running (not disabled)
  2. Wait for the next poll interval
  3. Check logs for processing errors
  4. Verify the image format is supported

Too many false positives

Lower the similarity threshold:

./mahresources -hash-similarity-threshold=6 ...

Missing obvious duplicates

Raise the similarity threshold:

./mahresources -hash-similarity-threshold=15 ...

High memory usage

If the hash cache is too large:

  1. Reduce the cache size: -hash-cache-size=50000
  2. Disable the worker if not needed: -hash-worker-disabled
  3. Add more system memory