22

I have a large database (16M rows) containing perceptual hashes of images.

I'd like to be able to search for rows by hamming distance in a reasonable timeframe.

Currently, as far as I properly understand the issue, I think the best option here would be a custom SP-GiST implementation that implements a BK-Tree, but that seems like a lot of work, and I'm still fuzzy on the practical details of properly implementing a custom index. Calculating the hamming distance is tractable enough, and I do know C, though.

Basically, what is the appropriate approach here? I need to be able to query for matches within a certain edit-distance of a hash. As I understand it, Levenshtein distance with strings of equal length is functionally hamming distance, so there is at least some existing support for what I want, though no clear way to create an index from it (remember, the value I'm querying for changes. I cannot pre-compute the distance from a fixed value, since that would only be useful for that one value).

The hashes are currently stored as a 64-char string containing the binary ASCII encoding of the hash (e.g. "10010101..."), but I can convert them to int64 easily enough. The real issue is I need to be able to query relatively fast.

It seems like it could be possible to achieve something along the lines of what I want with the pg_trgm, but I'm a bit unclear on how the trigram matching mechamism works (in particular, what does the similarity metric it returns actually represent? It looks kind of like edit-distance).

Insert performance is not critical (it's very computationally expensive to calculate the hashes for each row), so I primarily care about searching.

Fake Name
  • 1,336
  • 5
  • 16
  • 28
  • The smlar extension might have what you need: http://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf or pg_similarity: https://www.pgcon.org/2009/schedule/attachments/108_pg_similarity.pdf – Neil McGuigan Jul 23 '14 at 17:44
  • @NeilMcGuigan - Interesting! The first presentation there is actually from the people who maintain the SP-GiST and GIST systems in postgres. – Fake Name Jul 24 '14 at 07:48
  • The first link is for something fundamentally different, though. they're looking for set intersections, whereas I'm looking for hamming distance. I could finangle the phashes into a set, but it would be extremely messy, and require a lot of support code everywhere else. – Fake Name Jul 24 '14 at 08:11
  • FWIW, At this point, I've more or less concluded I need to implement my own indexing system. I'm looking into custom SP-GiST indices at the moment, but I have no idea what I'm doing. – Fake Name Sep 04 '14 at 23:04
  • 1
    @FakeName: When you say the hamming distance, I am assuming you mean the hamming distance of the hash values strings, not the images? In other words, you are looking to ask: Find all the hash values which are X bit substitutions away from the input parameter – Thomas Kejser Oct 27 '14 at 22:43
  • @ThomasKejser - Correct. The image similarity is a function of the edit-distance between the hashes of the two images. The database contains just the perceptual hashes of the images, not the images themselves. – Fake Name Oct 28 '14 at 02:02
  • @FakeName I hope I understood your question correct. You have an image => generate the hash => get all images that have a similar hash? If you want to have all with a distance of only one: This might be fast: http://norvig.com/spell-correct.html – Wikunia Mar 22 '15 at 10:02
  • @Wikunia That is correct, but I need to search within a distance of 2-8 pretty regularly, so just exploring the local permutation space is non-viable. – Fake Name Mar 22 '15 at 17:44
  • Do you only have the chars 1 and 0 in your hash? Or a lot of chars? And how fast should your algo be? If there are less than 16 million combinations it will be faster :) – Wikunia Mar 22 '15 at 17:48
  • @Wikunia - I'm currently using int64 for data-stores (I converted the storage data-type). I need to be able to search over a dataset of >10M hash values for an edit distance of 4 in less then one second, ideally. – Fake Name Mar 22 '15 at 17:49
  • Also, what do you mean by "16 million combinations"? – Fake Name Mar 22 '15 at 17:54
  • Okay int64 with an edit distance of 4 mean that there are around 600.000 possibilities which bits to change from 0 to 1 or the other way around. So you can built these new int64 numbers and use the index of your database. It might be possible in less than a second. It should be at least faster than calculating the distance for all 16M hashes. – Wikunia Mar 22 '15 at 18:01
  • @Wikunia - That is true, but doing it intelligently with something like a BK tree is faster then both of those options (and only results in ~50K distance calculations), so I implemented that. – Fake Name Mar 22 '15 at 18:04

2 Answers2

15

MOAR ANSWERS!

Ok, I've finally taken the time to write a custom PostgreSQL indexing extension. I used the SP-GiST interface.

This was fairly challenging, mostly because Posgres is big.

Anyways, as usual, it's up on github here.

Performance-wise, it's currently ~2-3 times slower then the pure-in-memory implementation in my other answer to this question, but it's so much more convenient to use I'll happily eat that performance hit (realistically, it's ~50 ms/query - 150 ms/query, which is still pretty small).

Fake Name
  • 1,336
  • 5
  • 16
  • 28
13

Well, I spent a while looking at writing a custom postgres C extension, and wound up just writting a Cython database wrapper that maintains a BK-tree structure in memory.

Basically, it maintains a in-memory copy of the phash values from the database, and all updates to the database are replayed into the BK-tree.

It's all up on github here. It also has a LOT of unit-tests.

Querying across a dataset of 10 million hash values for items with a distance of 4 results in touching ~0.25%-0.5% of the values in the tree, and takes ~100 ms.

Fake Name
  • 1,336
  • 5
  • 16
  • 28
  • BK-Tree in memory with 16 million rows in memory? I was looking at something similar however with 1000 images and 2000 descriptors on each image my in memory size was huge. – Stewart Aug 31 '19 at 01:17
  • @Stewart - A lot of this depends on the size of your hash. In my case, the hash value output is a single 64-bit bitfield that I store as a int64. You seem to have a much larger phash data type. I'm also not sure how searches would work on a different datatype like that. Are they still a metric space? How do you calculate distance? – Fake Name Aug 31 '19 at 23:01
  • I'm using 32bit descriptors with the FLANN marcher provided with opencv. To calculate distance I use hamming with a threshold based on Lowe's ratio. At this point I'm not sure if its best to try and stick with in memory FLANN which provides a KD-tree structure or to switch to a solution more similar to yours. Why did you end up rolling your own and not going for something like libflann? – Stewart Sep 01 '19 at 09:44
  • @Stewart - I didn't roll my own. I'm using super boring DFT-based hashing. – Fake Name Sep 02 '19 at 05:38