Using hashed trigrams to search over encrypted data

Question

For practice, I write let's call it a notebook app that stores users' notes in AES-encrypted form. For encryption, I use a password-based intermediate key technique as described here. Actually, the exact encryption method doesn't matter here, just so you know that only the client has the ability to decrypt the notes.

I want to implement a basic search algorithm that would let users search over their notes. Obviously, you can't just encrypt the search phrase and search it over the database or something like that, so here's an idea:

When a user creates or updates a note, the client-side algorithm creates a list of its trigrams, filters duplicates, then hashes each one of them, and then passes it to the server where it's stored alongside the encrypted note text in the database. When hashing trigrams, the user's personal salt has to be used.

When searching, the same thing applies to the search phrase and the database tries to search notes by given hashed trigrams.

So I have a couple of questions about this idea:

Would it be secure enough?
Would it decrease security if the hashes of trigrams get truncated to save some space? To handle collisions, the decrypted text could be checked on the client to verify it does have matches.
What would be more efficient?
- To store trigrams of a note as string divided by spaces in a separate column, then search them with LIKE or REGEXP statement
- Or to store them in a separate table with one row per trigram with a foreign key, and search over them with = operator

Edit after some comments:

To prevent brute-force, the encryption key (or even a hashed version of it) could be used as the salt for hashing trigrams. I suppose it can work because the key can only be known by client. Is it a good way to deal with it, or there are drawbacks of this approach I fail to see?

As I've been told that using the same string as a salt and as a key could be a bad idea, there's an alternative way: we generate this "trigram salt" when the user signs up, encrypt it with the encryption key and store it in the database, then use its decrypted form as mentioned above.

I'm tempted to move this to Cryptography.SE - let's see how it does here first, though. — Rory Alsop, Apr 28 '23 at 12:02
This question can be better answered at Crypto SE. I suggest to move the question there. — mentallurg, Apr 28 '23 at 16:35
"What would be more efficient?" - an array column or the separate table would be cleaner, but whatever scheme you use, check that your dbms supports in index on it. — Bergi, Apr 28 '23 at 21:50

Ja1024 · Accepted Answer · 2023-04-28T22:25:19.723

5

Hashing 3-character strings is useless, because the hash can be trivially broken with brute force (i. e. trying out all possible combinations). So your approach would effectively reveal the trigrams of the plaintext, giving an attacker the chance to systematically approximate the content.

I see two options:

Use a hash-based message authentication code (HMAC) for the trigrams with the user's key as the HMAC key. This way, only the client can calculate HMACs from plaintext trigrams, not an attacker. When a new note is created, the client tokenizes the note, calculates the trigram HMACs and sends the HMACs to the server, so that they can be stored in the database. When a note should be searched, the client calculates the HMACs from the search string and passes them to the server for a full text search in the database. Note that while the HMACs don't reveal the trigrams themselves, this scheme still leaks the frequencies of the trigrams. An attacker could try to correlate the HMACs with common trigrams of the language used in the notes (e. g. English) and find out what the notes likely contains.
Tokenize the text and then encrypt the entire collection of tokens with the user's key. Of course that means the full text has to be performed client-side after the user has provided the password.

edited Apr 28 '23 at 22:25

answered Apr 28 '23 at 12:54

Ja1024

5,769
14
21

What if I use the encryption key as the salt for hashing trigrams? If the salt is only known by client, the attacker won't know the initial info, so there'd be no opportuny for a brute-force, right? – v_slav.b Apr 28 '23 at 18:17
5

This would be a home-made HMAC. It's entirely unclear how secure your specific scheme for combining a key with a hash algorithm would be, and you could run into all kinds of unexpected cryptographic issues. The benefit of a standard HMAC algorithm (e. g. the one from RFC 2104) is that is has been designed specifically as a combination of a hash algorithm and a key, and it has been extensively checked both conceptually and practically. – Ja1024 Apr 28 '23 at 19:41
HMAC trigrams still leave an uncomfortably large attack surface, especially if the attacker can influence the user to create certain documents – user253751 Apr 28 '23 at 21:42
I'm not sure an attacker can make the user calculate HMACs in the scenario described by the OP, but what the trigram HMACs potentially do allows is a frequency analysis. – Ja1024 Apr 28 '23 at 22:18
I think it is a way to go, thank you very much. I just didn't realize at first that the idea with the salt I proposed is almost like using HMAC as I simply didn't know how it works. I'm really sorry to bother you again, yet could you please check the last paragraph of the post (edited) and tell if the other idea I've come up with is good or bad? – v_slav.b Apr 30 '23 at 15:30
Also, I have an idea on how to deal with frequency leaks. When getting a hash of a trigram, it might get truncated with a random offset, so there would be no one to one correspondence between trigrams and hashes. The offset distribution might be not even, to leave no chance to establish a one to many connection as well. It will slow down the search process though, as it will have to search over n times more trigrams, where n is the number of possible offsets. – v_slav.b Apr 30 '23 at 15:52
@v_slav.b: But you are going to represent the same trigram always with the same value (hash, whatever). Then truncation does not matter at all. The attacker will just compare the frequency of each hash, no matter how long different hashes are. Means, frequency analysis will still be possible. Which means, the attacker does not need to know password at all, and can decrypt the trigrams just based on the frequency. – mentallurg Apr 30 '23 at 16:06
I might have explained it not clear enough. The same trigram will not represent the same value. Let me clarify it. First, we get the hash H of a trigram T. Second, we truncate H, starting with a random position P, and thus get the truncated hash R of constant length L. So every T corresponds to a set of possible R values. When searching, every possible R for every given T must be taken and searched over the database. P has to be a multiple of L to avoid overlapping. – v_slav.b Apr 30 '23 at 23:17

score 2 · Answer 2 · answered Apr 28 '23 at 15:13

2

As others have mentioned, the brute force is much too easy on this scheme.

An option that was not previously mentioned is symmetric searchable encryption. These schemes are also not perfect but likely better than hashed trigrams.

answered Apr 28 '23 at 15:13

foreverska

2,057
2
11

score 2 · Answer 3 · answered Apr 30 '23 at 17:11

2

Attacks on the described scheme

Besides the problem with frequency of trigrams mentioned by "Ja1024", the described scheme has two further problems. It is vulnerable to a Known-plaintext attack and to a Chosen-plaintext attack.

To make search possible, you have to encode the same trigram always in the same way. Otherwise the same part contained in different document will be not found.

Known-plaintext attack: Without having any information, each encoded trigram can correspond to one of 17576 (=262626, in case only 26 lower case English letter used) plain trigrams. But if the attacker knows the plain text of one of the documents, then the attacker will reduce the number of possible pre-images for each trigram to 50 or 100 instead of 17576.

Chosen-plaintext attack: The attacker creates a document and several copies with small modifications, e.g. only single letter changed. Then the attacker cause the user to upload these document to your server. Then the attacker compares the sets of trigrams. The sets of encoded trigrams on the server will be almost identical and will have a few differences only. The attacker will know then, what encoded trigrams correspond to the modified document element. Thus, the attacker will know the plain text for these encoded trigrams. When repeated, more and more trigrams can be restored.

If the Chosen-plaintext attack attack can be done sufficiently many times, then plain text can be restored for every encoded trigram.

Even if Chosen-plaintext attack can be done only a limited number of times, it can still essentially simplify the frequency analysis. Frequency analysis deals with probabilities. Often there are multiple candidates that have very close frequency and the attacker has to try multiple candidates for the same encoded value. This leads to an exponentially growing number of combinations to test. But the two attacks described above can essentially simplify frequency analysis.

It is not full text search

Suppose you found a solution to prevent restoring of plain data. There is one more point to consider. Please pay attention that the described scheme is not a full text search and that's why cannot be as efficient. For instance, full text search allows to consider different forms of the same word, allows to find synonyms, e.g. if user is searching for a "car", also "vehicle" and "truck" may be found. Or if user is searching for "fruit", also "fruits", also typos like "friut" and "friuts", also synonyms like "apples" will be found. None of this will work in your case.

answered Apr 30 '23 at 17:11

mentallurg

12,418
5
36
50

Thanks for your response! What if I use the random offset trick as described in my comment to the Ja1024's respond? It was proposed as a way to make frequency analysis a little more difficult, and seems to be a possible way to deal with the attacks you've described as the attacker will get absolutely different sets of values everytime they submit the same or similar document. To make that work, any random offset must be a multiple of the length of the truncated version of the hash so there are no intersections – v_slav.b Apr 30 '23 at 23:01
And yea, the array of hashes has to be shuffled everytime of course – v_slav.b May 01 '23 at 00:04
@v_slav.b: I don't understand how random offset can help. The problem is that the same plain text trigram will always be encoded into the same value. For instance, you have no matter how complex encryption or hashing, and according to your algorithm "car" will be encoded by "81f7e62165d4..." and "can" is encoded by "95a7d9fce446...". The attacker does not know what "81f7e62165d4..." and "95a7d9fce446..." are. But it is knows that the frequency of "can" in English is approximately 15 times higher that the frequency of "car"... – mentallurg May 01 '23 at 02:25
1

@v_slav.b: ... But when attacker sees that "95a7d9fce446..." occurs 90 times and "81f7e62165d4..." occurs 6 times, then the attacker will expect that the probability is very high that "95a7d9fce446..." means "can" and "81f7e62165d4..." means "car, and the probability is very low that it is vice versa. – mentallurg May 01 '23 at 02:27
@v_slav.b: If you change the algorithm so that every time the same trigram is encoded differently, then you will not be able to search. For instance, one time you encode "car" as "a1e6f5ecfce9...", next time as "41fd88c1487e...", then as "254a7ce687f8..." Then you will not be able to search for "car", because it is represented by different values. If you always encode each trigram with the same value, then the scheme will be vulnerable to the attacks mentioned above: Frequency analysis, known plaintext attack, chosen plaintext attack. – mentallurg May 01 '23 at 02:32
It will not encode trigrams completely differently every time. it will have a restricted set of possible outputs for each trigram, and when searching, it will search for every possible output, the number of which might be 4 or 8 or 16 or else depending on the chosen length of the truncated value. My last comment under the Ja1024 response explains the idea quite well – v_slav.b May 01 '23 at 04:05
@v_slav.b: Now I understand what you mean. But it is not so easy. For instance, if frequencies of two trigrams are like 17 to 19, the to precisely reflect it you would need inverse number of elements in each set, i.e. 19 variants for the first trigram and 17 for the second. If you want to equalize frequencies of 3 trigrams that have frequencies like 13 to 17 to 19, you would need 1719=323 variants for the first trigram, 1319=247 for the second and 13*17=221 for the third... – mentallurg May 01 '23 at 05:39
@v_slav.b: ... The more trigrams you consider, the bigger should be the number of variants for each of them. Thus you can equalize the frequencies, but you will have huge number of variants for each trigram. This can make the search very slow. If you reduce the sets of variants, it will improve performance, but the frequencies will be not equalized any more and frequencies can again be analyzed, though it will be a bit harder, because differences in frequencies will be smaller and the attacker will need more documents to analyze. – mentallurg May 01 '23 at 05:46
I might be wrong, but I thought that if there are, say, 16 variants per trigram, the difficulty of such analysis would increase quite much, because the frequencies of different hashes of a same trigram would never be exactly the same even on big sets of data. At the same time, as I've checked, there's a huge number of groups of trigrams of English language with very similar frequencies, so it would be quite a task to establish one-to-many relationships between trigrams and sets of hashes. And what if make the distribution of possibilities of outputs not even? That becomes even harder I suppose – v_slav.b May 01 '23 at 06:11
@v_slav.b: You don't need to guess. Just calculate it. E.g. estimate the total amount of text that will be stored for a single user. Then estimate how many trigrams will you have, especially for very frequent ones like "car". Then estimate how often will each encoded value used. Then you will see if frequency analysis can be easy or not. – mentallurg May 01 '23 at 06:37
1

@v_slav.b: And don't focus on frequencies only. Other attacks remain still possible. For instance, chosen plaintext attack. Let say, the attacker makes user to upload a document that consists of a word "can" repeated 1000 times. Then if the encrypted result contains 16 values each repeated 62-63 times, then it would be clear that all these 16 values represent the trigram "can". I simplify it show the idea... – mentallurg May 01 '23 at 06:49

score 1 · Answer 4 · answered May 01 '23 at 08:49

1

If you really need server-side ngram search on encrypted data, then yes, this hashed-trigram approach is to be among the most practical options. However, it is easy to leak information like this.

A critical part of the security of this scheme is that the ngram hashes are "salted" per user, i.e. are calculated using a keyed hash or HMAC function. Thus, two users with the same document but different keys will get a different hashed ngram set. Truncating the hashes does not reduce security in any way, it might even help by making hashes less distinguishable. However, this will reduce search performance a bit, by making false positive hits more likely.

There is however the problem that the hashed ngrams disclose a frequency distribution of ngrams in the document. Some ngrams are very frequent in English language, some less so. By comparing the hashed ngram distribution in your database index with the expected distribution, it might be possible to probabilistically reconstruct some messages. This becomes even more important when the attacker can provide a document that you store with this encrypted method. Then they could correlate the known ngrams with the entries in your database index, and use that as their Rosetta stone for decoding parts of the document contents. Mentallurg's answer discusses these problems in more detail.

Such attacks can be made more difficult by adding a large number of random hashes to the index. This will reduce search performance (depending on how much you truncate the hashes), but it can increase privacy in a manner similar to k-anonymity models.

To reduce information leakage, using fixed-size bloom filters might be preferable. A bloom filter is a bit vector that represents a set of values. Adding a value to the set involves hashing the data with multiple different hash functions (or a keyed hash functions with multiple keys), using the hashes as indices into the bit vector, and setting those bits. To query whether an entry exists in the set, you check whether the bits at these indices are set. The size of the bit vector, the number of hash functions, and the number of entries affect the false positive rate. You can additionally set random bits, which will increase the false positive rate. If all documents are indexed with a bit vector of same size, and have the same percentage of bits set, then it would become more difficult for an attacker to extract useful information them. This can be made even harder if the hash functions don't map entries to a single set of indices, but to multiple potential sets, and select one of them at random (at the cost of making querying more complicated, because you'd have to check all potential positions).

Information leakage can also occur during querying. In your context, a query consists of repeated questions “in which documents does the ngram abc occur?”. Under the keywords “private information retrieval”, “oblivious transfer”, or “private set intersection” you'll find lots of existing solutions, some of them cryptographically safe. A non-cryptographic method would involve not just sending your actual queries, but also sending plausible alternative queries. When using truncated hashed identifiers, you can generate such queries as random numbers. But since this will lead to additional data transfer and to false positives, this will likely reduce search performance. It is also possible that this noise is not sufficient to mask the data distributions of the true queries – ngram hashes or bloom filter indices that occur in multiple queries are more likely to be real.

answered May 01 '23 at 08:49

amon

1,346
8
9

Thank you for your detailed response! As I've been thinking about it, I came up with another rather straightforward idea: what if I just use one salt per note and then use MySQL's SHA2 function in SELECT request? Yes, it would have to calculate hashes for every trigram over and over again as the salt will differ from note to note, and I thought it would be very slow, but as I tested it now on like 50000 rows, it keeps showing 0.00s spent on calculating. To prevent brute-force, the trigrams need to be sent to the server in hashed form too, with salt per user (HMAC). What do you think? – v_slav.b May 01 '23 at 12:12
It won't save from what you've described in the last paragraph though, but I will look into what you have advised. – v_slav.b May 01 '23 at 12:13
@v_slav.b You cannot use database features to calculate hashes here, since that would require access to the sensitive plaintext data. In your threat model, you want the DB to only handle encrypted data. You might have to think more carefully about which actor gets to know what. Having a separate salt/key per document also defeats the point of having a search index. At some point, it gets easier to treat the server as dumb encrypted block storage (literally just put/get/delete for arbitrary blobs), and do everything else client-side (creating and updating indices, searching the indices). – amon May 01 '23 at 12:35
but the server won't be receiving plaintext. As I said, the trigrams will be arriving there in hashed form via HMAC using user's encryption key which is only available for client. So it's basically double hashing: first with encryption key, second with note's unique salt. And I don't understand why would using a column index be impossible as the values stored in the database are going to be compared as they are with double-hashed trigrams from request. – v_slav.b May 01 '23 at 13:26
@v_slav.b If the ngrams are already transmitted in hashed form, what value would it have to hash them again in the DB? In this kind of end-to-end-encrypted architectures, the database is assumed to be an attacker (if not now, then maybe compromised by an attacker in the future). Or do you only want to protect data at rest? Then this could make sense, but it would be a radically different threat model concerned e.g. about someone obtaining decommissioned hard drives. – amon May 01 '23 at 13:30
the whole point of the frequency analysis and other techniques described in all the responses, is that an attacker, in one way or another, can draw relationships between hashes and initial trigrams using one or several documents, and then use those relationships to decode all the other documents of the user. But if different salt is user for every new document (and, I suppose, should do so for every new version of the same document), it would leave zero space for such kind of attacks as the same trigram will be giving different hashes for different documents. – v_slav.b May 01 '23 at 13:44
@v_slav.b Yes, and if we want to prevent the database (a potential attacker) from performing this analysis, then the keyed hash must be applied by the client. You can't give the database (a potential attacker) the unsalted data and hope that the database is able to protect it from itself (a potential attacker). – amon May 01 '23 at 14:09
Yes, just as I said, we hash the data on client with user's key and then hash it with the salt S which is unique for this note. We send the hashed trigrams to the DB, it stores them as well as S. When searching, we give the DB the trigrams of the search phrase hashed with user's key, and it searches it like this: SELECT * FROM notes INNER JOIN trigrams ON trigrams.noteId = notes.noteId WHERE trigrams.value = SHA2(CONCAT([given_value], notes.salt), 256). There should be a GROUP BY as well as WHERE ... IN instead of = though, I just made it short. What's wrong here? – v_slav.b May 01 '23 at 14:23
This query should certainly be optimized, because in the version I've just shown it ends up calculating hashes with the same salt over and over again when passing through trigrams of the same note. – v_slav.b May 01 '23 at 14:43
@v_slav.b Please think more carefully about your threat model: what are you trying to keep confidential from whom by using encryption or hashing? It seems you are using E2E encryption to keep information about note contents secret from the database server (which could be an adversary), yet in this comment you propose disclosing unsalted ngram hashes and salts to the database (which could be an adversary), which would let an adversary reconstruct the ngram distribution for documents and queries. This is considered unacceptable in typical E2EE threat models. – amon May 02 '23 at 20:05
Can you please clarify what do you mean by database being an adversary? Does it mean an opportunity of an attacker to intercept the transmitted query? "Man in the middle" kind of attack? – v_slav.b May 02 '23 at 20:43

Using hashed trigrams to search over encrypted data

4 Answers4

Linked