Using an attribute index to find matching attributes of two layers faster?

Question

Similar to Indexing attribute field of shapefile in QGIS, I am wondering if such a thing like an attribute index exists for PyQGIS. Goal of its usage would be to iterate over two vector layers and find matching attribute values of a specified field in each layer. So it would work like a spatial index, just using attributes instead. So far I could only find, that I can create an Index using createAttributeIndex() as stated here and here. But absolutely no further information about its usage, the way it works or examples.

Basically the idea is to speed up code written like this:

vectorlayer_a = QgsProject.instance().mapLayersByName("layer_a")[0]
vectorlayer_b = QgsProject.instance().mapLayersByName("layer_b")[0]
for feat_a in vectorlayer_a.getFeatures():
    value_a = feat_a.attribute(1)
    for feat_b in vectorlayer_b.getFeatures():
        value_b = feat_b.attribute(1)
        if value_a == value_b:
            print('Hurray, finally found (another) one. Can I find all of them faster with an attribute index?')
            # Do stuff some stuff like...
            geom_a = feat_a.geometry()
            geom_b = feat_b.geometry()

Also, could attribute(1) have any datatype or would such a thing only work with numerical values, if this 'thing' exists at all?

You should look into using dictionaries, I think that's what you are looking for... I do things similar to this in Arc, but not sure I'm understanding exactly what you are trying to do see below question: https://gis.stackexchange.com/questions/375664/cursor-select-by-location-append-value-to-list/375668#375668 — bwp8nt, Oct 28 '20 at 23:15
I agree with @bwp8nt 100%, run times of feature enumerators inside feature enumerators (cursor in cursor is arcpy terminology) increase exponentially with the number of rows.. the enumerators are created by reading from disc or network with an initializer, indexed or not, therein lies a potential bottleneck. Iterating both vectorlayer_a and vectorlayer_b into 2 dicts of lists and comparing lists is much faster (read https://stackoverflow.com/questions/8023306/get-key-by-value-in-dictionary for some more helpful code) and lists are thread safe if you really want to try that can of worms. — Michael Stimson, Oct 29 '20 at 05:05

MrXsquared · Accepted Answer · 2020-11-02T17:15:36.377

Still, I don't know if there is an attribute index for PyQGIS and if so, how I could use it. But comments from bwp8nt and Michael Stimson pointed me into the right direction of making use of dictionaries to optimize my code without it. With the help of this great answer on SO, I finally managed to achieve my desired optimization without using an attribute index (explanation as comments):

vectorlayer_a = QgsProject.instance().mapLayersByName("layer_a")[0]
vectorlayer_b = QgsProject.instance().mapLayersByName("layer_b")[0]
Creating a dictionary of both layers containing feature id and desired attribute
feature id is needed to access desired features later on
attribute is needed to find matches later on
loop through both layers only once!
dict_a = {}
dict_b = {}
for feat_a in vectorlayer_a.getFeatures():
    dict_a[feat_a.id()] = feat_a.attribute(1) # feature id is used as key and attribute of column 1 as value (can have any datatype and must not be unique)
for feat_b in vectorlayer_b.getFeatures():
    dict_b[feat_b.id()] = feat_b.attribute(1) # feature id is used as key and attribute of column 1 as value (can have any datatype and must not be unique)
Avoid unnecessary loops through layer_b by using a dictionary for desired matches
Source: https://stackoverflow.com/a/64597197/8947209 (dont forget to upvote!)
dic2 = {}
re-sort: make keys of dict_b the values and values of dict_b the now unique keys
for i in dict_b.keys():
    elem = dict_b[i]
    if dic2.get(elem, None):
        dic2[elem].append(i)
    else:
        dic2[elem] = [i]
matches = {}
find the matching dict_a keys of re-sorted keys
for i in dict_a.keys():
    elem = dict_a[i]
    x = dic2.get(elem, None)
    if x:
        matches[i] = x
#print(dic2)
#print(matches)
Access desired features from matching dictionary by using feature ids
for k, v in matches.items(): # loop through key and value of matching dictionary
    i = 0 # counter to access value in values
    for l in v: # loop through list of current value
        featureid_layer_a = k # key of matching dict represents keys of dict_a and therefore featureids of layer_a
        featureid_layer_b = v[i] # values of matching dict represent keys of dict_b and therefore featureids of layer_b
        print('Hurray, found (another) pair really fast: ' + 'matching-dict-key|dict_a-key|layer_a-featureid = ' + str(featureid_layer_a) + ' | matching-dict-value|dict_b-key|layer_b-featureid = ' + str(featureid_layer_b))
        geom_a = vectorlayer_a.getFeature(featureid_layer_a).geometry() # accessing stuff by using featureid
        geom_b = vectorlayer_b.getFeature(featureid_layer_b).geometry() # accessing stuff by using featureid
        #print('geom_a: ' + str(geom_a))
        #print('geom_b: ' + str(geom_b))
        i += 1

Using an attribute index to find matching attributes of two layers faster?

1 Answers1

Creating a dictionary of both layers containing feature id and desired attribute

feature id is needed to access desired features later on

attribute is needed to find matches later on

loop through both layers only once!

Avoid unnecessary loops through layer_b by using a dictionary for desired matches

Source: https://stackoverflow.com/a/64597197/8947209 (dont forget to upvote!)

re-sort: make keys of dict_b the values and values of dict_b the now unique keys

find the matching dict_a keys of re-sorted keys

Access desired features from matching dictionary by using feature ids

Linked