7

I have a shapefile attribute table in QGIS that I would like to calculate the cumulative sum. The thing is that it has over 1.5 million features.

Type of device: HDD.

The attribute table looks like this:

enter image description here

I have tried to calculate in a new field the cumulated sum for "Equitativo" like this:

 with_variable(
    'ID_FINAL',
    ID_FINAL,
    sum(Equitativo, filter:=ID_FINAL<=@ID_FINAL)
    )

The problem is that is taking way too long (it's running for 17 hours now). How could I make this faster way?

Taras
  • 32,823
  • 4
  • 66
  • 137
  • 1
    Full table scans take time. Please [Edit] the Question to specify the storage format of the data, the type of device on which it is stored, and the actual runtime in seconds. – Vince Jan 16 '22 at 12:20
  • I'm not sure what "type of device" means. – Matias Franke Jan 16 '22 at 12:26
  • HDD, SSD, USB thumb drive,... – Vince Jan 16 '22 at 12:36
  • 1
    If you need to perform many calculations like this you can use geopandas. The calculation is performed in less than a second. But it takes about 30 s to read the shapefile and the same to write – BERA Jan 16 '22 at 14:30
  • 1
    @BERA (geo)pandas is really so fast. but it took 76 seconds to read, 0.02 second to sum (insanely fast), 127 seconds to write to file in my computer. – Kadir Şahbaz Jan 16 '22 at 14:42
  • Yes, only worth it if there are multiple calculations to be done – BERA Jan 16 '22 at 15:11

2 Answers2

15

You can use the script below in QGIS Python Editor. It takes about 45 seconds (including saving to the source file) for 1.5 million features in SSD or HDD. Select the layer first.

import time
from qgis.gui import QgsRubbersBand
from qgis.gui import iface
from qgis.PyQt.QtCore import Qvariant

start time = time.time()

layer = iface.activeLayer()

Add CUM_SUM field if not exist

field_index = layer.fields().indexFromName("CUM_SUM") if field_index == -1: field = QgsField('CUM_SUM', QVariant.Double, len=20, prec=3) layer.dataProvider().addAttributes([field]) layer.updateFields()

Get field indices

id_index = layer.fields().indexFromName("ID_FINAL") eq_index = layer.fields().indexFromName("Equitativo") cum_sum_index = layer.fields().indexFromName("CUM_SUM")

cum_sum = 0 attribute_map = {} request = QgsFeatureRequest().addOrderBy('ID_FINAL').setFlags(QgsFeatureRequest.NoGeometry).setSubsetOfAttributes(['ID_FINAL', 'Equitativo'], layer.fields()) for feature in layer.getFeatures(request): cum_sum += feature["Equitativo"] attribute_map.update({feature.id(): {cum_sum_index: cum_sum}})

layer.dataProvider().changeAttributeValues(attribute_map)

print("%s seconds" % (time.time() - start_time))

enter image description here

Taras
  • 32,823
  • 4
  • 66
  • 137
Kadir Şahbaz
  • 76,800
  • 56
  • 247
  • 389
  • Can you explain why this Python skript is so much faster? – Babel Jan 16 '22 at 15:18
  • 2
    @Babel It is hard to me to explain such a thing in English. Nyall Dawson explains "How to speed up PyQGIS scripts". It can give you an idea why an expression is sometimes so slow. For example; using get_feature_by_id is one of the reasons. Behind the scene, QGIS needs to do a bunch of things to give you the matching feature. – Kadir Şahbaz Jan 16 '22 at 16:34
  • @Babel. Nyall says: Well, remember that every time you call getFeatures() QGIS needs to do a whole bunch of things before it can start giving you the matching features. In this case, the code is calling getFeatures() once for every feature* ID in the list. So if the list had 100 features, that means QGIS is having to create a connection to the data source, set up and prepare a query to match a single feature, wait for the provider to process that, and then finally parse the single feature result. That’s a lot of wasted processing!* – Kadir Şahbaz Jan 16 '22 at 16:37
  • An expression may include that kind of situations. @Babel – Kadir Şahbaz Jan 16 '22 at 16:39
  • That makes sense, thanks a lot for the explanation. – Babel Jan 16 '22 at 16:42
6

Your expression is not efficient because it calculates the same sum over and over again: for feature no. 100, it is adding 100 numbers (feature1, feature2, ... feature100), even though adding 99% of these values (from feature1 to feature99) was done before.

Better rely on the already calculated sum of the last feature: for feature 100, something like (pseudocode): sum (feature100) = sum (feature99) + value (feature100)

So try this expression:

attribute(get_feature_by_id(@layer, $id), 'Equitativo')
+
attribute(get_feature_by_id(@layer, $id-1), 'Equitativo')

Will definitely take more time than the Python script by @Kadir Şahbaz, so you might want to stick to this. This solution here is just to show how to do it using QGIS expressions only.

Taras
  • 32,823
  • 4
  • 66
  • 137
Babel
  • 71,072
  • 14
  • 78
  • 208
  • 2
    +1. A small issue here: We don't know if the table is ordered by ID_FINAL. Its order may not be matched with $id order. Is that possible to order by ID_FINAL using expression before running it? – Kadir Şahbaz Jan 16 '22 at 14:22
  • Indeed, this works if the order for cumulative sum is the same than $id. If not, this would complicate things and I would definitely not stick to using expressions for larger datasets as your Python skript is much more efficient. – Babel Jan 16 '22 at 14:49