3

I would like to split a bigger dataset into multiple smaller ones and store the smaller data sets. For example, I have a big point feature data set and would like to split it into, let's say, 4 parts, and then store the 4 smaller datasets. Is there some tool in QGIS (or geopandas) to do this?

I know that you can create a grid and then join to the ids of the grids, then select and export as ect, but if there is a tool which makes it easier to split the dataset by just defining the amount of pieces the data should be splitted in, this would be much easier.

BERA
  • 72,339
  • 13
  • 72
  • 161
i.i.k.
  • 1,427
  • 5
  • 11

3 Answers3

3

Yes, there ist: the SAGA-Tool 'Split features layer', which does exaxtly what I described above: You can specify the amount of pieces by defining a relevant grid (if you want 4 pieces, you create a grid with 4 cells etc). As output it creates 4 parts (as shapefile).

enter image description here

enter image description here

i.i.k.
  • 1,427
  • 5
  • 11
  • In the end, I must say, even though this tool is okay for middlesized data, but for a file about 1GB it already takes quite a while, and you can not define the name of the outputfiles. If you have a single file which needs to be splitted, then this is okay, but it is not very useful in automation task (so when you want to split multiple files). – i.i.k. Nov 02 '23 at 14:11
3

You can create a column of repeated numbers and groupby this:

import geopandas as gpd
import numpy as np
import os
df = gpd.read_file(r"C:\GIS\data\testdata\1000points.geojson")
n_subsets = 6 #Number of subsets to create

#Create a column of repeated numbers: 0,1,2,3,4,5,0,1,2,3,4,5, ... df["subset_group"] = np.tile(range(n_subsets), df.shape[0])[:df.shape[0]]

out_folder = r"C:\GIS\data\testdata\outoutout" for subsetnum, subsetframe in df.groupby("subset_group"): out_file = os.path.join(out_folder, f"output_{subsetnum}.geojson") print(subsetframe.shape[0]) subsetframe.to_file(out_file) # 167 # 167 # 167 # 167 # 166 # 166

enter image description here

BERA
  • 72,339
  • 13
  • 72
  • 161
1

You can also split based on the id of the features:

  1. Use Field Calculator to create a new field quartils with this expression using quartils (q1 and q3) and median functions:

    case 
     when $id >= q3 ($id) then 1
     when $id >= median ($id) then 2
     when $id >= q1 ($id) then 3
     else 4
    end
    
  2. Run Split vector layer with the field quartils from step 1 as Unique ID field.

enter image description here

Babel
  • 71,072
  • 14
  • 78
  • 208