I'm trying to map a CSV with over a million lat/lon points in the US with no metadata. I've never tried to map this much data and want to know what steps I should take. I know it's too much data for QGIS so I tried splitting it into 100,000 line CSVs but then got a lot of files which I didn't know what to do with. I'm thinking it might be best to aggregate the points by census block but don't know the best way to do that using Javascript (I've never used Python) or command line tools? Any suggestions?
3 Answers
I would recommend using PostgreSQL/PostGIS, since it is natively supported by QGIS, has some handy built-in functions to join with other spatial data (like census tracts), and the use of a database should limit memory issues with large datasets. My recommended set of steps is below.
To execute SQL commands you can either use PGAdmin or QGIS, the former will give you more informative errors on queries, the latter will let you load the results of queries as layers on a map. To access the latter go to Database > DB Manager > DB Manager and click on the second button.
- Get started setting up PostGIS, also quick installers
- Create a table in your database using the
CREATE TABLEcommand (example) in either the QGIS SQL Window or in PGAdmin's SQL Window - Import your csv with either a COPY sql command or PGAdmin's built in import function by right-clicking on your table in PGAdmin and clicking on
Import...(this latter can be delicate so I'd recommend COPY for larger datasets). Add a geometry column to your table by executing the following SQL in either PGAdmin or the QGIS SQL window.
ALTER TABLE some_table ADD COLUMN geom geometry(Point,4326);Create the point geometries using something like
UPDATE yourtable SET geom = ST_SetSRID(ST_MakePoint( x, y), 4326);Display a subset of data by using a
SELECTstatement with something likeLIMIT 50000Or join to census data with something like
SELECT c.gid, c.geom
FROM census c
INNER JOIN yourdata ON ST_Within(yourdata.geom, c.geom)
Function references:
ST_Within
ST_MakePoint
ST_SetSRID
-
Thanks for the reply. I tried step 3 but I cant edit the table in QGIS. Is there another way to add a column without using a plugin? – user63623 Dec 09 '15 at 19:38
-
@user63623 I updated my answer to include an extra step that should address your issue. – raphael Dec 09 '15 at 20:32
-
thanks I think theres a step missing between 1 and 2. When you import csv layer into QGIS you need to save it as a shp file to create a database. Right? It freezes when I try that though. I also tried using PGAdmin but I can't find the import function. – user63623 Dec 09 '15 at 22:14
-
In PGAdmin you would first have to CREATE your table, using a sql query like this: https://stackoverflow.com/questions/9826833/create-table-in-postgresql, with all your columns from the csv properly defined. Then you can right-click on the tablename >
Import...to access the Import wizard, or use a COPY command
http://www.postgresql.org/docs/current/static/sql-copy.html I'll update the answer – raphael Dec 10 '15 at 02:47
I recently worked with a dataset of 1.4 million points imported from a CSV. I made sure to delete any irrelevant fields in the csv. It worked fine for me, even though some processes took a bit of time to execute. (QGIS 2.12, 64 bit Windows 7, 8 Gb RAM)
- 8,626
- 3
- 40
- 71
An R candidate fwiw, in pseudocode:
library(rgdal) ## for spatial export
library(readr) ## for fast file read
x <- read_csv("file.csv")
names(x) ## some as yet unknown columns
coordinates(x) <- c("x", "y") ## your coordinate names may be different
writeOGR(x, ".", layer = "filepoints", driver = "MapInfo File")
This will give you a filepoints.tab in the working directory you can read with QGIS. Or choose "ESRI Shapefile" to create a filepoints.shp, or whatever format it is you need. QGIS uses GDAL much like rgdal does, so there's a lot of overlap. There are analogous Python mechanisms.
Since you don't have any metadata, you could have a set after assigning coordinates(x) to set proj4string(x) <- CRS("+proj=something +etc") but we can only guess for your data.
- 8,196
- 1
- 27
- 30
UPDATEquery? The documentation says ST_MakePoint is faster, so I would also assume it might take less memory. @user63623, see my answer below – raphael Dec 09 '15 at 18:38