4

I am importing about 60 million records of csv data (about 25gb) to Postgresql server using ogr2ogr

After googling million times, I found out that I can make VRT file and uploading my csv file using ogr2ogr. Here's the code I used to importing data and vrt file

<OGRVRTDataSource>
    <OGRVRTLayer name="combine">
        <SrcDataSource>D:\combine.csv</SrcDataSource>
        <GeometryType>wkbPoint</GeometryType>
        <LayerSRS>WGS84</LayerSRS>
        <GeometryField encoding="PointFromColumns" x="longitude" y="latitude"/>
    </OGRVRTLayer>
</OGRVRTDataSource>

ogr2ogr -overwrite -progress -gt 999999999 -skipfailures -f "PostgreSQL" PG:"host=000.000.000.000 port=0000 dbname=myDB user=me password=youtellme" -lco GEOMETRY_NAME=shape -t_srs EPSG:3857 D:\input.vrt -nln output    

I thought I could adjust -gt option to speed up the importing time, yet it doesn't help much.

It has been 3 days since I started to importing data to Postgresql server and only 20 million data are uploaded so far. (means that It will take almost a week to upload)

Seems like my ogr2ogr code (or VRT) can only import 100 - 200 records per second depending on the server status (see the screen capture below)

enter image description here

user30184
  • 65,331
  • 4
  • 65
  • 118
Pil Kwon
  • 1,040
  • 1
  • 8
  • 22
  • If your code is working, then don't forget that there is also a [codereview.se] Stack Exchange. – PolyGeo Feb 27 '18 at 03:50
  • @PolyGeo sure thing. Thank you so much. I didn't know about the community. Should i ask there or here though? – Pil Kwon Feb 27 '18 at 04:19
  • Whatever you do, you should not ask on both at the same time because that would be cross-posting. If you don't see progress here then I think you should delete it and re-ask there. Our community seems generally willing to try and help with performance issues as long as they can be illustratted in a code snippet. – PolyGeo Feb 27 '18 at 05:15
  • 1
    What is your GDAL version? In https://gis.stackexchange.com/questions/109564/what-is-the-best-hack-for-importing-large-datasets-into-postgis/109604#109604 using --config PG_USE_COPY YES helped a lot but since GDAL 2.0 that should be the default. Are you sure that you don't have troubles with your database connection or hardware? – user30184 Feb 27 '18 at 05:54
  • @user30184 my gdal version is 2.2.3 and I dont see any hardware issues on my server :( – Pil Kwon Feb 27 '18 at 06:48
  • 2
    Why don't you just use the COPY command. It is by far the fastest way of loading CSV into Postgres and avoid all that hideous XML parsing, which is going to all loads of overhead. XML before breakfast, there should have had a health warning, I think I'd better go and look at some GeoJSON with my coffee :-) – John Powell Feb 27 '18 at 06:55
  • @JohnPowellakaBarça if i just copy the csv to postgis, can I still 'geocode' my plain text lat and lon ?? does postgis do automatically for me? – Pil Kwon Feb 27 '18 at 06:56
  • 4
    No, but after you have imported it, which will take hours istead of days you do: ALTER TABLE sometable ADD column geom geometry(POINT, 4326); UPDATE sometable SET geom = ST_SetSRID(ST_Makepoint(lon, lat), 4326) or something similar and then drop the text columns, if you see fit. I have used a similar approach hundreds of times and it will be vastly faster that the VRT approach, even if it involves more steps. – John Powell Feb 27 '18 at 07:04
  • For making sure there is no hardware problems convert some test data, for example one million rows, into shapefile and store that into PostGIS with ogr2ogr. Hypothesis is that you should be able to do that in less than two minutes. – user30184 Feb 27 '18 at 07:58
  • Good luck getting a 25gb file into shp though. Postgres is optimised for csv import, that is what OP has. I have uses the approach above for billions of rows. – John Powell Feb 27 '18 at 09:40
  • @Barça, I thought I wrote about making a sample of just one million rows for making a somewhat reliable test for finding the bottle neck. If VRT is really so much slower there may be something to report to GDAL developers. – user30184 Feb 27 '18 at 12:24
  • @JohnPowellakaBarça I cannot use COPY because the server is installed in another machine. the error message gives me that I have to do with psql. I am relatively newbie in this world so I am googling it how to import my CSVs to postgresql using psql. It's another headache.. – Pil Kwon Feb 27 '18 at 12:52
  • Actually, you can run COPY on remote machines from data locally, using the -c switch, of psql and stdin and stdout, eg, see: https://www.endpoint.com/blog/2013/11/21/copying-rows-between-postgresql. But, yes, you would have to use psql. – John Powell Feb 27 '18 at 12:59
  • I made a simple test with my laptop. Million points with only X and Y colums, conversion via VRT file with ogr2ogr took 37 seconds, "tuples in" rate was around 20000-60000. I strongly believe that your bottle neck is between your computer and the PostgreSQL server. – user30184 Feb 27 '18 at 13:43
  • @JohnPowellakaBarça psql -h myserverIP -d dbname -U username -c "\copy dataTable from STDIN" > D:\combine.csv... i am trying this but i dont know how to see the progress – Pil Kwon Feb 27 '18 at 14:40
  • @user30184 i am going to try again to see if i can somehow fix the bottle neck. thanks for help! if things get better i will answer my own quetion. – Pil Kwon Feb 27 '18 at 14:40

1 Answers1

8

You don't have to use a .vrt any more. ogr2ogr supports reading csv files with geometry directly since version 2.1.

The ogr2ogr command:

ogr2ogr -f "PostgreSQL" PG:"host=000.000.000.000 port=0000 dbname=myDB user=me password=youtellme" -s_srs EPSG:4326 -t_srs EPSG:3857 -progress -nln output -lco OVERWRITE=YES  -lco GEOMETRY_NAME=shape -oo X_POSSIBLE_NAMES=longitude -oo Y_POSSIBLE_NAMES=latitude combine.csv

It might be quicker to do the geometry transformation in the database.

HeikkiVesanto
  • 16,433
  • 2
  • 46
  • 68
  • 1
    Also Starting with GDAL 2.0, COPY is used by default when inserting from a table that has just been created. Obviously "from" means really "into". – user30184 Feb 27 '18 at 14:08
  • @user30184 so it is! Amended the answer. – HeikkiVesanto Feb 27 '18 at 14:11
  • HALLELUJAH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! i didn't do anything with my Server setting and it records 40K records in a sec. Thank you so much and also thanks to user30184 and john! – Pil Kwon Feb 27 '18 at 14:56
  • Excellent, that is quite a good speed up. Still worth learning psql, though :-) – John Powell Feb 27 '18 at 15:02
  • Oh, I see a problem here. I used -skipfailure option and it slows down importing speed to 150 records / sec. hmm.. it's weird But you helped my headache! thanks! – Pil Kwon Feb 27 '18 at 15:11
  • 1
    Ah, I should have noticed that. Skipfailures is forcing the size of the transactions into 1. If writing into database fails it leads to rollback which affects the whole transaction. Therefore inserting rows one by one is the only possibility to skip only problematic rows. It should be mentioned in the ogr2ogr manual page. – user30184 Feb 27 '18 at 15:24
  • @user30184 are there anyways I can use -skipfailture and expand the transaction size? – Pil Kwon Feb 27 '18 at 15:33
  • No, that's how databases support ACID https://en.wikipedia.org/wiki/ACID. If transaction must be rolled back it means that the whole transaction is missed, not only the faulty lines. – user30184 Feb 27 '18 at 15:41