Loading 260GB worth of CSV into MySQL in Linux

Question

I'm bulk loading a few thousand CSVs totaling ~260GB and ~4.5 billion rows to MySQL using mysqlimport calls on each CSV executed within a bash script. The MySQL installation and the data itself are on an external drive as I didn't have sufficient disk space on my machine to accommodate. My first attempt seemed to stall after a certain juncture (~24h), so I created another table without the indexes to see if processing sped up some (figured it was slowing things down to have to reindex the table after loading each CSV).

I'm now about 36h in and a little over 50% of the way complete, but stumped as to how I could speed up processing (or if it's even possible). Is there a better way to go about loading this much data to a MySQL table.

Edit: Processing seems to be slowing as it goes. Roughly 50% of data loaded in the first 36h, only an additional ~10% in the 24h since.

Edit2: I've updated innodb_buffer_pool_size to 4x its default value (128MB --> 512MB) and loading appears to be humming along--a single file was taking 15min+ to load prior to tuning, individual files are loading within ~1min now. There's been a corresponding slight increase in RAM usage (~4.5G to ~5.6G).

Just this once; it's a collection of historical data in this case. — Chris, Mar 08 '20 at 20:16
increasing global innodb_buffer_pool_size will possibly help and can be done dynamically - don't exceed the amount of RAM available as OOM won't help you. innodb_flush_log_at_trx_commit=2 will remove some IO overhead at the sacrifice of durability. set back to 1 later. — danblack, Mar 08 '20 at 22:06
You are trying to insert each CSV file in a single transaction, MySQL may choke on that. Cut the biggest files in reasonable chunks. — Gerard H. Pille, Mar 08 '20 at 22:16
@GerardH.Pille, the data is already broken into daily intervals (~4000+ files). Are you suggesting breaking them down further still? FWIW, the first portion of the load went fairly quickly; it's getting slower with time (first 24h loaded 50% of the data, 24h since just another ~10%) — Chris, Mar 08 '20 at 22:54
@danblack, thanks. Imagine this is something akin to MySQL's cache. Recommendation on how large to make it? Only appear to be using 4.4G of RAM (on a 16G system) — Chris, Mar 08 '20 at 22:58
@danblack, wow. Just updated buffer size to 2x and set the flush log variable to 2...processing appears to have quickened markedly (attributable mostly to increasing buffer size). — Chris, Mar 08 '20 at 23:32
@Chris What is the result displayed for SELECT @@innodb_change_buffer_max_size? If it is not 50, SET GLOBAL innodb_change_buffer_max_size=50 to ingest as much as possible to reduce length of time required. Also, what is result of SELECT @innodb_flush_neighbors? If it is not 2, SET GLOBAL innodb_flush_neighbors=2 to deal with all rows in one extent being handled on 1 cycle. This should help reduce SHOW GLOBAL STATUS LIKE '%pages_dirty%' in the least amount of time. — Wilson Hauck, Mar 08 '20 at 23:45
@WilsonHauck, thanks. Should I recent these variables to their defaults once this load is complete, or are these parameters preferable generally — Chris, Mar 09 '20 at 02:55
@Chris, innodb_flush_neighbors=2 would be best long-term. innodb_change_buffer_max=20 until you are loading significant data, then 50 for the duration + a couple hours to get the flushing done to your data tables. You will know when SHOW GLOBAL STATUS LIKE '%pages_dirty%'; result is less than 100. — Wilson Hauck, Mar 09 '20 at 09:26
on a 16G machine, probably 10G of innodb buffer pool size would of been a good starting point. — danblack, Mar 09 '20 at 23:37
@danblack, is there a formal relationship between buffer pool size and RAM used. Noted above, moving pool size to 2x (ie, 512MB) resulted in RAM usage that settled around 6G. Obviously, all of this isn't MySQL. — Chris, Mar 10 '20 at 04:11
The buffer pool is a one off, 1-1 mapping in size (barring some edge cases). Appropriate size is determined by the use of Innodb_buffer_pool_reads vs innodb_read_requests. Its like cache effeciency, high is good, 99.99% is overkill. Loading data an abnormality of "normal use". It still need a buffer because of secondary indexes, foreign keys, and ensuring unique constraints are met on top of just putting an entry in a secondary index. — danblack, Mar 10 '20 at 04:22

Loading 260GB worth of CSV into MySQL in Linux

0 Answers0

Linked