Which database should I look at for analyzing ~30gb, 700M records dataset?

Question

I'm looking at analyzing a kaggle dataset which is pretty normalized, specifically joining, grouping, slicing and dicing the 3 tables patients(2M) <-> diagnosis(700M) <-> doctors(2M).

I have 30GB RAM, 300GB SSD machine, so I'm able to analyze using a good old c program (I practically malloc and load everything to memory...), but I want to employ the power of ad-hoc SQL queries.

I tried different MySql, SQLLite and even mini-hadoop configurations, both on my machine and on google compute cloud and amazon EC (just to make sure that its not a config/env problem).

I can't find anything that will give me results in less than a couple of minutes to any ad-hoc query.. and if I didn't go thru the hassle of building indexes ad-hoc performance was atrocious (hours or so).

Am I missing something obvious? If its possible for my c program to fit everything in memory and do crude analysis, isn't there a database which loads everything to the buffer cache and gives lightning-fast performance for all the wonderful SQL analytics?

I know this might some a bit naive, but please go easy on me - I'm used to huge Teradata/Hadoop clusters doing heavylifting of many TB of data in very good timings, so I expected this to be very easy to handle...

Considering your related question (mysql: expected time for a simple group by on a single table, 2M records, single user?), your main problem is the design. Text column to store a few bytes status? It should be TINYINT. No indexes at all and you expect miracles? Wow, speechless. — ypercubeᵀᴹ, Jan 09 '16 at 19:20
If your queries consist of aggregate functions and group by's then I suggest you use some kind of cube tool instead — Nick.Mc, Jan 10 '16 at 05:46

Rick James · Answer 1 · 2016-01-10T05:22:15.500

(Too long for a comment...)

MySQL...

SHOW VARIABLES LIKE '%buffer%'; -- the main configuraton is the cache allocation. But it depends on which Engine you are using and how much RAM you have.

pretty normalized, specifically joining, grouping, slicing and dicing the 3 tables

Sorry, but we need to check the details. Please provide SHOW CREATE TABLE, some slow queries plus their EXPLAINs.

Are you CPU-bound? Then the queries and/or indexes need attending to.

Are you I/O-bound? Then probably indexes.

Edit 1

Glancing at the "schema", I see something naughty: diagnosis_code is NUMBER in one table and VARCHAR(30) in another. This makes it impossible to use an index when JOINing.

So, I repeat, let's see the MySQL version of CREATE TABLE and the queries.

Which database should I look at for analyzing ~30gb, 700M records dataset?

1 Answers1