I'm looking at analyzing a kaggle dataset which is pretty normalized, specifically joining, grouping, slicing and dicing the 3 tables patients(2M) <-> diagnosis(700M) <-> doctors(2M).
I have 30GB RAM, 300GB SSD machine, so I'm able to analyze using a good old c program (I practically malloc and load everything to memory...), but I want to employ the power of ad-hoc SQL queries.
I tried different MySql, SQLLite and even mini-hadoop configurations, both on my machine and on google compute cloud and amazon EC (just to make sure that its not a config/env problem).
I can't find anything that will give me results in less than a couple of minutes to any ad-hoc query.. and if I didn't go thru the hassle of building indexes ad-hoc performance was atrocious (hours or so).
Am I missing something obvious? If its possible for my c program to fit everything in memory and do crude analysis, isn't there a database which loads everything to the buffer cache and gives lightning-fast performance for all the wonderful SQL analytics?
I know this might some a bit naive, but please go easy on me - I'm used to huge Teradata/Hadoop clusters doing heavylifting of many TB of data in very good timings, so I expected this to be very easy to handle...
Textcolumn to store a few bytes status? It should beTINYINT. No indexes at all and you expect miracles? Wow, speechless. – ypercubeᵀᴹ Jan 09 '16 at 19:20