120

I am writing an application that needs to store and analyze large amounts of electrical and temperature data.

Basically I need to store large amounts of hourly electricity usage measurements for the past several years and for many years to come for tens of thousands of locations and then analyze the data in a not very complex manner.

The information that I need to store (for now) is Location ID, Timestamp (Date and Time), Temperature and Electricity Usage.

About the amount of the data that needs to be stored, this is an approximation, but something along those lines:
20 000+ locations, 720 records per month (hourly measurements, approximately 720 hours per month), 120 months (for 10 years back) and many years into the future. Simple calculations yield the following results:

20 000 locations x 720 records x 120 months (10 years back) = 1 728 000 000 records.

These are the past records, new records will be imported monthly, so that's approximately 20 000 x 720 = 14 400 000 new records per month.

The total locations will steadily grow as well.

On all of that data, the following operations will need to be executed:

  1. Retrieve the data for a certain date AND time period: all records for a certain Location ID between the dates 01.01.2013 and 01.01.2017 and between 07:00 and 13:00.
  2. Simple mathematical operations for a certain date AND time range, e.g. MIN, MAX and AVG temperature and electricity usage for a certain Location ID for 5 years between 07:00 and 13:00.

The data will be written monthly, but will be read by hundreds of users (at least) constantly, so the read speed is of significantly more importance.

I have no experience with NoSQL databases but from what I've gathered, they are the best solution to use here. I've read on the most popular NoSQL databases, but since they are quite different and also allow for very different table architecture, I have not been able to decide what is the best database to use.

My main choices were Cassandra and MongoDB, but I since I have very limited knowledge and no real experience when it comes to large data and NoSQL I am not very certain. I also read that PostreSQL also handles such amounts of data well.

My questions are the following:

  1. Should I use a NoSQL database for such large amounts of data. If not can I stick to MySQL?
  2. What database should I use?
  3. Should I keep the date and time in separate, indexed (if possible) columns to retrieve and process the data quickly for certain time and date periods, or can this be done by keeping the timestamp in a single column?
  4. Is a time series data modeling approach appropriate here, and if not could you give me pointers for a good table design?

Thank you.

RDFozz
  • 11,631
  • 4
  • 24
  • 38
Gecata
  • 1,303
  • 3
  • 9
  • 5
  • 31
  • While not small, this is not particularly a LARGE amount of data for proper hardware. And I hate to tell you, but so far what you have there sounds like relational data.
  • – TomTom Oct 17 '17 at 14:16
  • 1
    Since I read that MySQL (which I am currently using) is not the best choice for 1 000 000 000+ records and NoSQL is usually the solution, I posted my questions here, since I lack experience with NoSQL databases or working with billions of records of data. I also wrote that I may be wrong, thus asking for advice. Thanks. – Gecata Oct 17 '17 at 14:20
  • 6
    I've stored multi-TB tables with tens of billions of rows in MS SQL Server 2008-2014 by using a good key (epoch date), compression, partitioning, and ensuring my queries/indexes are partition aligned. I had to move to NoSQL (Hadoop) when I started getting petabytes of data to analyze and index differently. NoSQL should have other considerations and in this case, it doesn't seem to fit. – Ali Razeghi - AWS Oct 17 '17 at 16:42
  • 4
    @AliRazeghi Hadoop has nothing to do with SQL or NoSQL -- it's just a storage engine. There are plenty of SQL interfaces backed by Hadoop out there. – mustaccio Oct 17 '17 at 16:54
  • 3
    What are your constraints re:money to spend on software/licenses? – user3067860 Oct 17 '17 at 16:57
  • 1
    Hadoop was an example of a NoSQL solution that worked for us and many others when used with HBase or something like TSDB, in no way is it stated that it is the only one in my 1 quick example. The HDFS portion of hadoop is a storage engine. The MapReduce portion which is the second half is what gives it the capability to be much more than 'just store data as a file system'. I don't care what they use but it worked for us. Depending on your use and app, it could be considered NoSQL. – Ali Razeghi - AWS Oct 17 '17 at 17:03
  • A must read: https://medium.com/@Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f . Specialy the "How we sharded" part – Miguel Oct 17 '17 at 20:38
  • 1
    When you have infinite money, then I would suggest to buy a SAP HANA appliance. It's great for aggregations on large datasets. But you likely haven't infinite money. – Philipp Oct 18 '17 at 11:10