Best database and table design for billions of rows of data

Question

I am writing an application that needs to store and analyze large amounts of electrical and temperature data.

Basically I need to store large amounts of hourly electricity usage measurements for the past several years and for many years to come for tens of thousands of locations and then analyze the data in a not very complex manner.

The information that I need to store (for now) is Location ID, Timestamp (Date and Time), Temperature and Electricity Usage.

About the amount of the data that needs to be stored, this is an approximation, but something along those lines:
20 000+ locations, 720 records per month (hourly measurements, approximately 720 hours per month), 120 months (for 10 years back) and many years into the future. Simple calculations yield the following results:

20 000 locations x 720 records x 120 months (10 years back) = 1 728 000 000 records.

These are the past records, new records will be imported monthly, so that's approximately 20 000 x 720 = 14 400 000 new records per month.

The total locations will steadily grow as well.

On all of that data, the following operations will need to be executed:

Retrieve the data for a certain date AND time period: all records for a certain Location ID between the dates 01.01.2013 and 01.01.2017 and between 07:00 and 13:00.
Simple mathematical operations for a certain date AND time range, e.g. MIN, MAX and AVG temperature and electricity usage for a certain Location ID for 5 years between 07:00 and 13:00.

The data will be written monthly, but will be read by hundreds of users (at least) constantly, so the read speed is of significantly more importance.

I have no experience with NoSQL databases but from what I've gathered, they are the best solution to use here. I've read on the most popular NoSQL databases, but since they are quite different and also allow for very different table architecture, I have not been able to decide what is the best database to use.

My main choices were Cassandra and MongoDB, but I since I have very limited knowledge and no real experience when it comes to large data and NoSQL I am not very certain. I also read that PostreSQL also handles such amounts of data well.

My questions are the following:

Should I use a NoSQL database for such large amounts of data. If not can I stick to MySQL?
What database should I use?
Should I keep the date and time in separate, indexed (if possible) columns to retrieve and process the data quickly for certain time and date periods, or can this be done by keeping the timestamp in a single column?
Is a time series data modeling approach appropriate here, and if not could you give me pointers for a good table design?

Thank you.

Since I read that MySQL (which I am currently using) is not the best choice for 1 000 000 000+ records and NoSQL is usually the solution, I posted my questions here, since I lack experience with NoSQL databases or working with billions of records of data. I also wrote that I may be wrong, thus asking for advice. Thanks. — Gecata, Oct 17 '17 at 14:20
I've stored multi-TB tables with tens of billions of rows in MS SQL Server 2008-2014 by using a good key (epoch date), compression, partitioning, and ensuring my queries/indexes are partition aligned. I had to move to NoSQL (Hadoop) when I started getting petabytes of data to analyze and index differently. NoSQL should have other considerations and in this case, it doesn't seem to fit. — Ali Razeghi - AWS, Oct 17 '17 at 16:42
@AliRazeghi Hadoop has nothing to do with SQL or NoSQL -- it's just a storage engine. There are plenty of SQL interfaces backed by Hadoop out there. — mustaccio, Oct 17 '17 at 16:54
What are your constraints re:money to spend on software/licenses? — user3067860, Oct 17 '17 at 16:57
Hadoop was an example of a NoSQL solution that worked for us and many others when used with HBase or something like TSDB, in no way is it stated that it is the only one in my 1 quick example. The HDFS portion of hadoop is a storage engine. The MapReduce portion which is the second half is what gives it the capability to be much more than 'just store data as a file system'. I don't care what they use but it worked for us. Depending on your use and app, it could be considered NoSQL. — Ali Razeghi - AWS, Oct 17 '17 at 17:03
A must read: https://medium.com/@Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f . Specialy the "How we sharded" part — Miguel, Oct 17 '17 at 20:38
When you have infinite money, then I would suggest to buy a SAP HANA appliance. It's great for aggregations on large datasets. But you likely haven't infinite money. — Philipp, Oct 18 '17 at 11:10

score 126 · Accepted Answer · edited Oct 19 '17 at 18:50

This is exactly what I do every day, except instead of using the hourly data, I use the 5 minute data. I download about 200 million records everyday, so the amount you talk about here is not a problem. The 5 minute data is about 2 TB in size and I have weather data going back 50 years at an hourly level by location. So let me answer you questions based on my experience:

Don't use NoSQL for this. The data is highly structured and fits a relational database perfectly.
I personally use SQL Server 2016 and I have no problems applying computations across that volume of data. It was originally on a PostgreSQL instance when I started my job and it couldn't handle the volume of data as it was on a small AWS instance.
I would highly recommend extracting the hour portion of the date and storing it separate from the date itself. Believe me, learn from my mistakes!
I store the majority of data list-wise (DATE,TIME,DATAPOINT_ID,VALUE) but that is not how people will want to interpret the data. Be prepared for some horrendous queries against the data and vast amounts of pivoting. Don't be afraid to create a de-normalized table for result sets that are just too large to compute on the fly.

General tip: I store most of the data between two databases, the first is straight-up time series data and is normalized. My second database is very de-normalized and contains pre-aggregated data. As fast as my system is, I am not blind to the fact that users don't even want to wait 30 seconds for a report to load – even if I personally think 30 seconds to crunch 2 TB of data is extremely fast.

To elaborate on why I recommend storing the hour separate from the date, here are a few reasons why I do it that way:

The way that the electrical data is presented is by Hour Ending – therefore, 01:00 is actually the average of the electrical power for the previous hour and 00:00 is Hour Ending 24. (This is important because you actually have to search for two dates to include the 24 hour value – the day you are looking for plus the first mark of the following day.) However, the weather data is actually presented in a forward manner (actual and forecast for the next hour). In my experience with this data, consumers wish to analyze the effect that the weather has on the power price/demand. If you were to use a straight-up date comparison, you would actually be comparing the average price for the previous hour versus the average temperature for the following hour, even though the time stamps are the same. Storing the hour separate from the date allows you to apply transformations to the time with less performance impact than you would see applying a calculation to a DATETIME column.
Performance. I would say at least 90% of the reports that I generate are graphs, normally plotting the price against the hour either for a single date or for a range of dates. Having to split out the time from the date can bog down the speed of the query used to generate the report depending on the date range that you want to see. It is not uncommon for consumers to want to see a single date, Year-on-Year for the past 30 years (in fact for weather this is required to generate the 30 year normals) – this can be slow. Of course you can optimize your query and add indexes, and trust me I have some insane indexes that I would rather not have but it makes the system run fast.
Productivity. I hate having to write the same piece of code more than once. I used to store the date and time in the same column, until I had to write the same query over and over again to extract the time portion. After a while I just got sick of having to do this and extracted it to its own column. The less code you have to write the less chance there is of an error in it. Also, having to write less code means that you can get your reports out faster, nobody wants to be waiting all day for reports.
End users. Not all end users are power users (i.e. know how to write SQL). Having the data already stored in a format that they can bring into Excel (or other similar tool) with minimal effort will make you a hero in the office. If the users cannot access or manipulate the data easily, they will not use your system. Believe me, I designed the perfect system a couple of years ago and nobody used it because of this reason. Database design is not just about adhering to a predefined set of rules/guidelines, it is about making the system usable.

As I said above, this is all based on my personal experience, and let me tell you, it has been a hard few years and a lot of redesigns to get to where I am now. Don't do what I did, learn from my mistakes and make sure you involve the end users of your system (or developers, report authors etc...) when making decisions about your database.

I had good luck just using Epoch date but your recommendation is interesting for your use case. Thanks for sharing. — Ali Razeghi - AWS, Oct 17 '17 at 16:44
I originally stored the date/time in UTC, but then the consumers complained because they would always have to adjust to local time. Ultimately my design changed to make it easier for the consumers to use the data. — Mr.Brownstone, Oct 17 '17 at 17:45
+1. Note that a pre-aggregated denormalized dataset can be a data warehouse which should be an easier learning curve than a new tech like noSQl — Joe, Oct 17 '17 at 18:56
This is very good information and clear tips. It is best to take the advice of someone who has done this. The information presented clearly shows Mr. Brownstone has done this and has LEARNED LESSONS. — JonH, Oct 18 '17 at 15:48
In addition to Mr. Brownstone's answer, you may want to look into star schema design. This is basically a way of projecting a multidimensional model onto an SQL implementation. — Walter Mitty, Oct 19 '17 at 12:40
I disagree with a lot of this. None of this is a real concern with a modern database as demonstrated with actual numbers here. If users of the data are too stupid to use the sql, then you need to create them an interface -- you don't munge the schema. Extracting the hour is a bad idea — Evan Carroll, Oct 19 '17 at 15:17
@EvanCarrollQWERHJKL I'm sorry you feel that way and a few years ago I would have agreed with you. Your answer is a good answer but the queries contained within it don't come close to some of the ones I have to run on a daily basis and my answer is based upon the work I do everyday with the exact data set specified by the OP. — Mr.Brownstone, Oct 19 '17 at 15:31
@kennes physical, 16 Cores, 256GB RAM, 100GB OS Drive, 500GB local SSD with TempDB data on it, hybrid SAN with 8TB SSD Cache and 40TB of spindle disks capable of 100,000 iops/sec. Database implementation uses ColumnStore, compression, in-memory tables, partitioning and a tabular SSAS instance. — Mr.Brownstone, Oct 16 '18 at 19:51
That is incredible hardware depending on how many users you serve. Since this is a pseudo-optimization response, I think including your technology is useful. I was in complete shock to hear you can crunch 2TB in 30 seconds -- that is incredibly fast. My own personal judgement aside, I think it would be useful for future people looking to optimize time-series data! — kennes, Oct 21 '18 at 15:15

score 81 · Answer 2 · edited Jul 23 '19 at 13:29

PostgreSQL and BRIN indexes

Test it for yourself. This isn't a problem on a 5 year old laptop with an ssd.

EXPLAIN ANALYZE
CREATE TABLE electrothingy
AS
  SELECT
    x::int AS id,
    (x::int % 20000)::int AS locid,  -- fake location ids in the range of 1-20000
    now() AS tsin,                   -- static timestmap
    97.5::numeric(5,2) AS temp,      -- static temp
    x::int AS usage                  -- usage the same as id not sure what we want here.
  FROM generate_series(1,1728000000) -- for 1.7 billion rows
    AS gs(x);

                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series gs  (cost=0.00..15.00 rows=1000 width=4) (actual time=173119.796..750391.668 rows=1728000000 loops=1)
 Planning time: 0.099 ms
 Execution time: 1343954.446 ms
(3 rows)

So it took 22min to create the table. Largely, because the table is a modest 97GB. Next we create the indexes,

CREATE INDEX ON electrothingy USING brin (tsin);
CREATE INDEX ON electrothingy USING brin (id);    
VACUUM ANALYZE electrothingy;

It took a good long while to create the indexes too. Though because they're BRIN they're only 2-3 MB and they store easily in ram. Reading 96 GB isn't instantaneous, but it's not a real problem for my laptop at your workload.

Now we query it.

explain analyze
SELECT max(temp)
FROM electrothingy
WHERE id BETWEEN 1000000 AND 1001000;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5245.22..5245.23 rows=1 width=7) (actual time=42.317..42.317 rows=1 loops=1)
   ->  Bitmap Heap Scan on electrothingy  (cost=1282.17..5242.73 rows=993 width=7) (actual time=40.619..42.158 rows=1001 loops=1)
         Recheck Cond: ((id >= 1000000) AND (id <= 1001000))
         Rows Removed by Index Recheck: 16407
         Heap Blocks: lossy=128
         ->  Bitmap Index Scan on electrothingy_id_idx  (cost=0.00..1281.93 rows=993 width=0) (actual time=39.769..39.769 rows=1280 loops=1)
               Index Cond: ((id >= 1000000) AND (id <= 1001000))
 Planning time: 0.238 ms
 Execution time: 42.373 ms
(9 rows)

Update with timestamps

Here we generate a table with different timestamps in order to satisify the request to index and search on a timestamp column, creation takes a bit longer because to_timestamp(int) is substantially more slow than now() (which is cached for the transaction)

EXPLAIN ANALYZE
CREATE TABLE electrothingy
AS
  SELECT
    x::int AS id,
    (x::int % 20000)::int AS locid,
    -- here we use to_timestamp rather than now(), we
    -- this calculates seconds since epoch using the gs(x) as the offset
    to_timestamp(x::int) AS tsin,
    97.5::numeric(5,2) AS temp,
    x::int AS usage
  FROM generate_series(1,1728000000)
    AS gs(x);

                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series gs  (cost=0.00..17.50 rows=1000 width=4) (actual time=176163.107..5891430.759 rows=1728000000 loops=1)
 Planning time: 0.607 ms
 Execution time: 7147449.908 ms
(3 rows)

Now we can run a query on a timestamp value instead,,

explain analyze
SELECT count(*), min(temp), max(temp)
FROM electrothingy WHERE tsin BETWEEN '1974-01-01' AND '1974-01-02';
                                                                        QUERY PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=296073.83..296073.84 rows=1 width=7) (actual time=83.243..83.243 rows=1 loops=1)
   ->  Bitmap Heap Scan on electrothingy  (cost=2460.86..295490.76 rows=77743 width=7) (actual time=41.466..59.442 rows=86401 loops=1)
         Recheck Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin <= '1974-01-02 00:00:00-06'::timestamp with time zone))
         Rows Removed by Index Recheck: 18047
         Heap Blocks: lossy=768
         ->  Bitmap Index Scan on electrothingy_tsin_idx  (cost=0.00..2441.43 rows=77743 width=0) (actual time=40.217..40.217 rows=7680 loops=1)
               Index Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin <= '1974-01-02 00:00:00-06'::timestamp with time zone))
 Planning time: 0.140 ms
 Execution time: 83.321 ms
(9 rows)

Result:

 count |  min  |  max  
-------+-------+-------
 86401 | 97.50 | 97.50
(1 row)

So in 83.321 ms we can aggregate 86,401 records in a table with 1.7 Billion rows. That should be reasonable.

Hour ending

Calculating the hour ending is pretty easy too, truncate the timestamps down and then simply add an hour.

SELECT date_trunc('hour', tsin) + '1 hour' AS tsin,
  count(*),
  min(temp),
  max(temp)
FROM electrothingy
WHERE tsin >= '1974-01-01'
  AND tsin < '1974-01-02'
GROUP BY date_trunc('hour', tsin)
ORDER BY 1;
          tsin          | count |  min  |  max  
------------------------+-------+-------+-------
 1974-01-01 01:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 02:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 03:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 04:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 05:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 06:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 07:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 08:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 09:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 10:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 11:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 12:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 13:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 14:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 15:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 16:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 17:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 18:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 19:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 20:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 21:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 22:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-01 23:00:00-06 |  3600 | 97.50 | 97.50
 1974-01-02 00:00:00-06 |  3600 | 97.50 | 97.50
(24 rows)

Time: 116.695 ms

It's important to note, that it's not using an index on the aggregation, though it could. If that's your typically query you probably want a BRIN on date_trunc('hour', tsin) therein lies a small problem in that date_trunc is not immutable so you'd have to first wrap it to make it so.

Partitioning

Another important point of information on PostgreSQL is that PG 10 bring partitioning DDL. So you can, for instance, easily create partitions for every year. Breaking down your modest database into minor ones that are tiny. In doing so, you should be able to use use and maintain btree indexes rather than BRIN which would be even faster.

CREATE TABLE electrothingy_y2016 PARTITION OF electrothingy
    FOR VALUES FROM ('2016-01-01') TO ('2017-01-01');

Or whatever.

score 14 · Answer 3 · edited Aug 29 '18 at 01:59

It amazes me me that nobody here has mentioned benchmarking - that is until @EvanCarroll came along with his excellent contribution!

If I were you, I would spend some time (and yes, I know it's a precious commodity!) setting up systems, running what you think will be (get end-user input here!), say, your 10 most common queries.

My own thoughts:

NoSQL solutions can work very well for particular use cases but are frequently inflexible for ad-hoc queries. For an amusing take on NoSQL by Brian Aker - former chief architect of MySQL, see here!

I agree with @Mr.Brownstone that your data is eminently suited to a relational solution (and this opinion has been confirmed by Evan Carroll)!

If I were to commit to any expenditure, it would be to my disk technology! I would be spending any money I had at my disposal on NAS or SAN or maybe some SSD disks to hold my rarely written aggregate data!

First I would look at what I have available now. Run some tests and show the results to the decision makers. You already have a proxy in the form of EC's work! But, a quick test or two whipped together on your own hardware would be more convincing!

Then think about spending money! If you are going to spend money, look at hardware first rather than software. AFAIK, you can hire out disk technology for a trial period, or better yet, spin up a couple of proofs-of-concept on the cloud.

My own personal first port of call for a project like this would be PostgreSQL. That is not to say that I would rule out a proprietary solution, but the laws of physics and disks are the same for everyone! "Yae cannae beet the laws o' physics Jim" :-)

score 7 · Answer 4 · answered Oct 17 '17 at 15:31

7

If you have not already, take a look at a time series DBMS, since it is optimized for storing and querying data where the primary focus is the date/time type. Typically time series databases are used for recording data in the minute/second/sub-second ranges, so I'm not sure if it is still appropriate for hourly increments. That said, this type of DBMS seems to be worth looking into. Currently InfluxDB seems to be the most established and widely used time series database.

answered Oct 17 '17 at 15:31

FloorDivision

413
4
9

1

What is an example of a time series DBMS? – bishop Oct 18 '17 at 00:25
2

Have a look here. – Vérace Oct 18 '17 at 08:11
1

@Vérace link no longer works, any one else here might want to try this: https://ondataengineering.net/tech-categories/time-series-databases/ – Tony Jul 30 '20 at 14:11

score 6 · Answer 5 · answered Oct 18 '17 at 22:27

Clearly this is not a NoSQL problem, but I would suggest that while an RDBMS solution would work, I think an OLAP approach will fit much better and given the very limited data ranges involved, I would strongly suggest investigating the use of a column based DB rather then row based one. Think about it this way, you may have 1.7 billion pieces of data, but you still only need 5 bits to index every possible value of hour or day of month.

I have experience with a similar problem domain where Sybase IQ (now SAP IQ) is used to store up to 300 million counters an hour of telecoms equipment performance management data, but I doubt if you have the budget for that sort of solution. In the open source arena, MariaDB ColumnStore is a very promising candidate, but I would recommend also investigating MonetDB.

Since query performance is a major driver for you, give consideration to how queries will be phrased. This is where OLAP and RDBMS show their greatest differences:- with OLAP you normalize for query performance, not to reduce repetition, reduce storage or even to enforce consistency. So in addition to the original timestamp (you did remember to capture its timezone I hope?) have a separate field for the UTC timestamp, other ones for the date and time, and yet more for the year, month, day, hour, minute and UTC offset. If you have additional information about locations, feel free to keep that in a separate location table that can be looked up on demand and feel free to keep the key to that table in your main record but keep the full location name in your main table as well, after all, all possible locations still only take 10 bits to index and every reference you do not have to follow to get the data to be reported is time saved on your query.

As a final suggestion, use separate tables for popular aggregated data and use batch jobs to populate them, that way you don't have to repeat the exercise for each and every report that uses an aggregated value and makes queries that compare current to historical or historical to historical much easier and much, much faster.

I've had good experience with HP Vertica. We had a single table with 9 columns that had 130bn rows, without a lot of tuning. It just worked. — ThatDataGuy, Aug 03 '19 at 19:31

Best database and table design for billions of rows of data

5 Answers5

PostgreSQL and BRIN indexes

Update with timestamps

Hour ending

Partitioning

Linked