Appropriate situation to data warehouse and how to best manage the warehouse?

Question

I have a query that is on the opening page of our application that takes about 8-10 seconds to load. I've tried all I can to optimize it and I simply cannot make it go any faster (indexed appropriately, rewriting redundant code, ridding it of view-ception hell it lived in). It's a query on our largest table with a bunch of joins to other tables that have a many to one relationship to the main (many in the smaller relate to one id in the largest) and requires an unavoidable group by largeTable.id that makes it so slow.

Now I do have to make this work faster one way or another. I think this may be an appropriate situation to have a data warehouse (I hope that's the right term) type table. I would insert into this warehouse table the results of the slow query and then query from the warehouse table directly. Then, anytime something in my application that affects a table that affects the data from the slow query/data warehouse table, I would drop the rows in the table and recreate it which would then take slightly longer than the initial query, but only happen a few times a day, whereas the rest of the day I am querying directly from my data warehouse table much faster.

Is this a reasonable solution? Are there any common pitfalls or things I need to watch out for with this methodology, or a better way to do it? After a few days of trying to just optimize the query, I need some work around. I am still relatively new to this and am open to suggestions.

The term you're looking for is "materialized view" or "summary table". MySQL does not support materialized views, but they can be approximated by user-maintained tables; examples in this answer — mustaccio, May 02 '19 at 16:50
This question is overly broad and would require more of a position paper than an answer. This is more of a "I'm doing a project on x and am trying to do y but it's not working..." - i.e. questions are much more specific! — Vérace, May 02 '19 at 17:39
Unfortunately if I already had the knowledge required to make the question specific enough I wouldn't have had to ask it in the first place! That's the issue with not knowing what you don't know unfortunately. — bjk116, May 02 '19 at 17:43
There are many ways to tackle slow queries, but we need to see a query and SHOW CREATE TABLE and EXPLAIN SELECT .... Meanwhile, for DW tips , see http://mysql.rjweb.org/doc.php/datawarehouse — Rick James, May 03 '19 at 04:23
@RickJames I used Explain Select already to do my best, everything is a unique or non-unique Key look up, which runs fast enough, its when I add the group by which HAS to be there for the correct result, that it slows. I'm thinking about maybe modifying the underlying tables so a group by isn't necessary, I personally suspect there's something off with how we are storing data that we need a group by in this. — bjk116, May 03 '19 at 13:18
@bjk116 - I would be glad to help, but I need more details; there are too many possibilities. — Rick James, May 03 '19 at 13:23

Jon of All Trades · Answer 1 · 2019-05-02T16:28:30.243

3

Calculating and storing a value is a cache; that's a very common strategy. A data warehouse is a very different beast (usually more than one table!).

The main thing to be concerned about is timeliness. If customers expect the results to be closer to real-time, that's OK for now, as ten seconds once an hour should be fine. Even if it bloats to ten times that length, it should be OK.

If you absolutely must have current data, rather than "up to one hour old," you'll need to know when your source tables have changed. If you already have a DateLastModified column in each of the tables, then you can check to see if there's a more recent values than the DateRecorded you'll be capturing when you save this query. A stored procedure can query these values and refresh the cache if necessary. The problem is that the run time will be slow when any source data has changed, which I suspect will be most of the time. Also, you'll probably need indices on your DateLastModified fields, to quickly identify when the source data hasn't changed.

Alternatively, you could scatter triggers hither and yon, so any time a key table is updated, you mark the cache as dirty so it can be updated. This would be high maintenance and fragile, but if you don't already have DateLastModified fields it might be easier.

If your customers are OK with this report being out of date by up to a few minutes, just run it on a schedule.

edited May 02 '19 at 16:28

answered May 02 '19 at 16:04

Jon of All Trades

5,967
5
45
62

Ok that is reassuring that it's common and good to know the right term. This is for a small application used by my company. I am using MySQL 5.6 and I may already have a way to do this by running SELECT SQL_CACHE etc. So this would be telling MySQL server to store the results if possible and as long as none of the underlying tables changed, would be able to retrieve straight from cache correct? – bjk116 May 02 '19 at 16:11
1

My assumption is that you'd run the query and store the results on a schedule, perhaps once an hour. If it's important to you that any change in the source system be reflected in the result set, that's a lot harder. The normal solution would be materialized views, but they don't exist in MySQL. – Jon of All Trades May 02 '19 at 16:22
Yea that's one of my pet peevse with MySQL. The SQL_CACHE seems to have improved performance dramatically, I think behind the scenes it sort of does the same thing but automatically, stores the results, unlses an underlying table has changed, and then it's run fresh which ok it has to happen, but otherwise the speed performance is great. – bjk116 May 02 '19 at 16:27
@bjk116 - What version? The Query cache has been removed in 8.0, Galera, and Group Replication. Aurora, on the other hand, has enhanced it. – Rick James May 03 '19 at 04:26
"Luckily" I am using 5.6 which has query caching as an option. – bjk116 May 03 '19 at 13:16

Appropriate situation to data warehouse and how to best manage the warehouse?

1 Answers1