How to optimise this MySQL query? Millions of Rows

I have the following query:

SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094 GROUP BY analytics.source ORDER BY frequency DESC LIMIT 10

The analytics table has 60M rows and the transactions table has 3M rows.

When I run an EXPLAIN on this query, I get:

EXPLAIN

+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ | # id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | | +------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ | '1' | 'SIMPLE' | 'analytics' | 'ref' | 'analytics_user_id | analytics_source' | 'analytics_user_id' | '5' | 'const' | '337662' | 'Using where; Using temporary; Using filesort' | | '1' | 'SIMPLE' | 'transactions' | 'ref' | 'tran_analytics' | 'tran_analytics' | '5' | 'dijishop2.analytics.id' | '1' | NULL | | +------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+

I can't figure out how to optimise this query as it's already very basic. It takes around 70 seconds to run this query.

Here are the indexes that exist:

+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | # Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | 'analytics' | '0' | 'PRIMARY' | '1' | 'id' | 'A' | '56934235' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_user_id' | '1' | 'user_id' | 'A' | '130583' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_product_id' | '1' | 'product_id' | 'A' | '490812' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_affil_user_id' | '1' | 'affil_user_id' | 'A' | '55222' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_source' | '1' | 'source' | 'A' | '24604' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_country_name' | '1' | 'country_name' | 'A' | '39510' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_gordon' | '1' | 'id' | 'A' | '56934235' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_gordon' | '2' | 'user_id' | 'A' | '56934235' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_gordon' | '3' | 'source' | 'A' | '56934235' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | +-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | # Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | 'transactions' | '0' | 'PRIMARY' | '1' | 'id' | 'A' | '2436151' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'tran_user_id' | '1' | 'user_id' | 'A' | '56654' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'transaction_id' | '1' | 'transaction_id' | 'A' | '2436151' | '191' | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'tran_analytics' | '1' | 'analytics' | 'A' | '2436151' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'tran_status' | '1' | 'status' | 'A' | '22' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'gordon_trans' | '1' | 'status' | 'A' | '22' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'gordon_trans' | '2' | 'analytics' | 'A' | '2436151' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+

Simplified schema for the two tables before adding any extra indexes as suggested as it didn't improve the situation.

CREATE TABLE `analytics` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `affil_user_id` int(11) DEFAULT NULL, `product_id` int(11) DEFAULT NULL, `medium` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `terms` varchar(1024) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `is_browser` tinyint(1) DEFAULT NULL, `is_mobile` tinyint(1) DEFAULT NULL, `is_robot` tinyint(1) DEFAULT NULL, `browser` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `mobile` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `robot` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `platform` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `referrer` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `domain` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `ip` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `continent_code` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `country_name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `city` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `analytics_user_id` (`user_id`), KEY `analytics_product_id` (`product_id`), KEY `analytics_affil_user_id` (`affil_user_id`) ) ENGINE=InnoDB AUTO_INCREMENT=64821325 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; CREATE TABLE `transactions` ( `id` int(11) NOT NULL AUTO_INCREMENT, `transaction_id` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `user_id` int(11) NOT NULL, `pay_key` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `sender_email` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `amount` decimal(10,2) DEFAULT NULL, `currency` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `analytics` int(11) DEFAULT NULL, `ip_address` varchar(46) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `session_id` varchar(60) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `eu_vat_applied` int(1) DEFAULT '0', PRIMARY KEY (`id`), KEY `tran_user_id` (`user_id`), KEY `transaction_id` (`transaction_id`(191)), KEY `tran_analytics` (`analytics`), KEY `tran_status` (`status`) ) ENGINE=InnoDB AUTO_INCREMENT=10019356 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

If the above can not be optimised any further. Any implementation advice on summary tables will be great. We are using a LAMP stack on AWS. The above query is running on RDS (m1.large).

You have a high reputation score, so you're not a newbie. You should know by now that you should include the SHOW CREATE TABLE for each of the tables in your query, so we can see what data types, indexes, and constraints you have in your tables. Help us help you!
– Bill Karwin
Jun 21 at 17:57

SHOW CREATE TABLE

Apologies Bill, they are massive tables (lots of columns). Will get that after I try Gordon's suggestion.
– Abs
Jun 21 at 17:59

The reason I suggest using SHOW CREATE TABLE is that if anyone wants to try out your tables on a sandbox instance, they have to painstakingly recreate the table by guessing at your columns and indexes. It's possible to piece together something resembling your tale from your SHOW INDEXES, but it takes too much work and I can't be sure it's correct. I am not going to spend time doing that. Good luck!
– Bill Karwin
Jun 21 at 20:04

SHOW CREATE TABLE

What happens to query performance if you omit the GROUP BY clause? (I know it doesn't yield the result you want; the point is to figure out whether GROUP BY ... LIMIT... is soaking up a lot of time.)
– O. Jones
Jun 24 at 11:35

GROUP BY

GROUP BY ... LIMIT...

Can you explain what you want a little better? COUNT(a.id) in a query doing a LEFT JOIN b is a little strange. It counts matching rows from b, and counts 1 for every row in a without a matching row in b. Is that what you want? to me it sounds like the kind of thing that's hard to explain to users. Perfection in that COUNT operation is critical, because you use it later for the GROUP BY ... LIMIT ... operation.
– O. Jones
Jun 24 at 11:42

COUNT(a.id)

a LEFT JOIN b

b

1

a

b

COUNT

GROUP BY ... LIMIT ...

13 Answers
13

I would create the following indexes (b-tree indexes):

analytics(user_id, source, id) transactions(analytics, status)

This is different from Gordon's suggestion.

The order of columns in the index is important.

You filter by specific analytics.user_id, so this field has to be the first in the index.
Then you group by analytics.source. To avoid sorting by source this should be the next field of the index. You also reference analytics.id, so it is better to have this field as part of the index, put it last. Is MySQL capable of reading just the index and not touching the table? I don't know, but it is rather easy to test.

analytics.user_id

analytics.source

source

analytics.id

Index on transactions has to start with analytics, because it would be used in the JOIN. We also need status.

transactions

analytics

JOIN

status

I would be interested to see what the performance gains are from this suggestion - this seems like a great suggestion.
– DavidCa1226
Jun 29 at 21:43

@DavidCa1226, yes, I'm curious too. Often appropriate indexes is the most powerful way to tune the query. Only after you confirm that the covering index doesn't help as expected you start looking into the execution plan details and trying to adjust the query, so that the optimiser uses the index as expected.
– Vladimir Baranov
2 days ago

For this query:

SELECT a.source AS referrer, COUNT(*) AS frequency, SUM( t.status = 'COMPLETED' ) AS sales FROM analytics a LEFT JOIN transactions t ON a.id = t.analytics WHERE a.user_id = 52094 GROUP BY a.source ORDER BY frequency DESC LIMIT 10 ;

You want an index on analytics(user_id, id, source) and transactions(analytics, status).

analytics(user_id, id, source)

transactions(analytics, status)

I should've mentioned I have indexes but not composite indexes, the above are composite indexes right? Running it now, might take some time to apply these to the large test tables.
– Abs
Jun 21 at 18:05

I added the indexes and this didn't make much of a difference unfortunately, it still takes 70 seconds to execute.
– Abs
Jun 21 at 18:16

@Abs . . . How many rows match the user id?
– Gordon Linoff
Jun 22 at 1:52

176200 match the user_id on the analytics table.
– Abs
Jun 22 at 15:05

176200

user_id

@Abs Make sure you added exactly the indexes that Gordon proposed. Assuming the indexes *_gordon that you listed in your question are the ones you tried, you seem to have added the columns in the wrong order - analytics(id, user_id, source) instead of analytics(user_id, id, source) and transactions(status, analytics) instead of transactions(analytics, status).
– Solarflare
Jun 25 at 0:33

*_gordon

analytics(id, user_id, source)

analytics(user_id, id, source)

transactions(status, analytics)

transactions(analytics, status)

First some analysis...

SELECT a.source AS referrer, COUNT(*) AS frequency, -- See question below SUM(t.status = 'COMPLETED') AS sales FROM analytics AS a LEFT JOIN transactions AS t ON a.id = t.analytics AS a WHERE a.user_id = 52094 GROUP BY a.source ORDER BY frequency DESC LIMIT 10

If the mapping from a to t is one:many, then you need to consider whether the COUNT and SUM have the correct values, or inflated values. As the query stands, the are "inflated". The JOIN occurs before the aggregation, so you are counting the number of transactions and how many were completed. I'll assume that is desired.

COUNT

SUM

JOIN

Note: The usual patter is COUNT(*); saying COUNT(x) implies checking x for being NULL. I suspect that check is not needed?

COUNT(*)

COUNT(x)

x

NULL

This index handles the WHERE and is "covering":

WHERE

analytics: INDEX(user_id, source, id) -- user_id first transactions: INDEX(analytics, status) -- in this order

The GROUP BY may or may not require a 'sort'. The ORDER BY, being different than the GROUP BY, definitely will need a sort. And the entire grouped set of rows will need to be sorted; there is no shortcut for the LIMIT.

GROUP BY

ORDER BY

GROUP BY

LIMIT

Normally, Summary tables are date-oriented. Perhaps, keying by date and user_id would make sense? How many transactions per day does the average user have? If at least 10, then let's consider a Summary table. Also, it is important not to be UPDATEing or DELETEing old records. More

UPDATEing

DELETEing

I would probably have

user_id ..., source ..., dy DATE ..., status ..., freq MEDIUMINT UNSIGNED NOT NULL, status_ct MEDIUMINT UNSIGNED NOT NULL, PRIMARY KEY(user_id, status, source, dy)

Then the query becomes

SELECT source AS referrer, SUM(freq) AS frequency, SUM(status_ct) AS completed_sales FROM Summary WHERE user_id = 52094 AND status = 'COMPLETED' GROUP BY source ORDER BY frequency DESC LIMIT 10

The speed comes from many factors

JOIN

(It still needs the extra sort.)

Even without the summary table, there may be some speedups...

Normalizing

KEY (transaction_id(191))

utf8mb4_unicode_ci

Thank you for the useful analytics Rick! I will go through each and see which i can implement without worrying too much.
– Abs
Jun 26 at 10:15

@Abs - and possibly experiment. Pretty easy to copy a table for playing with: CREATE TABLE copy LIKE live; INSERT INTO copy SELECT * FROM live;
– Rick James
Jun 26 at 16:34

CREATE TABLE copy LIKE live; INSERT INTO copy SELECT * FROM live;

Try below and let me know if this helps.

SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM (SELECT * FROM analytics where user_id = 52094) analytics LEFT JOIN (SELECT analytics, status from transactions where analytics = 52094) transactions ON analytics.id = transactions.analytics GROUP BY analytics.source ORDER BY frequency DESC LIMIT 10

Great! 7% performance increase! I am sorry I couldn't do more. I will update my original post, to include the "solution", otherwise a Summary table is not a bad idea, I had to do that several times.
– Vincent Rye
Jun 21 at 19:31

Done, changed the original post, feel free to upvote it.
– Vincent Rye
Jun 21 at 19:45

@VincentRye - Changing a JOIN into a subquery without any semantic change will not speed things up in MySQL.
– Rick James
Jun 24 at 19:47

JOIN

@Rick James - Apparently it id. The subquery has a where statement in it. It could also have been written without a subquery and added an additional where on the join itself. Either should work fine.
– Vincent Rye
Jun 25 at 16:51

@VincentRye, your query produces completely different result than original query. Why do you filter transactions table with analytics = 52094? Original filter is on analytics.user_id. transactions.analytics is not related to analytics.user_id in any way, but you filter it with the same value. transactions.analytics is related to analytics.id, not analytics.user_id.
– Vladimir Baranov
23 hours ago

transactions

analytics = 52094

analytics.user_id

transactions.analytics

analytics.user_id

transactions.analytics

analytics.id

analytics.user_id

Could you try below Approach:

SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(sales) AS sales FROM analytics LEFT JOIN( SELECT transactions.Analytics, (CASE WHEN transactions.status = 'COMPLETED' THEN 1 ELSE 0 END) AS sales FROM analytics INNER JOIN transactions ON analytics.id = transactions.analytics ) Tra ON analytics.id = Tra.analytics WHERE analytics.user_id = 52094 GROUP BY analytics.source ORDER BY frequency DESC LIMIT 10

This query potentially joins millions of analytics records with transactions records and calculates the sum (including the status check) on millions of records.
If we could first apply the LIMIT 10 and then do the join and calculate the sum, we could speed up the query.
Unfortunately, we need the analytics.id for the join, which gets lost after applying the GROUP BY. But maybe analytics.source is selective enough to boost the query, if we could limit the analytics by source.

analytics

transactions

LIMIT 10

analytics.id

GROUP BY

analytics.source

analytics

source

My Idea is therefore to calculate the frequencies, limit by them, to return the analytics.source and frequency in a subquery and to use this result to filter the analytics in the main query, which then does the rest of the joins and calculations on a hopefully much reduced number of records.

analytics.source

frequency

analytics

Minimal subquery (note: no join, no sum, returns 10 records):

SELECT source, COUNT(id) AS frequency FROM analytics WHERE user_id = 52094 GROUP BY source ORDER BY frequency DESC LIMIT 10

The full query using the above query as subquery x:

x

SELECT x.source AS referrer, x.frequency, SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales FROM (<subquery here>) x INNER JOIN analytics a ON x.source = a.source -- This reduces the number of records LEFT JOIN transactions t ON a.id = t.analytics WHERE a.user_id = 52094 -- We could have several users per source GROUP BY x.source, x.frequency ORDER BY x.frequency DESC

If this does not yield the expected performance boost, this could be due to MySQL applying the joins in an unexpected order. As explained here "Is there a way to force MySQL execution order?", you could replace the join by STRAIGHT_JOIN in this case.

STRAIGHT_JOIN

@Abs: Did my suggested changes speed up your query?
– Olivier Jacot-Descombes
7 hours ago

Only Problem I find in your query is

GROUP BY analytics.source ORDER BY frequency DESC

because of this query is doing filesort using temporary table.

One way to avoid this is by creating another table like

CREATE TABLE `analytics_aggr` ( `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `frequency` int(10) DEFAULT NULL, `sales` int(10) DEFAULT NULL, KEY `sales` (`sales`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;`

insert data into analytics_aggr using below query

insert into analytics_aggr SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094 GROUP BY analytics.source ORDER BY null

Now you can easily get you data using

select * from analytics_aggr order by sales desc

Try this

SELECT a.source AS referrer, COUNT(a.id) AS frequency, SUM(t.sales) AS sales FROM (Select id, source From analytics Where user_id = 52094) a LEFT JOIN (Select analytics, case when status = 'COMPLETED' Then 1 else 0 end as sales From transactions) t ON a.id = t.analytics GROUP BY a.source ORDER BY frequency DESC LIMIT 10

I'm proposing this because you said "they are massive table" but this sql using very few columns only. In this case if we use inline view with require columns only then it will be good

Note: memory also will play important role here. So confirm the memory before decide the inline view

I would try to separate querying from the two tables. Since you need only top 10 sources, I would get them first and then query from transactions the sales column:

source

transactions

sales

SELECT source as referrer ,frequency ,(select count(*) from transactions t where t.analytics in (select distinct id from analytics where user_id = 52094 and source = by_frequency.source) and status = 'completed' ) as sales from (SELECT analytics.source ,count(*) as frequency from analytics where analytics.user_id = 52094 group by analytics.source order by frequency desc limit 10 ) by_frequency

It may be also faster without the distinct

distinct

I would try subquery:

SELECT a.source AS referrer, COUNT(*) AS frequency, SUM((SELECT COUNT(*) FROM transactions t WHERE a.id = t.analytics AND t.status = 'COMPLETED')) AS sales FROM analytics a WHERE a.user_id = 52094 GROUP BY a.source ORDER BY frequency DESC LIMIT 10;

Plus indexes exactly as @Gordon's answer: analytics(user_id, id, source) and transactions(analytics, status).

I'm not sure this query will run. You are using analytics.id in select, but you don't group by it.
– Alexandr Kapshuk
Jun 29 at 7:59

@AlexandrKapshuk Of course it will run: db-fiddle.com/f/qzk3NqwaYDHENvQpp4bNat/0. But if you we want to be strict I should wrap subquery with MIN
– Lukasz Szozda
Jun 29 at 9:50

MIN

in case there are several different id with the same source in analytics table, the query won't sum over all id. I think it can be corrected by using sum instead of min. Didn't know you can do that in MySQL!
– Alexandr Kapshuk
Jun 29 at 11:20

id

source

analytics

id

sum

min

The select subquery will execute once for every analytics result row. This may seem fast for values of user_id with few results but will be slow for values where there are many result rows.
– saarp
2 hours ago

analytics

user_id

I am assuming the predicate, user_id = 52094, is for illustration purpose and in application, the selected user_id is a variable.

I also assume that ACID property is not very important here.

(1) Therefore, I will maintain two replica tables with only the necessary fields (it is similar to the indices Vladimir had suggested above) using a utility table.

CREATE TABLE mv_anal ( `id` int(11) NOT NULL, `user_id` int(11) DEFAULT NULL, `source` varchar(45), PRIMARY KEY (`id`) ); CREATE TABLE mv_trans ( `id` int(11) NOT NULL, `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `analytics` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ); CREATE TABLE util ( last_updated_anal int (11) NOT NULL, last_updated_trans int (11) NOT NULL ); INSERT INTO util (0, 0);

The gain here is that we will be reading a relatively smaller projections of the original tables -- hopefully, OS level and DB level caches work and they aren't read from slower secondary storage but from faster RAM. This can be a very great gain.

Here is how I updated the two tables (the below is a transaction run by a cron) :

-- TRANSACTION STARTS -- INSERT INTO mv_trans SELECT id, IF (status = 'COMPLETE', 1, 0) AS status, analysis FROM transactions JOIN util ON util.last_updated_trans <= transactions.id UPDATE util SET last_updated_trans = sub.m FROM (SELECT MAX (id) AS m FROM mv_trans) sub; -- TRANSACTION COMMITS -- -- similar transaction for mv_anal.

(2) Now, I will tackle the selectivity to reduce sequential scan time. I will have to build a b-tree index on user_id, source and id (in this sequence) on mv_anal.

Note: the above can be achieved by just creating index on analytics table but building such an index requires reading big table with 60M rows. My method requires the index building to read only very thin table. Thus, we can rebuild the btree more frequently (to counter the skew problem as the table is append-only).

This is how I make sure the high selectivity is achieved when querying and to counter skewing btree problem.

(3) In PostgreSQL, WITH subqueries are always materialized. I hope similarly for MySQL. Therefore, as the last mile of optimization:

WITH sub_anal AS ( SELECT user_id, source AS referrer, COUNT (id) AS frequency FROM mv_anal WHERE user_id = 52094 GROUP BY user_id, source ORDER BY COUNT (id) DESC LIMIT 10 ) SELECT sa.referrer, sa.frequency, SUM (status) AS sales FROM sub_anal AS sa JOIN mv_anal anal ON sa.referrer = anal.source AND sa.user_id = anal.user_id JOIN mv_trans AS trans ON anal.id = trans.analytics

Late to the party. I think you'll need to load one index into MySQL's cache. The NLJ is probably killing performance. Here's how I see it:

The Path

Your query is simple. It has two tables and the "path" is very clear:

analytics

transactions

LEFT OUTER JOIN

analytics

The Access

Once the path is clear, you need to decide if you want to use an Index Access or a Table Access. Both have pros and cons. However, you want to improve the SELECT performance:

SELECT

The Filtering

Again, you want high performance for the SELECT. Therefore:

SELECT

Row Aggregation

After filtering, the next step is to aggregate rows by GROUP BY analytics.source. This can be improved by placing the source column as the first column in the index.

GROUP BY analytics.source

source

Optimal Indexes for Path, Access, Filtering, and Aggregation

Considering all the above, you should include all mentioned columns into indexes. The following indexes should improve the response time:

create index ix1_analytics on analytics (user_id, source, id); create index ix2_transactions on transactions (analytics, status);

These indexes fulfill the "path", the "access", and the "filtering" strategies decribed above.

The Index Cache

Finally -- and this is critical -- load the secondary index into MySQL's memory cache. MySQL is performing a NLJ (Nested Loop Join) -- a 'ref' in MySQL lingo -- and needs to access the second one randomly nearly 200k times.

Unfortunately, I don't know for sure how to load the index into MySQL's cache. The use of FORCE may work, as in:

FORCE

SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions FORCE index (ix2_transactions) ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094 GROUP BY analytics.source ORDER BY frequency DESC LIMIT 10

Make sure you have enough cache space. Here's a short question/answer to figure out: How to figure out if mysql index fits entirely in memory

Good luck! Oh, and post the results.

"LOAD INDEX INTO CACHE is used only for MyISAM tables" - dev.mysql.com/doc/refman/8.0/en/load-index.html
– Paul Spiegel
2 days ago

LOAD INDEX INTO CACHE

MyISAM

You're right, this worked for MyISAM. However, it's still critical to load the secondary index ix2_transactions into the cache. FORCE may help here, but I'm not sure. If it can't be loaded, I don't see how the speed can be improved.
– The Impaler
2 days ago

ix2_transactions

FORCE

This question has definitely received a lot of attention so I'm sure all obvious solutions have been tried. I did not see something that addresses the LEFT JOIN in the query, though.

LEFT JOIN

I have noticed that LEFT JOIN statements usually force query planners into hash join which are fast for a small number of results, but terribly slow for a large number of results. As noted in @Rick James' answer, since the join in the original query is on the identity field analytics.id, this will generate large number of results. A hash join will yield terrible performance results. The suggestion below addresses this below without any schema or processing changes.

LEFT JOIN

analytics.id

Since the aggregation is by analytics.source, I would try a query that creates separate aggregations for frequency by source and sales by source and defers the left join until after aggregation is complete. This should allow the indexes to be used best (typically this is a merge join for large data sets).

analytics.source

Here is my suggestion:

SELECT t1.source AS referrer, t1.frequency, t2.sales FROM ( -- Frequency by source SELECT a.source, COUNT(a.id) AS frequency FROM analytics a WHERE a.user_id=52094 GROUP BY a.source ) t1 LEFT JOIN ( -- Sales by source SELECT a.source, SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales FROM analytics a JOIN transactions t WHERE a.id = t.analytics AND t.status = 'COMPLETED' AND a.user_id=52094 GROUP by a.source ) t2 ON t1.source = t2.source ORDER BY frequency DESC LIMIT 10

Hope this helps.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk