On the evilness that is SELECT COUNT(DISTINCT id)

Pascal Hakim
Pascal is the Online Director at IDG Australia. With his team, he makes sure that the servers, code and ads that you can see on this page are all doing what they are supposed to. So please let him know if things are broken.

I was debugging some SQL today when I noticed that I was seeing a query go through the PostgreSQL log a few more times than it should have been. We log all queries that take over 200ms so on the odd occasions some queries will show up. It's rare to see one show up more than a couple times, so this one just didn't feel right. As ignoring those hunches usually turns out for the worse, I had a look at the latest generated pgfouine output and I saw that it was showing up in there as well. Something was definitely wrong.

A short browse through the code and revision history revealed that it was a query that had been in the code for ages but that until recently had barely been used. We were now using it on section pages on PC World to find out the number of reviews that each section has.

I had a look at the ANALYSE statement and it looked like this:

 Aggregate (cost=2687.10..2687.11 rows=1 width=4) (actual time=1982.010..1982.011 rows=1 loops=1)    -> Nested Loop (cost=280.07..2665.14 rows=8784 width=4) (actual time=2.298..1366.103 rows=1628276 loops=1)          -> Nested Loop (cost=115.71..238.13 rows=1 width=4) (actual time=0.619..17.228 rows=173 loops=1)          .          .          .          -> Bitmap Heap Scan on arttax at (cost=164.37..2339.17 rows=8784 width=0) (actual time=1.599..5.607 rows=9412 loops=173)                Recheck Cond: (at.taxid = 36)                -> Bitmap Index Scan on arttax_taxid_id_idx (cost=0.00..162.17 rows=8784 width=0) (actual time=1.318..1.318 rows=9414 loops=173)                      Index Cond: (at.taxid = 36)

The query itself was returning a COUNT so the rows = 1 was expected. The result returned for the sample query was 173 so the third line was expected as well. What wasn't was the second line. rows = 1628276?? That was way too high. It looked like it was merging something in, then discarding it straight away.

Looking at the other half, we can see that the arttax table which is normally used to match taxonomies to articles isn't being used properly. While we were filtering it for reviews (the taxid = 36 part), we weren't filtering for articles. Conveniently enough, 9412 * 173 = 1628276, which explained where we were getting that result from earlier.

This is when I noticed something that should have grabbed my attention a lot earlier on. The query started with:

         SELECT count(DISTINCT (a.id)) AS total FROM...

Putting those two together a solution was quite obvious. The query was joining in a table that it wasn't matching against properly. The original developer who had written this query decided to make it work by adding a DISTINCT at the front, when they noticed they were getting duplicated results, and had therefore not noticed that one of the tables being joined out was missing some join criteria.

After fixing that up, EXPLAIN ANALYSE looked like this:

 Aggregate (cost=240.47..240.48 rows=1 width=4) (actual time=11.436..11.436 rows=1 loops=1)    -> Nested Loop (cost=115.71..240.46 rows=1 width=4) (actual time=0.635..11.381 rows=173 loops=1)          -> Nested Loop (cost=115.71..238.13 rows=1 width=12) (actual time=0.622..10.224 rows=173 loops=1)          .          .          .          -> Index Scan using arttax_taxid_id_idx on arttax at (cost=0.00..2.32 rows=1 width=4) (actual time=0.005..0.006 rows=1 loops=173)                Index Cond: ((at.taxid = 36) AND (at.id = am.aid))

Much better! The arttax index is now being used properly, and the row number stays at 173 which is what we expect.

End result... runtime has gone from 2113.361ms to 11.7ms. This query won't be showing up in our logs any more!

Comments

Post new comment

Users posting comments agree to the TechWorld comments policy.

Login or register to link comments to your user profile, or you may also post a comment without being logged in.

Twitter Feed