The view from the top of IT with TechWorld Editor Rohan Pearce
I was debugging some SQL today when I noticed that I was seeing a query go through the PostgreSQL log a few more times than it should have been. We log all queries that take over 200ms so on the odd occasions some queries will show up. It's rare to see one show up more than a couple times, so this one just didn't feel right. As ignoring those hunches usually turns out for the worse, I had a look at the latest generated pgfouine output and I saw that it was showing up in there as well. Something was definitely wrong.
A short browse through the code and revision history revealed that it was a query that had been in the code for ages but that until recently had barely been used. We were now using it on section pages on PC World to find out the number of reviews that each section has.
I had a look at the ANALYSE statement and it looked like this:
Aggregate (cost=2687.10..2687.11 rows=1 width=4) (actual time=1982.010..1982.011 rows=1 loops=1)
-> Nested Loop (cost=280.07..2665.14 rows=8784 width=4) (actual time=2.298..1366.103 rows=1628276 loops=1)
-> Nested Loop (cost=115.71..238.13 rows=1 width=4) (actual time=0.619..17.228 rows=173 loops=1)
The query itself was returning a COUNT so the rows = 1 was expected. The result returned for the sample query was 173 so the third line was expected as well. What wasn't was the second line. rows = 1628276?? That was way too high. It looked like it was merging something in, then discarding it straight away.
Looking at the other half, we can see that the arttax table which is normally used to match taxonomies to articles isn't being used properly. While we were filtering it for reviews (the taxid = 36 part), we weren't filtering for articles. Conveniently enough, 9412 * 173 = 1628276, which explained where we were getting that result from earlier.
This is when I noticed something that should have grabbed my attention a lot earlier on. The query started with:
SELECT count(DISTINCT (a.id)) AS total FROM...
Putting those two together a solution was quite obvious. The query was joining in a table that it wasn't matching against properly. The original developer who had written this query decided to make it work by adding a DISTINCT at the front, when they noticed they were getting duplicated results, and had therefore not noticed that one of the tables being joined out was missing some join criteria.
After fixing that up, EXPLAIN ANALYSE looked like this:
Aggregate (cost=240.47..240.48 rows=1 width=4) (actual time=11.436..11.436 rows=1 loops=1)
-> Nested Loop (cost=115.71..240.46 rows=1 width=4) (actual time=0.635..11.381 rows=173 loops=1)
-> Nested Loop (cost=115.71..238.13 rows=1 width=12) (actual time=0.622..10.224 rows=173 loops=1)
Much better! The arttax index is now being used properly, and the row number stays at 173 which is what we expect.
End result... runtime has gone from 2113.361ms to 11.7ms. This query won't be showing up in our logs any more!