You don't need SELECT COUNT(*)

2023-07-24

Sometimes counting rows in a large database may be slow and it can impact your application performance. In this post, I will show why SELECT COUNT(*) query can be expensive and an alternative to it when you don't really need the exact count number but an estimate.

Let's say we need to count articles in a given range of ids:

SELECT COUNT(*) FROM articles WHERE id BETWEEN 100 AND 500;

This query will use Index Only Scan and not Index Scan. It will be really fast because all the needed information is stored in the index itself and there is no need to reach the heap space. Check my article about Index Only Scan to learn more about this.

If we update those rows:

UPDATE articles SET brand = 'acme' WHERE id BETWEEN 100 AND 500;

and execute again the same query:

SELECT COUNT(*) FROM articles WHERE id BETWEEN 100 AND 500;

This time, query will still do a Index Only Scan but, actually, it will reach the heap space many times (you can see this using explain analyze). But, why?

Updates are not performed immediately because of transactional and concurrency control reasons. Instead, the visibility map tells the index scanner that something has changed in those rows; it might be updated or deleted. That's the reason the query needs to check, one by one, whether every row has been deleted or not to perform the actual count. That makes SELECT COUNT(*) query slow.

The more we increase the range of articles ids in the query, the more rows will be involved in the counting. At some point, planner will use Parallel Scan to make query faster. However, eventually, if we keep increasing the range of ids, the query will become slower and slower.

So, the question is: do we really need the exact count number? Maybe we don't.

There are features where we don't really need the exact number, but an estimate is more than enough. For instance, if we have to implement a social post Likes counter, maybe users don't need the exact number but a close one. We can sacrifice accuracy in order to improve performance by using explain command.

If we run this SQL statement:

explain (format json) SELECT COUNT(*) FROM articles WHERE id BETWEEN 100 AND 500;

Postgres will show you the query plan, in JSON format, without executing the actual query. Basically, Postgres is giving us an estimation of which steps are going to be executed to perform the query.

Among this JSON-formatted metadata, there is a field called Plan Rows that represents the estimate number of rows that will be returned. Postgres estimates this number internally based on different conditions and current index state.

In the case we don't really need the exact number, we can show Plan Rows's value to the user. It will be close enough to the real one and, definitely, the performance will be far better than executing the actual query as we did above.

You don't need SELECT COUNT(*)

More posts...