Chapter 7: Query Optimization

Getting the most performance out of a database often involves optimizing queries; however, before running EXPLAIN on the slow query, take a step back and ask yourself the following questions:

Does the query need to be run at all?
Does the query need to be run this often?

Next, look for things that it doesn’t need to do and:

Remove unnecessary calculations at run time
Reduce amount of data being processed
Reduce amount of data being written
Reduce amount of data being downloaded

The advice on this page is assuming the query is being run via code and that it will be run many times. If it’s a one-off query, the work to optimize it may not be worth it.

Does the query need to be run at all?

This should be a no-brainer, but it happens more often that you would think. In my experience, sometimes the solution to fixing the worst queries was to simply delete the query. This is especially true in applications that use an ORM that hides SQL from the engineer. The ORM can run queries that the engineer never intended.

In applications that have been around a while, where there have been many different engineers working on it, it’s common for code to get orphaned when features change. Sometimes not every obsolete line of code was removed in a refactor, which can create queries that run when they don’t need to.

Does the query need to be run this often?

Can the data be preprocessed and cached? Preprocessing means to save the resultset somewhere else. This could be in a materialized view, another table such as an aggregate table, or in a different system entirely like Redis or in the application’s memory. It doesn’t have to be the final query; it can just be a piece of it.

Think of preprocessing data as getting your data at build time instead of at run time. You should do the least amount of work possible at run time to fetch data as fast as possible. Enabling queries to do the least amount of work possible is a theme that will come up over and over again when talking about database performance.

Remove unnecessary calculations at run time

Are there functions on the left side of an operator? For example, in the fragment, WHERE DATE_TRUNC('day', inserted_at) >= '2020-07-01';, anything on the left side of the operator, in this case >=, will have to be calculated for every row in the table. On tables with millions of rows, this can add up quickly.

For this particular case, there are two ways to solve it.

Calculated Column

You can store the results of a function in a separate column or table, which is known as a calculated column. Some databases have specific functions to handle this. In PG12, creating a calculated column is easy. You can add GENERATED ALWAYS AS and STORED to a column in a CREATE TABLE.

CREATE TABLE users (
  id BIGSERIAL primary key,
  name VARCHAR(30),
  phone VARCHAR(20),
  email VARCHAR(200),
  search_fields VARCHAR(255) GENERATED ALWAYS AS 
    CONCAT(name,' ',phone,' ',email) STORED
);

On older versions, this can be achieved with a trigger and custom database function.

CREATE OR REPLACE FUNCTION generate_user_search_fields()
  RETURNS TRIGGER AS $$
  BEGIN
    IF
      (TG_OP = 'INSERT') OR
      (
        (TG_OP = 'UPDATE') AND
        (
          (OLD.search_fields IS NULL) OR
          (OLD.name != NEW.name OR OLD.phone != NEW.phone OR OLD.email != NEW.email)
        )
      ) THEN
      NEW.search_fields = CONCAT(NEW.name, ' ', NEW.phone, ' ', NEW.email);
    END IF;
    RETURN NEW;
  END;
  $$ LANGUAGE 'plpgsql'; 

CREATE TRIGGER user_generate_search_fields
      BEFORE INSERT OR UPDATE ON users
      FOR EACH ROW EXECUTE PROCEDURE generate_user_search_fields()

Calculated columns are useful when the calculations show up regularly in different queries, especially if the results are also in the SELECT clause.

Using the example above, I’d create another column, called inserted_on, set as a date type, and store the results of the DATE_TRUNC there.

Partial Index

Some databases, such as Postgres, can create partial indexes. This means that it creates an index that doesn’t contain data from every row or it contains calculated data. Instead, it creates an index from the results of a function or WHERE clause. For example, we can have a partial index using the DATE_TRUNC function. As long as our query uses the exact same function, it will use that and speed up our query.

CREATE INDEX idx_inserted_on ON my_table (date_trunc('day', inserted_at));

This is nice when you don’t want to maintain the values in another column, but keep in mind that if this is in the SELECT clause, the database may still have to calculate the value for any results returned. Since this is typically a small set of records, it shouldn’t add a significant amount of time to the query.

Partial indexes are less visible than calculated columns, meaning an engineer that isn’t paying attention may not know that the index exists. During refactors, the query may be changed to where the index is no longer used and a new index might not be created to replace it.

Reduce the amount of data being processed

Using hashes & maps in your code is much faster than iterating over an array or list. Same goes for indexes vs. table scans. Anything that has to be calculated at query time will not use an index. This includes functions on the left of an operator like we talked about in the previous section as well as common table expressions (CTEs) and subqueries.

Typically joining tables is faster than subqueries or CTEs because joins can use indexes; however, in a few cases it can be faster to run a subquery. Later in the chapter, we’ll talk about running EXPLAIN, which will give you some insight as to when to use one type of query over another.

Reduce amount of data being written

Every time a record is updated in an ACID compliant database, such as Postgres, the entire row is rewritten even if only one column was updated. If there are data in columns that never change in a table along with data in columns that change often, consider moving the columns that change often into their own table.

One common example is when a scheduled job sets statuses on rows as they are being worked. It might set all rows to ‘incomplete’, then change them to ‘pending’, and ‘complete’ as it works through the list. Every time the status changes, the entire row is rewritten. In this example, the entire table is rewritten 3 times.

The database’s vacuum process won’t be able to keep up, which will bloat the size of the table, cause data fragmentation, and bust any internal cache that may have been on that table.

Instead, move the status and any other columns specific to that job into its own table. I call this Isolating Your Writes. It will allow your parent table to maintain its cache longer and reduce data fragmentation.

Or better yet, use Redis or another in-memory database for worker queues.

Reduce amount of data being downloaded

Don’t SELECT *. Instead only select the columns you need. Your ORM wants to SELECT *, but don’t let it. Any ORM worth its salt will allow you to specify which columns to select.