Indexes play a crucial role in improving performance in Postgres, but it’s not always easy to know when a query could benefit from an index. Here, we’ll explore:
- What indexes are
- Some scenarios where they can be useful
- Guidelines for determining which type of index to add
- How to identify when an index is missing
What is an index?
By default, Postgres reads data by scanning every row and filtering out unnecessary results. This approach works well for small datasets, but as tables grow, it becomes inefficient and slows down performance.
In Postgres, an index functions like the index at the back of a book. When searching for a specific topic in a book, you have two options:
- Read the entire book. This is similar to a sequential scan in Postgres, where every row is scanned until the desired result is found.
- Use the index to navigate to the relevant section in the book. In Postgres, this is known as an index scan, enabling faster retrieval of relevant rows by bypassing large portions of the table.
To illustrate this, let’s create a users
table with 500k rows and observe how Postgres queries this table using two methods:
EXPLAIN
displays the query plan, showing how the query will be executed, along with estimated costs.EXPLAIN ANALYZE
provides the same information and runs the query for more accurate cost estimation.
Here’s an example using EXPLAIN ANALYZE
to filter the users table:
-- Input
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'someone@example.com';
-- Output
Seq Scan on users (cost=0.00..21500.00 rows=1 width=64) (actual time=0.025..215.000 rows=1 loops=1)
Filter: (email = 'someone@example.com')
Rows Removed by Filter: 499999
Planning Time: 0.065 ms
Execution Time: 215.150 ms
With a sequential scan, Postgres would take 215ms to execute this query. Let’s see the impact of adding an index first:
-- Input
CREATE INDEX idx_users_email ON users(email);
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'someone@example.com';
-- Output
Index Scan using idx_users_email on users (cost=0.42..8.44 rows=1 width=64) (actual time=0.010..0.020 rows=1 loops=1)
Index Cond: (email = 'someone@example.com')
Planning Time: 0.062 ms
Execution Time: 0.050 ms
By using the index scan, the execution time decreased to 0.05ms – a significant improvement!
It’s worth noting that if the query involved fetching a user by ID, this issue might not have arisen because Postgres automatically creates an index for primary keys.
What are the disadvantages?
While indexes offer performance benefits, they come with drawbacks. They consume additional disk space and slightly slow down write operations (inserts, updates, deletes) as the index must be maintained. Therefore, indexes should only be added for queries where they are essential for improved performance.
So when do I need one?
Due to the overhead involved, indexes should be added only for frequently executed queries. But how can you determine if a query would benefit from an index?
(1) When the query filters out a significant portion of the rows it reads. This occurs when rows are filtered using a WHERE
clause or when a JOIN
operation involves a small number of rows.
- For example, at incident.io, we often retrieve resources for a specific organization, with the relevant resources being a small subset of the entire table. Therefore, we typically create an index on the
organization_id
andarchived_at
columns when adding a new table:
CREATE INDEX idx_users_organisation_id ON users (organisation_id) WHERE archived_at IS NULL;
(2) When the query requires rows to be returned in a specific order. This occurs when an ORDER BY
clause is used. If an index matches this order, Postgres can directly retrieve rows from the index, eliminating the need for a separate sorting process.
- For instance, if we frequently need to retrieve users for an organization in the order they registered, we could create an index for this purpose:
CREATE INDEX idx_users_organization_id_created_at ON users (organisation_id, created_at);
(3) When uniqueness needs to be enforced. This can be accomplished by creating a unique index.
- For example, if a constraint dictates that each user must have a unique email address within an organization, we could establish this uniqueness with an index:
CREATE UNIQUE INDEX idx_users_unique_email ON users(organization_id, email);
What type of index should I add?
In most cases, a b-tree index is preferred as it is Postgres’ default and suitable for a wide range of scenarios. You can create one like this:
CREATE INDEX idx_users_email ON users(email);
Order matters
When including multiple fields in an index, ensure that the most frequently queried columns are placed first to maximize index reusability. For example, the index below can be used by Postgres to filter incidents for a specific organization_id
, even if no filtering is done on reported_at
. However, this index cannot be reused if filtering is only based on reported_at
. This is an important consideration!
CREATE INDEX idx_users_organization_id_reported_at ON users (organization_id, reported_at);
Don’t over-index
It’s advisable to prioritize indexes that cover multiple use cases, even if it means the query doesn’t exclusively rely on an index scan. To illustrate this point, consider a scenario where we frequently retrieve all users for an organization in the viewer
state. While we could create an index on organization_id
and state
, this index wouldn’t be reusable for other queries fetching all users for an organization, regardless of state.
CREATE INDEX idx_users_organization_id_state ON users (organization_id, state);
Instead, creating an index solely on organization_id
might be more beneficial, as the query could utilize this index before performing a sequential scan on a small subset of rows. This approach enhances index reusability!
CREATE INDEX idx_users_organization_id ON users (organization_id);
Use EXPLAIN
/ ANALYZE
to determine the appropriate index based on your data structure.
Hey there! Just a heads up, if using your index isn’t more efficient, Postgres will just ignore it when planning a query.
Now, let’s talk about excluding rows you’ll never need. In Postgres, this is known as a partial index. For example, at incident.io, we make sure to only return rows to our production app that haven’t been soft deleted in our database. This means most of our indexes are constrained to rows where `archived_at` is null.
By creating a partial index like this, not only is the index smaller and takes up less disk space, but it also makes scanning faster because there’s less data to sift through.
While b-tree indexes cover most scenarios, there are other types like GIN, GiST, hash, BRIN, and SP-GiST indexes to consider for more specific use cases.
When it comes to identifying when you’re missing an index, it’s important to be proactive. Adding indexes to tables after the number of rows has built up can lead to performance issues. To quickly spot index-related problems, we added a section to our Postgres performance dashboard in Grafana.
The key takeaway is to debug your queries using `EXPLAIN` and `EXPLAIN ANALYZE`, identify when to add an index based on query patterns, choose the right type of index, and leverage tools like Query Insights in Google Cloud SQL or Grafana performance dashboards.
By carefully considering these factors, you can create a solid indexing strategy that boosts your database’s performance without overwhelming maintenance. Happy indexing on your WordPress platform! sentence:
Please make sure to lock all the doors before leaving the house.