Debugging deadlocks in Postgres | Blog



ERROR: deadlock detected (SQLSTATE 40P01)

What steps should you take when encountering this error?

Deadlocks are a common challenge in backend development, but by investigating and designing carefully, they can be identified and resolved.

Let’s start with the basics.




A brief introduction




What is a lock?

Locks are utilized to manage concurrent access to data in tables, preventing multiple processes from accessing or modifying the same data simultaneously.

When one process holds a lock, any other process that requires the lock must wait until it is released, such as when the first process completes.

While locks can be explicitly introduced in your code and queries, most are created implicitly by the database itself (in our case: Postgres), ensuring the safe and predictable execution of your queries.


📖 A comprehensive guide about the various types of Postgres locks can be found
here.




What is a deadlock?

A deadlock occurs when two or more processes are unable to progress because each is waiting for another to release a lock.

For example, if endpoint X initiates a transaction that includes:

  • Update Incident
  • Update Incident Custom Fields
  • Update Incident Summary

And endpoint Y performs similar but slightly different work:

  • Update Incident Summary
  • Update Incident

When both of these endpoints are triggered simultaneously, they have a high chance of encountering a deadlock:

X acquires a lock on the Incident while Y acquires a lock on the Incident Summary.

X then waits for a lock on the Incident Summary while Y waits for a lock on the Incident.

Thankfully, Postgres has mechanisms in place to detect deadlocks and will end the conflict early, returning the above error.




How do I resolve a deadlock?

The primary solution to a deadlock is to redesign your transactions to operate in a consistent order.

Deadlocks are often not transient or load-induced errors but rather indications of poorly structured logic.

In the example provided, the deadlock can be avoided if the Incident Summary is updated only after updating the Incident.

If this is not feasible, there are alternative approaches available:

  • Introducing an
    explicit lock
    at the beginning of the transaction, for instance with FOR UPDATE. This could be a lock on the rows being manipulated or possibly an
    advisory lock.
  • Reducing the
    scope of an explicit lock,
    for example with FOR NO KEY UPDATE.
  • Implementing retries with significant jitter to avoid repeated deadlocks.

It’s important to note that all three of these approaches will reduce parallelism and increase the likelihood of lock timeout errors.




Case study: The bulk upsert

Sometimes the root cause of a deadlock may not be as simple as a few out-of-order steps.

We encountered an intermittent deadlock in a process that synchronizes our Catalog product.

The synchronization occurs concurrently from various parts of the application, ensuring that Catalog data remains up to date periodically and especially when state changes are detected.

While the impact on customers from this error was minimal – as the synchronization would be retried shortly and everything would return to normal – we wanted to address the issue and find a permanent solution to eliminate the pager alerts.




The debugging process

To understand the deadlock nature, we needed to identify the two conflicting processes.

The initial step was to refer to our Runbook for debugging lock timeouts in Postgres.

To diagnose the problem, we required two critical pieces of information:

  1. The timestamp of the deadlock error.
  2. A trace or request ID. This is a fragment or identifier present in the error log and other logs generated during the same process/request lifecycle.

Using the Google Cloud Console to access Postgres logs, we looked for the trace ID around the time of the error.




This search revealed the Postgres process ID (PID) of 1381687, the executing query, and the full query causing the blockage, all in one place. Success!




Postgres provided all the necessary information in a single log line, though typically logs are spread across multiple lines, requiring multiple searches:

  • $trace1 OR $pid1, to find the second PID
  • $pid2, to find the second trace ID
  • $trace2 OR $pid2, to view the complete sequence of events from the competing process.

For example:

d1fe0daad95d03d1df71ccdc653b958b” OR “[1381687]




After identifying the two problematic queries, we were surprised to discover that they were identical.

Further investigation using Google Trace Explorer revealed that both queries were invoked by the same code.







🤔 How can two identical transactions deadlock?

If two concurrent processes acquire the same locks in the same order or sequence, they should not deadlock, right? One should wait for the other to finish. At worst, you may encounter a lock timeout.

However, in this scenario, two apparently identical operations did result in a deadlock, particularly during a bulk upsert.

With a single insert query, we were inserting multiple rows of new data. If any row already existed and there was a conflict on a uniqueness constraint, we would update the existing row instead of throwing an error.

Hey there! Let’s talk about a common database pattern known as the “create or update”, or “upsert” pattern. It’s a technique used to either create a new entry in a database or update an existing one.

So, we were working on this pattern and realized that our two procedures were not exactly the same. The rows we were inserting were all jumbled up, not sorted in any particular order. They had been put into a map and then converted back into a slice (kind of like an array in Go).

Interestingly, when it comes to Postgres, it acquires locks individually and sequentially during a bulk insert. This means that instead of locking all affected rows at once, it locks them one by one in a specific order.

This caused a bit of a hiccup in our seemingly smooth upsert operations. The steps were actually happening out of order, with locks being acquired randomly and blocking each other from moving forward.

Luckily, the number of rows we were dealing with was small enough that we could simply sort them before doing the upsert. This solved the deadlock issue and wrapped up our investigation successfully!

Now, what did we learn from this experience? Well, deadlocks may sound intimidating, but they often come down to a simple problem. It’s important to leverage Postgres logs to figure out which processes are conflicting and causing issues.

And don’t overlook seemingly straightforward tasks, especially if they’re hidden behind an ORM. The order of operations can have a significant impact, even at a lower level.

So, next time you encounter a deadlock situation, remember to dig into the details, use the tools at your disposal, and don’t underestimate the power of proper sequencing. Happy coding! 🙌

Leave a Reply

Your email address will not be published. Required fields are marked *