How our data team handles incidents

Introduction

Hey there! Let’s talk about data incidents and how the Data team at incident.io is handling them. In the past, data teams weren’t always involved in incident management, but with data playing a bigger role in business decisions and products, data-related incidents are becoming more common and crucial.

Here at incident.io, the Data team is deeply embedded in various aspects of the business, helping both go-to-market and product teams make data-driven choices. We deal with data incidents regularly and rely on our own product to monitor, triage, and respond to them. Here’s how we’ve set things up.

Getting the data team on call

We recently introduced our On-call feature, which has revolutionized our incident management process. Every week, one person from the Data team takes on the role of “Data Responder”. Their duties include leading data-related incidents, addressing dbt pipeline issues, and handling queries from the company.

We have a smooth schedule in place within our product, rotating through team members during working hours. We also sync schedules with Slack user groups, making it easy to tag the on-call person without alerting the entire team.

How are alerts triggered?

We run our dbt pipeline on CircleCI hourly, with alerts sent to the incident.io alerts API for model failures or test errors. The alert payload contains essential details like the failed build link in CircleCI and the latest commit in the repository.

We can create custom attributes for alerts through the web app UI using expressions, allowing us to route errors to the appropriate team based on the impact.

Responding to the incident

When an incident occurs, communication is key. Creating a dedicated space for updates and solutions, like a Slack channel, keeps everyone informed and focused. Our product offers features like attaching GitHub PRs, AI-generated incident summaries, and identifying similar incidents for efficient resolution.

Closing the loop

After resolving an incident, it’s crucial to follow up. For minor incidents, assign tasks to prevent recurrences. For major incidents, consider a postmortem to reflect on the incident and improve processes. Conducting a meta-review can help identify patterns in data incidents and response setups.

Conclusion

Data incidents require attention, ownership, and clear communication. Having a solid data incident management process reduces stress when things go wrong. By implementing some of the strategies mentioned here, you can empower your data team to handle incidents effectively.

Leave a Reply

Your email address will not be published. Required fields are marked *