Building On-call: Our observability strategy | Blog

At incident.io, we operate an on-call product. Ensuring high availability is crucial for our customers, which is why excellent observability (o11y) is a key tool for us.

Observability enhances the product experience in two ways:

1. Proactive: Constantly monitoring the system’s health allows us to detect issues before they escalate, preventing problems before they impact customers.

2. Reactive: Despite unexpected incidents, having a robust o11y setup enables engineers to quickly identify and address issues, reducing the severity and impact of incidents.

Having good observability is crucial, as simply having data is not enough if it is not easily accessible or used effectively.

Good UX is essential for an effective o11y setup.

Investing in understanding systems, data, and user needs is key to creating a user-friendly observability system.

It’s important to involve the entire engineering team in utilizing observability effectively, rather than relying on a few specialized engineers.

In this post, we’ll demonstrate how we implemented our o11y setup, using our on-call system as an example.

What does our system involve?

Our on-call system manages alerts from third-party tools, identifies who to notify based on alert attributes, and sends notifications to the appropriate individuals.

It follows a simple principle of “alerts in, notifications out” and is represented by the following diagram:


Having a clear understanding of your system is crucial.

Mapping out the system collaboratively helps create a shared understanding of observability. Segmenting the system into distinct sub-systems aids in building clear dashboards.

Structure is key for great observability

Emphasizing usability in the observability stack is essential for success.

Effective observability offers a consistent view of the system at all levels, enabling engineers to quickly identify and address issues.

Utilizing dashboards that provide a high-level overview before diving into specific details is crucial for efficient incident response.

Key goals for building dashboards include:

  • Easy to understand at a glance: Simplifying information for quick comprehension.
  • Consistency: Ensuring uniformity in dashboard design for seamless navigation.
  • Easy transition between dashboards and logs: Facilitating smooth navigation between different data sources.

Creating a design system for dashboards with a consistent structure aids in meeting these goals.

Overview dashboard — Which part of my system is unhappy?

An overview dashboard serves as the initial point of reference for engineers when detecting issues, providing a snapshot of system status.

This dashboard serves as a guide to quickly direct you to the next step without providing all the detailed information on what is going wrong. It is recommended to make this dashboard easily accessible to your team, such as setting it as the Grafana homepage and bookmarking it in Slack channels. Training your team to use this dashboard regularly will help them be more prepared when issues arise.

This overview dashboard, used during a recent load test, easily highlights the areas under stress, such as our escalations system, prompting further investigation on the specific system dashboard.

The dashboard showcases the health of different components of the system, with red signals indicating any abnormalities, no matter how minor. It includes metrics for infrastructure health, queue health, and a simple health overview for each subsystem. Each breakdown on the overview links directly to the associated system dashboard for deeper analysis.

The system dashboard provides a detailed view of a specific subsystem, focusing on a single job to be done. In our case, it is divided into alerts, alert routing, escalations, and notifications, each addressing a specific question related to system health. This approach helps in identifying and resolving issues efficiently. Upon closer inspection, we can gain a more detailed understanding of the unexpected load that was highlighted in our overview dashboard. As we analyze the data, we can see which organization is generating this load, filter our logs for examples, and use the log tables in our dashboard to delve deeper into specific requests. This information allows us to track potential issues and their impact over time, helping us identify bottlenecks, rejected requests, and system strain. By monitoring these metrics, we can proactively address any issues that may arise and ensure the smooth operation of our systems. This is referred to as an event log: it is recorded each time we execute this task, following the same format, and containing all pertinent information.

Event logs provide an engineer with a standardized log to start their investigation (which includes tracking the result and specific resource IDs to search for) before delving deeper into the specific journey of a request.

We log an event after the task is finished, documenting any specific outcomes, durations, and the IDs of resources involved (elements we cannot include in metrics due to cardinality constraints):

log.Info(ctx, "Handled alert", map[string]any{

"event": "alertroute_handle_alert",
"duration": time.Since(startAt).Seconds(),

"outcome": outcome,
"match_count": matchCount,

"alert": alert.ID,
"alert_source_config": alert.AlertSourceConfigID,
"alert_status": alert.Status,
})

It is crucial that this log is always recorded, regardless of the outcome (whether it’s an error, panic, etc.). Therefore, we use defer to ensure our event log is captured at the end of our execution, ensuring it is tracked no matter what happens.

You should include (and elaborate on) what is already being tracked in your metrics. Metrics do not allow you to drill down to a specific unit of work, so it is important to supplement your counters or histograms with logs that share labels but provide more detail on what actually occurred, enabling you to move from a metric-based graph to a specific example (the log) of what went wrong.

Event logs (or high cardinality events) are a common pattern in modern observability. For more information on how this appears in code, check out a great post by Lisa on Anchor logs.

Exemplars

An exemplar is a specific trace representation of a metric you are monitoring. They are a valuable tool for bridging the gap between metric-based charts on your dashboard and actual requests in your logs.

Exemplars allow you to easily select a request that contributed to your metric value and quickly navigate to the trace with just one click.

In code, using exemplars looks something like this—we track exemplars alongside instances where we are incrementing metrics.

md.ObserveWithExemplar(
ctx,

handleAlertMatchDurationSeconds.

With(prometheus.Labels{
"outcome": outcome,
"outcome_incidents": string(outcomeIncidents),
"outcome_escalations": outcomeEscalations,
}),


time.Since(startAt).Seconds(),
)

Tracing — What did my request spend its time doing?

Tracing provides us with the most detailed view of a single request, enabling close-up debugging. Analyzing a trace allows you to connect the dots between logs and visualize how a request utilized its time.

Traces are especially valuable for understanding slow requests as they reveal where and why a request stalled—whether due to time spent waiting for locks, high latency in 3rd party requests, or slow queries.

As the most granular level of investigation, there isn’t much we can do to enhance UX here beyond ensuring that all important aspects of the request are being tracked. However, there are a few tips that can make this process easier.

Since tracing is the tool you turn to when investigating slow requests, having very specific data about the slow parts of your request makes their use much more efficient.

  • 3rd party interactions: Engineers should immediately recognize that a request was slow because 1 second was spent waiting for a response from an external API. We include tracing by default for any 3rd party API requests, using a single shared base client for all API requests where we apply standard logging and traces.
  • Database queries: For all queries, tracking the exact duration of each query helps you understand which part of a request is causing delays. Including the specific query used in your tracing metadata allows you to quickly pinpoint a slow component. Additionally, we separately track the time spent waiting for a database connection from the query itself. This differentiation aids in understanding whether a query is slow due to waiting for a lock or heavy database load, versus inefficient query execution.

Exercise it

An observability setup is only valuable if it is practically implemented.

Are you wondering if your dashboard really works? The only way to find out is to put it to the test and push it to its limits.

Having an overview dashboard is useless if it doesn’t show any signs of breaking when your system is in trouble. You need to actively use it, make adjustments, and ensure it functions properly during challenging times.

Game days can be a valuable tool for testing the effectiveness of your dashboards and getting your team on board with using them. Treat it like user testing by providing minimal guidance and observing what they find easy or difficult to navigate. Collaborating with your team to make necessary changes will help them feel more comfortable using the dashboards.

Conclusion

Viewing observability as a product that should provide a great user experience can significantly enhance how your team interacts with it. By following this approach, we have been able to quickly diagnose issues and empower our team to identify and resolve problems efficiently. Our clear structure and hierarchy make it easy to incorporate new charts when needed.

Achieving this level of effectiveness requires intentional investment in your setup and gaining support from your team on the importance of having clear and user-friendly dashboards.

This advice is applicable regardless of the tool you use. While we utilize Grafana for our observability, the key principles of clarity and consistency can be applied to any tool stack.

Stay in the loop

Interested in more behind-the-scenes content? We publish new articles weekly. Enter your email to stay updated.