Continually testing our product with smoke tests | Blog

Hey there!

So, let’s talk about the release of On-call. Our team knew that for this product to succeed, we needed to have a rock-solid system right from the start. Our customers expect nothing less, and internally, we wouldn’t settle for anything less than a reliable paging product.

In the past, our product Response was crucial for incident response after detection. But now, with On-call, we are the first line of notification for engineers when something goes wrong.

Our main selling point? We’ll always page you when things go haywire. That’s our promise, and it’s non-negotiable.

With this shift in focus, we made significant technical and organizational changes to prioritize reliability as our top technical goal. This allowed us to keep improving our product day after day without compromising on stability.

One key change was the introduction of Smoke Tests, which continuously test the core functionalities of our product. In this post, I’ll dive into how we use them and the insights we gained from building our own Smoke Test framework.

Our Smoke Test suite

Our Smoke Test suite runs various tests across different scenarios in our product, ensuring everything from alert ingestion to proper user notifications during escalations.

We run these tests every minute in both production and staging environments to exercise our infrastructure and integrations. If any test fails consistently, we receive alerts for proactive investigation.

Additionally, we run these tests on every application change, including pull requests and deployments. These tests use containers on a single machine for quick feedback without hindering our deployment process.

We also test unusual edge cases to validate the robustness of our system. For instance, we check if our system can handle malformed JavaScript alerts.

By running these tests continuously and integrating them into our development process, we can confidently make changes to our system’s core components while ensuring that everything operates as expected.

Learnings

Start with a clean slate

It’s crucial to begin each test by clearing any remnants from previous runs to prevent issues with test consistency and reliability.

By ensuring a clean slate at the start of each test, we eliminate potential issues caused by leftover configurations and maintain a stable testing environment.

Use your standard rails

Utilize standard rails and abstractions in your codebase to access the database, avoiding direct updates or manipulations that could lead to unexpected test results.

By sticking to established methods and abstractions, you ensure consistent and reliable testing while minimizing the risk of creating unusual database states.

Test your user’s assumptions

Consider user expectations and real-world scenarios when designing tests to ensure that your system behaves as users anticipate.

By incorporating time-based metrics and user-centric testing, you can validate not just the correctness of your system but also its real-world performance and user experience.

Prepare to unearth the unexpected

Continuously running tests can reveal hidden issues and assumptions in your system, prompting necessary adjustments before they impact users.

By proactively testing assumptions and scaling scenarios, you can address potential challenges before they become customer-facing problems.

Conclusions

Hope you found this overview of our Smoke Testing approach insightful! It’s a vital part of our commitment to reliability at incident.io.

Our Smoke Testing strategy is just one aspect of our comprehensive reliability efforts for On-call, encompassing a range of technical and cultural changes.

Launching a critical product like On-call required us to ensure its resilience and flexibility for future enhancements. We’re dedicated to delivering a platform that meets the highest standards of reliability and performance.

Stay in touch!

Want more insights on On-call and our development process? Drop your email, and we’ll keep you updated!