A simple tool for debugging flakes

Introducing stress, a simple tool for debugging inconsistent errors. No one likes seeing a test fail in (CI) and then pressing Up + Enter every 3 minutes locally to catch a flake!1

Try it out! cargo install stress && stress --output --bail -- ls -a

Motivation

This is frustratingly common in tests due to a variety of factors including

  • Code-skew as the code changes without updating the tests
  • Poor test isolation
  • Transient errors as tests interact with multiple pieces of infrastructure. Ex: frontend + backend + database

Often the fix is either deleting the test2 editing the test code or changing something about the environment where the test is running, increasing available memory or tweaking the config.

How does it work?

Pass in a command say ls -a and it will be run a specified number of times (default: 10). The exit codes encountered and the number of occurrences will be printed out along with the command output (with the --output flag).

If all that's needed is to see if any runs fail there's a --bail flag that stops the program on the first command run that ends with a failure (non-zero exit code).

Why not <insert-tool-here>?

Definitely! That's also a great option if it works for you! A while-loop in bash should be sufficient, as one of my coworkers wisely pointed out.

So why do this, you might wonder? Just to scratch my own itch.

Next time I need to debug a test, I can focus on the problem instead of the tooling.

Thanks for reading

Leave your thoughts and feedback on GitHub.

If you're using stress let me know how I can improve it for you.

Footnotes

1: If you love manually trying to catch flakes...uh...🤯

2: This is obviously a joke! But it's not always the wrong approach. Test flakes, especially in CI, are a burden on your entire engineering team. They slow down continuous development and reduce trust in the test suite. Empowering engineers to turn off flakes forces engineering teams to confront a tragedy of the commons, failed CI runs.