Questions Every QA Director Must Ask Themselves #3: What to do about Flaky Tests?

Flaky tests are worse than useless.

At least If they were just useless, we could ignore them. What makes them worse than useless? For one, you cannot rely on the automated test result, so your team usually has to re-run the test manually. So in the end, was it worth it? You took the time and effort to automate that test, but you didn’t get the benefit of the automated run – plus you didn’t save any effort on manual testing.  

New call-to-action

But even worse, flaky tests undermine the credibility of all your automated tests. If you have a set that isn’t reliable, why should your stakeholders (or even you) believe in the other test results?  

If the term “flaky tests” is new to you, it means an automated test that gives different results for the same configuration. The test might fail, then pass the next time you execute it – without changing anything.  

One of my worst experiences with flaky tests was when I was leading the Test Automation team for a financial application. My team was responsible for the automated lab, the tooling, and some of the common code used in all of the automated tests. The Feature teams owned the actual test cases. We had a suite of automated tests, approximately 1500 of them, that ran every night. And, guess what?  

Every night we had some failures. That was probably to be expected as the Feature teams were pushing a lot of changes. But, the bad news was that approximately 60% of the time, a test failure was not a bug in the product, but a flaw with the test or the testing infrastructure.  

Every day for the next 3 months at 7:00 am I was in a daily stand up meeting to review the overnight automated tests and decide what to do with the results. Not fun. Also, my team was considered guilty of every failure until we proved our innocence (and when we were guilty we had to re-run those tests manually).

Over the years, I’ve learned a few tips for tackling flaky tests and avoid these issues.  

1. Make sure your app is testable for automated tests.
2. Tackle technical debt in your test code.
3. Don’t do so much work in the UI (using your test scripts).
4. Empower Feature teams to run (and own) the tests.
5. Provide your stakeholders with consolidated results (manual + automated together)

1. App Testability

The technology used for developing your apps can be one of the most important factors in preventing or eliminating flaky tests, but it’s often the most difficult factor for a QA director to affect. The technology, architecture, and design decisions that impact testability are often made well before your automated test program began. However, all is not lost.  

One source of flaky tests, especially UI driven tests, are the locators used by the automation framework to find the UI elements. The testers often have to use a locator based on XPath, which can change as your UI changes. Instead, if the UI elements are all identified by a unique ID, that will reduce the chances that a UI improvement will break tests. Ask your developers to use unique identifiers for all UI elements.  

Another testability area that can cause flaky tests is setting up test data. In order to check functionality in your app, the automated test needs to have some data already set up in your system. Without a means to set up the data reliably, the testers often use the automated framework to drive the UI to create the data – which increases the odds that the test will fail in the setup stage, instead of the feature being tested.   

If this is your situation, ask your developers for help. Perhaps there is a “developers API” they use for internal testing, and that can be repurposed.  

2. Technical Debt in Test Code

Just like production code, test code is prone to suffer from technical debt – and often, this causes flaky tests later on. This type of technical debt is caused by automation engineers taking shortcuts to get the tests working quickly, then leaving these shortcuts in place to work on additional tests.  

The solution to hard-coded values is generally to pull your important data from a data source outside of the test code, perhaps an Excel sheet. Or, to consolidate your values to a single place in the code, making it easier to update.  

Instead of a delay in your tests, most test frameworks have the ability to wait for a condition to occur. In our search test, for example, we might wait for the app to say “x results have met your criteria”, then proceed with the next step. So, if it only takes 1 second, your test runs faster this time. But, if it takes 10 seconds, your test will still check the search results when they are ready.

For these types of code-related technical debt, I usually ask the testers to fix the code when they encounter these issues. For instance, if the tax rate changes and causes several tests to fail, I ask the tester to fix the design, not simply update the hard-coded values. Alternatively, you could dedicate some time at each sprint for “code hygiene” tasks.  

3. Using the UI too much

I already mentioned one example of using the UI too much in the App Testability section. It’s usually better to stuff the data into your system using an API or other means rather than rely on the UI scripts to pre-populate the data. This practice reduces the amount of exposure your test scripts have to the UI changes.

Another opportunity is to make sure you test at the right level of your technology stack. A classic stack has a UI that presents the user experience, a “back end” of business logic, and further back, a database or persistence layer.  

You should consider testing the business logic directly, instead of through the UI. Then, use the UI tests to test the user experience or the end-to-end flow. Testing the business logic directly is usually a set of API tests, which have the added benefit of running faster, being less prone to false alarms, and make it easier to extend the coverage for permutations.  

For example, when testing an e-commerce site, the total amount will vary based on a number of items, discounts, shipping charges, taxes, and maybe other factors. I would look for an opportunity to create an API-driven test that would test all the permutations of these factors, instead of trying to automate all the permutations through the UI. The API tests would focus on making sure the “math” is right while the UI test ensures everything is connected end to end, and the user experience meets expectations.  

4. Empower Feature Teams

If the first time the new feature code is run through the automated tests by your Automation team, you are asking for flaky tests. The Feature teams are often innovating at the UI layer of your app, the very place that you are using for test input. When the automated tests are run and fail because of a change in the product UI, your test might get a bad reputation for being incorrect. An alternative approach is to empower the Feature teams to run the tests before merging to your trunk branch or handing off code for system test. This way, the Feature teams can adjust the tests to match the product changes.

5. Consolidated Results

Often, we get our test results from different places. The manual tests are recorded in a test case management tool, and automation results might come from the Continuous Integration platform.  This doesn’t cause flaky tests but can create the perception of flaky tests. If you have the same feature tested with both manual & automated tests, and the results are different – your stakeholders won’t know which result to believe. Both results may actually be true, but it’s hard to communicate that with a status metric.  

At Testlio, we’ve found that its better to give stakeholders a consolidated view of the test results, where we combine the automated and manual test results into a single source of truth. The human, in this case, turns the automated test result into real solid information.  

I hope these tips have been useful.

Recently, I attended a software testing conference where the audience was polled: more than 80% were currently automating some portion of their tests, but fewer than 10% had a consistently green dashboard – there was flakiness all around.