We all know each one of our tests should be independent, self-contained and therefore able to run by itself or in parallel with other tests with deterministic results. However in real life we many times encounter tests that sometimes pass and sometimes fail. There are many different sources of non-deterministic outcomes of tests. In this post, I would like to specifically focus on tests that depend on a global state which they don’t control for sufficiently.

This might be a global (process) variable, filesystem, etc. One test might not clean-up this global state, the next test might not initiate all it’s assumptions completely. If executed in sequence, in the natural order, neither the developer nor the CI ever notice a problem. However, a hidden dependency between tests had been introduced and will cause problems later.

The good:

  • dependent tests have a deterministic outcome, it's just that ordering of tests is one of the inputs. That's why I think it's useful to clearly distinguish between them and other non-deterministic/flaky tests.

The bad:

  • many times, the incorrect code which causes the test to fail is seemingly unrelated and "far" from the test itself.
  • the tests are passing when executed in the usual ordering (CI)

I would like to share 2 simple techniques which help to tackle dependent tests, that I didn't see in other sources:

  • automatically run each test in isolation
  • execute all tests in reverse order

For demonstration I have this test suite in test_dep.py:

import locale


def test1():
    assert locale.str(0.1) == '0.1'


def test2():
    locale.setlocale(locale.LC_ALL, 'cs_CZ')
    assert locale.str(0.2) == '0,2'


def test3():
    assert locale.str(0.3) == '0,3'

When I execute this test suite with pytest everything seems to be fine.

Automatically run each test in isolation

There is a pytest plugin called pytest-forked which executes each test in a forked process and transmits the results to the master runner. This is a good start because it will identify tests that are clearly not standalone and don't work by themselves. I executed this with my sample test suite

$ pip install pytest-forked
...
$ pytest --forked -v

and I got:

collected 3 items                                  

test_deps.py::test1 PASSED                  [ 33%]
test_deps.py::test2 PASSED                  [ 66%]
test_deps.py::test3 FAILED                   [100%]

===================== FAILURES =====================
______________________ test3 _______________________
def test3():
>       assert locale.str(0.3) == '0,3'
E       AssertionError: assert '0.3' == '0,3'
E         - 0.3
E         + 0,3

test_deps.py:14: AssertionError
======== 1 failed, 2 passed in 0.09 seconds ========

By running pytest test_deps.py::test3 I can confirm that indeed the test doesn't run by itself which allows me to work on a fix.

Execute all tests in reverse order

Existing articles recommend executing the tests in random order to identify some of the dependencies. That is a valid suggestion, however, I think a special case of "random" order deserves priority treatment. Let's execute the tests in reverse order. That should disturb the most dependencies which were unconsciously created while developers and CI were executing the tests in the natural order.

I didn't find any pytest plugin which would accomplish this. However, it's very easy to customize the pytest run through conftest.py like this:

def pytest_collection_modifyitems(items): 
    items.reverse()

After placing this conftest.py into the test root directory I executed

$ pytest -v

with this result

collected 3 items                                  

test_deps.py::test3 FAILED                   [ 33%]
test_deps.py::test2 PASSED                  [ 66%]
test_deps.py::test1 FAILED                  [100%]

===================== FAILURES =====================
______________________ test3 _______________________

    def test3():
>       assert locale.str(0.3) == '0,3'
E       AssertionError: assert '0.3' == '0,3'
E         - 0.3
E         + 0,3

test_deps.py:14: AssertionError
______________________ test1 ______________________

    def test1():
>       assert locale.str(0.1) == '0.1'
E       AssertionError: assert '0,1' == '0.1'
E         - 0,1
E         + 0.1

test_deps.py:5: AssertionError
======== 2 failed, 1 passed in 0.06 seconds ========

two tests are failing which is valuable information. We already know that test3 doesn't work by itself, so we leave it aside. Regarding test1 and test2 there seems to be an interaction between them. We can confirm this by executing pytest test_deps.py::test1 test_deps.py::test2 vs pytest test_deps.py::test2 test_deps.py::test1, getting different result. With this information, we can work on a fix. In this case, we would probably choose to improve test2 so that it cleans-up after itself properly. In a larger test suite, we might not be able to identify the code "corrupting" the global state, but at least we can harden the failing test (test1) set-up so that it's resilient in such case.

Conclusion

In this article, I focused on non-deterministic test outcomes stemming from different ordering/selection of tests. While there is no silver bullet my two tricks will help expose tests that don't set-up their environment well enough.

Other resources

pytest: Flaky tests
pytest-flakefinder