Skip to content

Comments

tests/run-tests.py: Add automatic retry for known-flaky tests.#6

Draft
andrewleech wants to merge 3 commits intoreview/flaky-test-retryfrom
flaky-test-retry
Draft

tests/run-tests.py: Add automatic retry for known-flaky tests.#6
andrewleech wants to merge 3 commits intoreview/flaky-test-retryfrom
flaky-test-retry

Conversation

@andrewleech
Copy link
Owner

@andrewleech andrewleech commented Feb 23, 2026

Summary

The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.

I added a flaky-test allowlist and retry-until-pass mechanism to run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.

tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py stays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.

Retries are on by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately from clean passes.

Testing

Not yet tested in CI — this draft PR is to exercise the review workflow and get CI results on the fork.

Trade-offs and Alternatives

Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.

The test result status strings ("pass", "fail", "skip") are bare string literals throughout run-tests.py — the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.

Generative AI

I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.

Add a FLAKY_TESTS allowlist and retry-until-pass logic to run-tests.py.
When a test on the list fails, it is re-run up to 2 more times and
passes on its first successful attempt. This reduces the ~18% per-run
false-failure rate on the unix port CI workflow to near 0% without
excluding the tests entirely.

The flaky list covers the 8 tests responsible for essentially all CI
failures on master: thread/thread_gc1.py (GC race), thread/stress_aes.py
(QEMU timeout), thread/stress_schedule.py, thread/stress_recurse.py,
thread/stress_heap.py, cmdline/repl_lock.py, cmdline/repl_cont.py,
and extmod/time_time_ns.py.

Retries are active by default; --no-retry disables them. The end-of-run
summary reports tests that passed via retry separately.

tools/ci.sh is updated to remove --exclude patterns for tests now
handled by the retry logic, so they get actual coverage again.
thread/stress_aes.py is kept excluded on RISC-V QEMU where it runs
close to the 200s timeout and retries would burn excessive CI time.

Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
@github-actions
Copy link

github-actions bot commented Feb 23, 2026

Code size report:

Reference:  zephyr/mpconfigport: Remove duplicate builtins.open definition. [1ab9b66]
Comparison: tests/run-tests.py: Support multiple platforms in flaky test map. [merge of f10a479]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    +0 +0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

On forks where CODECOV_TOKEN is not set, the upload step fails and
marks the coverage job red even though all tests pass. Gate the
upload on the secret being available so it's cleanly skipped instead.

Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
Change the platform field in FLAKY_TESTS from a single string to a
tuple of sys.platform strings, allowing a test to be marked flaky on
more than one platform.

Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
saved_testcase_count = testcase_count.value
flaky_failures = []
for i, r in enumerate(results):
if r[1] == "fail":
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a shared / original definition of "fail" that can be used rather than hard coded in string here?

retried_pass = False
for attempt in range(1, FLAKY_TEST_MAX_RETRIES + 1):
print(
"retry {} ({}/{}, {})".format(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .format the standard way of outputting strings in this project?

run_one_test(test_file)
retry_entries = results[pre_len:]
del results[pre_len:] # clean up all entries appended during retry
if len(retry_entries) == 1 and retry_entries[0][1] == "pass":
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per earlier comment, "pass" probably shouldn't be hard coded in string here. Also is this correct for all tests types (diff/exp/unittest)?

"{} (retried {})".format(FLAKY_REASON_PREFIX, FLAKY_TEST_MAX_RETRIES),
)
print(
"FAIL {} (retried {}, {})".format(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this FAIL compare to earlier fail strings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants