tests/run-tests.py: Add automatic retry for known-flaky tests.#6
Draft
andrewleech wants to merge 3 commits intoreview/flaky-test-retryfrom
Draft
tests/run-tests.py: Add automatic retry for known-flaky tests.#6andrewleech wants to merge 3 commits intoreview/flaky-test-retryfrom
andrewleech wants to merge 3 commits intoreview/flaky-test-retryfrom
Conversation
Add a FLAKY_TESTS allowlist and retry-until-pass logic to run-tests.py. When a test on the list fails, it is re-run up to 2 more times and passes on its first successful attempt. This reduces the ~18% per-run false-failure rate on the unix port CI workflow to near 0% without excluding the tests entirely. The flaky list covers the 8 tests responsible for essentially all CI failures on master: thread/thread_gc1.py (GC race), thread/stress_aes.py (QEMU timeout), thread/stress_schedule.py, thread/stress_recurse.py, thread/stress_heap.py, cmdline/repl_lock.py, cmdline/repl_cont.py, and extmod/time_time_ns.py. Retries are active by default; --no-retry disables them. The end-of-run summary reports tests that passed via retry separately. tools/ci.sh is updated to remove --exclude patterns for tests now handled by the retry logic, so they get actual coverage again. thread/stress_aes.py is kept excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time. Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
29e879c to
91111d5
Compare
|
Code size report: |
On forks where CODECOV_TOKEN is not set, the upload step fails and marks the coverage job red even though all tests pass. Gate the upload on the secret being available so it's cleanly skipped instead. Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
Change the platform field in FLAKY_TESTS from a single string to a tuple of sys.platform strings, allowing a test to be marked flaky on more than one platform. Signed-off-by: Andrew Leech <andrew.leech@planetinnovation.com.au>
andrewleech
commented
Feb 23, 2026
| saved_testcase_count = testcase_count.value | ||
| flaky_failures = [] | ||
| for i, r in enumerate(results): | ||
| if r[1] == "fail": |
Owner
Author
There was a problem hiding this comment.
Is there a shared / original definition of "fail" that can be used rather than hard coded in string here?
andrewleech
commented
Feb 23, 2026
| retried_pass = False | ||
| for attempt in range(1, FLAKY_TEST_MAX_RETRIES + 1): | ||
| print( | ||
| "retry {} ({}/{}, {})".format( |
Owner
Author
There was a problem hiding this comment.
Is .format the standard way of outputting strings in this project?
andrewleech
commented
Feb 23, 2026
| run_one_test(test_file) | ||
| retry_entries = results[pre_len:] | ||
| del results[pre_len:] # clean up all entries appended during retry | ||
| if len(retry_entries) == 1 and retry_entries[0][1] == "pass": |
Owner
Author
There was a problem hiding this comment.
As per earlier comment, "pass" probably shouldn't be hard coded in string here. Also is this correct for all tests types (diff/exp/unittest)?
andrewleech
commented
Feb 23, 2026
| "{} (retried {})".format(FLAKY_REASON_PREFIX, FLAKY_TEST_MAX_RETRIES), | ||
| ) | ||
| print( | ||
| "FAIL {} (retried {}, {})".format( |
Owner
Author
There was a problem hiding this comment.
How does this FAIL compare to earlier fail strings?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The unix port CI workflow on master fails ~18% of the time due to a handful of intermittent test failures (mostly thread-related). This has normalised red CI to the point where contributors ignore it, masking real regressions.
I added a flaky-test allowlist and retry-until-pass mechanism to
run-tests.py. When a listed test fails, it's re-run up to 2 more times and passes on the first success. The 8 tests on the list account for essentially all spurious CI failures on master.tools/ci.shis updated to remove--excludepatterns for tests now handled by the retry logic, so they get actual coverage again.thread/stress_aes.pystays excluded on RISC-V QEMU where it runs close to the 200s timeout and retries would burn excessive CI time.Retries are on by default;
--no-retrydisables them. The end-of-run summary reports tests that passed via retry separately from clean passes.Testing
Not yet tested in CI — this draft PR is to exercise the review workflow and get CI results on the fork.
Trade-offs and Alternatives
Retry-until-pass can mask a test that has become intermittently broken. The allowlist is deliberately small and each entry documents a reason and optional platform restriction. An alternative would be to mark these tests as expected-fail, but that removes coverage entirely which is worse than the current situation.
The test result status strings (
"pass","fail","skip") are bare string literals throughoutrun-tests.py— the retry code follows this existing convention. Worth considering a follow-up to consolidate these into constants for the whole file.Generative AI
I used generative AI tools when creating this PR, but a human has checked the code and is responsible for the description above.