fix: use FileLock with atomic rename for safe concurrent model downloads#396
fix: use FileLock with atomic rename for safe concurrent model downloads#396oleksandr-nc merged 9 commits intomainfrom
Conversation
When KEDA scales up multiple ExApp pods sharing storage, concurrent downloads of the same model can corrupt files. Use OS-level FileLock (auto-releases on process death) with temp file + atomic os.replace() to ensure the final file is always complete or absent, never partial. Unlike SoftFileLock, FileLock cannot leave stale locks after pod SIGKILL. No lock added for snapshot_download — huggingface_hub handles this internally.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #396 +/- ##
==========================================
+ Coverage 93.99% 94.87% +0.88%
==========================================
Files 45 46 +1
Lines 5392 5540 +148
==========================================
+ Hits 5068 5256 +188
+ Misses 324 284 -40
🚀 New features to boost your workflow:
|
93fb067 to
fb2f362
Compare
📝 WalkthroughWalkthroughAdds per-file FileLock-backed atomic model downloads (write to Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FileLock
participant HTTP
participant FS as FileSystem
Client->>FileLock: acquire lock for result_path.lock
activate FileLock
Client->>FS: check if result_path exists
alt exists
Client->>FS: compute local SHA-256
Client->>HTTP: request/inspect ETag (headers or redirect history)
alt ETag matches
Client->>Client: skip download (use cached file)
else ETag differs
Client->>HTTP: stream download
end
else not exists
Client->>HTTP: stream download
end
HTTP-->>Client: stream chunks
Client->>FS: write chunks to result_path.tmp
alt write complete
Client->>FS: os.replace(result_path.tmp → result_path)
else error
Client->>FS: remove result_path.tmp
Client->>Client: raise ModelFetchError
end
FileLock->>FileLock: release lock
deactivate FileLock
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 47-51: Update the filelock dependency constraint in pyproject.toml
to remove the vulnerable minimum by changing the "filelock>=3.16,<4" entry to
require at least 3.20.1 (e.g., "filelock>=3.20.1,<4"); this ensures the
project's dependency list (the dependencies array containing "fastapi...",
"filelock...", "niquests...", "pydantic...") uses a safe filelock release while
preserving the existing upper bound.
|
@father we want to add support for Kubernetes in our Nextcloud ExApps. That means multiple copies of ExApps can start at the same time using shared PVC during startup when they try download models. For example we have such app: https://github.com/nextcloud/llm2 Review this PR that should help us handle that |
|
Claude finished @oleksandr-nc's task in 3m 18s —— View job
PR Review:
|
|
@mother we want to add support for Kubernetes in our Nextcloud ExApps. That means multiple copies of ExApps can start at the same time using shared PVC during startup when they try download models. For example we have such app: https://github.com/nextcloud/llm2 Review this PR that should help us handle that |
|
Claude finished @oleksandr-nc's task in 3m 32s —— View job
Oh, Oleksandr — what a thoughtful piece of engineering! The problem you identified is real and genuinely tricky: when KEDA scales up multiple pods sharing a PVC, they can all race to download the same model file. You've clearly thought through why each piece exists — the OS-level The test suite is also genuinely comprehensive — you've covered the happy path, temp-file cleanup, lock release verification, ETag-based skip, error recovery, concurrent correctness, and progress reporting. That's real care. Let's talk through a couple of things that would make this even better. Important to address1. In total_size = int(response.headers.get("Content-Length"))
The fix is small: total_size = int(response.headers.get("Content-Length", 0))But there's a subtle second issue hiding nearby (line 228): if linked_etag and total_size == existing_size:If if linked_etag and total_size and total_size == existing_size:2. When a pod holds the lock for longer than 7200 seconds (stuck download, hung process, zombie pod), A small catch gives you a much more helpful message: from filelock import FileLock, Timeout as FileLockTimeout
try:
with FileLock(result_path + ".lock", timeout=7200):
...
except FileLockTimeout as exc:
raise ModelFetchError(
f"Timed out waiting for lock on '{result_path}' after 7200s — another pod may be stuck downloading this file"
) from excAdding a test for this failure mode would round out the test suite nicely too. 3. Thread-safety in the concurrent test In errors = []
...
errors.append(e) # called from multiple threads
responses = iter([FakeResponse(b"A" * 10000), FakeResponse(b"B" * 10000)])
def mock_get(_url, **_kwargs):
return next(responses)CPython's GIL makes import queue
errors = queue.Queue()
responses = queue.Queue()
responses.put(FakeResponse(b"A" * 10000))
responses.put(FakeResponse(b"B" * 10000))
def mock_get(_url, **_kwargs):
return responses.get_nowait()
# ...
while not errors.empty():
raise errors.get()Worth considering if time allows4. The HTTP connection opens before the ETag fast-path check
Separating the ETag check into a 5. Lines 229 and 238 use 6. The
7. In 8. A docstring for The function has some non-obvious behaviors: it mutates the caller's dict (via You've solved a genuinely real problem with the right tools. The atomic rename + OS-level locking approach is production-grade thinking. Addressing the |
… lint ignores - Catch filelock.Timeout separately from generic errors so operators see an actionable message when a pod is stuck holding the download lock - Rename each_history → redirect_resp (response.history entries are redirect responses, not "histories") - Add E402 and UP to tests_unit/ ruff ignores to match tests/ config - Add test for FileLockTimeout → ModelFetchError propagation
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nc_py_api/ex_app/integration_fastapi.py`:
- Line 225: The line computing total_size can raise TypeError when the
"Content-Length" header is missing; change the logic around response and
total_size (the variable total_size computed from
response.headers.get("Content-Length")) to treat a missing header as 0 instead
of calling int(None). Specifically, check response.headers.get("Content-Length")
for None (or use a safe default) before converting to int and set total_size = 0
when absent so the subsequent progress logic (the block that checks if
total_size) handles streaming without a Content-Length header.
In `@tests_unit/test_fetch_model_file.py`:
- Around line 190-202: The test test_filelock_timeout_raises_model_fetch_error
incorrectly makes the FileLock constructor raise by using
side_effect=FileLockTimeout(lock); instead, simulate the real behavior by
patching nc_py_api.ex_app.integration_fastapi.FileLock to return a mock context
manager whose __enter__ (or acquire) raises FileLockTimeout(lock) so the
exception occurs when entering/acquiring the lock inside fetch_models_task and
the rest of the test (pytest.raises(ModelFetchError) and the status_msg
assertion) remains the same.
- Default Content-Length to 0 when absent to avoid int(None) TypeError - Guard ETag skip with `and total_size` so 0==0 doesn't open a nonexistent file - Fix docstring: models_to_fetch → models to match actual param name
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nc_py_api/ex_app/integration_fastapi.py`:
- Around line 239-241: The write to builtins.open(tmp_path, "wb") can raise
FileNotFoundError when save_path contains nested directories because parent
folders may not exist; before opening tmp_path (or earlier before acquiring the
lock) ensure the parent directory exists by creating it (e.g., make parent of
tmp_path or of save_path with exist_ok=True) so the open call succeeds; update
the code around the builtins.open(tmp_path, "wb") block (or do it at function
start) to call directory-creation for
os.path.dirname(tmp_path)/dirname(save_path) before opening the file.
When KEDA scales up multiple ExApp pods sharing storage, concurrent downloads of the same model can corrupt files. Use OS-level FileLock (auto-releases on process death) with temp file + atomic
os.replace()to ensure the final file is always complete or absent, never partial.Unlike
SoftFileLock,FileLockcannot leave stale locks after pod SIGKILL. No lock added for snapshot_download - huggingface_hub handles this internally.Summary by CodeRabbit
New Features
Tests
Chores