Fix populate antijoin to use .proj() for correct pending key computation#1405
Fix populate antijoin to use .proj() for correct pending key computation#1405hummuscience wants to merge 1 commit intodatajoint:masterfrom
Conversation
The antijoin that computes pending keys (`key_source - self` in `_populate_direct`, `key_source - self._target` in `jobs.refresh`, and `todo - self` in `progress`) did not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones. This caused: - `populate(reserve_jobs=False)`: all key_source entries were iterated instead of just pending ones (mitigated by `if key in self:` check inside `_populate1`, but wasted time on large tables) - `populate(reserve_jobs=True)`: `jobs.refresh()` inserted all keys into the jobs table as 'pending', not just truly pending ones. Workers then wasted their `max_calls` budget processing already-completed entries before reaching any real work. - `progress()`: reported incorrect remaining counts in some cases Fix: add `.proj()` to the target side of all three antijoins so the subtraction matches on primary key only, consistent with how DataJoint antijoins are meant to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for digging into this — the One question on the motivation: I traced through the code and with the current test fixture ( On the CI failures — two issues to fix:
Happy to help get these sorted if you'd like — just let me know. |
Summary
_populate_direct()to useself.proj()in the antijoin that computes pending keysjobs.refresh()to useself._target.proj()when computing new keys for the jobs tableprogress()fallback path to useself.proj()in the remaining countProblem
The antijoin that computes pending keys (
key_source - self) does not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones.This causes:
populate(reserve_jobs=False): allkey_sourceentries are iterated instead of just pending ones. Mitigated byif key in self:check inside_populate1(), but wastes time on large tables.populate(reserve_jobs=True):jobs.refresh()inserts all keys into the jobs table as'pending', not just truly pending ones. Workers then waste theirmax_callsbudget processing already-completed entries before reaching any real work — effectively making distributed populate non-functional for partially-populated tables.progress(): reports incorrect remaining counts in the fallback (no common attributes) path.Reproduction
Fix
Add
.proj()to the target side of all three antijoins so the subtraction matches on primary key only:autopopulate.py:406self._jobs_to_do(restrictions) - selfself._jobs_to_do(restrictions) - self.proj()autopopulate.py:704todo - selftodo - self.proj()jobs.py:373key_source - self._targetkey_source - self._target.proj()Test plan
test_populate_antijoin_with_secondary_attrs— verifies pending key count after partial populate (direct mode)test_populate_distributed_antijoin— verifiesjobs.refresh()only creates entries for truly pending keys (distributed mode)🤖 Generated with Claude Code