Fix populate antijoin to use .proj() for correct pending key computation by hummuscience · Pull Request #1405 · datajoint/datajoint-python

hummuscience · 2026-02-19T16:14:19Z

Summary

Fix _populate_direct() to use self.proj() in the antijoin that computes pending keys
Fix jobs.refresh() to use self._target.proj() when computing new keys for the jobs table
Fix progress() fallback path to use self.proj() in the remaining count

Problem

The antijoin that computes pending keys (key_source - self) does not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones.

This causes:

populate(reserve_jobs=False): all key_source entries are iterated instead of just pending ones. Mitigated by if key in self: check inside _populate1(), but wastes time on large tables.
populate(reserve_jobs=True): jobs.refresh() inserts all keys into the jobs table as 'pending', not just truly pending ones. Workers then waste their max_calls budget processing already-completed entries before reaching any real work — effectively making distributed populate non-functional for partially-populated tables.
progress(): reports incorrect remaining counts in the fallback (no common attributes) path.

Reproduction

# Given a Computed/Imported table with secondary attributes:
MyTable.populate(max_calls=5)  # partially populate

# This returns ALL keys, not just unpopulated ones:
pending = MyTable.key_source - MyTable()
print(len(pending))  # == len(key_source), should be len(key_source) - 5

# But this works correctly:
pending = MyTable.key_source - MyTable().proj()
print(len(pending))  # == len(key_source) - 5  ✓

Fix

Add .proj() to the target side of all three antijoins so the subtraction matches on primary key only:

Location	Before	After
`autopopulate.py:406`	`self._jobs_to_do(restrictions) - self`	`self._jobs_to_do(restrictions) - self.proj()`
`autopopulate.py:704`	`todo - self`	`todo - self.proj()`
`jobs.py:373`	`key_source - self._target`	`key_source - self._target.proj()`

Test plan

Added test_populate_antijoin_with_secondary_attrs — verifies pending key count after partial populate (direct mode)
Added test_populate_distributed_antijoin — verifies jobs.refresh() only creates entries for truly pending keys (distributed mode)
Existing test suite passes

🤖 Generated with Claude Code

The antijoin that computes pending keys (`key_source - self` in `_populate_direct`, `key_source - self._target` in `jobs.refresh`, and `todo - self` in `progress`) did not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones. This caused: - `populate(reserve_jobs=False)`: all key_source entries were iterated instead of just pending ones (mitigated by `if key in self:` check inside `_populate1`, but wasted time on large tables) - `populate(reserve_jobs=True)`: `jobs.refresh()` inserted all keys into the jobs table as 'pending', not just truly pending ones. Workers then wasted their `max_calls` budget processing already-completed entries before reaching any real work. - `progress()`: reported incorrect remaining counts in some cases Fix: add `.proj()` to the target side of all three antijoins so the subtraction matches on primary key only, consistent with how DataJoint antijoins are meant to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimitri-yatsenko · 2026-02-19T19:48:51Z

Thanks for digging into this — the .proj() fix on the antijoin is a sound defensive pattern and the right thing to do regardless.

One question on the motivation: I traced through the code and with the current test fixture (Experiment with Subject as parent), key_source.proj() yields {subject_id} and self (Experiment) shares only subject_id as a common attribute name. The antijoin already matches on PK only in this case, with or without .proj(). Could you share a concrete example (table definitions + reproduce steps) where key_source - self actually returns wrong results today? I'd like to understand the exact conditions — e.g., a custom key_source that retains secondary attributes, or a naming collision between key_source attributes and target secondary attributes.

On the CI failures — two issues to fix:

Test logic: Experiment.make() inserts fake_experiments_per_subject = 5 rows per key, so populate(max_calls=2) processes 2 subjects and produces 10 rows, not 2. The assertions need to account for this: assert len(experiment) == 2 * experiment.fake_experiments_per_subject.
Lint: ruff-format wants the final assert in test_populate_antijoin_with_secondary_attrs collapsed to a single line. Running pre-commit run --all-files locally will fix it.

Happy to help get these sorted if you'd like — just let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix populate antijoin to use .proj() for correct pending key computation#1405

Fix populate antijoin to use .proj() for correct pending key computation#1405
hummuscience wants to merge 1 commit intodatajoint:masterfrom
hummuscience:fix/populate-antijoin-proj

hummuscience commented Feb 19, 2026

Uh oh!

dimitri-yatsenko commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

hummuscience commented Feb 19, 2026

Summary

Problem

Reproduction

Fix

Test plan

Uh oh!

dimitri-yatsenko commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments