Skip to content

Fix populate antijoin to use .proj() for correct pending key computation#1405

Open
hummuscience wants to merge 1 commit intodatajoint:masterfrom
hummuscience:fix/populate-antijoin-proj
Open

Fix populate antijoin to use .proj() for correct pending key computation#1405
hummuscience wants to merge 1 commit intodatajoint:masterfrom
hummuscience:fix/populate-antijoin-proj

Conversation

@hummuscience
Copy link

Summary

  • Fix _populate_direct() to use self.proj() in the antijoin that computes pending keys
  • Fix jobs.refresh() to use self._target.proj() when computing new keys for the jobs table
  • Fix progress() fallback path to use self.proj() in the remaining count

Problem

The antijoin that computes pending keys (key_source - self) does not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones.

This causes:

  • populate(reserve_jobs=False): all key_source entries are iterated instead of just pending ones. Mitigated by if key in self: check inside _populate1(), but wastes time on large tables.
  • populate(reserve_jobs=True): jobs.refresh() inserts all keys into the jobs table as 'pending', not just truly pending ones. Workers then waste their max_calls budget processing already-completed entries before reaching any real work — effectively making distributed populate non-functional for partially-populated tables.
  • progress(): reports incorrect remaining counts in the fallback (no common attributes) path.

Reproduction

# Given a Computed/Imported table with secondary attributes:
MyTable.populate(max_calls=5)  # partially populate

# This returns ALL keys, not just unpopulated ones:
pending = MyTable.key_source - MyTable()
print(len(pending))  # == len(key_source), should be len(key_source) - 5

# But this works correctly:
pending = MyTable.key_source - MyTable().proj()
print(len(pending))  # == len(key_source) - 5  ✓

Fix

Add .proj() to the target side of all three antijoins so the subtraction matches on primary key only:

Location Before After
autopopulate.py:406 self._jobs_to_do(restrictions) - self self._jobs_to_do(restrictions) - self.proj()
autopopulate.py:704 todo - self todo - self.proj()
jobs.py:373 key_source - self._target key_source - self._target.proj()

Test plan

  • Added test_populate_antijoin_with_secondary_attrs — verifies pending key count after partial populate (direct mode)
  • Added test_populate_distributed_antijoin — verifies jobs.refresh() only creates entries for truly pending keys (distributed mode)
  • Existing test suite passes

🤖 Generated with Claude Code

The antijoin that computes pending keys (`key_source - self` in
`_populate_direct`, `key_source - self._target` in `jobs.refresh`, and
`todo - self` in `progress`) did not project the target table to its
primary key before the subtraction. When the target table has secondary
(non-PK) attributes, the antijoin fails to match on primary key alone
and returns all keys instead of just the unpopulated ones.

This caused:
- `populate(reserve_jobs=False)`: all key_source entries were iterated
  instead of just pending ones (mitigated by `if key in self:` check
  inside `_populate1`, but wasted time on large tables)
- `populate(reserve_jobs=True)`: `jobs.refresh()` inserted all keys into
  the jobs table as 'pending', not just truly pending ones. Workers then
  wasted their `max_calls` budget processing already-completed entries
  before reaching any real work.
- `progress()`: reported incorrect remaining counts in some cases

Fix: add `.proj()` to the target side of all three antijoins so the
subtraction matches on primary key only, consistent with how DataJoint
antijoins are meant to work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimitri-yatsenko
Copy link
Member

Thanks for digging into this — the .proj() fix on the antijoin is a sound defensive pattern and the right thing to do regardless.

One question on the motivation: I traced through the code and with the current test fixture (Experiment with Subject as parent), key_source.proj() yields {subject_id} and self (Experiment) shares only subject_id as a common attribute name. The antijoin already matches on PK only in this case, with or without .proj(). Could you share a concrete example (table definitions + reproduce steps) where key_source - self actually returns wrong results today? I'd like to understand the exact conditions — e.g., a custom key_source that retains secondary attributes, or a naming collision between key_source attributes and target secondary attributes.

On the CI failures — two issues to fix:

  1. Test logic: Experiment.make() inserts fake_experiments_per_subject = 5 rows per key, so populate(max_calls=2) processes 2 subjects and produces 10 rows, not 2. The assertions need to account for this: assert len(experiment) == 2 * experiment.fake_experiments_per_subject.

  2. Lint: ruff-format wants the final assert in test_populate_antijoin_with_secondary_attrs collapsed to a single line. Running pre-commit run --all-files locally will fix it.

Happy to help get these sorted if you'd like — just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments