Skip to content

Conversation

@hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Jan 30, 2026

Adding a plan_splits function to the Scanner to facilitate a single solution for partitioning Lance dataset for efficient distributed compute. This function (1) filters the dataset (using index looking / delete vectors) producing a mapping of fragment IDs to valid row ranges and (2) bin packs these fragment rows ranges into "splits" that target a configurable partition size (in rows count or bytes).

Perhaps the two most important aspects to align on are:

(1) Using a Splits enum that returns different "split types". In the case of a filterable query this returns a Vec<FilteredReadPlan> where each can be fed to the new execute_filtered_read_plan function to read (without incurring traditional read overhead through index lookup , deletion vector application, etc). In every other case (currently nearest / vector search) we return a list of Fragments (existing Spark partition method). This is meant to be a sane default that will be improved upon in the future. For example, we may want to partition these search types based on index files rather than fragment-level boundaries, etc.

(2) Removing the current FilteredReadPlan and making the existing FilteredReadInternalPlan the default. The differentiating factor between these is the latter stores row ranges (ie. Range<u64>) and the former a bitmap of row indexes. IIUC the intuition is that for network transfer a bitmap will be more efficient so we should use that user-facing. IMO we use row ranges in our internal APIs and so the bitmap is ONLY useful if we are network transfering "splits" AND and bitmap representation is smaller. You can do a simple calculation to understand if the serialization will be smaller (ex. # ranges * 2 * bytes per range index <> # of rows / 8). So rather than forcing the bitmap conversion on this API, we can quickly identify in the serialization logic if it makes sense to use bitmap or row range and perform that logic inline. Additionally, this is something we can punt on for now and just work with row ranges until it becomes a problem.

LuQQiu and others added 5 commits January 29, 2026 14:23
- Add FilteredReadPlan struct using RowAddrTreeMap for row selection
- Add get_or_create_plan API for lazy plan computation via OnceCell
- Support providing pre-computed plan to FilteredReadExec::try_new
- Centralize plan creation in get_or_create_plan_impl
- Make RowAddrSelection public in lance-core
- Add FilteredReadInternalPlan (private) using BTreeMap<u32, Vec<Range<u64>>>
  for efficient local execution without bitmap conversion
- Keep FilteredReadPlan (public) using RowAddrTreeMap for distributed execution
- Local path: plan_scan() → internal plan → ScopedFragmentRead (zero conversions)
- External API: get_or_create_plan() converts internal → external once
- with_plan() converts external → internal for distributed workers
- Add bitmap_to_ranges() utility in lance-core for efficient bitmap conversion
- Use BTreeMap for rows to maintain deterministic fragment order

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions bot added the enhancement New feature or request label Jan 30, 2026
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 84.92707% with 93 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 84.62% 65 Missing and 28 partials ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
…c implementations in the future

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions bot added the java label Feb 4, 2026
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@hamersaw hamersaw marked this pull request as ready for review February 5, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants