feat: add plan_splits function for distributed compute #5863

hamersaw · 2026-01-30T21:16:45Z

Adding a plan_splits function to the Scanner to facilitate a single solution for partitioning Lance dataset for efficient distributed compute. This function (1) filters the dataset (using index looking / delete vectors) producing a mapping of fragment IDs to valid row ranges and (2) bin packs these fragment rows ranges into "splits" that target a configurable partition size (in rows count or bytes).

Perhaps the two most important aspects to align on are:

(1) Using a Splits enum that returns different "split types". In the case of a filterable query this returns a Vec<FilteredReadPlan> where each can be fed to the new execute_filtered_read_plan function to read (without incurring traditional read overhead through index lookup , deletion vector application, etc). In every other case (currently nearest / vector search) we return a list of Fragments (existing Spark partition method). This is meant to be a sane default that will be improved upon in the future. For example, we may want to partition these search types based on index files rather than fragment-level boundaries, etc.

(2) Removing the current FilteredReadPlan and making the existing FilteredReadInternalPlan the default. The differentiating factor between these is the latter stores row ranges (ie. Range<u64>) and the former a bitmap of row indexes. IIUC the intuition is that for network transfer a bitmap will be more efficient so we should use that user-facing. IMO we use row ranges in our internal APIs and so the bitmap is ONLY useful if we are network transfering "splits" AND and bitmap representation is smaller. You can do a simple calculation to understand if the serialization will be smaller (ex. # ranges * 2 * bytes per range index <> # of rows / 8). So rather than forcing the bitmap conversion on this API, we can quickly identify in the serialization logic if it makes sense to use bitmap or row range and perform that logic inline. Additionally, this is something we can punt on for now and just work with row ranges until it becomes a problem.

- Add FilteredReadPlan struct using RowAddrTreeMap for row selection - Add get_or_create_plan API for lazy plan computation via OnceCell - Support providing pre-computed plan to FilteredReadExec::try_new - Centralize plan creation in get_or_create_plan_impl - Make RowAddrSelection public in lance-core

- Add FilteredReadInternalPlan (private) using BTreeMap<u32, Vec<Range<u64>>> for efficient local execution without bitmap conversion - Keep FilteredReadPlan (public) using RowAddrTreeMap for distributed execution - Local path: plan_scan() → internal plan → ScopedFragmentRead (zero conversions) - External API: get_or_create_plan() converts internal → external once - with_plan() converts external → internal for distributed workers - Add bitmap_to_ranges() utility in lance-core for efficient bitmap conversion - Use BTreeMap for rows to maintain deterministic fragment order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

codecov · 2026-01-30T22:38:26Z

Codecov Report

❌ Patch coverage is 84.92707% with 93 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	84.62%	65 Missing and 28 partials ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

…c implementations in the future Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

LuQQiu and others added 5 commits January 29, 2026 14:23

fix: remove redundant clone in test

05b9bf6

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

small fix

20dacb8

initial commit

884fe00

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the enhancement New feature or request label Jan 30, 2026

hamersaw mentioned this pull request Jan 30, 2026

feat: add scanner.plan_splits function #5792

Closed

hamersaw added 4 commits January 30, 2026 15:19

working for filterable scans

43dbf3d

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge branch 'main' into feature/plan-splits

dc80d78

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

removed dead code

d660136

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

added rough python bindings for testing

2be3da2

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the python label Jan 30, 2026

hamersaw added 3 commits February 3, 2026 15:51

working e2e

b95f631

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

using an enum for Splits that allows us to add FTS and vector specifi…

e204900

…c implementations in the future Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

adding java bindings

eb33d51

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the java label Feb 4, 2026

made java FilteredReadPlan serializable

0d6d6d1

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw mentioned this pull request Feb 4, 2026

feat: using scanner.planSplits to prune fragments / rows and bin pack spark partitions lance-format/lance-spark#202

Draft

hamersaw added 5 commits February 5, 2026 13:50

adding unit tests

abf6764

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

added bin_pack unit tests

4902d16

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

docs updates

5dbdac0

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hopefully python docs correct

860d3c8

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge remote-tracking branch 'upstream/main' into feature/plan-splits

4003156

hamersaw marked this pull request as ready for review February 5, 2026 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add plan_splits function for distributed compute #5863

feat: add plan_splits function for distributed compute #5863

Uh oh!

hamersaw commented Jan 30, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add plan_splits function for distributed compute #5863

Are you sure you want to change the base?

feat: add plan_splits function for distributed compute #5863

Uh oh!

Conversation

hamersaw commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hamersaw commented Jan 30, 2026 •

edited

Loading

codecov bot commented Jan 30, 2026 •

edited

Loading