Integrating in-mem, inline, beta search into GH DiskANN by gopalrs · Pull Request #782 · microsoft/DiskANN

gopalrs · 2026-02-16T10:31:17Z

Integrate filter search into benchmarks. This PR allows users to run filter benchmarks using the inline beta search algorithm. The PR also adds a "DocumentInsertStrategy" which allows users to insert "documents" (== {vector, attributes}) into the DiskANN index and query using vectors + predicates on attributes.

- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools

Copilot

Pull request overview

This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.

Changes:

Added DocumentInsertStrategy and supporting public types to insert/query Document objects (vector + attributes) through DocumentProvider.
Extended inline beta filter search to handle predicate encoding failures and added a constructor for InlineBetaStrategy.
Added a new benchmark input/backend (document-index-build) plus example config for running document + filter benchmarks.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
test_data/disk_index_search/data.256.label.jsonl	Updates LFS pointer for label test data used in filter benchmarks.
diskann-tools/src/utils/ground_truth.rs	Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation.
diskann-tools/Cargo.toml	Adds `serde_json` dependency (and adjusts manifest metadata).
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs	Adds `Vec<T>` query support for full-precision in-mem provider (for inline beta usage).
diskann-label-filter/src/lib.rs	Exposes the new `document_insert_strategy` module under `encoded_attribute_provider`.
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs	Adds `InlineBetaStrategy::new` and introduces `is_valid_filter` fast-path logic.
diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs	Adjusts filter encoding to be optional and threads `is_valid_filter` into the query computer.
diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs	Makes `RoaringAttributeStore` public for cross-crate use.
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs	Changes encoded filter representation to `Option`, allowing “invalid filter” fallback behavior.
diskann-label-filter/src/encoded_attribute_provider/document_provider.rs	Allows vector types used in documents to be `?Sized`.
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs	New strategy wrapper enabling insertion/search over `Document` values.
diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs	Simplifies lookup error messaging and signature for attribute→id mapping.
diskann-label-filter/src/document.rs	Makes `Document` generic over `?Sized` vectors.
diskann-benchmark/src/utils/tokio.rs	Adds a reusable multi-thread Tokio runtime builder.
diskann-benchmark/src/utils/recall.rs	Re-exports `knn` recall helper for benchmark use.
diskann-benchmark/src/inputs/mod.rs	Registers a new `document_index` input module.
diskann-benchmark/src/inputs/document_index.rs	New benchmark input schema for document-index build + filtered search runs.
diskann-benchmark/src/backend/mod.rs	Registers new `document_index` backend benchmarks.
diskann-benchmark/src/backend/index/result.rs	Extends search result reporting with query count and wall-clock summary columns.
diskann-benchmark/src/backend/document_index/mod.rs	New backend module entrypoint for document index benchmarks.
diskann-benchmark/src/backend/document_index/benchmark.rs	New end-to-end benchmark: build via `DocumentInsertStrategy` + filtered search via `InlineBetaStrategy`.
diskann-benchmark/example/document-filter.json	Adds example job configuration for document filter benchmark runs.
Cargo.lock	Adds `serde_json` to the lockfile dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-16T10:36:16Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+    // We need to clone data for each task
+    let queries_arc = Arc::new(queries.clone());


run_search_parallel clones the entire queries matrix and predicate list (predicates.to_vec()) for each call, which can be very expensive for realistic benchmark sizes (extra memory + copy time) and is repeated for every rep/run config. Since queries is already owned in the caller, consider passing an Arc<Matrix<T>> / Arc<Vec<(usize, ASTExpr)>> into run_filtered_search and cloning the Arc instead of deep-cloning the data.

Suggested change

// We need to clone data for each task

let queries_arc = Arc::new(queries.clone());

// Wrap shared data in Arc so it can be cloned cheaply per task

let queries_arc = Arc::new(queries);

Copilot · 2026-02-16T10:36:17Z

diskann-tools/Cargo.toml

@@ -5,14 +5,13 @@ version.workspace = true
 authors.workspace = true
 description.workspace = true
 documentation.workspace = true


diskann-tools no longer declares license.workspace = true while other workspace crates do. This inconsistency can break packaging/publishing or workspace metadata expectations. If this wasn’t intentional, re-add the workspace license field to match the rest of the workspace manifests.

Suggested change

documentation.workspace = true

documentation.workspace = true

license.workspace = true

Copilot · 2026-02-16T10:36:17Z

diskann-tools/src/utils/ground_truth.rs

+/// Expands a JSON object with array-valued fields into multiple objects with scalar values.
+/// For example: {"country": ["AU", "NZ"], "year": 2007}
+/// becomes: [{"country": "AU", "year": 2007}, {"country": "NZ", "year": 2007}]
+///
+/// If multiple fields have arrays, all combinations are generated.
+fn expand_array_fields(value: &Value) -> Vec<Value> {
+    match value {
+        Value::Object(map) => {
+            // Start with a single empty object
+            let mut results: Vec<Map<String, Value>> = vec![Map::new()];
+
+            for (key, val) in map.iter() {
+                if let Value::Array(arr) = val {
+                    // Expand: for each existing result, create copies for each array element
+                    let mut new_results: Vec<Map<String, Value>> = Vec::new();
+                    for existing in results.iter() {
+                        for item in arr.iter() {
+                            let mut new_map: Map<String, Value> = existing.clone();
+                            new_map.insert(key.clone(), item.clone());
+                            new_results.push(new_map);
+                        }
+                    }
+                    // If array is empty, keep existing results without this key
+                    if !arr.is_empty() {
+                        results = new_results;
+                    }
+                } else {
+                    // Non-array field: add to all existing results
+                    for existing in results.iter_mut() {
+                        existing.insert(key.clone(), val.clone());
+                    }
+                }
+            }
+
+            results.into_iter().map(Value::Object).collect()


expand_array_fields clones JSON objects/values and can generate a cartesian product when multiple fields are arrays. Inside read_labels_and_compute_bitmap this happens in a nested loop over (queries × base_labels), so worst-case time/memory can explode on realistic datasets. Consider avoiding materializing expanded variants (evaluate arrays lazily), or imposing a cap/short-circuit to prevent combinatorial blow-ups.

Copilot · 2026-02-16T10:36:17Z

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

+            //If predicate evaluation fails, we will return the score returned by the
+            //inner computer, as though no predicate was specified.
+            tracing::warn!(
+                "Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"


The warning message in the !is_valid_filter branch says “Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()”, but this branch is taken when encoding the filter expression failed (not when evaluation failed), and the type is InlineBetaComputer. The message should reflect the actual failure mode to avoid misleading diagnostics.

Suggested change

//If predicate evaluation fails, we will return the score returned by the

//inner computer, as though no predicate was specified.

tracing::warn!(

"Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"

//If the filter expression is invalid or failed to encode, we will return the score

//returned by the inner computer, as though no predicate was specified.

tracing::warn!(

"Filter expression was invalid or failed to encode in InlineBetaComputer::evaluate_similarity(); proceeding without filtering"

Copilot · 2026-02-16T10:36:18Z

diskann-benchmark/src/utils/tokio.rs

+/// Create a generic multi-threaded runtime with `num_threads`.
+pub(crate) fn runtime(num_threads: usize) -> anyhow::Result<tokio::runtime::Runtime> {
+    Ok(tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(num_threads)
+        .build()?)
+}


This new runtime() builder doesn’t call enable_all() (or at least enable_time/enable_io). If any benchmark path (now or later) uses Tokio timers, networking, or async FS, the runtime will fail at runtime with missing driver errors. Consider enabling the required drivers explicitly to make this utility safe to reuse.

Copilot · 2026-02-16T10:36:18Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+        search_l,
+        recall: recall_metrics,
+        qps,
+        wall_clock_time: rep_latencies,
+        mean_latency: mean,
+        p90_latency: p90,
+        p99_latency: p99,
+        mean_cmps,
+        mean_hops,
+        per_query_details: Some(per_query_details),
+    })


per_query_details is always set to Some(...) even when it’s empty, which forces this field into the serialized output for every run. For larger query sets this can bloat benchmark output significantly when many queries have imperfect recall. Consider only setting this to Some when you actually have entries (or gating it behind a debug/verbose option).

Copilot · 2026-02-16T10:36:18Z

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

+                    .accept(&pe)?
+                {
+                    filtered_candidates.push(Neighbor::new(candidate.id, candidate.distance));
+                }


When is_valid_filter is false, post_process never pushes any candidates, so filtered_candidates stays empty and the search returns 0 results. If the intent is to fall back to unfiltered search when the predicate can’t be encoded, this should forward the original candidates (or treat it as match-all) instead of filtering everything out.

Suggested change

}

}

} else {

// If the filter is not valid/encodable, fall back to unfiltered candidates.

filtered_candidates.push(Neighbor::new(candidate.id, candidate.distance));

Copilot · 2026-02-16T10:36:18Z

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

+            //If predicate evaluation fails, we will return the score returned by the
+            //inner computer, as though no predicate was specified.
+            tracing::warn!(
+                "Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"
+            );


The !is_valid_filter branch logs a warning on every evaluate_similarity call. This function is called extremely frequently during search, so this will spam logs and can materially degrade performance. Consider logging once when building the query computer (already done) and making evaluate_similarity silently fall back to unfiltered scoring when the filter is invalid.

Suggested change

//If predicate evaluation fails, we will return the score returned by the

//inner computer, as though no predicate was specified.

tracing::warn!(

"Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"

);

// If the filter is invalid, treat it as though no predicate was specified

// and return the score from the inner computer without additional logging.

Copilot · 2026-02-16T10:36:19Z

diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs

+    pub fn new(ast_expr: &ASTExpr, attribute_map: Arc<RwLock<AttributeEncoder>>) -> Self {
        let mut mapper = ASTLabelIdMapper::new(attribute_map);
-        let ast_id_expr = ast_expr.accept(&mut mapper)?;
-        Ok(Self { ast_id_expr })
+        match ast_expr.accept(&mut mapper) {
+            Ok(ast_id_expr) => Self {
+                ast_id_expr: Some(ast_id_expr),
+            },
+            Err(_e) => Self { ast_id_expr: None },
+        }


EncodedFilterExpr::new now swallows all mapping errors and returns None without preserving/logging the underlying error. This makes it hard to distinguish “empty predicate” vs “unknown attribute/value” vs other failures, and can cause silent fallback to unfiltered behavior. Consider returning a Result again, or storing the error for callers to surface in a controlled way.

I'm going to echo this suggestion - are we sure that dropping errors is the right move?

hildebrandmw · 2026-02-18T17:50:49Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+                i8_val as f64
+            } else {
+                0.0
+            };


Switching on TypeId is not a good idea for a number of reasons:

It's not easily extendable. If type-ids are littered throughout a codebase, it's difficult to keep track of and maintained.

Failed matches are runtime errors instead of compile time errors. This loses the confidence of "if it compiles, it works".

In the context of this particular problem, it's not clear to me that simply returning 0.0 is the correct behavior. This is silently succeeding with an incorrect value rather than notifying the caller that the operation is unsupported. While this is correct in the moment since there is another check that bails earlier on an unsupported data type, this leaves the code in a brittle state. The earlier check can be added to support float16 - everything will still compile, but this will silently fail.

Could this be fixed addressed by taking the result of ComputeMedoid and using a helper

fn find_closest<T>(x: MatrixView<'_, T>, y: &[T]) -> Option<usize> where for<'a> SquaredL2: PureDistanceFunction<&'a [T], &'a [T], f32>, { let min_dist = f32::INFINITY; let min_ind = x.nrows(); for (i, row) in x.row_iter() { let dist = SquaredL2::evaluate(row, y); if dist < min_dist { min_dist = dist; min_ind = i; } } // No closest neighbor found. if min_ind == x.nrows() { None } else { Some(min_ind) } }

hildebrandmw · 2026-02-18T17:59:41Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+            _ => Err(anyhow::anyhow!(
+                "Unsupported data type: {:?}. Supported types: float32, uint8, int8.",
+                build.data_type
+            )),


This pattern is what DispatchRule is trying to avoid. The reason this is problematic is because

It does not provide any error when the benchmark input is first decoded and checked. This means that if you are running a benchmark that tries to use float16 - it will not error until that benchmark is run. Building the logic into DispatchRule will catch the error (and provide an at-least marginally useful diagnostic).

With the integration into DispatchRule, the supported types will be available when listing benchmarks for the executable, which helps with discoverability.

Using a register chain really helps when commenting out backends to reduce compile times while iterating.

You can see a simple example dispatching on just DataType here. All that's really needed is to life the type parameter T to DocumentIndexJob.

hildebrandmw · 2026-02-18T18:02:23Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+    where
+        T: bytemuck::Pod + Copy + Send + Sync + 'static + std::fmt::Debug,
+        T: diskann::graph::SampleableForStart + diskann_utils::future::AsyncFriendly,
+        T: diskann::utils::VectorRepr + diskann_utils::sampling::WithApproximateNorm,


VectorRepr implies Pod, Copy, Send, Sync, and Debug. This where clause can be simplified to

T: diskann::utils::VectorRepr + SampleableForStart + WithApproximateNorm + 'static`

Looking over this - I am now wondering why SampleableForStart does not require WithApproximateNorm. But that's another matter.

hildebrandmw · 2026-02-18T18:14:52Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+            data_arc.clone(),
+            attributes_arc.clone(),
+            output.draw_target(),
+        )?;


The explicit build and search loops should no longer be needed. The diskann-benchmark-core traits now has generic build and search routines that I strongly recommend using.

You can see example implementation of the build and search traits, as well as its use in diskann-benchmark. You may be able to use the KNN trait directly.

Here's why you might want to use these:

Less maintenance: the job of parallelizing and running etc. is delegated to diskann-benchmark-core. You don't have to maintain a ControlBlock of any of that.

Faster compile times: the routines in diskann-benchmark-core are fairly carefully designed to avoid excess monomorphizations and this has a noticeable impact on compile time.

More flexibility: The build algorithm gives you finer control over the parallelization strategy mostly for free.

hildebrandmw · 2026-02-18T18:16:58Z

diskann-benchmark/src/inputs/document_index.rs

+    pub(crate) l_build: usize,
+    pub(crate) alpha: f32,
+    #[serde(default = "default_num_threads")]
+    pub(crate) num_threads: usize,


Recommend always requiring num_threads. When running benchmarks, it's always better to be explicit. Especially when it comes to the number of threads.

hildebrandmw · 2026-02-18T18:30:00Z

diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs

+        let is_valid_filter = id_query.encoded_filter_expr().is_some();
+        if !is_valid_filter {
+            tracing::warn!(
+                "Failed to convert {} into an id expr. This will now be an unfiltered search.",


What's the motivation for falling back to unfiltered instead of fully failing? Using an unfiltered search risks returning invalid neighbors, which could break a caller expecting filtered results.

hildebrandmw · 2026-02-18T18:32:18Z

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

        inner_computer: Inner,
        beta_value: f32,
        filter_expr: EncodedFilterExpr,
+        is_valid_filter: bool,


The whole fallback to unfiltered feels dubious to me. But it it's something we're convinced is correct, you can make the implementation more "correct-by-construction" by using Option<EncodedFilterExpr> instead of tacking on a bool. This ensures that if the filter expression is invalid, it's impossible to actually use.

hildebrandmw · 2026-02-18T18:33:13Z

diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs

+ * Licensed under the MIT license.
+ */
+
+use std::{collections::HashMap, fmt::Debug, future::Future};


What happened to the diff in this file?

hildebrandmw · 2026-02-18T18:38:09Z

diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs

+
+/// Support for Vec<T> queries that delegates to the [T] impl via deref.
+/// This allows InlineBetaStrategy to use Vec<T> queries with FullAccessor.
+impl<T, Q, D, Ctx> BuildQueryComputer<Vec<T>> for FullAccessor<'_, T, Q, D, Ctx>


I don't think this kind of duplication is a rabbit hole we want to go down. I suspect it will be impossible to maintain.

Instead, you can likely use the Lower trait to transform the Vec<T> in the filtered search to a [T] for the inner provider.

hildebrandmw · 2026-02-18T18:38:55Z

test_data/disk_index_search/data.256.label.jsonl

 version https://git-lfs.github.com/spec/v1
-oid sha256:7f8b6b99ca32173557689712d3fb5da30c5e4111130fd2accbccf32f5ce3e47e
-size 17702
+oid sha256:92576896b10780a2cd80a16030f8384610498b76453f57fadeacb854379e0acf


Why is an update to this file needed (just curious).

gopal-msr added 5 commits February 10, 2026 10:10

Sync changes from CDB_DiskANN repo

e3e1651

- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools

Before merging with main

ec2091f

Merge with main

5ad6e44

Working version of inline beta search

a949024

Merge branch 'main' into sync-from-cdb-diskann

95cd4a9

gopalrs requested review from a team and Copilot February 16, 2026 10:31

Copilot started reviewing on behalf of gopalrs February 16, 2026 10:31 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

Removing unnecessary stats

98ad4f7

hildebrandmw requested changes Feb 18, 2026

View reviewed changes

		// We need to clone data for each task
		let queries_arc = Arc::new(queries.clone());

	documentation.workspace = true
	documentation.workspace = true
	license.workspace = true

Conversation

gopalrs commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments