Skip to content

bigquery-storage read session to_dataframe() loses 1152 records per stream #14900

@patricksurry

Description

@patricksurry

Determine this is the right repository

  • I determined this is the correct repository in which to report this bug.

Summary of the issue

Context
This is related to Google Cloud Support ticket 65020767 where I was asked to file a report here as they consider it out of scope.

I am using the bigquery-storage read client to fetch multiple streams from a static BQ table. I noticed that the client.read_rows(stream.name).to_dataframe() method loses 1152 records per stream, compared to the client.read_rows(stream.name).rows() which produces the correct result.

Expected Behavior:
See example code below. The correct result from a SQL count(*) on the table gives 7,303,007 records.

I expected that my code with to_dataframe() would fetch the same number of records.

Actual Behavior:

Instead I received these results for different values of the max_stream_count argument. By default (no argument) it produced 40 streams. In each case the total row count was 9128num_streams below the correct value.

max_stream_count,  result
1,   7301855 [7301855]        # missing 1152 rows = 1 * 9*128
2,   7300703 [3467413, 3833290].  # missing 2304 rows = 2 * 9*128
8,   7293791 [1094808, 911159, 911286, 911386, 910562, 912084, 912954, 729552]. # missing 9216 rows = 8 * 9*128
none (40 by default),  7256927 [181563, 180985, 181974, 181696, 181770, 181694, 181110, 181662, 181353, 181152, 181342, 180928, 182225, 181868, 180987, 180512, 181476, 181825, 181460, 181902, 181510, 180666, 181595, 180857, 181766, 180494, 181289, 181609, 181403, 182320, 181416, 181210, 182091, 181602, 181844, 180852, 181417, 182208, 181037, 180257]
    # missing 46,080 rows = 40 * 9*128

API client name and version

google.cloud.bigquery_storage==2.34.0

Reproduction steps: code

from google.cloud.bigquery_storage import BigQueryReadClient, ReadSession, DataFormat

client = BigQueryReadClient()

project_id = "gcp-hopper-ds-research"
dataset_id = "psurry"
table_id = "tmp_trips_10M"

# Configure the read session
read_session = ReadSession(
    table=f'projects/{project_id}/datasets/{dataset_id}/tables/{table_id}',
    data_format=DataFormat.AVRO,
)

# Create the read session
session = client.create_read_session(
    parent=f"projects/{project_id}",
    read_session=read_session,
)
print(f"Creating {len(session.streams)} streams for reading.")

results = [
        len(client.read_rows(stream.name).to_dataframe())  # loses 9*128 = 1152 records per stream
#        sum(1 for _ in client.read_rows(stream.name).rows())  # gives correct result
        for stream in session.streams
]
print(len(results), sum(results), results)

Reproduction steps: supporting files

No response

Reproduction steps: actual results

No response

Reproduction steps: expected results

file: output.txtmydata.csv

Calculated: bar

OS & version + platform

OS X 15.7.2

Python environment

3.12.11

Python dependencies

google.cloud.bigquery_storage==2.34.0
fastavro==1.12.1
pandas==2.3.3

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: bigquerystorageIssues related to the BigQuery Storage API.priority: p1Important issue which blocks shipping the next release. Will be fixed prior to next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions