-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Determine this is the right repository
- I determined this is the correct repository in which to report this bug.
Summary of the issue
Context
This is related to Google Cloud Support ticket 65020767 where I was asked to file a report here as they consider it out of scope.
I am using the bigquery-storage read client to fetch multiple streams from a static BQ table. I noticed that the client.read_rows(stream.name).to_dataframe() method loses 1152 records per stream, compared to the client.read_rows(stream.name).rows() which produces the correct result.
Expected Behavior:
See example code below. The correct result from a SQL count(*) on the table gives 7,303,007 records.
I expected that my code with to_dataframe() would fetch the same number of records.
Actual Behavior:
Instead I received these results for different values of the max_stream_count argument. By default (no argument) it produced 40 streams. In each case the total row count was 9128num_streams below the correct value.
max_stream_count, result
1, 7301855 [7301855] # missing 1152 rows = 1 * 9*128
2, 7300703 [3467413, 3833290]. # missing 2304 rows = 2 * 9*128
8, 7293791 [1094808, 911159, 911286, 911386, 910562, 912084, 912954, 729552]. # missing 9216 rows = 8 * 9*128
none (40 by default), 7256927 [181563, 180985, 181974, 181696, 181770, 181694, 181110, 181662, 181353, 181152, 181342, 180928, 182225, 181868, 180987, 180512, 181476, 181825, 181460, 181902, 181510, 180666, 181595, 180857, 181766, 180494, 181289, 181609, 181403, 182320, 181416, 181210, 182091, 181602, 181844, 180852, 181417, 182208, 181037, 180257]
# missing 46,080 rows = 40 * 9*128
API client name and version
google.cloud.bigquery_storage==2.34.0
Reproduction steps: code
from google.cloud.bigquery_storage import BigQueryReadClient, ReadSession, DataFormat
client = BigQueryReadClient()
project_id = "gcp-hopper-ds-research"
dataset_id = "psurry"
table_id = "tmp_trips_10M"
# Configure the read session
read_session = ReadSession(
table=f'projects/{project_id}/datasets/{dataset_id}/tables/{table_id}',
data_format=DataFormat.AVRO,
)
# Create the read session
session = client.create_read_session(
parent=f"projects/{project_id}",
read_session=read_session,
)
print(f"Creating {len(session.streams)} streams for reading.")
results = [
len(client.read_rows(stream.name).to_dataframe()) # loses 9*128 = 1152 records per stream
# sum(1 for _ in client.read_rows(stream.name).rows()) # gives correct result
for stream in session.streams
]
print(len(results), sum(results), results)
Reproduction steps: supporting files
No response
Reproduction steps: actual results
No response
Reproduction steps: expected results
file: output.txtmydata.csv
Calculated: bar
OS & version + platform
OS X 15.7.2
Python environment
3.12.11
Python dependencies
google.cloud.bigquery_storage==2.34.0
fastavro==1.12.1
pandas==2.3.3
Additional context
No response