-
Notifications
You must be signed in to change notification settings - Fork 86
feat: implement DataWriter for Iceberg data files #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements DataWriter class for writing Iceberg data files as part of issue apache#441 (task 2). Implementation: - Factory method DataWriter::Make() for creating writer instances - Support for Parquet and Avro file formats via WriterFactoryRegistry - Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID - Proper lifecycle management with Initialize/Write/Close/Metadata - PIMPL idiom for ABI stability Tests: - 12 comprehensive unit tests covering creation, write/close lifecycle, metadata generation, error handling, and feature validation - All tests passing (12/12) Related to apache#441
8944a75 to
a201953
Compare
|
|
||
| ICEBERG_ASSIGN_OR_RAISE(writer_, | ||
| WriterFactoryRegistry::Open(options_.format, writer_options)); | ||
| return {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?
| if (closed_) { | ||
| return InvalidArgument("Writer already closed"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?
| return InvalidArgument("Writer already closed"); | ||
| } | ||
| ICEBERG_RETURN_UNEXPECTED(writer_->Close()); | ||
| closed_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class address thread safety?
| TEST_F(DataWriterTest, CreateWithParquetFormat) { | ||
| DataWriterOptions options{ | ||
| .path = "test_data.parquet", | ||
| .schema = schema_, | ||
| .spec = partition_spec_, | ||
| .partition = PartitionValues{}, | ||
| .format = FileFormatType::kParquet, | ||
| .io = file_io_, | ||
| .properties = {{"write.parquet.compression-codec", "uncompressed"}}, | ||
| }; | ||
|
|
||
| auto writer_result = DataWriter::Make(options); | ||
| ASSERT_THAT(writer_result, IsOk()); | ||
| auto writer = std::move(writer_result.value()); | ||
| ASSERT_NE(writer, nullptr); | ||
| } | ||
|
|
||
| TEST_F(DataWriterTest, CreateWithAvroFormat) { | ||
| DataWriterOptions options{ | ||
| .path = "test_data.avro", | ||
| .schema = schema_, | ||
| .spec = partition_spec_, | ||
| .partition = PartitionValues{}, | ||
| .format = FileFormatType::kAvro, | ||
| .io = file_io_, | ||
| }; | ||
|
|
||
| auto writer_result = DataWriter::Make(options); | ||
| ASSERT_THAT(writer_result, IsOk()); | ||
| auto writer = std::move(writer_result.value()); | ||
| ASSERT_NE(writer, nullptr); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication
| // Check length before close | ||
| auto length_result = writer->Length(); | ||
| ASSERT_THAT(length_result, IsOk()); | ||
| EXPECT_GT(length_result.value(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: check the size of the data passed to the write function?
| if (!writer_) { | ||
| return InvalidArgument("Writer not initialized"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (!writer_) { | |
| return InvalidArgument("Writer not initialized"); | |
| } | |
| ICEBERG_PRECHECK(writer_, "Writer not initialized"); |
nit, this should make the code shorter.
| } | ||
|
|
||
| Result<FileWriter::WriteResult> Metadata() { | ||
| if (!closed_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use ICEBERG_CHECK here
| EXPECT_GT(length.value(), 0); | ||
| } | ||
|
|
||
| } // namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this closing namespace curly before the first TEST_F?
Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).
Implementation:
Related to #441