Skip to content

Conversation

@shangxinli
Copy link
Contributor

Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).

Implementation:

  • Factory method DataWriter::Make() for creating writer instances
  • Support for Parquet and Avro file formats via WriterFactoryRegistry
  • Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID
  • Proper lifecycle management with Initialize/Write/Close/Metadata
  • PIMPL idiom for ABI stability

Related to #441

Implements DataWriter class for writing Iceberg data files as part of
issue apache#441 (task 2).

Implementation:
- Factory method DataWriter::Make() for creating writer instances
- Support for Parquet and Avro file formats via WriterFactoryRegistry
- Complete DataFile metadata generation including partition info,
  column statistics, serialized bounds, and sort order ID
- Proper lifecycle management with Initialize/Write/Close/Metadata
- PIMPL idiom for ABI stability

Tests:
- 12 comprehensive unit tests covering creation, write/close lifecycle,
  metadata generation, error handling, and feature validation
- All tests passing (12/12)

Related to apache#441
@shangxinli shangxinli force-pushed the implement-data-file-writer branch from 8944a75 to a201953 Compare January 31, 2026 17:59

ICEBERG_ASSIGN_OR_RAISE(writer_,
WriterFactoryRegistry::Open(options_.format, writer_options));
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?

Comment on lines +62 to +64
if (closed_) {
return InvalidArgument("Writer already closed");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?

return InvalidArgument("Writer already closed");
}
ICEBERG_RETURN_UNEXPECTED(writer_->Close());
closed_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class address thread safety?

Comment on lines +78 to +109
TEST_F(DataWriterTest, CreateWithParquetFormat) {
DataWriterOptions options{
.path = "test_data.parquet",
.schema = schema_,
.spec = partition_spec_,
.partition = PartitionValues{},
.format = FileFormatType::kParquet,
.io = file_io_,
.properties = {{"write.parquet.compression-codec", "uncompressed"}},
};

auto writer_result = DataWriter::Make(options);
ASSERT_THAT(writer_result, IsOk());
auto writer = std::move(writer_result.value());
ASSERT_NE(writer, nullptr);
}

TEST_F(DataWriterTest, CreateWithAvroFormat) {
DataWriterOptions options{
.path = "test_data.avro",
.schema = schema_,
.spec = partition_spec_,
.partition = PartitionValues{},
.format = FileFormatType::kAvro,
.io = file_io_,
};

auto writer_result = DataWriter::Make(options);
ASSERT_THAT(writer_result, IsOk());
auto writer = std::move(writer_result.value());
ASSERT_NE(writer, nullptr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication

// Check length before close
auto length_result = writer->Length();
ASSERT_THAT(length_result, IsOk());
EXPECT_GT(length_result.value(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: check the size of the data passed to the write function?

Comment on lines +45 to +47
if (!writer_) {
return InvalidArgument("Writer not initialized");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!writer_) {
return InvalidArgument("Writer not initialized");
}
ICEBERG_PRECHECK(writer_, "Writer not initialized");

nit, this should make the code shorter.

}

Result<FileWriter::WriteResult> Metadata() {
if (!closed_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use ICEBERG_CHECK here

EXPECT_GT(length.value(), 0);
}

} // namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this closing namespace curly before the first TEST_F?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants