Go Parquet: FixedSizeList Bug Reads Values As NULL
This article delves into a specific bug encountered while using the Apache Arrow Go library with Parquet files. The issue arises when writing FixedSizeList<float32> arrays to Parquet files using pqarrow.FileWriter. While the data is written correctly in memory, reading it back using pqarrow.FileReader.ReadTable results in NULL values. This behavior is in contrast to the correct handling of standard List<float32> arrays, highlighting a discrepancy in the processing of fixed-size lists. Let's explore the details of this bug, its reproduction, potential causes, and the environment in which it occurs.
The Curious Case of NULL Values in FixedSizeList
When working with data serialization and storage, accuracy and reliability are paramount. The Parquet file format, known for its efficient data compression and columnar storage, is a popular choice for many applications. Similarly, the Apache Arrow project provides a standardized, in-memory columnar data format, facilitating high-performance data processing. The pqarrow package in Go acts as a bridge between these two powerful technologies.
However, a peculiar issue arises when employing FixedSizeList types within this ecosystem. Specifically, when a FixedSizeList<float32> array is written to a Parquet file using the pqarrow.FileWriter, the data appears to be correctly serialized in memory. Yet, upon reading the data back using pqarrow.FileReader.ReadTable, the values are unexpectedly interpreted as NULL. This behavior is not observed with standard List<float32> arrays, suggesting a specific problem in how FixedSizeList types are handled during the read operation.
This discrepancy can lead to significant data integrity issues, as the retrieved data does not accurately reflect the originally stored information. Understanding the root cause of this bug is crucial for ensuring the reliable use of Arrow Go with Parquet for applications involving fixed-size lists. In the following sections, we will dissect a minimal reproduction case, examine the expected versus actual behavior, and explore the potential underlying causes at the code level.
Reproducing the Bug: A Step-by-Step Guide
To effectively understand and address a bug, it's essential to have a clear and reproducible test case. The following Go code snippet provides a minimal reproduction of the FixedSizeList Parquet issue in Arrow Go. This code demonstrates how writing a FixedSizeList<float32>[8] array with specific values (1 to 8) results in NULL values when read back.
// fixedsize_list_parquet_repro.go
//
// Minimal reproduction for FixedSizeList + Parquet issue in Arrow Go.
// Writes a FixedSizeList<float32>[8] with values [1..8] and reads it back
// via pqarrow. On v14.0.2 the values are read as nulls, while the in-memory
// record before writing is correct.
package main
import (
"context"
"fmt"
"os"
"path/filepath"
"github.com/apache/arrow/go/v14/arrow"
"github.com/apache/arrow/go/v14/arrow/array"
"github.com/apache/arrow/go/v14/arrow/memory"
"github.com/apache/arrow/go/v14/parquet"
"github.com/apache/arrow/go/v14/parquet/file"
"github.com/apache/arrow/go/v14/parquet/pqarrow"
)
func main() {
const dim = 8
expected := []float32{1, 2, 3, 4, 5, 6, 7, 8}
out := filepath.Join(os.TempDir(), "fixedsize_bug.parquet")
fmt.Println("Parquet file:", out)
// Schema: FixedSizeList<float32>[8]
schema := arrow.NewSchema(
[]arrow.Field{
{
Name: "embedding",
Type: arrow.FixedSizeListOf(int32(dim), arrow.PrimitiveTypes.Float32),
},
},
nil,
)
pool := memory.NewGoAllocator()
// --- Write ---
f, err := os.Create(out)
if err != nil {
panic(err)
}
props := parquet.NewWriterProperties()
awProps := pqarrow.NewArrowWriterProperties()
pw, err := pqarrow.NewFileWriter(schema, f, props, awProps)
if err != nil {
panic(err)
}
b := array.NewRecordBuilder(pool, schema)
defer b.Release()
flb := b.Field(0).(*array.FixedSizeListBuilder)
vb := flb.ValueBuilder().(*array.Float32Builder)
// Single FixedSizeList value [1..8]
flb.Append(true)
for _, v := range expected {
vb.Append(v)
}
rec := b.NewRecord()
defer rec.Release()
fmt.Println("In-memory record before write:")
fmt.Println(rec)
if err := pw.Write(rec); err != nil {
panic(err)
}
// Ensure Parquet footer and metadata are fully written
if err := pw.Close(); err != nil {
panic(err)
}
// --- Read back via pqarrow ---
rf, err := os.Open(out)
if err != nil {
panic(err)
}
defer rf.Close()
pr, err := file.NewParquetReader(rf)
if err != nil {
panic(err)
}
defer pr.Close()
fr, err := pqarrow.NewFileReader(pr, pqarrow.ArrowReadProperties{}, pool)
if err != nil {
panic(err)
}
tbl, err := fr.ReadTable(context.Background())
if err != nil {
panic(err)
}
defer tbl.Release()
fmt.Println("\nExpected values:", expected)
fmt.Println("Table read back:")
fmt.Println(tbl)
}
To reproduce the bug, save the code as fixedsize_list_parquet_repro.go, ensure you have the necessary dependencies installed (github.com/apache/arrow/go/v14), and run the program using go run ./fixedsize_list_parquet_repro.go. The output will demonstrate the discrepancy between the in-memory record (correct values) and the table read back from the Parquet file (NULL values).
This code snippet first defines a schema with a single field: a FixedSizeList<float32> of dimension 8. It then creates a Parquet file, writes a record containing the values [1..8] to the file, and subsequently reads the data back. The output clearly shows that the in-memory record before writing contains the expected values, but the table read back from the Parquet file contains NULL values in the embedding column. This reproduction provides a concrete example of the bug in action, allowing developers to investigate and address the issue effectively.
Example Output
When running the reproduction code (e.g., on v14.0.2), you'll observe the following output:
go run ./fixedsize_list_parquet_repro.go
Parquet file: /var/folders/95/j3gr9h157fq0djs38znqgkg80000gn/T/fixedsize_bug.parquet
In-memory record before write:
record:
schema:
fields: 1
- embedding: type=fixed_size_list<item: float32, nullable>[8]
rows: 1
col[0][embedding]: [[1 2 3 4 5 6 7 8]]
Expected values: [1 2 3 4 5 6 7 8]
Table read back:
schema:
fields: 1
- embedding: type=list<list: float32, nullable>
metadata: ["PARQUET:field_id": "-1"]
embedding: [[[(null) (null) (null) (null) (null) (null) (null) (null)]]]
This output clearly illustrates the bug: the in-memory record before writing to the Parquet file contains the correct values ([1 2 3 4 5 6 7 8]), while the table read back from the Parquet file shows NULL values for the embedding. This discrepancy highlights the issue in the Parquet read operation when dealing with FixedSizeList types.
Expected vs. Actual Behavior: A Stark Contrast
The core of any bug report lies in clearly defining the expected behavior versus the actual behavior observed. In this case, the expectation is straightforward: if we write a FixedSizeList<float32>[8] array containing the values [1, 2, 3, 4, 5, 6, 7, 8] to a Parquet file, we should be able to read the same values back without any data loss or corruption.
Expected Behavior
The embedding values should be read back from the Parquet file as [1 2 3 4 5 6 7 8], perfectly matching the in-memory FixedSizeList<float32>[8] before the Parquet write operation. This expectation aligns with the fundamental principle of data serialization and deserialization, where the original data should be preserved throughout the process.
Actual Behavior
Instead of the expected outcome, the embedding values are read back as a list of 8 NULL values when using pqarrow.FileReader.ReadTable. This represents a significant deviation from the expected behavior, indicating a potential flaw in the handling of FixedSizeList types during the read process. The fact that the in-memory record before writing is correct further isolates the issue to the Parquet read operation.
This stark contrast between the expected and actual behavior underscores the severity of the bug. The unexpected NULL values can lead to incorrect analysis, corrupted data pipelines, and ultimately, unreliable results. Identifying the root cause and implementing a fix is crucial for ensuring the integrity of data stored in Parquet files using Arrow Go.
Diving Deep: Likely Root Cause Analysis
To effectively address a bug, it's essential to delve into the codebase and identify the potential root cause. Based on the observed behavior and a code-level analysis, a likely cause for the FixedSizeList Parquet issue can be pinpointed within the parquet/pqarrow/path_builder.go file in the Arrow Go library (specifically, version v14.0.2).
The investigation suggests that the issue stems from how the pathBuilder.Visit function handles FIXED_SIZE_LIST types compared to LIST types. In essence, the FIXED_SIZE_LIST case within pathBuilder.Visit fails to update the p.nullableInParent flag before visiting the child values. This is in contrast to the LIST case, where this flag is correctly set.
The Role of nullableInParent
The nullableInParent flag plays a crucial role in determining the definition level for values within the Parquet file. The addTerminalInfo function, responsible for setting metadata, increments p.info.maxDefLevel when p.nullableInParent is true. This increment in the definition level is critical for correctly interpreting present values during the read operation.
For LIST types, the p.nullableInParent flag is set appropriately, ensuring that present values are assigned a higher definition level. However, for FIXED_SIZE_LIST types, this flag remains false due to the missing update in the pathBuilder.Visit function. Consequently, present values within the FixedSizeList are encoded and decoded with a lower definition level, leading them to be misinterpreted as NULL values during the read process.
A Potential Fix
A minimal fix to this issue appears to be setting p.nullableInParent = true within the FIXED_SIZE_LIST branch of the pathBuilder.Visit function, before the call to Visit(larr.ListValues()). This adjustment would mirror the handling of LIST types and ensure that present values within FixedSizeList arrays are correctly interpreted during the read operation.
This code-level analysis provides a strong indication of the bug's origin and a potential solution. However, thorough testing and validation are essential to confirm the fix and ensure that it doesn't introduce any unintended side effects.
Environment Details: Setting the Stage
Understanding the environment in which a bug occurs is crucial for effective debugging and resolution. The FixedSizeList Parquet issue in Arrow Go has been observed under the following conditions:
- Arrow Go: v14.0.2
- Go: 1.21+ (reproduced with go1.24 toolchain)
- OS: macOS (ARM64)
- Reader used:
pqarrow.FileReader.ReadTable(The behavior is also visible when inspecting the Parquet file with DuckDB).
These details provide valuable context for developers attempting to reproduce and fix the bug. The specific Arrow Go version (v14.0.2) is particularly important, as the identified root cause lies within the codebase of this version. The Go version (1.21+) and operating system (macOS ARM64) indicate the general environment in which the bug was encountered, while the use of pqarrow.FileReader.ReadTable pinpoints the specific function responsible for the erroneous behavior.
The mention of DuckDB is also significant. DuckDB is an in-process SQL OLAP database management system that can directly query Parquet files. The fact that the issue is also visible when inspecting the Parquet file with DuckDB reinforces the notion that the problem lies in the way the data is written to or interpreted from the Parquet file itself, rather than being specific to the pqarrow.FileReader.ReadTable function.
Conclusion: Towards a Resolution
The FixedSizeList Parquet issue in Arrow Go presents a significant challenge for developers working with fixed-size list data structures within the Parquet ecosystem. The bug, which manifests as NULL values when reading back data written as FixedSizeList<float32> arrays, can lead to data corruption and unreliable analysis.
Through a detailed reproduction case, analysis of expected versus actual behavior, and code-level investigation, a likely root cause has been identified within the parquet/pqarrow/path_builder.go file. The missing update to the p.nullableInParent flag in the FIXED_SIZE_LIST case appears to be the culprit, leading to incorrect definition levels and subsequent misinterpretation of values as NULL.
A potential fix involves setting p.nullableInParent = true within the FIXED_SIZE_LIST branch of the pathBuilder.Visit function, mirroring the handling of LIST types. However, thorough testing and validation are crucial to confirm the fix and ensure its robustness.
The environment details, including the specific Arrow Go version, Go version, operating system, and reader used, provide valuable context for developers working on the resolution. The fact that the issue is also observable with DuckDB further reinforces the need for a comprehensive fix that addresses the underlying data interpretation problem.
By understanding the intricacies of this bug and its potential causes, the community can work collaboratively towards a solution that ensures the reliable use of FixedSizeList types within the Arrow Go and Parquet ecosystem. This will ultimately contribute to the integrity and accuracy of data-driven applications relying on these technologies.
For further information on Apache Arrow and Parquet, consider exploring the official Apache Arrow website.