parquet-go/v2

parquet-go is a pure-go implementation of reading and writing the parquet format file.

Support Read/Write Nested/Flat Parquet File
Simple to use
High performance
Comprehensive encoding support
New logical types including geospatial types

Installation

go get github.com/hangxie/parquet-go

What's New in v2

This repo was forked from https://github.com/xitongsys/parquet-go and merged https://github.com/xitongsys/parquet-go-source. v2 introduces significant improvements and new features:

Major Improvements

Better Error Handling: Most functions now return errors instead of using panic/recover style code
Performance Enhancements:
- Optimized SkipRows() for faster data skipping
- Optimized schema reading performance by eliminating redundant tree traversals
- Reduced lock contention using sync.Map in critical paths
- Improved memory usage efficiency
Enhanced Type Support: Proper interpretation of logical types and converted types
Apache Parquet Format 2.12.0: Updated to the latest parquet format specification

New Features Since v1

Encoding Support

BYTE_STREAM_SPLIT: Full support for INT32/INT64/FIXED_LEN_BYTE_ARRAY types
BIT_PACKED: Read support for BIT_PACKED encoding
Data Page V2: Complete support for Data Page V2 format
Proper validation of encoding/type compatibility at schema stage

New Logical Types

FLOAT16: Half-precision floating point numbers stored as FIXED[2], decoded to float32
INTEGER: Enhanced integer types with proper bitWidth and signedness mapping
- 8-bit → int8/uint8
- 16-bit → int16/uint16
- 32-bit → int32/uint32
- 64-bit → int64/uint64
UUID: 16-byte values automatically converted to canonical UUID strings
GEOMETRY: Planar geospatial coordinates with optional CRS
GEOGRAPHY: Spherical geospatial coordinates with optional CRS and edge interpolation algorithm
VARIANT: Dynamic type support (limited tooling compatibility)

Geospatial Support (GeoParquet)

Comprehensive support for geospatial data with configurable JSON output modes:

Hex Mode: WKB data as hexadecimal strings
Base64 Mode: WKB data as base64-encoded strings
GeoJSON Mode: RFC 7946 compliant GeoJSON output (default for GEOGRAPHY)
Hybrid Mode: Both GeoJSON and raw WKB together

Features:

Configurable coordinate precision
Optional CRS reprojection to CRS84
Support for Point, LineString, and Polygon geometries
Proper handling of CRS and algorithm metadata

See geoparquet.md for detailed documentation.

API Enhancements

Reset(): Reset reader to beginning of file
ReadStopWithError(): Handle errors during ReadStop operations
SkipRowsByIndexWithError(): Handle errors during SkipRowsByIndex operations
Clone(): Clone ParquetFileReader interface for concurrent access
Page manipulation functions for advanced use cases

Better JSON/BSON Support

Proper BSON data decoding in JSON output
Improved DATE type output in ISO 8601 format
TIME values output as human-readable strings
INTERVAL type with proper millisecond precision

Additional Data Sources

HTTP reader support for reading parquet files over HTTP
Enhanced S3v2 support with versioned object access
Improved Azure Blob storage support
GoCloud CDK integration for generic blob storage

Bug Fixes Since v1

Fixed race conditions in:
- source/http
- writer/writer.go (flush operations)
- lz4_raw compression
Fixed panic issues:
- Handling corrupted parquet files
- Old-style LIST format compatibility
- Zero-value unmarshal operations
- Empty files with zero records
- Out of bound index errors
Fixed encoding issues:
- Hardcoded encoding bug in column chunks
- PLAIN_DICTIONARY encoding compatibility
- Proper encoding validation
Fixed data handling:
- Empty slice handling in decimal comparison
- Negative decimal values between (-1, 1)
- Optional scalar field handling
- Default root name assumptions
Fixed metadata:
- Format version in footer
- create_by field format
- Statistics for INTERVAL and geospatial data
Fixed GeoJSON output format for multi-geometries

Breaking Changes from v1

Please refer to v1 README.md for v1 documentation. Key breaking changes:

Many functions now return errors instead of panicking
Separated reader and writer interfaces for ParquetFile sources
Updated to use github.com/hangxie/parquet-go module path

Quick Start

Writing Parquet Files

package main

import (
    "log"
    "github.com/hangxie/parquet-go/parquet"
    "github.com/hangxie/parquet-go/writer"
    "github.com/hangxie/parquet-go/source"
)

type Student struct {
    Name   string  `parquet:"name=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
    Age    int32   `parquet:"name=age, type=INT32"`
    ID     int64   `parquet:"name=id, type=INT64"`
    Weight float32 `parquet:"name=weight, type=FLOAT"`
    Sex    bool    `parquet:"name=sex, type=BOOLEAN"`
}

func main() {
    fw, err := source.NewLocalFileWriter("output.parquet")
    if err != nil {
        log.Fatal("Can't create file", err)
    }
    defer fw.Close()

    pw, err := writer.NewParquetWriter(fw, new(Student), 4)
    if err != nil {
        log.Fatal("Can't create parquet writer", err)
    }

    num := 10
    for i := 0; i < num; i++ {
        stu := Student{
            Name:   "StudentName",
            Age:    int32(20 + i%5),
            ID:     int64(i),
            Weight: float32(50.0 + float32(i)*0.1),
            Sex:    i%2 == 0,
        }
        if err = pw.Write(stu); err != nil {
            log.Fatal("Write error", err)
        }
    }

    if err = pw.WriteStop(); err != nil {
        log.Fatal("WriteStop error", err)
    }
}

Reading Parquet Files

package main

import (
    "log"
    "github.com/hangxie/parquet-go/reader"
    "github.com/hangxie/parquet-go/source"
)

type Student struct {
    Name   string  `parquet:"name=name, type=BYTE_ARRAY, convertedtype=UTF8"`
    Age    int32   `parquet:"name=age, type=INT32"`
    ID     int64   `parquet:"name=id, type=INT64"`
    Weight float32 `parquet:"name=weight, type=FLOAT"`
    Sex    bool    `parquet:"name=sex, type=BOOLEAN"`
}

func main() {
    fr, err := source.NewLocalFileReader("output.parquet")
    if err != nil {
        log.Fatal("Can't open file", err)
    }
    defer fr.Close()

    pr, err := reader.NewParquetReader(fr, new(Student), 4)
    if err != nil {
        log.Fatal("Can't create parquet reader", err)
    }
    defer pr.ReadStop()

    num := int(pr.GetNumRows())
    students := make([]Student, num)
    if err = pr.Read(&students); err != nil {
        log.Fatal("Read error", err)
    }

    for _, stu := range students {
        log.Printf("%+v\n", stu)
    }
}

Type System

Primitive Types

Primitive Type	Go Type
BOOLEAN	bool
INT32	int32
INT64	int64
INT96 (deprecated)	string
FLOAT	float32
DOUBLE	float64
BYTE_ARRAY	string
FIXED_LEN_BYTE_ARRAY	string

Logical Types

Logical Type	Primitive Type	Go Type
UTF8	BYTE_ARRAY	string
INT_8	INT32	int32
INT_16	INT32	int32
INT_32	INT32	int32
INT_64	INT64	int64
UINT_8	INT32	int32
UINT_16	INT32	int32
UINT_32	INT32	int32
UINT_64	INT64	int64
DATE	INT32	int32
TIME_MILLIS	INT32	int32
TIME_MICROS	INT64	int64
TIMESTAMP_MILLIS	INT64	int64
TIMESTAMP_MICROS	INT64	int64
INTERVAL	FIXED_LEN_BYTE_ARRAY	string
DECIMAL	INT32,INT64,FIXED_LEN_BYTE_ARRAY,BYTE_ARRAY	int32,int64,string,string
UUID	FIXED_LEN_BYTE_ARRAY	string
FLOAT16	FIXED_LEN_BYTE_ARRAY	string
GEOMETRY	BYTE_ARRAY	string
GEOGRAPHY	BYTE_ARRAY	string
JSON	BYTE_ARRAY	string
BSON	BYTE_ARRAY	string
LIST	-	slice
MAP	-	map

Type Notes

Type aliases are supported (e.g., type MyString string), but the base type must follow the table
Use converter.go for type conversion utilities

Encoding Support

Supported Encodings

Encoding	Types	Read	Write
PLAIN	All types	✓	✓
PLAIN_DICTIONARY	All types	✓	✓
RLE_DICTIONARY	All types	✓	✓
DELTA_BINARY_PACKED	Integer types	✓	✓
DELTA_BYTE_ARRAY	BYTE_ARRAY, UTF8	✓	✓
DELTA_LENGTH_BYTE_ARRAY	BYTE_ARRAY, UTF8	✓	✓
BYTE_STREAM_SPLIT	INT32, INT64, FIXED_LEN_BYTE_ARRAY	✓	✓
BIT_PACKED	Boolean, Integer	✓	✓

Encoding Notes

For maximum compatibility, use PLAIN and PLAIN_DICTIONARY encodings
Avoid PLAIN_DICTIONARY for high-cardinality fields to prevent excessive memory usage
Use omitstats=true tag to skip statistics for large array fields

Compression Support

Compression	Supported
UNCOMPRESSED	✓
SNAPPY	✓
GZIP	✓
LZO	✗
BROTLI	✓
LZ4	✓
LZ4_RAW	✓
ZSTD	✓

Repetition Types

Repetition Type	Go Declaration	Description
REQUIRED	`V1 int32` with tag `parquet:"name=v1, type=INT32"`	Standard required field
OPTIONAL	`V1 *int32` with tag `parquet:"name=v1, type=INT32"`	Use pointer for optional fields
REPEATED	`V1 []int32` with tag `parquet:"name=v1, type=INT32, repetitiontype=REPEATED"`	Use slice with repetitiontype tag

Repetition Notes

LIST and REPEATED are different in the parquet format - prefer LIST
Standard and non-standard LIST/MAP formats are both supported

Schema Definition

Four methods to define schema:

1. Go Struct Tags

type Student struct {
    Name   string  `parquet:"name=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
    Age    int32   `parquet:"name=age, type=INT32, encoding=PLAIN"`
    ID     int64   `parquet:"name=id, type=INT64"`
    Weight float32 `parquet:"name=weight, type=FLOAT"`
    Sex    bool    `parquet:"name=sex, type=BOOLEAN"`
}

2. JSON Schema

jsonSchema := `{
  "Tag": "name=parquet_go_root, repetitiontype=REQUIRED",
  "Fields": [
    {"Tag": "name=name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
    {"Tag": "name=age, type=INT32, repetitiontype=REQUIRED"}
  ]
}`

3. CSV Metadata

md := []string{
    "name=Name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY",
    "name=Age, type=INT32",
}

4. Arrow Schema

schema := arrow.NewSchema(
    []arrow.Field{
        {Name: "int64", Type: arrow.PrimitiveTypes.Int64},
        {Name: "float64", Type: arrow.PrimitiveTypes.Float64},
    },
    nil,
)

Schema Notes

All struct fields must be exported (start with uppercase letter)
InName (Go field name) and ExName (Parquet field name) are distinct
Avoid field names differing only by first letter case
PARGO_PREFIX_ is reserved - don't use as field prefix
Use \x01 as delimiter to support . in field names

Writers

Four writer types are available:

ParquetWriter: Write Go structs - example
JSONWriter: Convert JSON to Parquet - example
CSVWriter: Write CSV-like data - example
ArrowWriter: Write using Arrow schemas - example

Readers

Two reader types:

ParquetReader: Read into Go structs - example
ColumnReader: Read raw column data with repetition/definition levels - example

Reader Notes

For large files, read in chunks to avoid OOM
Configure RowGroupSize and PageSize in writer:

pw.RowGroupSize = common.DefaultRowGroupSize // default 128M
pw.PageSize = common.DefaultPageSize         // default 8K

ParquetFile Interface

All file sources must implement:

type ParquetFile interface {
    io.Seeker
    io.Reader
    io.Writer
    io.Closer
    Open(name string) (ParquetFile, error)
    Create(name string) (ParquetFile, error)
}

Supported Sources

Local filesystem
HDFS
S3 (AWS SDK v1 and v2)
Google Cloud Storage
Azure Blob Storage
HTTP (read-only)
Memory buffer
GoCloud CDK (generic blob storage)
OpenStack Swift

See source/README.md for details.

Concurrency

Optimize performance with parallel marshaling/unmarshaling:

func NewParquetReader(pFile ParquetFile.ParquetFile, obj interface{}, np int64) (*ParquetReader, error)
func NewParquetWriter(pFile ParquetFile.ParquetFile, obj interface{}, np int64) (*ParquetWriter, error)
func NewJSONWriter(jsonSchema string, pfile ParquetFile.ParquetFile, np int64) (*JSONWriter, error)
func NewCSVWriter(md []string, pfile ParquetFile.ParquetFile, np int64) (*CSVWriter, error)

Set np parameter to control the number of parallel goroutines.

Examples

Build examples with the example build tag:

go build -tags example ./example/local_flat        # Basic flat structure
go build -tags example ./example/local_nested      # Nested structures
go build -tags example ./example/json_write        # JSON to Parquet
go build -tags example ./example/csv_write         # CSV to Parquet
go build -tags example ./example/new_logical       # FLOAT16 + INTEGER
go build -tags example ./example/geospatial        # GEOMETRY + GEOGRAPHY
go build -tags example ./example/all_types         # Comprehensive sample

Example	Description
local_flat.go	Write/read flat parquet file
local_nested.go	Write/read nested structures
read_partial.go	Read partial fields
read_partial2.go	Read sub-structs
read_without_schema_predefined.go	Read without predefined schema
json_schema.go	Define schema with JSON
json_write.go	Convert JSON to Parquet
convert_to_json.go	Convert Parquet to JSON
csv_write.go	CSV writer
column_read.go	Read raw column data
type.go	Type examples
type_alias.go	Type alias examples
new_logical.go	New logical types
geospatial.go	Geospatial types
all_types.go	All type support

Documentation

v1 README - Original v1 documentation
source/README.md - File source implementations
geoparquet.md - Detailed geospatial support documentation

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1,444 Commits
.github		.github
common		common
compress		compress
encoding		encoding
example		example
layout		layout
marshal		marshal
parquet		parquet
reader		reader
schema		schema
source		source
types		types
writer		writer
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
READMEv1.md		READMEv1.md
geoparquet.md		geoparquet.md
go.mod		go.mod
go.sum		go.sum

License

hangxie/parquet-go

Folders and files

Latest commit

History

Repository files navigation

parquet-go/v2

Installation

What's New in v2

Major Improvements

New Features Since v1

Encoding Support

New Logical Types

Geospatial Support (GeoParquet)

API Enhancements

Better JSON/BSON Support

Additional Data Sources

Bug Fixes Since v1

Breaking Changes from v1

Quick Start

Writing Parquet Files

Reading Parquet Files

Type System

Primitive Types

Logical Types

Type Notes

Encoding Support

Supported Encodings

Encoding Notes

Compression Support

Repetition Types

Repetition Notes

Schema Definition

1. Go Struct Tags

2. JSON Schema

3. CSV Metadata

4. Arrow Schema

Schema Notes

Writers

Readers

Reader Notes

ParquetFile Interface

Supported Sources

Concurrency

Examples

Documentation

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 74

Packages 0

Uh oh!

Languages

Packages