[SPARK-44988][SQL] Support reading Parquet TIMESTAMP(NANOS,false) #53221

AbinayaJayaprakasam · 2025-11-25T20:33:48Z

Convert TIMESTAMP(NANOS,*) to LongType regardless of nanosAsLong config to allow reading Parquet files with nanosecond precision timestamps.

What changes were proposed in this pull request?

Simplified the TIMESTAMP(NANOS) handling in ParquetSchemaConverter to always convert to LongType, removing the nanosAsLong condition check that caused TIMESTAMP(NANOS,false) files to be unreadable.

Why are the changes needed?

SPARK-40819 added spark.sql.legacy.parquet.nanosAsLong as a workaround for TIMESTAMP(NANOS,true), but:

Only worked for TIMESTAMP(NANOS,true), not for TIMESTAMP(NANOS,false)
Required users to know about an obscure internal config flag
Still required manual casting from Long to Timestamp

This fix makes all NANOS timestamps readable by default. Since Spark cannot fully support nanosecond precision in its type system, converting to LongType preserves precision while allowing files to be read.

Does this PR introduce any user-facing change?

Yes - Parquet files with TIMESTAMP(NANOS,*) are now readable by default without configuration. Values are read as LongType (nanoseconds since epoch). Users can convert to timestamp if needed: (col('nanos') / 1e9).cast('timestamp')

How was this patch tested?

Updated ParquetSchemaSuite test expectations
All tests in ParquetSchemaSuite pass
Manually tested with TIMESTAMP(NANOS,false) Parquet file generated via PyArrow

Was this patch authored or co-authored using generative AI tooling?

No

Problem Demonstration

Step 1: Generate test Parquet file with TIMESTAMP(NANOS,false)

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime

data = [
    pa.array([1, 2, 3, 4, 5], type=pa.int32()),
    pa.array([
        datetime(2023, 1, 1, 10, 0, 0),
        datetime(2023, 1, 1, 10, 0, 1),
        datetime(2023, 1, 1, 10, 0, 2),
        datetime(2023, 1, 1, 10, 0, 3),
        datetime(2023, 1, 1, 10, 0, 4)
    ], type=pa.timestamp('ns'))  # TIMESTAMP(NANOS,false)
]

schema = pa.schema([
    ('id', pa.int32()),
    ('timestamp_nanos', pa.timestamp('ns'))
])

table = pa.table(data, schema=schema)
pq.write_table(table, 'test_nanos_timestamp.parquet')

Step 2: Verify Parquet schema

$ parquet-tools schema test_nanos_timestamp.parquet

message schema {
  optional int32 id;
  optional int64 timestamp_nanos (TIMESTAMP(NANOS,false));
}

Step 3: Before fix - Unreadable

spark.read.parquet("test_nanos_timestamp.parquet").show()

org.apache.spark.sql.AnalysisException: [PARQUET_TYPE_ILLEGAL] Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). SQLSTATE: 42846
  at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:2081)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:238)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$4(ParquetSchemaConverter.scala:321)
  at scala.Option.getOrElse(Option.scala:201)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:258)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:207)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$4$adapted(ParquetSchemaConverter.scala:168)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$4$adapted(ParquetSchemaConverter.scala:132)
  ...

Step 4: After fix - Readable

df = spark.read.parquet("test_nanos_timestamp.parquet")
df.printSchema()
root
|-- id: integer (nullable = true)
|-- timestamp_nanos: long (nullable = true)

df.show()

25/11/26 01:03:28 INFO CodeGenerator: Code generated in 5.300541 ms
+---+-------------------+
| id|    timestamp_nanos|
+---+-------------------+
|  1|1672567200000000000|
|  2|1672567201000000000|
|  3|1672567202000000000|
|  4|1672567203000000000|
|  5|1672567204000000000|
+---+-------------------+

SUCCESS: File read without error!

Test coverage

Updated existing test: ParquetSchemaSuite -Changed test expectation from "error" to "success with LongType"

Full test suite:

$ ./build/sbt "sql/testOnly *ParquetSchemaSuite"
[info] Total number of tests run: 110
[info] Tests: succeeded 110, failed 0
[info] All tests passed.---

Behavior Matrix

Scenario	Before	After	Breaking?
NANOS + nanosAsLong=true	LongType	LongType	No
NANOS + nanosAsLong=false	ERROR	LongType	No (fix!)
MICROS/MILLIS timestamps	TimestampType	TimestampType	No

Convert TIMESTAMP(NANOS,*) to LongType regardless of nanosAsLong config to allow reading Parquet files with nanosecond precision timestamps. ### What changes were proposed in this pull request? Simplified the TIMESTAMP(NANOS) handling in ParquetSchemaConverter to always convert to LongType, removing the nanosAsLong condition check that caused TIMESTAMP(NANOS,false) files to be unreadable. ### Why are the changes needed? SPARK-40819 added spark.sql.legacy.parquet.nanosAsLong as a workaround for TIMESTAMP(NANOS,true), but: - Only worked for TIMESTAMP(NANOS,true), not for TIMESTAMP(NANOS,false) - Required users to know about an obscure internal config flag - Still required manual casting from Long to Timestamp This fix makes all NANOS timestamps readable by default. Since Spark cannot fully support nanosecond precision in its type system, converting to LongType preserves precision while allowing files to be read. ### Does this PR introduce any user-facing change? Yes - Parquet files with TIMESTAMP(NANOS,*) are now readable by default without configuration. Values are read as LongType (nanoseconds since epoch). Users can convert to timestamp if needed: (col('nanos') / 1e9).cast('timestamp') ### How was this patch tested? - Updated ParquetSchemaSuite test expectations (lines 1112-1121) - All 110 tests in ParquetSchemaSuite pass - Manually tested with TIMESTAMP(NANOS,false) Parquet file generated via PyArrow

github-actions bot added the SQL label Nov 25, 2025

Retrigger CI

059c360

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-44988][SQL] Support reading Parquet TIMESTAMP(NANOS,false) #53221

[SPARK-44988][SQL] Support reading Parquet TIMESTAMP(NANOS,false) #53221

AbinayaJayaprakasam commented Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SPARK-44988][SQL] Support reading Parquet TIMESTAMP(NANOS,false) #53221

Are you sure you want to change the base?

[SPARK-44988][SQL] Support reading Parquet TIMESTAMP(NANOS,false) #53221

Conversation

AbinayaJayaprakasam commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Problem Demonstration

Test coverage

Behavior Matrix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AbinayaJayaprakasam commented Nov 25, 2025 •

edited

Loading