Skip to content

Conversation

@AbinayaJayaprakasam
Copy link
Contributor

@AbinayaJayaprakasam AbinayaJayaprakasam commented Nov 25, 2025

Convert TIMESTAMP(NANOS,*) to LongType regardless of nanosAsLong config to allow reading Parquet files with nanosecond precision timestamps.

What changes were proposed in this pull request?

Simplified the TIMESTAMP(NANOS) handling in ParquetSchemaConverter to always convert to LongType, removing the nanosAsLong condition check that caused TIMESTAMP(NANOS,false) files to be unreadable.

Why are the changes needed?

SPARK-40819 added spark.sql.legacy.parquet.nanosAsLong as a workaround for TIMESTAMP(NANOS,true), but:

  • Only worked for TIMESTAMP(NANOS,true), not for TIMESTAMP(NANOS,false)
  • Required users to know about an obscure internal config flag
  • Still required manual casting from Long to Timestamp

This fix makes all NANOS timestamps readable by default. Since Spark cannot fully support nanosecond precision in its type system, converting to LongType preserves precision while allowing files to be read.

Does this PR introduce any user-facing change?

Yes - Parquet files with TIMESTAMP(NANOS,*) are now readable by default without configuration. Values are read as LongType (nanoseconds since epoch). Users can convert to timestamp if needed: (col('nanos') / 1e9).cast('timestamp')

How was this patch tested?

  • Updated ParquetSchemaSuite test expectations
  • All tests in ParquetSchemaSuite pass
  • Manually tested with TIMESTAMP(NANOS,false) Parquet file generated via PyArrow

Was this patch authored or co-authored using generative AI tooling?

No

Problem Demonstration

Step 1: Generate test Parquet file with TIMESTAMP(NANOS,false)

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime

data = [
    pa.array([1, 2, 3, 4, 5], type=pa.int32()),
    pa.array([
        datetime(2023, 1, 1, 10, 0, 0),
        datetime(2023, 1, 1, 10, 0, 1),
        datetime(2023, 1, 1, 10, 0, 2),
        datetime(2023, 1, 1, 10, 0, 3),
        datetime(2023, 1, 1, 10, 0, 4)
    ], type=pa.timestamp('ns'))  # TIMESTAMP(NANOS,false)
]

schema = pa.schema([
    ('id', pa.int32()),
    ('timestamp_nanos', pa.timestamp('ns'))
])

table = pa.table(data, schema=schema)
pq.write_table(table, 'test_nanos_timestamp.parquet')

Step 2: Verify Parquet schema

$ parquet-tools schema test_nanos_timestamp.parquet

message schema {
  optional int32 id;
  optional int64 timestamp_nanos (TIMESTAMP(NANOS,false));
}

Step 3: Before fix - Unreadable

spark.read.parquet("test_nanos_timestamp.parquet").show()

org.apache.spark.sql.AnalysisException: [PARQUET_TYPE_ILLEGAL] Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). SQLSTATE: 42846
  at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:2081)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:238)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$4(ParquetSchemaConverter.scala:321)
  at scala.Option.getOrElse(Option.scala:201)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:258)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:207)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$4$adapted(ParquetSchemaConverter.scala:168)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$4$adapted(ParquetSchemaConverter.scala:132)
  ...

Step 4: After fix - Readable

df = spark.read.parquet("test_nanos_timestamp.parquet")
df.printSchema()
root
|-- id: integer (nullable = true)
|-- timestamp_nanos: long (nullable = true)

df.show()

25/11/26 01:03:28 INFO CodeGenerator: Code generated in 5.300541 ms
+---+-------------------+
| id|    timestamp_nanos|
+---+-------------------+
|  1|1672567200000000000|
|  2|1672567201000000000|
|  3|1672567202000000000|
|  4|1672567203000000000|
|  5|1672567204000000000|
+---+-------------------+

SUCCESS: File read without error!

Test coverage

Updated existing test: ParquetSchemaSuite -Changed test expectation from "error" to "success with LongType"

Full test suite:

$ ./build/sbt "sql/testOnly *ParquetSchemaSuite"
[info] Total number of tests run: 110
[info] Tests: succeeded 110, failed 0
[info] All tests passed.---

Behavior Matrix

Scenario Before After Breaking?
NANOS + nanosAsLong=true LongType LongType No
NANOS + nanosAsLong=false ERROR LongType No (fix!)
MICROS/MILLIS timestamps TimestampType TimestampType No

Convert TIMESTAMP(NANOS,*) to LongType regardless of nanosAsLong config
to allow reading Parquet files with nanosecond precision timestamps.

### What changes were proposed in this pull request?

Simplified the TIMESTAMP(NANOS) handling in ParquetSchemaConverter to always
convert to LongType, removing the nanosAsLong condition check that caused
TIMESTAMP(NANOS,false) files to be unreadable.

### Why are the changes needed?

SPARK-40819 added spark.sql.legacy.parquet.nanosAsLong as a workaround for
TIMESTAMP(NANOS,true), but:
- Only worked for TIMESTAMP(NANOS,true), not for  TIMESTAMP(NANOS,false)
- Required users to know about an obscure internal config flag
- Still required manual casting from Long to Timestamp

This fix makes all NANOS timestamps readable by default. Since Spark cannot
fully support nanosecond precision in its type system, converting to LongType
preserves precision while allowing files to be read.

### Does this PR introduce any user-facing change?

Yes - Parquet files with TIMESTAMP(NANOS,*) are now readable by default
without configuration. Values are read as LongType (nanoseconds since epoch).
Users can convert to timestamp if needed: (col('nanos') / 1e9).cast('timestamp')

### How was this patch tested?

- Updated ParquetSchemaSuite test expectations (lines 1112-1121)
- All 110 tests in ParquetSchemaSuite pass
- Manually tested with TIMESTAMP(NANOS,false) Parquet file generated via PyArrow
@github-actions github-actions bot added the SQL label Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant