[SPARK-54446][SQL][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format #53232

zhengruifeng · 2025-11-26T12:27:27Z

What changes were proposed in this pull request?

FPGrowth supports local filesystem

Why are the changes needed?

to make FPGrowth work with local filesystem

Does this PR introduce any user-facing change?

yes, FPGrowth will work when local saving mode is one

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

nit

zhengruifeng · 2025-11-26T12:43:15Z

This PR is another attempt to save ml models containing dataframes to driver's local fs.
TBH, I am not very familiar with the arrow file reader / writer

zhengruifeng · 2025-11-26T12:52:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+    fileWriter.start()
+    while (batchBytesIter.hasNext) {
+      val batchBytes = batchBytesIter.next()
+      val batch = ArrowConverters.loadBatch(batchBytes, allocator)


The batch: ArrowRecordBatch doesn't extends Serializable, so still use the Array[Byte] as the underlying data in the PR.

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

holdenk · 2025-11-27T01:37:09Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+  def saveDataFrame(path: String, df: DataFrame): Unit = {
+    if (localSavingModeState.get()) {
+      val filePath = Paths.get(path)
+      Files.createDirectories(filePath.getParent)
+
+      df match {
+        case d: org.apache.spark.sql.classic.DataFrame =>
+          ArrowFileReadWrite.save(d, path)
+        case _ => throw new UnsupportedOperationException("Unsupported dataframe type")
+      }
+    } else {
+      df.write.parquet(path)
+    }
+  }
+
+  def loadDataFrame(path: String, spark: SparkSession): DataFrame = {
+    if (localSavingModeState.get()) {
+      spark match {
+        case s: org.apache.spark.sql.classic.SparkSession =>
+          ArrowFileReadWrite.load(s, path)
+        case _ => throw new UnsupportedOperationException("Unsupported session type")
+      }
+    } else {
+      spark.read.parquet(path)
+    }
+  }


So if we have localSavingModeState set to true this will write out an arrow file which is not stable format wise. It does look like localSavingModeState is only set to true in internal methods in Scala. Looking in the PySpark docstrings I see we tell people to use this API so I remain -0.9.

hi @holdenk , as @WeichenXu123 explained #53150 (comment), this is a runtime temporary file in spark connect server side, and will be cleaned after session close.
So I think we don't have to use a stable format here.

localSavingModeState is also used internally, (only Spark driver code can set the flag) . Where does the doc string mentioned it ? we should remove it from doc and mark localSavingModeState as private field

Hmm, even it is just a temporary session file, is there any reason not to use Parquet but Arrow file format?

we can read/write parquet with arrow, but it requires a new dependency

<dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-arrow</artifactId> </dependency>

otherwise, I am not sure whether we have utils to read/write parquet.

At the end we need the in-memory data to be in arrow format, so using arrow file is more efficient.

viirya

Wonder why choosing Arrow file format now instead of Parquet?
Due to the process of batch -> bytes -> batch -> bytes (when writing to file), it doesn't look like an efficient way.

viirya · 2025-11-27T01:49:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+    val rdd = df.toArrowBatchRdd(maxRecordsPerBatch, "UTC", true, false)
+    val arrowSchema = ArrowUtils.toArrowSchema(df.schema, "UTC", true, false)
+    val writer = new SparkArrowFileWriter(arrowSchema, path)
+    writer.write(rdd.toLocalIterator)


Instead, can we call toLocalIterator on original DataFrame's rdd and write rows to Arrow batches locally? Then we don't need to have the redundant Bytes?

we can make best of the ArrowConverters utils, if we use the Bytes

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

WeichenXu123

LGTM

cloud-fan · 2025-11-27T03:12:48Z

Due to the process of batch -> bytes -> batch -> bytes (when writing to file)

Can we have a shared util to produce RDD of arrow batches? Then we can either turn it to RDD of bytes, or write it to local files.

HyukjinKwon · 2025-11-27T04:11:31Z

Can we have a shared util to produce RDD of arrow batches? Then we can either turn it to RDD of bytes, or write it to local files.

This is actually already reusing a lot of existing utiles at ArrowConverters.scala. We have that same logic in Python but this SparkArrowFileWriter is new in JVM.

Basically toArrowBatchRdd is the util you meant for batch -> bytes.

Below code is for bytes -> batch -> write

val writer = new SparkArrowFileWriter(arrowSchema, path)
writer.write(rdd.toLocalIterator)

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

zhengruifeng added 6 commits November 26, 2025 15:08

fix

d93588d

apply arrow

9474f69

nit

init

07f67e6

test

536b403

test

4dcf366

test

e09ece3

github-actions bot added SQL ML MLLIB labels Nov 26, 2025

nit

76bc0a8

zhengruifeng mentioned this pull request Nov 26, 2025

[SPARK-54446][ML] FPGrowth supports local filesystem #53150

Draft

zhengruifeng requested review from HyukjinKwon, WeichenXu123, cloud-fan, holdenk and viirya and removed request for WeichenXu123 November 26, 2025 12:40

zhengruifeng commented Nov 26, 2025

View reviewed changes

HyukjinKwon approved these changes Nov 26, 2025

View reviewed changes

viirya reviewed Nov 26, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala Show resolved Hide resolved

holdenk requested changes Nov 27, 2025

View reviewed changes

viirya reviewed Nov 27, 2025

View reviewed changes

cloud-fan reviewed Nov 27, 2025

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala Show resolved Hide resolved

cloud-fan reviewed Nov 27, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 27, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala Outdated Show resolved Hide resolved

WeichenXu123 approved these changes Nov 27, 2025

View reviewed changes

zhengruifeng changed the title ~~[SPARK-54446][ML] FPGrowth supports local filesystem with Arrow file format~~ [SPARK-54446][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format Nov 27, 2025

viirya reviewed Nov 27, 2025

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala Outdated Show resolved Hide resolved

address comments

c7c2db5

zhengruifeng changed the title ~~[SPARK-54446][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format~~ [SPARK-54446][SQL][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format Nov 28, 2025

[SPARK-54446][SQL][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format #53232

Are you sure you want to change the base?

[SPARK-54446][SQL][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format #53232

Conversation

zhengruifeng commented Nov 26, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Nov 26, 2025

Uh oh!

zhengruifeng Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

holdenk Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 27, 2025

Uh oh!

HyukjinKwon commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhengruifeng Nov 26, 2025 •

edited

Loading

WeichenXu123 Nov 27, 2025 •

edited

Loading

viirya Nov 27, 2025 •

edited

Loading