[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations #50798

yaooqinn · 2025-05-06T03:48:30Z

What changes were proposed in this pull request?

Restore IDE Index with type annotations

Why are the changes needed?

Restore IDE Indexing for code navi to improve Spark contribution experience.

Does this PR introduce any user-facing change?

dev only change

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

no

LuciferYang · 2025-05-06T08:18:25Z

@yaooqinn Thank you for the efforts you've invested in resolving this issue. I've also been struggling with this problem for quite a while. After this one, has the issue been fully resolved, or have we only achieved partial mitigation?

also cc @hvanhovell , do you have a more effective solution to address this issue? The core obstacle stems from IntelliJ's inability to properly distinguish and utilize class/object definitions with identical names between the connect-shims module and other modules in the project during code indexing. This deficiency has already caused several disruptions for developers using the IDE:

Code completion failures due to IntelliJ's indexing anomalies
Debugging workflow interruptions triggered by compilation errors when running test cases in IntelliJ

yaooqinn · 2025-05-06T08:32:34Z

@LuciferYang

Technically, it covers all. However, I haven't covered all due to huge LOC.

LuciferYang · 2025-05-06T10:55:52Z

Local verification results are as follows:

When importing the project into IntelliJ IDEA via the Maven approach, this pr performs favorably. The issue of code "red highlighting" that previously occurred has been significantly alleviated. Additionally, I was able to execute test cases directly within the IDE, such as running BitmapExpressionsQuerySuite. The results are as follows:

When importing the project into IntelliJ IDEA via the sbt approach, the effect of this pr appears less pronounced. Code related to RDD, SparkContext, and other components still exhibits "red highlighting", and executing test cases within the IDE continues to throw errors. An example is shown below:

It is uncertain whether the above conclusions hold true in other developers' IDE environments. To ensure robustness, it would be prudent to involve additional contributors in verifying the actual outcomes.

LuciferYang · 2025-05-06T11:01:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

  lazy val fileConstantMetadataColumns: Seq[AttributeReference] = output.collect {
    // Collect metadata columns to be handled outside of the scan by appending constant columns.
    case FileSourceConstantMetadataAttribute(attr) => attr
  }

-  override def vectorTypes: Option[Seq[String]] =
+  protected def sessionState: SessionState = relation.sparkSession.sessionState


While this approach may resolve the immediate issue, it could incur long-term development costs:
Validating whether newly added code triggers "red highlighting" in the IDE for every pr submission would be impractical. This might lead to a proliferation of similar follow-up tasks or recurring issues.

We can add style check later

LuciferYang

If the long-term maintenance cost remains low, I'm +1 on this

cloud-fan · 2025-05-06T11:27:33Z

sql/core/src/main/scala/org/apache/spark/sql/UDTFRegistration.scala

@@ -44,8 +45,9 @@ private[sql] class UDTFRegistration private[sql] (tableFunctionRegistry: TableFu
         | udfDeterministic: ${udtf.udfDeterministic}
      """.stripMargin)

+    val sessionState: SessionState = SparkSession.getActiveSession.get.sessionState


what's the actual change here? the type annotation?

Yes. Honestly, I have no clue why explicit type annotations work for the IDE, I just tried and saw it helped

dongjoon-hyun

Sorry but I'm a little reluctant to bring SparkContext back to the code (even internal ones), @yaooqinn .

    val sc: SparkContext = spark.sparkContext
    val broadcastedConf = sc.broadcast(new SerializableConfiguration(hadoopConf))

Do you think you can propose this without SparkContext-related changes?

yaooqinn · 2025-05-07T03:33:48Z

Hi @dongjoon-hyun, w/o SparkContext-related changes, broadcast- defaultParallelism- defaultMinPartitions addJar- addFiles- addArchive listJars-, and so on, related symbols will still be unresolved in the IDEA.

Do you have any suggestions to make it work w/o them?

dongjoon-hyun · 2025-05-07T04:16:42Z

Technically, I don't have a suggestion to make it work.

However, I can see this PR could have an undesirable side effect not only for now but forever because this PR seems unintentionally to advert SparkContext again by exposing more and more to new developers in Spark code base. According to this PR's claim, after this PR, I'm worrying if new developers also could add more SparkContext instances into the Spark code repository via the following patterns to keep the type annotations. What is your opinion about that?

  val sc: SparkContext = spark.sparkContext

dongjoon-hyun · 2025-05-07T04:19:42Z

BTW, I'll leave the final decision to you and the other reviewers, @yaooqinn .
So, feel free to proceed if you think this is needed~

cloud-fan · 2025-05-07T05:30:20Z

I think we need to first understand why explicit type annotation works. The current approach looks a bit hacky to me.

yaooqinn · 2025-05-07T06:10:38Z

@dongjoon-hyun, I will make the following change to solve issues with the main part of the SparkContext callings, and restore the rest to the status before this PR

--- a/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
+++ b/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
@@ -20,7 +20,9 @@ import java.io.{ObjectInputStream, ObjectOutputStream}

 import org.apache.hadoop.conf.Configuration

+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.{DeveloperApi, Unstable}
+import org.apache.spark.broadcast.Broadcast

 /**
  * Hadoop configuration but serializable. Use `value` to access the Hadoop configuration.
@@ -39,3 +41,9 @@ class SerializableConfiguration(@transient var value: Configuration) extends Ser
     value.readFields(in)
   }
 }
+
+private[spark] object SerializableConfiguration {
+  def broadcast(sc: SparkContext, conf: Configuration): Broadcast[SerializableConfiguration] = {
+    sc.broadcast(new SerializableConfiguration(conf))
+  }
+}

yaooqinn · 2025-05-07T06:15:14Z

The current approach looks a bit hacky to me.

Hi @cloud-fan, I do think that the approach here is hacky because type annotation for declarations is a common language feature.

The hacky part here might belong to IDEA itself to understand implicit definitions in the shims way we applied in Spark

yaooqinn · 2025-05-07T06:47:11Z

core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala

@@ -20,7 +20,9 @@ import java.io.{ObjectInputStream, ObjectOutputStream}

 import org.apache.hadoop.conf.Configuration

+import org.apache.spark.SparkContext


@dongjoon-hyun,

Except for changes in the hive-thriftserver module, this is the only place doing import SparkContext now. And it's in spark core

Thank you, @yaooqinn .

dongjoon-hyun · 2025-05-07T15:30:37Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

@@ -53,16 +53,14 @@ case class ParquetScanBuilder(
  override protected val supportsNestedSchemaPruning: Boolean = true

  override def pushDataFilters(dataFilters: Array[Filter]): Array[Filter] = {
-    val sqlConf = sparkSession.sessionState.conf
-    if (sqlConf.parquetFilterPushDown) {


The original code looks okay to me. Could you confirm that this part also has a broken type annotations before this PR? Or, is this simply revised to reuse conf instead of sqlConf at line 56?

dongjoon-hyun

This PR has multiple types of improvements.

Although I understand these were changed to serve the same goal IDE Type Annotations issue, if you don't mind, I'd like to propose you to split this PR into more narrow-scoped ones to be more safe . For example,

A PR to change Class Loader usage.

withClassLoader

- // Always use the latest class loader provided by executionHive's state.
- val executionHiveClassLoader = session.sharedState.jarClassLoader
- Thread.currentThread().setContextClassLoader(executionHiveClassLoader)

 protected def withClassLoader(f: => Unit): Unit = {
    val executionHiveClassLoader = session.sharedState.asInstanceOf[SharedState].jarClassLoader
    Thread.currentThread().setContextClassLoader(executionHiveClassLoader)
    f
  }

A PR to add new methods and use them

final protected def sessionState: SessionState = session.sessionState
final protected def sparkContext: SparkContext = session.sparkContext

private def sparkSessionState: SparkSessionState = SparkSQLEnv.sparkSession.sessionState

Variable creation to reusing
Explicit type casting with asInstanceOf.

WDYT?

cloud-fan · 2025-05-07T17:51:55Z

sql/core/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala

@@ -52,7 +52,7 @@ private[sql] object AvroUtils extends Logging {
      spark: SparkSession,
      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType] = {
-    val conf = spark.sessionState.newHadoopConfWithOptions(options)
+    val conf = spark.sessionState.asInstanceOf[SessionState].newHadoopConfWithOptions(options)


Changes like this are not very reasonable. It's only needed to work around the current IntelliJ limitation of not recognizing the dependency exclusion.

I'm a bit hesitant to merge this PR. If we need a temporary fix for IntelliJ support, shall we use maven profiles and the default profile does not include connect-shim?

This can be experimentally tested in Intellij by manually ignoring the connect-shims module. We will observe that this action causes the sql-api module to be marked with compilation errors. For instance, the definitions of JavaRDD and RDD cannot be found when working with classes like org.apache.spark.sql.DataFrameReader.

LuciferYang · 2025-05-07T19:36:59Z

@yaooqinn @cloud-fan @dongjoon-hyun

I've noticed that the connect-client-jvm module still relies on the core module during runtime.

For example, the SparkConf object passed to the org.apache.spark.sql.connect.SparkSession.Builder#config API is actually defined in the core module. If this dependency is intentional, could the sql-api module also include core as a compile-time dependency? This approach might allow us to reduce the number of stub classes defined in shims.scala, such as SparkContext, SparkConf, JavaRDD, and RDD. Consequently, this could potentially minimize the changes required in this pr.

I’m currently testing this idea in #50815

dongjoon-hyun · 2025-05-07T19:38:52Z

Thank you, @LuciferYang .

yaooqinn · 2025-05-08T02:13:39Z

Thank you, @dongjoon-hyun and @cloud-fan . I would like to put this pull request on hold and wait @LuciferYang 's #50815

LuciferYang · 2025-05-08T06:01:38Z

Currently, we have not shaded the connect-shims into connect-client-jvm. As a result, the usage of classes such as SparkContext and SparkConf falls into two distinct scenarios:

When using Ammonite spark-shell, connect-client-jvm relies on the class definitions from the spark-core and spark-sql jars.
When connect-client-jvm is included as a dependency in a third-party project, it will transitively pull in connect-shims. In this case, there is no longer a need to include spark-core and spark-sql.

Therefore, although the tests for #50815 may pass, I still recommend maintaining the current definitions in shims.scala. Otherwise, in Scenario 2, we would end up transitively introducing spark-core into the third-party project, which seems counter to the original design intent.

yaooqinn · 2025-05-08T06:08:51Z

Thank you for the input @LuciferYang

github-actions bot added SQL ML MLLIB STRUCTURED STREAMING PYTHON AVRO and removed ML MLLIB labels May 6, 2025

yaooqinn changed the title ~~[WIP][SPARK-52012][CORE][SQL] Restore IDE Index with type annotations~~ [SPARK-52012][CORE][SQL] Restore IDE Index with type annotations May 6, 2025

yaooqinn added 6 commits May 6, 2025 16:36

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations

d328571

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations

85b27df

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations

dd68203

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations

0a99485

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations

ea27362

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations

0176eee

yaooqinn force-pushed the SPARK-52012 branch from bae1f65 to 0176eee Compare May 6, 2025 08:38

LuciferYang reviewed May 6, 2025

View reviewed changes

cloud-fan reviewed May 6, 2025

View reviewed changes

dongjoon-hyun reviewed May 6, 2025

View reviewed changes

maintain SparkContext in core module scope

02f7055

github-actions bot added the CORE label May 7, 2025

yaooqinn commented May 7, 2025

View reviewed changes

dongjoon-hyun reviewed May 7, 2025

View reviewed changes

cloud-fan reviewed May 7, 2025

View reviewed changes

		@@ -20,7 +20,9 @@ import java.io.{ObjectInputStream, ObjectOutputStream}

		import org.apache.hadoop.conf.Configuration

		import org.apache.spark.SparkContext

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations #50798

Are you sure you want to change the base?

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations #50798

Conversation

yaooqinn commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented May 6, 2025

Uh oh!

yaooqinn commented May 6, 2025

Uh oh!

LuciferYang commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented May 7, 2025

Uh oh!

dongjoon-hyun commented May 7, 2025

Uh oh!

dongjoon-hyun commented May 7, 2025

Uh oh!

cloud-fan commented May 7, 2025

Uh oh!

yaooqinn commented May 7, 2025

Uh oh!

yaooqinn commented May 7, 2025

Uh oh!

yaooqinn May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented May 7, 2025

Uh oh!

yaooqinn commented May 8, 2025

Uh oh!

LuciferYang commented May 8, 2025

Uh oh!

yaooqinn commented May 8, 2025

Uh oh!

Uh oh!

yaooqinn commented May 6, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

yaooqinn May 7, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

LuciferYang commented May 7, 2025 •

edited

Loading