Skip to content

[SPARK-52012][CORE][SQL] Restore IDE Index with type annotations #50798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented May 6, 2025

What changes were proposed in this pull request?

Restore IDE Index with type annotations

Why are the changes needed?

Restore IDE Indexing for code navi to improve Spark contribution experience.

Does this PR introduce any user-facing change?

dev only change

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

no

@yaooqinn yaooqinn changed the title [WIP][SPARK-52012][CORE][SQL] Restore IDE Index with type annotations [SPARK-52012][CORE][SQL] Restore IDE Index with type annotations May 6, 2025
@LuciferYang
Copy link
Contributor

@yaooqinn Thank you for the efforts you've invested in resolving this issue. I've also been struggling with this problem for quite a while. After this one, has the issue been fully resolved, or have we only achieved partial mitigation?

also cc @hvanhovell , do you have a more effective solution to address this issue? The core obstacle stems from IntelliJ's inability to properly distinguish and utilize class/object definitions with identical names between the connect-shims module and other modules in the project during code indexing. This deficiency has already caused several disruptions for developers using the IDE:

  1. Code completion failures due to IntelliJ's indexing anomalies
  2. Debugging workflow interruptions triggered by compilation errors when running test cases in IntelliJ

@yaooqinn
Copy link
Member Author

yaooqinn commented May 6, 2025

@LuciferYang

Technically, it covers all. However, I haven't covered all due to huge LOC.

@LuciferYang
Copy link
Contributor

Local verification results are as follows:

  1. When importing the project into IntelliJ IDEA via the Maven approach, this pr performs favorably. The issue of code "red highlighting" that previously occurred has been significantly alleviated. Additionally, I was able to execute test cases directly within the IDE, such as running BitmapExpressionsQuerySuite. The results are as follows:

image

  1. When importing the project into IntelliJ IDEA via the sbt approach, the effect of this pr appears less pronounced. Code related to RDD, SparkContext, and other components still exhibits "red highlighting", and executing test cases within the IDE continues to throw errors. An example is shown below:

image

It is uncertain whether the above conclusions hold true in other developers' IDE environments. To ensure robustness, it would be prudent to involve additional contributors in verifying the actual outcomes.

lazy val fileConstantMetadataColumns: Seq[AttributeReference] = output.collect {
// Collect metadata columns to be handled outside of the scan by appending constant columns.
case FileSourceConstantMetadataAttribute(attr) => attr
}

override def vectorTypes: Option[Seq[String]] =
protected def sessionState: SessionState = relation.sparkSession.sessionState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this approach may resolve the immediate issue, it could incur long-term development costs:
Validating whether newly added code triggers "red highlighting" in the IDE for every pr submission would be impractical. This might lead to a proliferation of similar follow-up tasks or recurring issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add style check later

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the long-term maintenance cost remains low, I'm +1 on this

@@ -44,8 +45,9 @@ private[sql] class UDTFRegistration private[sql] (tableFunctionRegistry: TableFu
| udfDeterministic: ${udtf.udfDeterministic}
""".stripMargin)

val sessionState: SessionState = SparkSession.getActiveSession.get.sessionState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the actual change here? the type annotation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Honestly, I have no clue why explicit type annotations work for the IDE, I just tried and saw it helped

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I'm a little reluctant to bring SparkContext back to the code (even internal ones), @yaooqinn .

    val sc: SparkContext = spark.sparkContext
    val broadcastedConf = sc.broadcast(new SerializableConfiguration(hadoopConf))

Do you think you can propose this without SparkContext-related changes?

@yaooqinn
Copy link
Member Author

yaooqinn commented May 7, 2025

Hi @dongjoon-hyun, w/o SparkContext-related changes, broadcast- defaultParallelism- defaultMinPartitions addJar- addFiles- addArchive listJars-, and so on, related symbols will still be unresolved in the IDEA.

Do you have any suggestions to make it work w/o them?

@dongjoon-hyun
Copy link
Member

Technically, I don't have a suggestion to make it work.

However, I can see this PR could have an undesirable side effect not only for now but forever because this PR seems unintentionally to advert SparkContext again by exposing more and more to new developers in Spark code base. According to this PR's claim, after this PR, I'm worrying if new developers also could add more SparkContext instances into the Spark code repository via the following patterns to keep the type annotations. What is your opinion about that?

  val sc: SparkContext = spark.sparkContext

@dongjoon-hyun
Copy link
Member

BTW, I'll leave the final decision to you and the other reviewers, @yaooqinn .
So, feel free to proceed if you think this is needed~

@cloud-fan
Copy link
Contributor

I think we need to first understand why explicit type annotation works. The current approach looks a bit hacky to me.

@yaooqinn
Copy link
Member Author

yaooqinn commented May 7, 2025

@dongjoon-hyun, I will make the following change to solve issues with the main part of the SparkContext callings, and restore the rest to the status before this PR

--- a/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
+++ b/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
@@ -20,7 +20,9 @@ import java.io.{ObjectInputStream, ObjectOutputStream}

 import org.apache.hadoop.conf.Configuration

+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.{DeveloperApi, Unstable}
+import org.apache.spark.broadcast.Broadcast

 /**
  * Hadoop configuration but serializable. Use `value` to access the Hadoop configuration.
@@ -39,3 +41,9 @@ class SerializableConfiguration(@transient var value: Configuration) extends Ser
     value.readFields(in)
   }
 }
+
+private[spark] object SerializableConfiguration {
+  def broadcast(sc: SparkContext, conf: Configuration): Broadcast[SerializableConfiguration] = {
+    sc.broadcast(new SerializableConfiguration(conf))
+  }
+}

@yaooqinn
Copy link
Member Author

yaooqinn commented May 7, 2025

The current approach looks a bit hacky to me.

Hi @cloud-fan, I do think that the approach here is hacky because type annotation for declarations is a common language feature.

The hacky part here might belong to IDEA itself to understand implicit definitions in the shims way we applied in Spark

@github-actions github-actions bot added the CORE label May 7, 2025
@@ -20,7 +20,9 @@ import java.io.{ObjectInputStream, ObjectOutputStream}

import org.apache.hadoop.conf.Configuration

import org.apache.spark.SparkContext
Copy link
Member Author

@yaooqinn yaooqinn May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun,

Except for changes in the hive-thriftserver module, this is the only place doing import SparkContext now. And it's in spark core

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @yaooqinn .

@@ -53,16 +53,14 @@ case class ParquetScanBuilder(
override protected val supportsNestedSchemaPruning: Boolean = true

override def pushDataFilters(dataFilters: Array[Filter]): Array[Filter] = {
val sqlConf = sparkSession.sessionState.conf
if (sqlConf.parquetFilterPushDown) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code looks okay to me. Could you confirm that this part also has a broken type annotations before this PR? Or, is this simply revised to reuse conf instead of sqlConf at line 56?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has multiple types of improvements.

Although I understand these were changed to serve the same goal IDE Type Annotations issue, if you don't mind, I'd like to propose you to split this PR into more narrow-scoped ones to be more safe . For example,

  1. A PR to change Class Loader usage.
withClassLoader
- // Always use the latest class loader provided by executionHive's state.
- val executionHiveClassLoader = session.sharedState.jarClassLoader
- Thread.currentThread().setContextClassLoader(executionHiveClassLoader)
 protected def withClassLoader(f: => Unit): Unit = {
    val executionHiveClassLoader = session.sharedState.asInstanceOf[SharedState].jarClassLoader
    Thread.currentThread().setContextClassLoader(executionHiveClassLoader)
    f
  }
  1. A PR to add new methods and use them
final protected def sessionState: SessionState = session.sessionState
final protected def sparkContext: SparkContext = session.sparkContext
private def sparkSessionState: SparkSessionState = SparkSQLEnv.sparkSession.sessionState
  1. Variable creation to reusing

  2. Explicit type casting with asInstanceOf.

WDYT?

@@ -52,7 +52,7 @@ private[sql] object AvroUtils extends Logging {
spark: SparkSession,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType] = {
val conf = spark.sessionState.newHadoopConfWithOptions(options)
val conf = spark.sessionState.asInstanceOf[SessionState].newHadoopConfWithOptions(options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes like this are not very reasonable. It's only needed to work around the current IntelliJ limitation of not recognizing the dependency exclusion.

I'm a bit hesitant to merge this PR. If we need a temporary fix for IntelliJ support, shall we use maven profiles and the default profile does not include connect-shim?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be experimentally tested in Intellij by manually ignoring the connect-shims module. We will observe that this action causes the sql-api module to be marked with compilation errors. For instance, the definitions of JavaRDD and RDD cannot be found when working with classes like org.apache.spark.sql.DataFrameReader.

@LuciferYang
Copy link
Contributor

LuciferYang commented May 7, 2025

@yaooqinn @cloud-fan @dongjoon-hyun

I've noticed that the connect-client-jvm module still relies on the core module during runtime.

For example, the SparkConf object passed to the org.apache.spark.sql.connect.SparkSession.Builder#config API is actually defined in the core module. If this dependency is intentional, could the sql-api module also include core as a compile-time dependency? This approach might allow us to reduce the number of stub classes defined in shims.scala, such as SparkContext, SparkConf, JavaRDD, and RDD. Consequently, this could potentially minimize the changes required in this pr.

I’m currently testing this idea in #50815

@dongjoon-hyun
Copy link
Member

Thank you, @LuciferYang .

@yaooqinn
Copy link
Member Author

yaooqinn commented May 8, 2025

Thank you, @dongjoon-hyun and @cloud-fan . I would like to put this pull request on hold and wait @LuciferYang 's #50815

@LuciferYang
Copy link
Contributor

Currently, we have not shaded the connect-shims into connect-client-jvm. As a result, the usage of classes such as SparkContext and SparkConf falls into two distinct scenarios:

  1. When using Ammonite spark-shell, connect-client-jvm relies on the class definitions from the spark-core and spark-sql jars.
  2. When connect-client-jvm is included as a dependency in a third-party project, it will transitively pull in connect-shims. In this case, there is no longer a need to include spark-core and spark-sql.

Therefore, although the tests for #50815 may pass, I still recommend maintaining the current definitions in shims.scala. Otherwise, in Scenario 2, we would end up transitively introducing spark-core into the third-party project, which seems counter to the original design intent.

@yaooqinn
Copy link
Member Author

yaooqinn commented May 8, 2025

Thank you for the input @LuciferYang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants