Skip to content

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

xi-db
Copy link
Contributor

@xi-db xi-db commented May 13, 2025

What changes were proposed in this pull request?

Currently, calling DataFrame.col(colName) with a non-existent column in the Scala Spark Connect client does not raise an error. In contrast, both PySpark (in either Spark Connect or Spark Classic) and Scala in Spark Classic do raise an exception in such cases. This leads to inconsistent behavior between Spark Connect and Spark Classic in Scala.

PySpark on Spark Classic:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

PySpark on Spark Connect:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Classic:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Connect:

> spark.range(10).col("nonexistent")
res5: org.apache.spark.sql.Column = unresolved_attribute {
  unparsed_identifier: "nonexistent"
  plan_id: 2
}

it doesn't throw any exceptions.

In this PR, eager validation of column names has been implemented in the DataFrame.col(colName) method of the Scala client to ensure consistent behavior with both Spark Classic and PySpark. The implementation here is based on the __getitem__ and verify_col_name methods in PySpark.

Now, it will throw an error in Scala client on Spark Connect:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Why are the changes needed?

This PR ensures consistent behavior between Spark Connect and Spark Classic in the scenario described above.

Does this PR introduce any user-facing change?

Yes, referencing non-existent column in Scala client will now throw an error.

How was this patch tested?

New test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@xi-db
Copy link
Contributor Author

xi-db commented May 13, 2025

Hi @zhengruifeng , I'm fixing the behaviour difference of referencing a non-existent column in Spark Connect Scala, based on the PySpark's __getitem__ and verify_col_name methods, could you please review this PR?

@zhengruifeng
Copy link
Contributor

you may need to resolve the test failures, otherwise LGTM

@hvanhovell
Copy link
Contributor

@xi-db the Connect API is supposed to be lazy. That we did this in Python is a mistake. Concretely, I can see two problems with this:

  • It can create quite a few more extra RPCs.
  • It is misleading. By the time you submit something for execution, your underlying data might have changed. You will see a failure anyway. This works for classic because we have eager analysis, and the Dataset is bound at definition time instead of execution time.

@xi-db
Copy link
Contributor Author

xi-db commented Jun 10, 2025

@hvanhovell Thank you, makes sense. In this case, we can close this PR. Do we want to remove the column name validation from __getitem__ on PySpark, so df['non_existing_col'] won't trigger an analysis call? cc @zhengruifeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants