Skip to content

Conversation

@linhongliu-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds the complete serialization/deserialization infrastructure for parsing metric view YAML definitions:

  • Add Jackson YAML dependencies to pom.xml
  • Implement canonical model for metric views:
    • Column, Expression (Dimension/Measure), MetricView, Source
    • YAMLVersion validation and exception types
  • Implement version-specific serde (v0.1):
    • YAML deserializer/serializer
    • Base classes for extensibility
  • Add JSON utilities for metadata serialization

Why are the changes needed?

SPIP: Metrics & semantic modeling in Spark

Does this PR introduce any user-facing change?

No

How was this patch tested?

build/sbt "catalyst/testOnly org.apache.spark.sql.metricview.serde.MetricViewFactorySuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code
Co-Authored-By: Claude [email protected]

… for metric views

This commit adds the complete serialization/deserialization infrastructure
for parsing metric view YAML definitions:

- Add Jackson YAML dependencies to pom.xml
- Implement canonical model for metric views:
  - Column, Expression (Dimension/Measure), MetricView, Source
  - YAMLVersion validation and exception types
- Implement version-specific serde (v0.1):
  - YAML deserializer/serializer
  - Base classes for extensibility
- Add JSON utilities for metadata serialization
pom.xml Outdated
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-yaml</artifactId>
<version>${fasterxml.jackson.version}</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all jackson deps are managed by jackson-bom now, you don't need to declare it again here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


object JsonUtils {
// Singleton ObjectMapper that can be used across the project
private lazy val mapper: ObjectMapper = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the comments, should we try to use this singleton as much as possible in the subsequent development?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm surprised that there is no such util in the Spark repo. For now, I plan to use it for all the metric view development, but not sure how much other code needs this.

// Only parse the "version" field and ignore all others
@JsonIgnoreProperties(ignoreUnknown = true)
case class YAMLVersion(version: String) extends Validatable {
def validYAMLVersions: Set[String] = Set("0.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def validYAMLVersions: Set[String] = Set("0.1")
private def validYAMLVersions: Set[String] = Set("0.1")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


val mapper = new ObjectMapper(yamlFactory)
// Exclude null values from serialized output
.setSerializationInclusion(JsonInclude.Include.NON_NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.setSerializationInclusion(JsonInclude.Include.NON_NULL)
. setDefaultPropertyInclusion(JsonInclude.Include.NON_NULL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Exclude null values from serialized output
.setSerializationInclusion(JsonInclude.Include.NON_NULL)
// Exclude empty collections/strings from serialized output
.setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
. setDefaultPropertyInclusion(JsonInclude.Include.NON_EMPTY)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

) extends ColumnBase

object ColumnV01 {
def fromCanonical(canonical: Column[_ <: Expression], ordinal: Int): ColumnV01 = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ordinal mandatory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, removed

Try(CatalystSqlParser.parseQuery(sourceText)) match {
case Success(_) => SQLSource(sourceText)
case Failure(queryEx) =>
throw queryEx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should queryEx be wrapped into MetricViewValidationException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@linhongliu-db
Copy link
Contributor Author

cc @cloud-fan to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants