Skip to content

Conversation

@linghengqian
Copy link
Member

@linghengqian linghengqian commented Jan 6, 2026

Fixes #32793 .

Changes proposed in this pull request:

  • Add nullability checks to the Agent's OpenTelemetry plugin.

Before committing this PR, I'm sure that I have checked the following options:

  • My code follows the code of conduct of this project.
  • I have self-reviewed the commit code.
  • I have (or in comment I request) added corresponding labels for the pull request.
  • I have passed maven check locally : ./mvnw clean install -B -T1C -Dmaven.javadoc.skip -Dmaven.jacoco.skip -e.
  • I have made corresponding changes to the documentation.
  • I have added corresponding unit tests for my changes.
  • I have updated the Release Notes of the current development version. For more details, see Update Release Note

@linghengqian linghengqian added type: bug feature: agent shardingsphere agent labels Jan 6, 2026
@linghengqian linghengqian marked this pull request as ready for review January 7, 2026 12:02
@linghengqian linghengqian requested a review from Copilot January 7, 2026 12:03
@linghengqian linghengqian added this to the 5.5.3 milestone Jan 7, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the robustness of the Agent's OpenTelemetry plugin by adding null safety checks for span operations and aligns the Agent's Dockerfile with the Proxy's JDK version.

  • Added null checks for parentSpan before calling setParent() to prevent NPEs when no parent span exists
  • Added null checks for span objects before calling lifecycle methods (setStatus(), end(), recordException())
  • Updated Agent's Dockerfile from JDK 23 to JDK 25 to match the Proxy's configuration

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
distribution/agent/Dockerfile Updates base image from eclipse-temurin:23-jdk to eclipse-temurin:25-jdk to align with Proxy's Dockerfile
agent/plugins/tracing/type/opentelemetry/src/main/java/org/apache/shardingsphere/agent/plugin/tracing/opentelemetry/advice/OpenTelemetrySQLParserEngineAdvice.java Adds null checks for parentSpan before setting parent context and for span before lifecycle operations
agent/plugins/tracing/type/opentelemetry/src/main/java/org/apache/shardingsphere/agent/plugin/tracing/opentelemetry/advice/OpenTelemetryJDBCExecutorCallbackAdvice.java Adds null checks for parentSpan before setting parent context and for span before lifecycle operations
agent/plugins/tracing/type/opentelemetry/src/test/java/org/apache/shardingsphere/agent/plugin/tracing/opentelemetry/advice/OpenTelemetrySQLParserEngineAdviceTest.java Adds test case for null parent span scenario and refactors test assertions to support both valid and invalid parent span IDs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@terrymanu terrymanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Direction is right, but a few changes are still needed before merging:

  • Good parts: The null guards in both advices address the NPE root cause reported in #32793. Adding the no-parent-span test for OpenTelemetrySQLParserEngineAdvice is the right move.
  • Issues to fix:
    1. Missing symmetric coverage for the JDBC path: please add a no-parent-span/null-attachment test for OpenTelemetryJDBCExecutorCallbackAdvice to match the SQL parser coverage and prevent regressions.
    2. The base image bump in distribution/agent/Dockerfile from eclipse-temurin:23-jdk to 25-jdk is unrelated to the NPE fix and introduces unvalidated compatibility risk (ByteBuddy/OTel/agent on non-LTS). Please revert it in this PR; if you want the upgrade, raise a separate PR with validation evidence.
    3. Scope vs. issue #32793: Elastic APM 9.2.3 connectivity/backpressure noted in the issue comments is not covered here. If the intent is NPE-only, please clarify that the remaining APM/metrics problem will be handled separately; otherwise, add the fix and validation.
  • Newly introduced risk: When parentSpan is absent, spans become orphaned (invalid parent). If that is intended, document it (or add assertions). If not, consider using the current context or emitting a warning.

Please address the above, and we can re-review quickly.

Copy link
Member Author

@linghengqian linghengqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Missing symmetric coverage for the JDBC path: please add a no-parent-span/null-attachment test for OpenTelemetryJDBCExecutorCallbackAdvice to match the SQL parser coverage and prevent regressions.
  • Solve.

The base image bump in distribution/agent/Dockerfile from eclipse-temurin:23-jdk to 25-jdk is unrelated to the NPE fix and introduces unvalidated compatibility risk (ByteBuddy/OTel/agent on non-LTS). Please revert it in this PR; if you want the upgrade, raise a separate PR with validation evidence.

  • It has been rolled back.

Scope vs. issue https://github.com/apache/shardingsphere/issues/32793 : Elastic APM 9.2.3 connectivity/backpressure noted in the issue comments is not covered here. If the intent is NPE-only, please clarify that the remaining APM/metrics problem will be handled separately; otherwise, add the fix and validation.

  • This is essentially beyond the scope of the original issue and remains unresolved, as there aren't even any trace-related error logs. Either the ShardingSphere JDBC is using an outdated version of the Otel SDK; or the ShardingSphere agent's Dockerfile uses JDK 23, which is incompatible with older versions of the Otel SDK; or the Elastic Agent doesn't allow sending trace-only data, while Jaeger does; or the Otel SDK has a bug. Overall, I would say this definitely deserves a new issue. Because I can't figure out how to write integration tests for this overly complex reproduction process.

  • ShardingSphere agent lacks graceful shutdown functionality when using the Otel SDK; forcibly shutting down the main application will cause the agent to throw an exception. This has gone beyond the scope of the original issue.

Newly introduced risk: When parentSpan is absent, spans become orphaned (invalid parent). If that is intended, document it (or add assertions). If not, consider using the current context or emitting a warning.

  • When there is no parent span, the Otel SDK should start a root span (because the parent span ID is invalid) instead of attaching an arbitrary context; this is consistent with Otel's behavior when a parent is missing, see the explanation at open-telemetry/opentelemetry-specification#3781 . The SLF4J Java API cannot be used here because the logs should be sent from the main application, not the agent.

@RaigorJiang
Copy link
Contributor

@linghengqian Version 5.5.3 is scheduled for release soon. Should we adjust the milestone for this issue to 5.5.4?

@linghengqian linghengqian modified the milestones: 5.5.3, 5.5.4 Jan 9, 2026
@linghengqian
Copy link
Member Author

@linghengqian Version 5.5.3 is scheduled for release soon. Should we adjust the milestone for this issue to 5.5.4?

  • There's no problem with this, as it doesn't affect the previous unit tests.

@terrymanu terrymanu merged commit bb2c480 into apache:master Jan 9, 2026
145 checks passed
@linghengqian linghengqian deleted the fix-agent branch January 9, 2026 04:31
@terrymanu terrymanu modified the milestones: 5.5.4, 5.5.3 Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature: agent shardingsphere agent type: bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using the OpenTelemerty plugin with Agent will throw some undocumented exceptions

3 participants