Skip to content

Lobe Pruning optimization in MaterialX proper #2680

@ppenenko

Description

@ppenenko

Introduction

This proposal is to port the Lobe Pruning optimization from external renderers to MaterialX core, enabling runtime GPU performance improvements for all MaterialX-based renderers.

Motivation

The MaterialX community has identified shader optimization as a key priority - see the call for action in issue #2480.

Complex physically-based shading models like OpenPBR and Standard Surface generate shader code with numerous BSDF lobes (subsurface, transmission, coat etc.) When material parameters disable certain lobes — for example, by setting subsurface_weight to 0 — the generated shader still contains all the evaluation code for those unused features. This leads to:

  • Shader code bloat which complicates navigation and debugging
  • Pushing texture and uniform limits on platforms like WebGPU
  • Wasted GPU cycles evaluating unused BSDFs and their input networks
  • Increased register pressure reducing GPU occupancy, which severely degrades performance

The above problems are especially acute on mobile platforms and on integrated GPUs.

Lobe Pruning addresses these problems by analyzing the shader graph before code generation and eliminating branches that contribute nothing to the final result. When lobe weight parameters are compile-time constants (known at shader generation time, analogous to constexpr in C++) with values of 0 or 1, entire subgraphs can be safely removed.

Expected Performance Impact

An implementation of the Lobe Pruning algorithm in OpenUSD PR #3525 demonstrates runtime performance improvements of up to x4. We should expect similar results for the MaterialX implementation.

Prior Work

There are currently two open-source implementations of the Lobe Pruning optimization by @JGamache-autodesk, which have been battle-tested and demonstrate significant performance benefits.

Core Algorithm

Both implementations share the same optimization algorithm with three pruning rules:

  1. Mix nodes: If the mix factor is 0 or 1, forward one input and delete the node
  2. Multiply nodes: If either input is 0, propagate the zero upstream and delete the node
  3. BSDF nodes: If a weight parameter is 0, substitute with a no-op "dark" BSDF and delete the original node

Both implementations target standard PBR BSDF nodes from the MaterialX specification (e.g., burley_diffuse_bsdf, conductor_bsdf, subsurface_bsdf, dielectric_bsdf, sheen_bsdf) and generate special "dark" BSDF nodes as no-op replacements for zero-weight BSDFs (e.g., dark_base_bsdf, dark_layer_bsdf).

When enabled, the optimizer traverses the shader graph and eliminates entire subgraphs.

Example: A Standard Surface material with subsurface_weight = 0.0 will:

  1. Eliminate the entire subsurface BSDF evaluation
  2. Remove all upstream texture fetches and math nodes feeding into subsurface
  3. Simplify the final mix operations

Maya USD Implementation

Maya USD has a production implementation of Lobe Pruning for a non-Storm Hydra render delegate drawing via Maya's Viewport 2.0, introduced in November 2024.

Key characteristics:

  • Uses MaterialX APIs (no USD dependencies in the core logic)
  • Operates at the MaterialX document/NodeGraph level
  • Tightly integrated with topology-neutral graph generation for efficient shader caching

Hydra Storm Implementation

Autodesk has developed and submitted Lobe Pruning to Hydra Storm in OpenUSD PR #3525 (under review since February 2025).

In contrast to Maya USD, this implementation operates on Hydra material networks instead of MaterialX node graphs, in order to properly integrate with USD's material workflows. These Hydra networks are created from MaterialX documents and are eventually converted to MaterialX documents for code generation.

Proposal: ShaderGraph-Level Integration

Motivation for the ShaderGraph Target

MaterialX Issue #2566 proposes a fundamental architectural change to shader generation. The Visitor Pattern API will allow MaterialX to generate shaders directly from non-MaterialX data sources such as Hydra material networks, without converting them to a MaterialX document first.

This has critical implications for Lobe Pruning:

  • Forward compatibility: If we implement Lobe Pruning at the document/NodeGraph level, it will become obsolete when Visitor Pattern changes land
  • Universal applicability: Optimizations at the ShaderGraph level benefit all input sources (MaterialX documents, USD networks, custom formats)
  • Correct architectural layer: ShaderGraph is the runtime representation used during code generation — the natural place for generation-time optimizations

Integration Point: ShaderGraph::optimize()

MaterialX already has optimization infrastructure in ShaderGraph::optimize() - see e.g.
PR #2499.

Relationship with Existing Optimizations

Premultiplied Add

Core Algorithm

The idea of this optimization is to transform mix(a, b, weight) operations into the add(a * (1 - weight), b * weight) pattern.

This enables a run-time GPU hardware hardware optimization: when weight has a uniform value, modern GPUs are able to skip evaluating the zero-weighted term.

It's important to note that this transformation is counter-productive for path tracing targets (OSL, MDL) which benefit from early elimination of terms with zero contributions.

Manual Implementation (NodeGraph Level)

MaterialX PRs #2459, #2483and #2493 manually apply the Premultiplied Add optimization to the OpenPBR, Standard Surface and glTF PBR NodeGraph definitions, respectively. The mix and add operations are implemented by separate nodes in the node graph.

This work achieved significant performance improvements for real-time rendering and motivated further exploration of performance optimizations. Later, the optimization was restricted to real-time shading languages only by specializing the node graphs for them, in order to avoid impacting path tracing targets.

Limitations:

  • Manual and static: Each case has to be optimized separately and manually.
  • Target-specific: Requires specializing node graphs per target.
  • Library node graphs only: Doesn't help users optimize custom node graphs.
  • Community consensus: optimizing at the node graph level is a premature optimization.

Automated Implementation (ShaderGraph Level)

MaterialX PR #2499 (currently under review) implements the Premultiplied Add transformation programmatically during shader generation.

Advantages over the manual approach:

  • Automatic: No case-by-case manual editing required
  • Can be controlled by GenOptions to enable/disable per target
  • Works at the ShaderGraph level, benefiting all possible input sources (document, Visitor Pattern, etc.)
  • Precedent for ShaderGraph-level optimizations

Relationship to Lobe Pruning

Lobe Pruning is expected to provide additional performance gains over Premultiplied Add thanks to:

  • reduced register pressure;
  • eliminating unused uniform and texture bindings;
  • reduced shader code complexity and, therefore, reduced individual shader compilation times.

The available performance measurements for the two optimizations seem to support these expectations.

At the same time, Lobe Pruning requires the weight parameters to be compile-time constants of 0 or 1, which is not always possible. Also, applying Lobe Pruning increases the number of shader permutations to compile, which may be an undesirable tradeoff for some applications.

Therefore, the two optimizations should be complimentary and controlled by separate codegen options. They should coexist in ShaderGraph::optimize() and share the code for detecting the optimization opportunities.

Topology Caching

The Topology Caching algorithm optimizes MaterialX codegen and shader compilation performance by determining the equivalence of MaterialX materials with respect to codegen. The underlying performance problem was first documented in OpenUSD Issue #2330 (March 2023). It was observed that Storm was generating separate shaders for functionally equivalent MaterialX networks, leading to unnecessary codegen and compilation overhead. This performance issue is applicable not just to Storm but to any renderer using MaterialX codegen.

Maya USD PR #3445 (merged in November 2023) is the first known open-source implementation of the Topology Caching optimization. It analyzes materials via MaterialX APIs and caches the generated shaders in Maya's own shader cache.

Later, OpenUSD PR #3073 (merged June 2024) introduced a similar optimization to Storm. In contrast to Maya USD, the algorithm analyzes Hydra material networks and relies on Hydra's existing instance registry mechanism for caching. Topology Caching improves cache hit rates for that data structure by using topology hashes as cache keys.

Core Algorithm

This optimization anonymizes shader graphs to enable shader reuse across different materials with equivalent topologies. In particular, the algorithm:

  • determines for each node if it's topological, i.e. if changes to its inputs affect the code generated for its implementation;
  • normalizes the emitted default uniform values which are otherwise emitted with the current values of the respective MaterialX inputs;
  • anonymizes the node names which otherwise affect the emitted code.

Similar to Lobe Pruning, the current integrations of Topology Caching are implemented at the node graph level, which is problematic from two perspectives:

  1. The topological property of individual nodes is determined by their concrete implementations at the shader graph level. Because of this, magic strings were necessary in the node graph-level implementations. It was discussed in OpenUSD PR #3073 that shader generators should ideally expose this information via a new API.
  2. With the Visitor Pattern introduction, these node graph-level optimizations will eventually become obsolete.

Therefore, just like with Lobe Pruning, Topology Caching should eventually be ported to the shader graph level.

It's also important to note that Topology Caching has more integration points with the specific host application than Lobe Pruning. First, as we've seen above, the shader cache is typically implemented by the host application, outside of MaterialX, and Topology Caching needs to integrate with it.

At the same time, Topology Caching has implications for material editing workflows. When the material is edited in the host application, it needs to determine whether the change affects the generated source code or if it only results in a uniform value or texture binding change. As an example of this logic, Maya USD's implementation generates a watch list mapping material attributes to their topological classification for efficient invalidation tracking. This is another integration point with the host application.

Relationship with Lobe Pruning

Lobe Pruning is integrated with Topology Caching in both existing implementations, with Topology Caching predating Lobe Pruning. The Maya USD integration is tighter, with both optimizations applied in the same graph traversal for efficiency. However, the two can be decoupled, controlled by separate settings and ported separately.

In terms of performance effects:

  • Topology Caching reduces codegen and compilation time by reducing the number of generated shaders for the given number of materials.
  • Lobe Pruning:
    • Reduces runtime GPU cost by simplifying the shaders.
    • Codegen and compilation impact is nuanced: Creates more shader permutations (increasing the number of work items), but each individual shader is simplified (making each work item cheaper). Permutations can be compiled in parallel, and cache hits eliminate recompilation, so having a good cache implementation becomes even more important.

In the interest of separation of concerns and making MaterialX contributions more atomic, the scope of proposal is limited to Lobe Pruning.

Risks / Considerations

Optimizations based on the Lobe Pruning algorithm have already been deployed in multiple production systems, which serve as a proof of concept.

Since the proposed changes are only supposed to implement a performance optimization and not modify the rendering behavior, existing MaterialX and USD test workflows can be used to guard against correctness regressions.

Finally, the optimization should remain optional, controlled with a run-time switch, for the sake of performance and correctness comparisons and to provide a fallback in the case of regressions.

Performance verification

The results of this optimization should be thoroughly verified with the following mechanisms:

  • Instrumentation in MaterialX, likely requiring a new tracing mechanism similar to that of OpenUSD
  • Shader code size: target language source line count, SPIR-V binary size
  • Compilation time: Measured with the proposed MaterialX instrumentation
  • Offline analysis with GPU vendor tools for the following metrics:
    • Register pressure
    • Resource usage: Uniform and texture sampler counts
  • Runtime performance: FPS measurements in representative scenes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions