-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Context
TVM-FFI currently provides two mechanisms for attaching behavior to types:
| Mechanism | Storage | Lookup | Inheritance | Designed For |
|---|---|---|---|---|
| Member methods | Row-major (per-type dict) | Name-based hash | Yes (base -> derived) | Per-type methods (sum, append, ...) |
| TypeAttrColumn | Column-major (std::vector indexed by type_index) |
O(1) pointer offset — single array dereference | No (per-type opt-in) | Builtin dunder dispatch (__any_hash__, __ffi_repr__, ...) |
TypeAttrColumn is currently a closed set — only a handful of builtin
attributes are registered as dense-array columns. The implicit rule is that type
attrs should be "rare and restrained by tvm-ffi package."
This document proposes making type attrs an open set backed by hash maps,
while preserving the dense-array fast paths for the few builtins that
genuinely need them — keeping the std::vector-indexed O(1) dispatch that
tvm-ffi's hot loops rely on.
Motivation: IR Trait Dispatch
Consider a compiler IR built on tvm-ffi. Every IR node can have traits —
static properties that optimization passes query:
SideEffectFree: Can this node be eliminated if its result is unused?Commutative: Can operands be reordered?HasCanonicalizer: Does this node type provide a canonicalization rule?Fusible: Can this node be fused into a neighboring kernel?
These traits are queried across all node types during compilation:
void DCE(const Object* node) {
if (HasTrait(node, "side_effect_free")) {
// safe to remove if unused
}
}Under the current closed-set policy, an IR framework cannot define new type
attrs — it must fall back to member method lookup or reinvent its own dispatch
table.
Problem with Dense Arrays at Scale
The current TypeAttrColumn uses a dense std::vector indexed by type_index.
This works well for a small fixed set of builtins where nearly every type
could have the attribute. But it does not scale to an open set:
- With 500 registered types and 100 user-defined trait columns, each dense
array allocates 500 entries (8 bytes each) = 400 KB of mostly-null
pointers. - As the type registry grows (downstream packages registering types), every
existing column must be resized. - The sparsity is wasteful: if
Commutativeonly applies to 10 out of 500 op
types, 98% of the column is null.
Dense arrays are the right choice for a small, fixed set of hot-path builtins.
They are the wrong choice for an open, extensible set of user-defined traits.
Proposed Design: Two-Tier Type Attr Dispatch
Tier 1: General type attrs via unordered_map
The default storage for type attrs is a hash map keyed by (attr_name, type_index). Any framework can register arbitrary type attrs without
pre-declaring columns:
// IR framework registers traits (no EnsureTypeAttrColumn needed)
TypeAttrSet(AddOp::type_index, "ir_trait.commutative", true);
TypeAttrSet(AddOp::type_index, "ir_trait.side_effect_free", true);
// Query: hash map lookup (~20-50 cycles)
AnyView val = TypeAttrGet("ir_trait.commutative", node->type_index());Performance: Hash map lookup is ~20-50 cycles. For IR trait queries that
run millions of times during compilation, this adds ~50 ms per 1M queries —
negligible compared to the IR transformations themselves. This is fast enough
for the vast majority of cross-type dispatch use cases.
Memory: Only entries that are actually set are stored. No pre-allocation,
no sparsity waste.
Open set: Any downstream framework can define new attrs with any string
key. No allowlist, no pre-registration.
Tier 2: Dense-array fast paths for hot builtins
For the few builtins that sit on genuinely critical paths — where every
object must be dispatched and the overhead compounds — the system retains
the existing std::vector-indexed dense-array columns. These provide a
single pointer-arithmetic dereference (column.data[type_index]) with no
hashing, no branching, and no cache misses beyond the one data load — the
same cost model as a C function-pointer slot.
This matters in practice because these builtins run in tight inner loops:
| Builtin | Why it's hot | Access pattern | Specialization |
|---|---|---|---|
__any_hash__ |
Called on every key in every Map insert/lookup |
Tight loop over map entries; hash dispatch is on the critical path | Dense array column (std::vector O(1)) |
__structural_equal__ |
Called on every node pair in IR comparison | Recursive DFS over entire IR trees; called millions of times per pass | Dense array column (std::vector O(1)) |
__ffi_repr__ |
Called once per object in repr | Single traversal, not perf-critical | General tier is fine |
__ffi_shallow_copy__ |
Called once per object in copy | Single invocation per clone | General tier is fine |
The criterion for tier-2 specialization is empirical: does hash-map overhead
show up in profiles? Most dunder methods (repr, copy, serialization) are
called infrequently enough that the general tier suffices. Only those on inner-loop
critical paths (hash, equality) need the dense-array std::vector fast path.
This is analogous to how CPython works: most dunder methods go through
tp_dict (hash map), but the most critical ones (tp_hash, tp_richcompare,
tp_iter) have dedicated function pointer slots in the type struct — precisely
because they sit on hot loops where even a hash-map probe would be too costly.
Implementation Sketch
class TypeAttrRegistry {
public:
// Tier 1: general hash-map storage
void Set(int32_t type_index, const std::string& attr_name, Any value) {
general_[attr_name][type_index] = std::move(value);
}
AnyView Get(const std::string& attr_name, int32_t type_index) const {
auto col_it = general_.find(attr_name);
if (col_it == general_.end()) return AnyView();
auto val_it = col_it->second.find(type_index);
if (val_it == col_it->second.end()) return AnyView();
return AnyView(val_it->second);
}
// Tier 2: dense-array fast path (opt-in per attr)
// Existing TypeAttrColumn mechanism, unchanged.
// Used only for __any_hash__, __structural_equal__, etc.
private:
// attr_name -> (type_index -> value)
std::unordered_map<std::string, std::unordered_map<int32_t, Any>> general_;
};For tier-2, the existing TVMFFITypeAttrColumn dense-array mechanism is
retained as-is. The two tiers coexist:
- Hot-path builtins are registered in both the dense column (for fast C++
dispatch) and the general map (for uniform Python-side access via
getattr). - User-defined attrs go only in the general map.
Optimization: Per-Column Hash Maps
If even the two-level hash lookup (attr_name then type_index) is too
slow for certain user-defined attrs, a framework can cache the inner map:
// Cache the inner map once at startup
static auto* commutative_map = &TypeAttrRegistry::Get("ir_trait.commutative");
// Fast query: single hash lookup by type_index
bool is_commutative = commutative_map->count(node->type_index());This gives ~20-cycle lookup (single unordered_map probe on a small integer
key) without requiring a dense array allocation for every registered type.
What This Achieves
For tvm-ffi core
- Hot-path builtins stay fast:
__any_hash__and__structural_equal__
keep theirstd::vector-indexed O(1) dense-array lookup — a single pointer
dereference with no hashing or branching. Zero performance regression on
the inner loops that tvm-ffi cares about most. - No sparsity waste: User-defined attrs use hash maps, not dense arrays.
- No ABI bloat: The general tier is a pure C++ implementation detail.
Dense-array columns remain the ABI for builtins. - Clear boundary: Tier-2 specialization is driven by profiling data, not
an arbitrary allowlist.
For downstream frameworks
- Open set: Any framework can define new type attrs with any string key.
No gatekeeping. - Uniform API:
TypeAttrGet(name, type_index)works for both builtins and
user-defined attrs. Python-sidegetattrworks for everything. - No confusion: Users don't need to know about the two tiers. They define
attrs, and the system routes them to the right storage. - IR traits work:
SideEffectFree,Commutative, etc. can be defined as
type attrs and queried efficiently across all node types.
Python API
@ffi.py_class
class MyIRNode(ffi.Object):
# Trait: registered as general type attr
ir_trait = SideEffectFree
# Repr: registered as type attr (tier-1 general map is fine for repr)
def __ffi_repr__(self, fn_print):
return f"MyIRNode(...)"
# Normal method: registered as member method (row store)
def simplify(self):
...The @ffi.py_class decorator routes:
__ffi_*__methods -> type attrs (general tier; tier-2 for known hot
builtins)- Class-level trait assignments -> type attrs (general tier)
- Everything else -> member methods
Comparison with CPython
CPython uses the same two-tier approach:
| Aspect | CPython | Proposed TVM-FFI |
|---|---|---|
| General methods | tp_dict (hash map) |
General type attr map |
| Hot slots | tp_hash, tp_richcompare, tp_iter, ... |
Dense-array columns for __any_hash__, ... |
| Criterion | Fixed set of ~40 type slots | Empirically profiled hot builtins |
| Open/closed | Open (any method in tp_dict) |
Open (any attr in general map) |
Summary
The current TypeAttrColumn system conflates two concerns:
- Open extensibility: Downstream frameworks need to define new type attrs.
- Hot-path performance: A few builtins need sub-10-cycle dispatch via
directstd::vectorindexing.
Dense arrays solve (2) but block (1). Hash maps solve (1) but are too slow for
(2). The solution is both:
- General tier (
unordered_map): Open set, no sparsity, ~20-50 cycle
lookup. Good enough for IR traits, repr, copy, serialization, and the vast
majority of cross-type dispatch. - Specialized tier (dense
std::vectorarray): Closed set of profiled hot
builtins. Single pointer-offset lookup (~4 cycles). Reserved for
__any_hash__,__structural_equal__, and any future builtin that shows up
in profiles.
This gives tvm-ffi's performance guarantee where it matters — the std::vector
O(1) dispatch that hot loops like Map operations and structural comparison
depend on — and open extensibility everywhere else.