-
Notifications
You must be signed in to change notification settings - Fork 566
Description
Summary
The JNI specification uses Modified UTF-8 (MUTF-8), not standard UTF-8, for class names, method names, and field names. Our codebase consistently treats these names as plain ASCII — and in practice, they always are. This issue documents the full analysis for awareness and to inform future work (e.g. #10795 trimmable type maps).
Practical impact: zero. A search across dotnet/android, dotnet/java-interop, dotnet/maui, and dotnet/runtime found no bug reports related to MUTF-8 encoding. The ASCII-only assumption has held across 6+ years and millions of apps.
Background: Standard UTF-8 vs Modified UTF-8
The JNI spec describes two differences from standard UTF-8:
| Situation | Standard UTF-8 | Modified UTF-8 |
|---|---|---|
| NUL character | 0x00 (1 byte) |
0xC0 0x80 (2 bytes) |
| Supplementary (non-BMP) characters (U+10000+) | 0xF0... (4 bytes) |
Two 3-byte sequences 0xED 0xA... + 0xED 0xB... (surrogate pair, 6 bytes) |
For class names the NUL case is irrelevant. The surrogate pair / non-BMP case is the only theoretical risk: if a class name contained emoji or a CJK Extension B+ character, the bytes in MUTF-8 would be a 6-byte surrogate pair, which Encoding.UTF8 would decode as two replacement characters (U+FFFD).
JNI encoding by API — with citations
| JNI API | Encoding | Source |
|---|---|---|
NewString / GetStringChars / GetStringRegion |
UTF-16 (jchar*) |
JNI spec - String Operations |
NewStringUTF / GetStringUTFChars / GetStringUTFRegion |
Modified UTF-8 (char*) |
JNI spec - String Operations |
FindClass name arg |
Modified UTF-8 | JNI spec - FindClass |
GetMethodID / GetFieldID name+sig args |
Modified UTF-8 | JNI spec - GetMethodID |
RegisterNatives name+sig fields |
Modified UTF-8 | JNI spec - RegisterNatives |
Java String heap representation |
UTF-16 (compact 8-bit for ASCII since Android 8) | Android JNI Tips |
Key quote from Android JNI Tips:
"The Java programming language uses UTF-16. For convenience, JNI also provides methods that work with Modified UTF-8... Data passed to
NewStringUTFmust be in Modified UTF-8 format. ... CheckJNI — enabled by default for emulators — scans strings and aborts the VM if it receives invalid input."
What's safe: normal string marshalling
JniEnvironment.Strings in dotnet/java-interop uses UTF-16 JNI APIs exclusively (NewString/GetStringChars) for all normal Java-to-C# string marshalling — method arguments, return values, field reads/writes. This is completely immune to MUTF-8 issues.
Encoding inconsistencies in the codebase
These are not bugs in practice (all real-world class names are ASCII), but are worth documenting for awareness.
1. TypeManager.GetClassName — MUTF-8 decoded as Latin-1
The native function get_java_class_name_for_TypeManager calls GetStringUTFChars (returns MUTF-8), strdups the result, replaces . with /, and returns a char*. The native code is aware this is MUTF-8 — the local variable is even named mutf8:
const char *mutf8 = env->GetStringUTFChars(name, nullptr);
char *ret = strdup(mutf8);
// ... replace '.' with '/' ...
return ret;The managed caller decodes the returned bytes with Marshal.PtrToStringAnsi, which interprets them as Latin-1 (ISO-8859-1):
IntPtr ptr = RuntimeNativeMethods.monodroid_TypeManager_get_java_class_name(class_ptr);
return Marshal.PtrToStringAnsi(ptr);For ASCII, Latin-1/UTF-8/MUTF-8 are byte-identical, so this works. For non-ASCII it would produce mojibake, but this path is only used for fallback type lookup/error logging.
History: the original 2016 implementation (initial import) used the UTF-16 JNI path and handled all Unicode correctly:
return JNIEnv.GetString(
JNIEnv.CallObjectMethod(class_ptr, JNIEnv.mid_Class_getName),
JniHandleOwnership.TransferLocalRef).Replace(".", "/");This was replaced in PR #3729 (Oct 2019, "JNIEnv.Initialize optimization") to save ~30ms on startup by moving the work to native code. The PtrToStringAnsi was the natural P/Invoke idiom for decoding a returned char* — encoding was not discussed in the PR.
2. FindClass(string) in java-interop — standard UTF-8 sent to a MUTF-8 API
JniEnvironment.Types.TryRawFindClass uses Marshal.StringToCoTaskMemUTF8 to encode the class name before passing it to FindClass. This produces standard UTF-8, which differs from MUTF-8 only for non-BMP characters.
The ReadOnlySpan<byte> overload (FindClass(ReadOnlySpan<byte>) using u8 literals) bypasses this entirely and is the preferred path.
3. ConstantPool.cs — already correct
Xamarin.Android.Tools.Bytecode/ConstantPool.cs in dotnet/java-interop already implements a correct MUTF-8 fixup pass before calling Encoding.UTF8.GetString, handling both 0xC0 0x80 NUL and surrogate-pair supplementary characters. This is the reference implementation if a proper MUTF-8 decoder is ever needed elsewhere.
Risk summary
| Path | Risk | Notes |
|---|---|---|
Normal string marshalling (JniEnvironment.Strings) |
None | Uses UTF-16 JNI APIs |
Typemap keys from [Register("...")] attributes |
None | Compile-time ASCII C# string literals |
FindClass(string) via Marshal.StringToCoTaskMemUTF8 |
Theoretical | Differs from MUTF-8 only for non-BMP class names |
TypeManager.GetClassName via PtrToStringAnsi |
Theoretical | Latin-1 decode of MUTF-8; fallback/error path only |
ConstantPool.cs bytecode parser |
None | Already implements correct MUTF-8 fixup |
Real-world precedent: Android 12 MUTF-8 enforcement
Android 12 (API 31) added strict MUTF-8 validation to NewStringUTF. Invalid input causes a hard SIGABRT:
JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8
This was triggered in the wild by facebook/react-native#34363 / facebook/flipper#3175, where an app name with diacritics (Romanian characters) was passed to NewStringUTF after incorrect percent-encoding produced invalid MUTF-8 byte sequences. 53+ GitHub issues across different projects match this error pattern.
This is not directly applicable to dotnet/android (we don't call NewStringUTF with user-provided strings), but illustrates that MUTF-8 issues can be latent for years and surface only when Android tightens enforcement.
Scenarios that could theoretically trigger issues
- Non-ASCII BMP class names (e.g. CJK
com/example/MyClass, accented Latin, Cyrillic) — work fine today. Verified against a real JVM (OpenJDK 21): MUTF-8 and standard UTF-8 encode all BMP characters (U+0000–U+FFFF) identically.Encoding.UTF8decodes them correctly. This covers all living languages, all ~27,000 common CJK ideographs, and all Latin/Cyrillic/Greek/Arabic scripts. - Non-BMP class names (U+10000+: emoji, rare CJK extensions, historic scripts) — would break. Verified against a real JVM:
GetStringUTFCharsreturns 6-byte MUTF-8 surrogate pairs (e.g.ED-A0-BD-ED-B8-80for 😀), whichEncoding.UTF8decodes as 6 replacement characters (�). Essentially non-existent in real class names. - ProGuard/R8 with Unicode obfuscation dictionaries — advanced obfuscators like dProtect can rename classes to arbitrary Unicode strings. If a bound AAR uses such obfuscation, it could produce non-ASCII JNI names. BMP obfuscation would work fine; non-BMP would break.
Conclusion
The ASCII-only assumption is deeply embedded and has been validated by years of production use with zero bug reports. Future work touching type name lookup paths (e.g. #10795) should simply maintain this same assumption and document it. No fix is needed at this time.
Open questions and follow-up
Connection to the trimmable type map (#10795)
The trimmable type map (NativeHashtable) stores JNI class names as UTF-16 characters in a native blob. At runtime, the type map lookup API accepts a string key.
Important: the trimmable type map path goes through java-interop's JniRuntime.JniTypeManager, which resolves class names via GetJniTypeNameFromClass. This calls Class.getName() and decodes the result using GetStringChars (UTF-16, not MUTF-8) into new string(char*, 0, len). So MUTF-8 is not involved in the trimmable type map lookup path at all — the class name arrives as a proper .NET string via the UTF-16 JNI API.
The MUTF-8 / GetStringUTFChars path only exists in the legacy TypeManager.GetClassName native helper (see the encoding inconsistencies section above).
The current flow for the trimmable type map is:
jclass -> Class.getName() via JNI
-> GetStringChars (UTF-16 jchar*)
-> new string(char*, 0, len) // heap allocation
-> .Replace('.', '/')
-> GetTypesForSimpleReference(string)
-> NativeHashtable lookup
The idea in #10795 is that the lookup table could also accept ReadOnlySpan<char> instead of just string. Since the class name is already available as UTF-16 chars from GetStringChars, we could copy them into a stackalloc buffer (with the . -> / replacement) instead of creating a heap-allocated string. For ASCII inputs (which is all real-world cases), this is a trivial and fast operation.
Even more aggressively, since GetStringChars returns a direct pointer to the JVM's internal character data, it may be possible to perform the lookup directly against that pointer as a ReadOnlySpan<char> — though the . to / replacement and JNI critical section constraints would need to be considered.
The TypeManager.GetClassName history provides additional confidence in the ASCII-only assumption: it has used PtrToStringAnsi (Latin-1, equivalent to ASCII widening) since 2019 with zero issues.
The benchmark data posted on #10795 shows a span-based lookup path is ~30% faster with zero heap allocation compared to the string-allocating path.
| Strategy | Key type | Source | Allocation | Notes |
|---|---|---|---|---|
| Current | string |
GetStringChars -> new string(char*) |
56-112 B/lookup | Heap-allocated string |
| Span from JNI | ReadOnlySpan<char> |
GetStringChars -> stackalloc copy (with .->/ fixup) |
0 B | ~30% faster; requires TryGetValue(ROS<char>) on the hashtable |
Staying on UTF-16 end-to-end
This approach has a significant advantage beyond performance: it sidesteps the MUTF-8 question entirely. Since GetStringChars returns UTF-16 and the native blob stores UTF-16, the entire lookup stays in UTF-16 from start to finish. No encoding conversion, no ASCII assumption needed for correctness, no fallback path for non-ASCII names. It's correct for all Unicode inputs by construction.
The only transformation needed is the . to / replacement (package separator to JNI separator; the $ for nested classes is already present in Class.getName() output and left untouched). This can be done during the stackalloc copy in a single pass — trivially vectorizable (compare against '.', blend with '/').
This makes the MUTF-8 analysis in this issue nicely self-contained: the MUTF-8 encoding concern is real but only affects the legacy TypeManager.GetClassName native helper path. The trimmable type map can avoid it entirely by staying on UTF-16.
Verified against a real JVM
All of the above has been tested against a desktop OpenJDK 21 JVM using java-interop's JreRuntime. Key results:
GetStringChars(UTF-16) round-trips all characters correctly: ASCII, CJK, emojiGetStringUTFChars(MUTF-8) returns 6-byte surrogate pairs for non-BMP characters (e.g.[ED-A0-BD-ED-B8-80]for U+1F600), whichEncoding.UTF8.GetString()decodes as������- The zero-allocation lookup (
GetStringChars→ stackalloc copy with.→/→ReadOnlySpan<char>lookup) works end-to-end with the JVM and produces 0 bytes of managed allocation across 1000 lookups
Test code is in the Utf16LookupTest experiment project.
Related
- [TrimmableTypeMap] Performance considerations #10795 — JNI type name lookup performance (trimmable type maps)
ConstantPool.csin dotnet/java-interop — correct MUTF-8 fixup implementation- JNI spec: Modified UTF-8
- Android JNI Tips: UTF-8 and UTF-16 Strings