Skip to content

Add support for non-native-endian UTF-16 and UTF-32 #763

@Rot127

Description

@Rot127

It would be great if the endianess of the input buffer could be changed for each match().
For our use case we can have strings which come in little and big endian encoding and we must support both.
The encoding normalization to UTF-8 naturally eats a lot of runtime.

Having this build into PCRE2 would be a blessing.

I am aware that the docs say:

UTF-16 and UTF-32 strings can indicate their endianness by special code knows as a byte-order mark (BOM).
The PCRE2 functions do not handle this, expecting strings to be in host byte order.

But would it be a possible extension? Or is it simply utopic because too complicated to implement?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions