Skip to content

charset: update deal with invalid utf8 characters #21422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions character-set-and-collation.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,20 @@ To disable this error reporting, use `set @@tidb_skip_utf8_check=1;` to skip the
>
> If the character check is skipped, TiDB might fail to detect illegal UTF-8 characters written by the application, cause decoding errors when `ANALYZE` is executed, and introduce other unknown encoding issues. If your application cannot guarantee the validity of the written string, it is not recommended to skip the character check.

In certain SQL statements, comparisons might involve invalid UTF-8 characters. For example:

```sql
SELECT * FROM `t` WHERE `id` > 'a" + string([]byte{0xff}) + "a';
```

In the preceding statement, `0xff` is an invalid UTF-8 byte. When handling such characters, TiDB behaves differently depending on the collation <span class="version-mark">New in v9.0.0</span>:

* Non-binary collations (such as `utf8mb4_general_ci`): TiDB truncates the string at the invalid byte. The truncated part is excluded from the comparison.

* `gbk_bin` and `gb18030_bin` collations: TiDB replaces invalid bytes with the character `?` and continues with the comparison.

* Other binary collations (such as `utf8_bin`): TiDB treats invalid bytes as ordinary bytes and compares them based on their original binary values.

## Collation support framework

<CustomContent platform="tidb">
Expand Down