buffer added to avoid splitted chatacter #122

Ovler-Young · 2023-06-12T18:31:49Z

inspired by ggml-org/whisper.cpp#399 (comment)

The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.
To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.

In this repo, the problem is very similar. Instead of use the icu library which might only on linux, I found a way to check it by pure c++, so no need to modify the makefile.

Some related issues: #109 #37

inspired by ggml-org/whisper.cpp#399 (comment)

buffer added to avoid splitted chatacter

5570289

inspired by ggml-org/whisper.cpp#399 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

buffer added to avoid splitted chatacter #122

buffer added to avoid splitted chatacter #122

Uh oh!

Ovler-Young commented Jun 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

buffer added to avoid splitted chatacter #122

Are you sure you want to change the base?

buffer added to avoid splitted chatacter #122

Uh oh!

Conversation

Ovler-Young commented Jun 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant