fix(tts): normalize markdown before speech synthesis#5
Conversation
Add parser-backed markdown normalization before TTS, sync locale schemas required by translation validation, and remove the stale Rust test workflow mock step so the PR remains green.
45ef2b6 to
debbf51
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a text normalization module for the TTS system, utilizing the pulldown-cmark library to convert Markdown content into speech-friendly text. Key changes include the integration of this normalization step into the TTSManager and the addition of logic to handle various Markdown elements like headings, lists, and links while omitting complex code blocks. Feedback focuses on improving the robustness of HTML entity decoding and simplifying whitespace normalization logic.
| fn normalize_inline_whitespace(text: &str) -> String { | ||
| let mut normalized = String::new(); | ||
| let mut last_was_space = false; | ||
|
|
||
| for ch in text.chars() { | ||
| if ch.is_whitespace() { | ||
| if !last_was_space { | ||
| normalized.push(' '); | ||
| last_was_space = true; | ||
| } | ||
| } else { | ||
| normalized.push(ch); | ||
| last_was_space = false; | ||
| } | ||
| } | ||
|
|
||
| normalized.trim().to_string() | ||
| } |
There was a problem hiding this comment.
| out.push_str(match entity.as_str() { | ||
| "amp" => "&", | ||
| "lt" => "<", | ||
| "gt" => ">", | ||
| "quot" => "\"", | ||
| "apos" | "#39" => "'", | ||
| "nbsp" => " ", | ||
| _ => "", | ||
| }); |
There was a problem hiding this comment.
The current HTML entity decoding is limited, only handling a few named entities and one specific numeric entity (#39). This can be made more robust by handling all numeric entities (both decimal and hexadecimal), which are common in HTML. This would improve the accuracy of the text normalization for a wider range of inputs.
let decoded = if let Some(code) = entity.strip_prefix('#') {
let (radix, code_str) = if let Some(hex_code) = code.strip_prefix('x') {
(16, hex_code)
} else {
(10, code)
};
u32::from_str_radix(code_str, radix)
.ok()
.and_then(std::char::from_u32)
.map(|c| c.to_string())
.unwrap_or_default()
} else {
match entity.as_str() {
"amp" => "&".to_string(),
"lt" => "<".to_string(),
"gt" => ">".to_string(),
"quot" => "\"".to_string(),
"apos" => "'".to_string(),
"nbsp" => " ".to_string(),
_ => String::new(),
}
};
out.push_str(&decoded);
Summary
Problem
Markdown selections were being sent directly to TTS, so users heard source syntax like
hash hash,asterisk asterisk, and raw link/code markers instead of natural prose.Implementation
pulldown-cmarkfor deterministic offline parsingCode example omitted.Validation
cargo test text_normalization -- --nocapturecargo check