Fix identifier (un)escaping#47
Conversation
| * | ||
| * @return string The token value. | ||
| */ | ||
| public function get_value(): string { |
There was a problem hiding this comment.
While I like that using get_value() is lazy and can generally nicely work for any token type where we need to interpret or normalize any values, I'm wondering how to solve the NO_BACKSLASH_ESCAPES SQL mode.
It's a very simple IF, but in the token instance, we just know nothing about SQL modes 🤔 The tokenizer knows it, so it could pass in a flag, or use a different token instance, but that makes it a bit less elegant.
There was a problem hiding this comment.
Could it be a constructor argument? The mode is already determined when the token is created. If that was a boolean flag baked into the Token instance, we could still keep the get_value() method argument-less.
| * > of pattern-matching contexts, they evaluate to the strings \% and | ||
| * > \_, not to % and _. | ||
| */ | ||
| '\%' => '\\\\%', |
There was a problem hiding this comment.
What do you think about using '\\%' instead of \% just to make sure PHP won't surprise us by treating the backslash as an escape sequence? Or, alternatively, $backslash . '%' or '\x5C%'?
There was a problem hiding this comment.
I thought maybe we're good since PHP docs says this:
To specify a literal single quote, escape it with a backslash (
\). To specify a literal backslash, double it (\\). All other instances of backslash will be treated as a literal backslash
But this gives me trust issues:
// This backslash is not treated as a literal backslash:
php > var_dump("\x5C");
string(1) "\"
// Neither is this:
php > var_dump("\n");
string(1) "
"
There was a problem hiding this comment.
Ah no, that was just me using the wrong quotes. We are good indeed:
php > var_dump('\x5C');
string(4) "\x5C"
php > var_dump('\n');
string(2) "\n"
There was a problem hiding this comment.
We're good here, but it's a good point to be a bit more explicit. I don't like the maps like '\n' => "\n", where at a quick sight, the left and right sides look the same.
I followed the $baskslash idea in 931c82b, and I think it looks better now.
| /* | ||
| * Apply the replacements. | ||
| * | ||
| * It is important to use "strtr()" and not "str_replace()", because |
There was a problem hiding this comment.
Such a brilliant find ❤️
| * A backslash with any other character represents the character itself. | ||
| * That is, \x evaluates to x, \\ evaluates to \, and \🙂 evaluates to 🙂. | ||
| */ | ||
| $value = preg_replace( '/\\\\(.)/u', '$1', $value ); |
There was a problem hiding this comment.
Now I feel nostalgic. The first time I've read how a quadruple backslash evaluates to a single backslash in preg_* functions was about 20 years ago in a PHP4 book. I'm old. 👴 Can we either document this or express this in a different way? Perhaps $escaped_backslash = preg_quote("\x5C"); and preg_replace( '/'. $escaped_backslash .'(.)/u', '$1', $value );? Maybe that's an overkill. Feel free to make any call here, I just want to make sure this gets brought up.
There was a problem hiding this comment.
Good point! I did that together with the other improvements in 931c82b.
adamziel
left a comment
There was a problem hiding this comment.
Great work, thank you Jan! I'm approving provisionally – I'd still like to see a rigorous test case that directly targets the quote_mysql_utf8_string_literal method. I know it's private. Perhaps it's generic and useful enough to be exposed publicly? And if not, there are ways to test private methods, too (although protected ones are easier).
|
I left a nitpick about a comment, but the substance of the PR looks great. Thank you for additional tests! |
The translate_string_literal logic in the SQLite driver that handles "unescaping" (interpreting) of MySQL escape sequences is only applied to a
textStringLiteralAST node. It turns out, the same logic is needed intranslate_pure_identifier, or, in other words, also forWP_MySQL_Lexer::DOUBLE_QUOTED_TEXTandWP_MySQL_Lexer::BACK_TICK_QUOTED_IDtokens.Therefore, it's probably better to move this logic to the tokenization phase and make token instances return the correct unquoted token values.
Surfaced in #42.