-
-
Notifications
You must be signed in to change notification settings - Fork 34
Replace vscode-oniguruma with oniguruma-to-es #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This replaces the dependency on the oniguruma wasm build. The code is a modified version of https://github.com/shikijs/shiki/blob/main/packages/engine-javascript. The upstream has an option `forgiving`, but we don’t. We can add it or remove it. We may also be able to increase the scope in which the cache is used. It can be global or per starry-night instance.
| import {toRegExp} from 'oniguruma-to-es' | ||
|
|
||
| /** @type {boolean} */ | ||
| const forgiving = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an option in shiki. We can make it an option, or make it true or false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the reason 100% coverage isn’t met and CI fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the tradeoffs?
I’d imagine the wasm currently is not forgiving, so changing that is also a break. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh, this is more like “silent”/“swallow”? Then yes, I do think it is the default currently to swallow regex errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no no jsdoc type needed too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about swalling regex syntax errors in the grammars. I just tested the bahaviour on the main branch. The original behaviour is the same as having forgiving = false here.
TBH I believe grammar regexes should just be syntactically valid. IMO we should just throw instead of being invalid.
The jsdoc type is needed to make this a boolean instead of true. This prevents TypeScript from detecting and removing dead code. I will remove the variable altogether. I just need it while we decide on being forgiving or not.
| @@ -1,4 +1,4 @@ | |||
| export type {GetOnigurumaUrl, Grammar, Options} from './lib/types.js' | |||
| export type {Grammar} from './lib/types.js' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the removal of these options should be considered a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, breaking. I think this is a major regardless, as this is supposed to work 99.99% correctly, but thus not the exact same
| @@ -1,4 +1,4 @@ | |||
| export type {GetOnigurumaUrl, Grammar, Options} from './lib/types.js' | |||
| export type {Grammar} from './lib/types.js' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, breaking. I think this is a major regardless, as this is supposed to work 99.99% correctly, but thus not the exact same
| return grammar | ||
| }, | ||
| onigLib: createOniguruma(options) | ||
| onigLib: import('./onig-lib.js') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dynamic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onigLib must be a promise that resolves to this module. It’s a bit silly IMO. We can also import it statically and pass Promise.resolve(onigLib). This was just easy to write while the API isn’t final yet.
If we decide to take options, this will change anyway.
| /** @type {boolean} */ | ||
| const forgiving = true | ||
|
|
||
| const MAX = 2 ** 32 - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer lowercase, and a comment describing this magic value is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The number is taken from https://github.com/shikijs/shiki/blob/v3.15.0/packages/engine-javascript/src/scanner.ts#L7. I figured out it looked suspiciously like a power of 2 minus 1.
I don’t know why it has this value. Perhaps @antfu knows.
| import {toRegExp} from 'oniguruma-to-es' | ||
|
|
||
| /** @type {boolean} */ | ||
| const forgiving = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the tradeoffs?
I’d imagine the wasm currently is not forgiving, so changing that is also a break. Why?
| return pattern | ||
| } | ||
|
|
||
| // Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preserved comments from where I got the code. I do agree this comment is useless. I can remove it.
| /** @type {Array<[index: number, match: RegExpExecArray]>} */ | ||
| const pending = [] | ||
|
|
||
| for (const [index, regexp] of regexps.entries()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer calling these expression(s)
| if (match.index === minIndex) { | ||
| return toResult(index, match) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 loops can be made into one: iterate over each and find the object whose index is the lowest. When done, return that found object?
| } | ||
|
|
||
| /** | ||
| * Do nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use .s to end such whole sentences in comments (generally, if the first word is also capitalized, and it could be multiple sentences, I use dots)?
| end: indice[1], | ||
| length: indice[1] - indice[0] | ||
| } | ||
| }) || [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can you sort the fields of these 3 objects
- seems weird to use this max value, is it the last index into some source string? Could that value be passed in as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sort how?
- Re max value: see review comment on its definition.
- Re passing in as a parameter: You mean to always call as
toResult(…, …, max)? I don’t really see the point TBH. - The original code also has an
offset, but it’s always 0.
| * @param {RegExpExecArray} match | ||
| * The actual match | ||
| * @returns {IOnigMatch} | ||
| * The oniguruma match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems useless to me to prefix every sentence with The; I’d do: Index in source value; Regex match; Oniguruma match or so?
| import {toRegExp} from 'oniguruma-to-es' | ||
|
|
||
| /** @type {boolean} */ | ||
| const forgiving = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about swalling regex syntax errors in the grammars. I just tested the bahaviour on the main branch. The original behaviour is the same as having forgiving = false here.
TBH I believe grammar regexes should just be syntactically valid. IMO we should just throw instead of being invalid.
The jsdoc type is needed to make this a boolean instead of true. This prevents TypeScript from detecting and removing dead code. I will remove the variable altogether. I just need it while we decide on being forgiving or not.
| /** @type {boolean} */ | ||
| const forgiving = true | ||
|
|
||
| const MAX = 2 ** 32 - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The number is taken from https://github.com/shikijs/shiki/blob/v3.15.0/packages/engine-javascript/src/scanner.ts#L7. I figured out it looked suspiciously like a power of 2 minus 1.
I don’t know why it has this value. Perhaps @antfu knows.
| return pattern | ||
| } | ||
|
|
||
| // Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preserved comments from where I got the code. I do agree this comment is useless. I can remove it.
| // change in meaning since TM grammars search line by line | ||
| singleline: true | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinions on this. We can make them configurable via options.
| }) | ||
| cache.set(pattern, regex) | ||
| return regex | ||
| } catch (error_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too, but XO insisted I call this error_.
If we decide to not be forgiving, there‘s no point in caching errors IMO. In that case I will remove the entire try/catch.
| end: indice[1], | ||
| length: indice[1] - indice[0] | ||
| } | ||
| }) || [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sort how?
- Re max value: see review comment on its definition.
- Re passing in as a parameter: You mean to always call as
toResult(…, …, max)? I don’t really see the point TBH. - The original code also has an
offset, but it’s always 0.
| } | ||
|
|
||
| if (forgiving) { | ||
| return null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer undefined, but the API expects null.
| return grammar | ||
| }, | ||
| onigLib: createOniguruma(options) | ||
| onigLib: import('./onig-lib.js') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onigLib must be a promise that resolves to this module. It’s a bit silly IMO. We can also import it statically and pass Promise.resolve(onigLib). This was just easy to write while the API isn’t final yet.
If we decide to take options, this will change anyway.
| */ | ||
| export function createOnigScanner(sources) { | ||
| /** @type {Map<string, RegExp | Error>} */ | ||
| const cache = new Map() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what level do we want to have this cache?
This replaces the dependency on the oniguruma wasm build. The code is a modified version of
https://github.com/shikijs/shiki/blob/main/packages/engine-javascript. The upstream has an option
forgiving, but we don’t. We can add it or remove it. We may also be able to increase the scope in which the cache is used. It can be global or per starry-night instance.The reason I tried this, was to decrease memory usage in a Next.js build. It didn’t work. We can still discuss whether or not this is an improvement.
Closes #50