Skip to content

Conversation

@remcohaszing
Copy link

This replaces the dependency on the oniguruma wasm build. The code is a modified version of
https://github.com/shikijs/shiki/blob/main/packages/engine-javascript. The upstream has an option forgiving, but we don’t. We can add it or remove it. We may also be able to increase the scope in which the cache is used. It can be global or per starry-night instance.

The reason I tried this, was to decrease memory usage in a Next.js build. It didn’t work. We can still discuss whether or not this is an improvement.

Closes #50

This replaces the dependency on the oniguruma wasm build.
The code is a modified version of
https://github.com/shikijs/shiki/blob/main/packages/engine-javascript.
The upstream has an option `forgiving`, but we don’t. We can add it or
remove it. We may also be able to increase the scope in which the cache
is used. It can be global or per starry-night instance.
import {toRegExp} from 'oniguruma-to-es'

/** @type {boolean} */
const forgiving = true
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an option in shiki. We can make it an option, or make it true or false.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason 100% coverage isn’t met and CI fails.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the tradeoffs?
I’d imagine the wasm currently is not forgiving, so changing that is also a break. Why?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, this is more like “silent”/“swallow”? Then yes, I do think it is the default currently to swallow regex errors?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no no jsdoc type needed too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about swalling regex syntax errors in the grammars. I just tested the bahaviour on the main branch. The original behaviour is the same as having forgiving = false here.

TBH I believe grammar regexes should just be syntactically valid. IMO we should just throw instead of being invalid.

The jsdoc type is needed to make this a boolean instead of true. This prevents TypeScript from detecting and removing dead code. I will remove the variable altogether. I just need it while we decide on being forgiving or not.

@@ -1,4 +1,4 @@
export type {GetOnigurumaUrl, Grammar, Options} from './lib/types.js'
export type {Grammar} from './lib/types.js'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the removal of these options should be considered a breaking change.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, breaking. I think this is a major regardless, as this is supposed to work 99.99% correctly, but thus not the exact same

@@ -1,4 +1,4 @@
export type {GetOnigurumaUrl, Grammar, Options} from './lib/types.js'
export type {Grammar} from './lib/types.js'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, breaking. I think this is a major regardless, as this is supposed to work 99.99% correctly, but thus not the exact same

return grammar
},
onigLib: createOniguruma(options)
onigLib: import('./onig-lib.js')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dynamic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onigLib must be a promise that resolves to this module. It’s a bit silly IMO. We can also import it statically and pass Promise.resolve(onigLib). This was just easy to write while the API isn’t final yet.

If we decide to take options, this will change anyway.

/** @type {boolean} */
const forgiving = true

const MAX = 2 ** 32 - 1
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer lowercase, and a comment describing this magic value is a good idea

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The number is taken from https://github.com/shikijs/shiki/blob/v3.15.0/packages/engine-javascript/src/scanner.ts#L7. I figured out it looked suspiciously like a power of 2 minus 1.

I don’t know why it has this value. Perhaps @antfu knows.

import {toRegExp} from 'oniguruma-to-es'

/** @type {boolean} */
const forgiving = true
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the tradeoffs?
I’d imagine the wasm currently is not forgiving, so changing that is also a break. Why?

return pattern
}

// Cache
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preserved comments from where I got the code. I do agree this comment is useless. I can remove it.

/** @type {Array<[index: number, match: RegExpExecArray]>} */
const pending = []

for (const [index, regexp] of regexps.entries()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer calling these expression(s)

if (match.index === minIndex) {
return toResult(index, match)
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 loops can be made into one: iterate over each and find the object whose index is the lowest. When done, return that found object?

}

/**
* Do nothing
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use .s to end such whole sentences in comments (generally, if the first word is also capitalized, and it could be multiple sentences, I use dots)?

end: indice[1],
length: indice[1] - indice[0]
}
}) || []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • can you sort the fields of these 3 objects
  • seems weird to use this max value, is it the last index into some source string? Could that value be passed in as a parameter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Sort how?
  • Re max value: see review comment on its definition.
  • Re passing in as a parameter: You mean to always call as toResult(…, …, max)? I don’t really see the point TBH.
  • The original code also has an offset, but it’s always 0.

* @param {RegExpExecArray} match
* The actual match
* @returns {IOnigMatch}
* The oniguruma match.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useless to me to prefix every sentence with The; I’d do: Index in source value; Regex match; Oniguruma match or so?

import {toRegExp} from 'oniguruma-to-es'

/** @type {boolean} */
const forgiving = true
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about swalling regex syntax errors in the grammars. I just tested the bahaviour on the main branch. The original behaviour is the same as having forgiving = false here.

TBH I believe grammar regexes should just be syntactically valid. IMO we should just throw instead of being invalid.

The jsdoc type is needed to make this a boolean instead of true. This prevents TypeScript from detecting and removing dead code. I will remove the variable altogether. I just need it while we decide on being forgiving or not.

/** @type {boolean} */
const forgiving = true

const MAX = 2 ** 32 - 1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The number is taken from https://github.com/shikijs/shiki/blob/v3.15.0/packages/engine-javascript/src/scanner.ts#L7. I figured out it looked suspiciously like a power of 2 minus 1.

I don’t know why it has this value. Perhaps @antfu knows.

return pattern
}

// Cache
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preserved comments from where I got the code. I do agree this comment is useless. I can remove it.

// change in meaning since TM grammars search line by line
singleline: true
}
})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinions on this. We can make them configurable via options.

})
cache.set(pattern, regex)
return regex
} catch (error_) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, but XO insisted I call this error_.

If we decide to not be forgiving, there‘s no point in caching errors IMO. In that case I will remove the entire try/catch.

end: indice[1],
length: indice[1] - indice[0]
}
}) || []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Sort how?
  • Re max value: see review comment on its definition.
  • Re passing in as a parameter: You mean to always call as toResult(…, …, max)? I don’t really see the point TBH.
  • The original code also has an offset, but it’s always 0.

}

if (forgiving) {
return null
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer undefined, but the API expects null.

return grammar
},
onigLib: createOniguruma(options)
onigLib: import('./onig-lib.js')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onigLib must be a promise that resolves to this module. It’s a bit silly IMO. We can also import it statically and pass Promise.resolve(onigLib). This was just easy to write while the API isn’t final yet.

If we decide to take options, this will change anyway.

*/
export function createOnigScanner(sources) {
/** @type {Map<string, RegExp | Error>} */
const cache = new Map()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what level do we want to have this cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use oniguruma-to-es for WASM-free mode

2 participants