Add support for some F# syntax features#166
Conversation
|
Ok after the next commit, all open issues should be also fixed. After this there are a few known "issues" still, but I don't see them so important:
|
|
Test in |
|
I took a random compiling Fantomas linted project: |
d1ac90c to
5d5adc0
Compare
- open type declarations
- FSI directives (#time, #I, #help, #quit)
- XML doc comments (/// as distinct xml_doc node)
- Type test pattern in atomic patterns (:? Type)
- Quotation splicing (%%) prefix operator
- fixed expressions
- Range expressions in computation expressions
- Preprocessor boolean conditions (&&, ||, !, parens, true/false)
- Extern/P/Invoke declarations
- SRTP trait call expressions (^a : (static member ...) )
- Triple-quoted string interpolation ($""" {expr} """)
- Operator precedence for && and || (split into 3 levels)
- module rec (recursive modules)
- and! in computation expressions
- struct tuple type annotations (struct (int * int))
- Add optional type_argument_constraints to _function_or_value_defn_body for
SRTP 'when' constraints on return type annotations
- Add optional 'struct' to anon_record_type for struct anonymous record types
- Add optional 'then' clause to additional_constr_defn for secondary
constructor initialization expressions
- Expand fsharp_signature parser: named_module, namespace (global/rec),
module_defn, type_definition, exception_definition, import_decl,
module_abbrev, compiler_directive_decl, preproc_if support
- Fix indentation bug in type extension with ($) identifier test
- Add quotation expression support (<@ @>, <@@ @@>) with external tokens
- Add multi-dollar triple-quoted string interpolation ($$"""...""", $$$"""...""")
- Add module ... = begin...end with begin as external token
- Add exception named fields (of field1: type * field2: type)
- Add multiline type provider support via _multiline_generic_type
- Add signature parser named parameters (curried_spec)
- Fix scanner serialize/deserialize bugs (clamped count, bounds check, off-by-one)
- Update highlights, injections, indents queries
Scanner changes:
- Add FORMAT_TRIPLE_QUOTE_CONTENT external token that stops at
unescaped { for interpolation support
Grammar changes:
- fsharp/grammar.js: new rules (trait_call_expression, extern_binding,
extern_param, and_bang, struct_type, _preproc_expression, xml_doc),
extended existing rules (import_decl, prefixed_expression, module_defn,
named_module, infix_expression, format_triple_quoted_string)
- fsharp_signature/grammar.js: added conflict for operator precedence
- common/scanner.h: FORMAT_TRIPLE_QUOTE_CONTENT with {{ escape handling
…atterns (ionide#134, ionide#149) - Fix infinite loop during error recovery by returning false from scanner when ERROR_SENTINEL is set, preventing zero-length DEDENT loop - Fix multiline record patterns by adding indent/dedent alternative in record_pattern grammar rule so scanner-emitted INDENT tokens between fields on different lines are handled correctly - Add test case for multiline record patterns in match expressions
…essions The application_expression highlight query previously used a wildcard (_) @function.call that captured the entire first child node. For generic constructor calls like ResizeArray<string>(), this meant the typed_expression spanning 'ResizeArray<string>' was tagged as function.call, causing the '<' at column 19 to incorrectly receive the function.call highlight instead of a bracket highlight. Changes: - Replace the single broad application_expression query with four specific patterns that target only the identifier within long_identifier_or_op, dot_expression, and their typed_expression variants - Add typed_expression '>' @punctuation.bracket to highlight the closing angle bracket consistently with generic_type (the opening '<' uses the _tyapp_open external token which is anonymous and unmatchable in queries) - Update test expectations: remove assertions for '<' (unmatchable) and change '>' from operator to punctuation.bracket
5d5adc0 to
6af9a0f
Compare
Nsidorenco
left a comment
There was a problem hiding this comment.
Really nice you're picking this up! I fixed the workflow so the CI now tests the parser against the FSharp.Core testsuite again - that should give a pretty good indication of the state of the parser.
| // During error recovery, all valid_symbols are true and tree-sitter | ||
| // restores scanner state before each attempt. Emitting zero-length | ||
| // tokens (DEDENT/PREPROC_END) here causes infinite loops: the parser | ||
| // can't use the token, recovers, restores state (undoing the pop), | ||
| // and the scanner emits the same token again forever. | ||
| // Return false to let tree-sitter's built-in error recovery skip | ||
| // the problematic character and move on. | ||
| return false; |
There was a problem hiding this comment.
If you do not return DEDENT/PREPROC_END tokens during error recovery you get a much worse parse tree during typing since it will I many cases be able to identify a partial parse tree
Effectively, if you use tree-sitter for syntax highlighting and write something like
match x with
It will fail to highlight anything since it lacks the DEDENT token to identify this is a partially correct match-statement
There was a problem hiding this comment.
We need to change this before we can merge.
Just because we're in the error recovery case does not mean we can give up in the external scanner.
If we can identify that a INDENT or similar token is valid we should emit that. Likewise, if a DEDENT token is valid or we reached EOF we should emit that. The tree-sitter error recovery mechanism cannot emit external scanner tokens so we need to emit those if they can help the error recovery.
There was a problem hiding this comment.
This should be fixed now.
There was a problem hiding this comment.
We should generally be wary of the size of the parser. It went from ~30mb to ~50mb here. 30mb was already rather large. An increase in parser size generally comes from an increased ambiguity within the grammar and is probably one of those things where mimicking the language spec won't necessarily lead to a performant tree-sitter parser
There was a problem hiding this comment.
Hmm, is the correct approach to try to keep it small, or to first get general F# parsing working and then make it more efficient? The parser is auto-generated, so I guess there are no easy wins with C function pointers (like higher-order functions in F#) or other tricks to make it small, but the grammar should be structured a certain way instead?
There was a problem hiding this comment.
I can see a potential issue with iterative development, when we have an auto-generated parser.c in source control, and the end result changes several megabytes per commit, the git repo will grow exponentially.
There was a problem hiding this comment.
I had a word with tree-sitter maintainers, and they basically said that a) parser.c doesn't belong to source control b) don't worry about parser.c size, that is more intentionally kept as uncompressed and large, worry more about the binary size.
There was a problem hiding this comment.
I had a word with tree-sitter maintainers, and they basically said that a) parser.c doesn't belong to source control
That seems a bit contradictory to what the actual state of the tree sitter ecosystem looks like (tree-sitter/tree-sitter#5269). If we were to remove the parser.c we would AFAIK break support for downstream consumers like nvim-treesitter, which depends on the parser.c to to build the parser.
b) don't worry about parser.c size, that is more intentionally kept as uncompressed and large, worry more about the binary size.
Sure, but you still have to download that uncompressed file before you can use the parser. And a large parser.c will nevertheless also result in a larger binary.
I'm fine with us moderately increasing the parser size while working on a more complete grammar but from my experience the way to reduce parser size is to structure the grammar differently, so the more we increase the parser size, the more work we have to redo to reduce the size again. This guide gives a pretty good indication of where the large size comes from
There was a problem hiding this comment.
I tried many different tricks to smaller parser.c and still support all the features by this branch, and only got like a megabyte away, which doesn't really help if it's already 50MB+. If we start to accept compromises, like "treat all numeric types equal (int32=int64)" then we get the size smaller, but at the cost of quality. What would actually cut around 40% size is tree-sitter side parser file structure change, like this tree-sitter/tree-sitter#5488 but it's not a "quick"-win.
There was a problem hiding this comment.
I see that nvim-treesitter made it a requirement that downstream users have the tree-sitter cli installed, so we could remove the parser.c from the repo, which I think is worthwhile.
There is definitely some structural change we could make to the grammar, which will bring us further from the language spec, but might make it a better tree-sitter parser. Not sure about the int23=int64 but might be something. The grammar already has things like not differentiating between expressions and expressions inside a computational expressions since that leads to a blowup in parser size for the very small gain of not being able to write let! in a normal expression block.
So if you find any construct where you want to merge them with the trade-off of a loss of accuracy wrt. the language spec I think it is worthwhile to experiment with.
There was a problem hiding this comment.
I pushed to my fork if you want to check, but as I said, these are not massive wins, and the question is the possible drawbacks. It seems the large size comes from symbol_count × state_count, so I tried to reduce those.
I tried removing _module_expression (one commit after this branch):
https://github.com/Thorium/tree-sitter-fsharp/tree/remove-module-expression
And then I tried a few other things (4 commits to this branch):
https://github.com/Thorium/tree-sitter-fsharp/tree/misc-testings
But they were experimenting, they seemed to be working, but the wins were not enough to do PRs.
|
Before this branch sample files with any error: This branch has sample files with any error: That still sounds like a lot, but remember, we are not measuring parsed lines: |
020ac3a to
4760499
Compare
|
I tried to continue, but it went just worse. I need a faster (parallel) way to evaluate results and better understand the parser.c growing before I can continue. I think this PR is now "ready". |
- Scanner: add * to is_infix_op_start() so multiplication on continuation lines is recognized as an infix operator - Grammar: add 3-part from..step..to alternative to _slice_range_special for step range expressions (e.g. [0..2..10]) - Scanner: emit DEDENT/NEWLINE before returning false in the MULTI_DOLLAR_TRIPLE_QUOTE_START handler, fixing interpolated strings on dedented lines being absorbed into previous let bindings
Require 'with' keyword for standalone type_extension rule, preventing bodyless type definitions (e.g. [<Measure>] type Cent) from being parsed as type extensions that greedily consume following declarations. Record and union type definitions retain support for members both with and without the 'with' keyword via type_extension_elements.
Add _argument_type and _curried_return_type type subsets to correctly parse member signatures. Before this fix, `string -> string * string` in a member signature would incorrectly parse `*` as part of a tuple argument type rather than a tuple return type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port of PR ionide#171 - adds srtp_call_expression rule for parsing SRTP trait member invocations like (^T : (member Method : ...) arg). Uses restricted _srtp_type_argument matching only ^-prefixed type params to avoid conflicts with char literals.
…WLINE scanner token Port of PR ionide#172 - adds type_declaration rule for bodyless type definitions (e.g. [<Measure>] type Dollars). Uses a new TYPE_DECL_NEWLINE external scanner token that fires at newline/EOF when the next non-blank line is not more indented, disambiguating bare declarations from types with bodies. Note: measure_op_type was omitted as the existing measure/ measure_quotient rules already handle A/B division in type contexts.
Scanner fixes (common/scanner.h): - Bug 1: 'begin' keyword check no longer corrupts state for identifier 'b' - Bug 2: '@' operator on continuation line no longer produces zero-width ERROR - Bug 3: Trailing semicolon in array comprehension no longer breaks parsing Grammar fixes (fsharp/grammar.js): - Bug 4: '..' range operator now works in multi-line [| |] and [ ] arrays/lists by adding optional(_newline) before _comp_or_range_expression and slice_ranges - Bug 5: [<assembly:...>] attributes followed by bare expressions like () now parse correctly via new _attribute_expression rule in _module_elem All 422 tests pass with no regressions. Parser.c size unchanged (~60MB).
- Regenerated fsharp_signature parser (inherits from fsharp/grammar.js which was modified but signature parser was not regenerated) - Removed unnecessary conflict entry [preproc_if, preproc_if_in_expression] eliminating the tree-sitter generate warning
|
We seem to be on 1112/5317 now, known issues:
|
|
@Nsidorenco can we get this merged? This would improve a lot of existing issues already. |
|
Yes @Thorium. Looks great, thank you for working on this! |
|
Thanks. Is it possible to get 0.1 bumped release so I could test this easier with other tools? |
|
Sure, a new version has been released |
Add support for the following F# syntax features (all tests passing):
Scanner changes:
Grammar changes: