Skip to content

Error recovery using lexer parser separation #80

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: 0.6.x
Choose a base branch
from

Conversation

ProphetLamb
Copy link

@ProphetLamb ProphetLamb commented Nov 14, 2022

Currently, the compilation is unable to produce more than one error, even if this is a recoverable error.
This PR implements a diagnostic pipeline that allows definition of arbitrary diagnostics and logging during paring.

Error recovery using a single stage parser is not really feasible. So we split between parser and lexer.
The lexer allocates memory when unescaping string literals, otherwise only a stream of slices is produced.
I am currently looking into if error recovery is possible in PEG without producing a bad grammar.
For now, all solution I could find are ugly, because PEG doesn't allow us to access the underlying stream. So I am probably going to port the Backus Naur like PEG syntax to a parser combinator (nom).

How to report diagnostics

the diag_report! macro accepts either a Span or a Range<Span> to report an error. Diagnostic messages must be predefined, as constants.

fn str_char_invalid(is: Span) -> IResult<Option<char>> {
    let (i, c) = preceded(char('\\'), anychar)(is.clone())?;
    diag_report!((&is..&i), ERR_INVALID_ESCAPE, c);
    Ok((i, None))
}

diag produces error messages similar to rust.

error[ELN0001]: Invalid integer `02439853427592345284395923845320459823457324953247640359104519845834634538`, number too large to fit in target type
--> test.reds:1:0 to l1:74
| 02439853427592345284395923845320459823457324953247640359104519845834634538

TODO

  • Is PEG Error Recovery possible ? implement : port to nom
  • Actually integrate the ported AST into the compiler pipeline
  • Dont use a boxed AST, instead use a pooled AST via context references. (needed for syntax receivers for additional diagnostics and quick fixes)

@jac3km4
Copy link
Owner

jac3km4 commented Nov 14, 2022

Hi, this looks like a nice addition!
Some suggestions/ideas:

  • I haven't used nom much, so I don't know how difficult it'd be, but if we're introducing a new parsing lib, I'd rather have it replace peg entirely rather than mix the two to avoid bloat.
  • I'd avoid changing the AST to prevent this from spilling all over the codebase. I think this change should focus on the parser only and remain compatible with the old AST. If a new AST is really crucial, you can add the new parser in a new workspace crate and provide a conversion into the existing AST, so that a potential switch can be made in steps (and maybe old AST can be abandoned one day if it works out).

Your change coincided with me switching the main branch to one with preliminary work for the next minor release, which is mostly not tested yet, apologies for that!

@ProphetLamb
Copy link
Author

ProphetLamb commented Nov 15, 2022

@jac3km4 The changes to the AST mostly refer to replacing Box slices with vectors created using bump allocation.
Bump allocation creates a lifetime buffer that you can create a vector in with the similar speed as a stack alloc, bc it is effectively a stack alloc. Growing a vec that is not on top of this bump stack, moves it to the heap, without clearing the reserved space.
This is the best case for allocation a parse tree as flat as possible

71a7580#diff-d951465f240179b5ea43bf25dbc628cb985688c0574305fa4fc71a3eae81a00dR8-R19

You can see this illustrated in the example rule linked. Note that these are allocator vectors from bumpalo::collections::Vec, and have the same lifetime as the parse tree.

Alternatively, we can use the allocator api for std::vec::Vec instead, the bumpalo vec is just a reexport of a specific version of the std lib known to work.

@jac3km4
Copy link
Owner

jac3km4 commented Nov 15, 2022

@ProphetLamb I've experimented with using a pooled approach early on in the project and I found there was no discernable difference in the CPU profile, it's far from being a bottleneck, and it was more complicated so I dropped the idea

@ProphetLamb
Copy link
Author

ProphetLamb commented Nov 15, 2022

@ProphetLamb I've experimented with using a pooled approach early on in the project and I found there was no discernable difference in the CPU profile, it's far from being a bottleneck, and it was more complicated so I dropped the idea

Luckily, bumpallo does all the work for us. And Box<[T<'ast>]> should be entirely replaceable with Vec<'ast, T<'ast>>.
Using a vec also allows for syntax analyzers modifying the ast post parse for features like metaprogramming, instead of hardcoded ModuleExists annotations, bc we can allocate into 'ast at any time.

@jac3km4
Copy link
Owner

jac3km4 commented Nov 15, 2022

@ProphetLamb I think bumpalo is the one I've tried, but I can't remember at the moment - I'd be open to this kind of change if it's shown to make a difference in performance, but that's not what I've seen

@ProphetLamb
Copy link
Author

Interesting. I have transitioned the script lang for the business logic used in the company I work at to pooled allocation, we got around 30% lower compilation times. But that one was written in C++. Maybe rust has just overall better at memory mgmt, idk

@jac3km4
Copy link
Owner

jac3km4 commented Nov 15, 2022

At the moment the compiler is pretty highly IO-bound because it reads and writes pretty big cache files. These files contain bytecode for ~500k LoC written by the original game devs. I think there are a number of optimizations that could be applied there to significantly cut the compilation time.
Maybe then this would be more easily visible in the profile.
I'd be open to bring in bump allocation, but I think it needs to be based on some observable benefit, because it does complicate the code a little bit in many places.

@ProphetLamb
Copy link
Author

That totally makes sense. If it does complicate things it is better to use the native approach, tho this is not the case using nom parsers, since they allow accessing the underlying stream context

@jac3km4
Copy link
Owner

jac3km4 commented Nov 15, 2022

I think bump allocation in the code that parses the binaries could be worth a shot, and maybe zero-copy parsing strings, which I avoided initially because the data loaded from these files gets mutated and mixed in with data produced by the compiler, but there are ways around that like using Cow
But IMO all of this is out of scope for the change you're working on and should be considered as a separate thing

@ProphetLamb
Copy link
Author

That is a good point for maybe another PR. Should I create an issue for that?

@jac3km4
Copy link
Owner

jac3km4 commented Nov 15, 2022

sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants