Skip to content
Conqu3red edited this page Aug 31, 2021 · 2 revisions

Parsergen provides a Lexer class which assists in splitting an input string into Tokens. It allows you to define a set of rules for which different tokens are produced. These rules can be regular expressions of the C++ ECMAScript Syntax or regular strings.

each rule may also have a modifier function which can perform other actions to the token after it has been matched, including signalling for the token to be ignored and accessing the regex result that matched against the token.

Lexer

Subclassing the Lexer class allows you to define and match tokens effectively.

#include "parsergen/lexer.hpp"
using namespace Parsergen;

class MyLexer : public Lexer {
    MyLexer(){
        rules = {
            token_match("NUMBER", "[0-9]+"),
            token_match("ID", "[a-zA-Z]+"),
            token_match_fast("SPACE", " ")
        };
    }
};

The function token_match matches based on regular expressions however token_match_fast matches on constant strings, use this if you have a token that doesn't require a regex expression to match it and you will see a reasonable speed improvement to Lexing.

When Lexing fails because it found a character or sequence of characters that couldn't be matched to any of your tokens it throws a LexError.

See how to use the example Lexer below:

int main(){
    std::unique_ptr<Lexer> my_lexer = std::make_unique<MyLexer>(); // required for giving to the TokenStream later
    my_lexer->setText("123 abc 7");
    my_lexer->Lex(); // will throw a LexError if it fails
    // my_lexer->tokens holds a vector of the resulting tokens
    std::cout << my_lexer->tokens[0].type; // NUMBER
    std::cout << my_lexer->tokens[1].type; // SPACE
    std::cout << my_lexer->tokens[2].type; // ID
    std::cout << my_lexer->tokens[3].type; // SPACE
    std::cout << my_lexer->tokens[4].type; // NUMBER
}

Modifiers

Both types of token matches accept modifier functions which are executed after the token is matched.

void TokenModifierFunc(Token &tok, utils::svmatch sv); // Expected modifier function for token_match
void TokenModifierFunc(Token &tok); // Expected modifier function for token_match_fast

// these may be passed as lambdas
token_match_fast("SPACE", " ", [this](Token &tok){
    throw NoToken(); // signal for the token to not be included in the result list
    // *still consumes the input that matched to this token*
})

// or by referring to class functions (if you want to access Lexer stuff) or just functions
void ignore(Token &tok){
    throw NoToken();
}

token_match_fast("SPACE", " ", ignore)

Note: you should call newline() when you get a newline so that the Lexer handles lines properly.

Token Stream

The TokenStream class provides the interface between a Lexer and a Parser, it is important to know how to construct one.

#include <memory>

int main(){
    // continue from the main function above
    auto token_stream = std::make_unique<TokenStream>(std::move(my_lexer));
    // the lexer has been "moved" out of the my_lexer unique_ptr, you can't use that variable to interact with it anymore
    // this token stream is then able to be used in the creation of a Parser instance.
}
Clone this wiki locally