Lexing
- Context-Free Languages
- Lexing
- Regular Languages
Note: lexing is also known as tokenizing or lexical analysis. Also, scanning means basically the same thing as lexing (https://github.com/kach/nearley/issues/265).
Lexing
DFAs, NFAs, and regular expressions deal with the problem of recognition. However for a compiler, we want to ask a different question:
Can a word
be split into tokens such that every ?
Usually, this language is some regular language defined by a DFA.
This is possible if
The job of a lexer (scanner, tokenizer) is to partition a program (string of characters) into substrings that are valid tokens (words in the language of valid tokens
Tokens, Kind, and Lexeme
In practice, tokens are annotated with a kind, with each accepting state in the DFA generating a particular kind of token (e.g 42 -> number, x -> identifier, def -> keyword, + -> plus)
A token is a pair of kind and a lexeme, which is the actual string of characters that were scanned to make up token
The Scanning Problem
- An input is a word
and a language of valid tokens - The output is a sequence of non-empty words
such that: - If we concat all the words together, we get the word back:
- Each of the words is in the language of valid tokens:
- If we concat all the words together, we get the word back:
The scanning problem does not always have a solution.
For example, there is no solution when the word 0x and 0 and x, we find that
Solution
Deciding on when the problem has a solution:
If we have a regular expression
The scanning problem corresponds exactly to the definition of the Kleene star operator of regular expressions:
Deciding on a specific solution is less easy, and a solution is not necessarily unique.
- When
and , then two solutions are aa|aaaandaaa|aa - When
is the language of natural numbers written in decimal, can be scanned as either 42or4|2
Maximal Munch Scanning Problem
When we write 42, we mean 42, not 4|2. We can add a third condition to the scanning problem:
- An input is a word
and a language of valid tokens - The output is a sequence of non-empty words
such that: - If we concat all the words together, we get the word back:
- Each of the words is in the language of valid tokens:
- Each
is the longest prefix of the remainder of the input that is in
- If we concat all the words together, we get the word back:
Maximal Munch requires that the scanner at each step, greedily finds the longest token possible before continuing to the rest of the input. This way, the output is unique.
One disadvantage is that there may be solutions to the scanning problem that do not exist in the maximal munch scanning problem.
Another example, in C++, consecutive > after a template would be interpreted as a right shift operator:
std::vector<std::vector<int>> my_mat_11; // Incorrect in C++03, correct in C++11.
std::vector<std::vector<int> > my_mat_03; // Correct in either C++03 or C++11.
Implementation
If a language
- Run the DFA for
on the remaining input until the DFA gets stuck or the end of input is reached - If the DFA is in a non-accepting state, backtrack the DFA and the input to the last-visited accepting state.
- If no accepting state is visited, then there is no solution, error
- Output a token corresponding to the accepting state that the DFA is in
- Reset the DFA to the start start and repeat
We can keep track of the accepting states to easily backtrack:
def scanOne (input: List[Char],
state: State,
backtrack: (List[Char], State)
): (List[Char], State)