Lexing

#folder

Note: lexing is also known as tokenizing or lexical analysis. Also, scanning means basically the same thing as lexing (https://github.com/kach/nearley/issues/265).

Lexing

DFAs, NFAs, and regular expressions deal with the problem of recognition. However for a compiler, we want to ask a different question:

Can a word be split into tokens such that every ?

Usually, this language is some regular language defined by a DFA.

This is possible if where

Info

The job of a lexer (scanner, tokenizer) is to partition a program (string of characters) into substrings that are valid tokens (words in the language of valid tokens )

Tokens, Kind, and Lexeme

In practice, tokens are annotated with a kind, with each accepting state in the DFA generating a particular kind of token (e.g 42 -> number, x -> identifier, def -> keyword, + -> plus)

Definition

A token is a pair of kind and a lexeme, which is the actual string of characters that were scanned to make up token

The Scanning Problem

Definition

  • An input is a word and a language of valid tokens
  • The output is a sequence of non-empty words such that:
    1. If we concat all the words together, we get the word back:
    2. Each of the words is in the language of valid tokens:

The scanning problem does not always have a solution.
For example, there is no solution when the word is 0x and is the language of decimal and hexadecimal numbers (specified by the example in DFAs). This is because , and if we split into two tokens 0 and x, we find that .

Solution

Deciding on when the problem has a solution:

Definition

If we have a regular expression specifying , then the scanning problem has a solution if and only if is in the language of a different regular expression

The scanning problem corresponds exactly to the definition of the Kleene star operator of regular expressions: exactly when we can partition into subwords that are all in the language of .

Deciding on a specific solution is less easy, and a solution is not necessarily unique.

Example

  • When and , then two solutions are aa|aaa and aaa|aa
  • When is the language of natural numbers written in decimal, can be scanned as either 42 or 4|2

Maximal Munch Scanning Problem

When we write 42, we mean 42, not 4|2. We can add a third condition to the scanning problem:

Definition

  • An input is a word and a language of valid tokens
  • The output is a sequence of non-empty words such that:
    1. If we concat all the words together, we get the word back:
    2. Each of the words is in the language of valid tokens:
    3. Each is the longest prefix of the remainder of the input that is in

Maximal Munch requires that the scanner at each step, greedily finds the longest token possible before continuing to the rest of the input. This way, the output is unique.

One disadvantage is that there may be solutions to the scanning problem that do not exist in the maximal munch scanning problem.

Example

and does not have a maximal munch scanning solution, because the first token would be , which leaves just which is not in the language.

Another example, in C++, consecutive > after a template would be interpreted as a right shift operator:

std::vector<std::vector<int>> my_mat_11; // Incorrect in C++03, correct in C++11.
std::vector<std::vector<int> > my_mat_03; // Correct in either C++03 or C++11.

Implementation

Steps

If a language is defined by a DFA:

  1. Run the DFA for on the remaining input until the DFA gets stuck or the end of input is reached
  2. If the DFA is in a non-accepting state, backtrack the DFA and the input to the last-visited accepting state.
    • If no accepting state is visited, then there is no solution, error
  3. Output a token corresponding to the accepting state that the DFA is in
  4. Reset the DFA to the start start and repeat

We can keep track of the accepting states to easily backtrack:

def scanOne (input: List[Char],
    state: State,
    backtrack: (List[Char], State)
): (List[Char], State)