CYK Parsing

CYK is short for the Cocke–Younger–Kasami algorithm.

Daniel Younger is a professor Emeritus in CO at UW.

Typically, the algorithm is presented in the form of dynamic programming. The grammar would have been transformed into what is known as "Chomsky normal form", which simplifies the DP.

Let be the start symbol of the input context-free grammar . The parser needs to decide whether derives .

In general, we ask if , and we start with .

Implementation

Given production rule , we want to find . Note that and may be the empty word .

Pseudocode:

def parse (α, x) = // Does α =>^* x?
    if (α is empty) // Case 1
        if (x is empty)
            true
        else
            false
    else if (α = aβ) // Case 2: α starts with a terminal, then has other stuff, A =>* x ?
        if (x = az && parse(β, z))
            true
        else
            false
    else if (α = A) // Case 3: α is a single non-terminal, A => γ =>* x ?
        for each rule A -> γ in p // Try to change γ into x
            if (parse(γ, x))
                return true
        false
    else // case 4: α = Aβ (starts with non-terminal, then has other stuff),
         // if A =>^* x1, β =>^* x2, x = x1 x2
        for each way to split x=x1 x2
            if (parse(A, x1) and parse(β, x2))
                return true
        false
Cases

Explanation

Case 1: is the empty sequence. The empty sequence derives only itself: .

Now we know is non-empty. Then it must start with some symbol (terminal or non-terminal), followed by the rest of the sequence (which could be empty).

Case 2: starts with a terminal , followed by the rest of the sequence . Since is a terminal symbol, it cannot change to anything else. So in order for to derive , needs to start with . Let be the rest of , after the initial , to . Now derives if and only if derives , which we can decide using a recursive call to parse.
To clear up some confusion: does not necessarily equal . comes from (start), while comes from (target).

We are left with the case that starts with a non-terminal symbol , followed by the rest of the sequence (which could be empty).

Case 3: starts with a non-terminal symbol , followed by the empty string. We recursively call parse on all production rules that accept as an input and see if one succeeds.

Case 4: starts with a non-terminal symbol , followed by , a non-empty sequence of terminal and non-terminal symbols.
In this case, , and we can split into two (possibly empty) substrings and , such that derives and derives (i.e decide if and ).
Recursively call parse on all possibly ways to split the substrings.

Intuition

This might feel counterintuitive, but what we are doing is going from the grammar, and trying every possibility with the goal of re-creating the input tokens in the form of a parse tree.

Memoization

Problem: If we try parse(expr, ID + ID), we recurse infinitely
Solution: remember where you've been, don't go back again (memoization)

Example

Example trace:

parse(expr, ID + ID) // Case 3
    // Try the production expr -> ID
    parse(ID, ID + ID) // Case 2
        parse(ε, + ID) // Case 1
            false
        false

    // Try the production expr -> expr op expr
    parse(expr op expr, ID + ID) // Case 4
        // Try splitting ID + ID into ε and ID + ID
        parse(expr, ε) // Case 3
            // Try the production expr - ID
            parse(ID, ε) // Case 2
                false
            // Try the production expr -> expr op expr
            parse(expr op expr, ε) // Case 4
                // Try splitting ε into ε and ε
                parse(expr, ε) // case 3
                    // WE HAVE SEEN THIS BEFORE, and this will be recursive
                    false // By memoization
                false
            false
        // Try splitting ID + ID into ID and + ID
        parse(expr, ID) // Case 3
            // Try the production extr -> ID
            parse(ID, ID) // Case 2
                parse(ε, ε) // case 1
                    true
                true
            true
        true
    true

Implementation

We want a table mapping the arguments to parse to the results that parse has returned for them in the past.
The keys are the pairs of the values of and .
For the values, the table can be in one of three states:

  1. parse has not yet been called with the given arguments and
  2. parse has been called with the given arguments and returned true
  3. parse has been called with the given arguments and returned false

Using as part of the key is inefficient, since can be very long (length of entire input). Since , we can store two integers, which represent the start and length of in (alternatively, you could have the start and end of ).

We don't have to worry about the length of since is either a terminal or non-terminal symbol, which are relatively short.

To implement memoization, we need to make the following modifications to the algorithm:

  1. At each place where the algorithm returns a value, add that value to the memoization map
  2. At the beginning of parse, check whether the memoization map already has the values and . If it is, just return the value from the map and skip the rest of the computation
  3. At the beginning of parse, after we have determined that the memoization map not yet defined and before we begin computation, add the value false for the current values and . This ensures that if we encoundter the same question again, we return false.
recur(α, from, length) <-> parse(α, w[from ... from + length - 1])
val memo = Map((Seq[Symbol], Int, Int, Boolean))

// At beginning of recur, if memo(...) is defined, return it
// Set memo(...) to false to avoid cycle
// At every return from recur, set memo(...) to return value

Space Complexity

See Big O

Info

For the table, since start and length are integers in the range 0 to the length of the input , they each have possibilities, so the complexity of the algorithm is bounded by .

In the worst case, not including the recursive calls, we can take up to time because case 4 iterates the length of . Since each execution of the main body adds an entry to the table, the number of executions is bounded by the number of table entries.

There are entries in the memoization table and each execution of the main body of parse takes time, so the overall running time of the algorithm is , where is the length of , the input string.

Constructing a Parse Tree

But we don't just want true or false, we want TREES.

Info

From each symbol we want a tree with that symbol as the root. The concatenation of the leaves of all trees returned should be the sequence of of terminal symbols.

Cases

Replace all "true" with a tree:

  1. Empty sequence, return empty sequence of trees
  2. is some terminal, and then is the rest of the sequence.
    The sequence starts with tree showing , then parse trees showing .
  3. consists of one symbol non-terminal
    The recursive call to parse returns a sequence of trees with symbols as theri roots.
  4. is some non-terminal and is the rest
    Two recursive calls: one returns one tree whose root is and whose leaves are the symbol , the other returns the a sequence of trees whose roots are the symbols of and whose leaves, concatenated together, form . These trees are returned as one sequence of trees.

In Scala for example, the return type will be Option[Seq[Tree]], and we can return Some(tree) or None.