CYK Parsing
CYK is short for the Cocke–Younger–Kasami algorithm.
Daniel Younger is a professor Emeritus in CO at UW.
Typically, the algorithm is presented in the form of dynamic programming. The grammar would have been transformed into what is known as "Chomsky normal form", which simplifies the DP.
Let
In general, we ask if
Given production rule
Pseudocode:
def parse (α, x) = // Does α =>^* x?
if (α is empty) // Case 1
if (x is empty)
true
else
false
else if (α = aβ) // Case 2: α starts with a terminal, then has other stuff, A =>* x ?
if (x = az && parse(β, z))
true
else
false
else if (α = A) // Case 3: α is a single non-terminal, A => γ =>* x ?
for each rule A -> γ in p // Try to change γ into x
if (parse(γ, x))
return true
false
else // case 4: α = Aβ (starts with non-terminal, then has other stuff),
// if A =>^* x1, β =>^* x2, x = x1 x2
for each way to split x=x1 x2
if (parse(A, x1) and parse(β, x2))
return true
false
Explanation
Case 1:
Now we know
Case 2: parse
.
To clear up some confusion:
We are left with the case that
Case 3: parse
on all production rules that accept
Case 4:
In this case,
Recursively call parse
on all possibly ways to split the substrings.
This might feel counterintuitive, but what we are doing is going from the grammar, and trying every possibility with the goal of re-creating the input tokens in the form of a parse tree.
Memoization
Problem: If we try parse(expr, ID + ID)
, we recurse infinitely
Solution: remember where you've been, don't go back again (memoization)
Example trace:
parse(expr, ID + ID) // Case 3
// Try the production expr -> ID
parse(ID, ID + ID) // Case 2
parse(ε, + ID) // Case 1
false
false
// Try the production expr -> expr op expr
parse(expr op expr, ID + ID) // Case 4
// Try splitting ID + ID into ε and ID + ID
parse(expr, ε) // Case 3
// Try the production expr - ID
parse(ID, ε) // Case 2
false
// Try the production expr -> expr op expr
parse(expr op expr, ε) // Case 4
// Try splitting ε into ε and ε
parse(expr, ε) // case 3
// WE HAVE SEEN THIS BEFORE, and this will be recursive
false // By memoization
false
false
// Try splitting ID + ID into ID and + ID
parse(expr, ID) // Case 3
// Try the production extr -> ID
parse(ID, ID) // Case 2
parse(ε, ε) // case 1
true
true
true
true
true
We want a table mapping the arguments to parse
to the results that parse
has returned for them in the past.
The keys are the pairs of the values of
For the values, the table can be in one of three states:
parse
has not yet been called with the given argumentsand parse
has been called with the given arguments and returnedtrue
parse
has been called with the given arguments and returnedfalse
Using
We don't have to worry about the length of
To implement memoization, we need to make the following modifications to the algorithm:
- At each place where the algorithm returns a value, add that value to the memoization map
- At the beginning of
parse
, check whether the memoization map already has the valuesand . If it is, just return the value from the map and skip the rest of the computation - At the beginning of
parse
, after we have determined that the memoization map not yet defined and before we begin computation, add the valuefalse
for the current valuesand . This ensures that if we encoundter the same question again, we return false
.
recur(α, from, length) <-> parse(α, w[from ... from + length - 1])
val memo = Map((Seq[Symbol], Int, Int, Boolean))
// At beginning of recur, if memo(...) is defined, return it
// Set memo(...) to false to avoid cycle
// At every return from recur, set memo(...) to return value
Space Complexity
See Big O
For the table, since start
and length
are integers in the range 0
to the length of the input
In the worst case, not including the recursive calls, we can take up to
There are parse
takes
Constructing a Parse Tree
But we don't just want true or false, we want TREES.
From each symbol we want a tree with that symbol as the root. The concatenation of the leaves of all trees returned should be the sequence of
Replace all "true" with a tree:
- Empty sequence, return empty sequence of trees
is some terminal, and then is the rest of the sequence.
The sequence starts with tree showing, then parse trees showing . consists of one symbol non-terminal
The recursive call to parse returns a sequence of trees with symbolsas theri roots. is some non-terminal and is the rest
Two recursive calls: one returns one tree whose root isand whose leaves are the symbol , the other returns the a sequence of trees whose roots are the symbols of and whose leaves, concatenated together, form . These trees are returned as one sequence of trees.
In Scala for example, the return type will be Option[Seq[Tree]]
, and we can return Some(tree)
or None
.