Context-Free Grammars
Consider the expression a + b * c - d
, we would have a DFA:
but if we wrote (a + b) * (c - d)
, our DFA would reject it. We could make this DFA:
But this DFA would reject ((a + b) * (c - d))
. The problem is, we can have an infinite number of parentheses, so we cannot define this language with a definite finite automaton. Instead, we need to use a context-free language.
Production Rules
A production rule is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences
Example production rules for a CFG are:
Explanation:
- An expression is either an identifier, or an expression followed by an operator and anothe expression, or an expression in parentheses
- An operator is one of the four listed arithmetic operators
Parse Trees
In a CFG, we can construct parse trees for words:
- Leaves are symbols in the word
- Internal nodes are a production rule in the grammar
- Label: left side of a rule
- Children: right side of the rule
For the expression (a + b) * c
, we have the parse tree:
/** A sample grammar. */
val grammar = parseGrammar(
"""
S BOF A EOF
A x
A A x
""")
/** A parse tree of the string `BOF x EOF` generated by `grammar`. */
val oneX: Tree =
new Tree(Token("S"), Seq(
new Tree(Token("BOF")),
new Tree(Token("A"), Seq(
new Tree(Token("x")),
)),
new Tree(Token("EOF")),
))
/** A parse tree of the string `BOF x x EOF` generated by `grammar`. */
lazy val twoXes: Tree =
new Tree(Token("S"), Seq(
new Tree(Token("BOF")),
new Tree(Token("A"), Seq(
new Tree(Token("A"), Seq(
new Tree(Token("x")),
)),
new Tree(Token("x")),
)),
new Tree(Token("EOF")),
))
/** A parse tree of the string `BOF x x x EOF` generated by `grammar`. */
lazy val threeXes: Tree =
new Tree(Token("S"), Seq(
new Tree(Token("BOF")),
new Tree(Token("A"), Seq(
new Tree(Token("A"), Seq(
new Tree(Token("A"), Seq(
new Tree(Token("x")),
)),
new Tree(Token("x")),
)),
new Tree(Token("x")),
)),
new Tree(Token("EOF")),
))
Ambiguity
An expression for a grammar might not have a unique tree.
A CFG is ambiguous if it generates multiple parse trees for some input string
We want to design our grammar without ambiguity (in the example, we would specify that our grammar follows BEDMAS)
These production rules are ambiguous:
where as these production rules that generate the same language are unambiguous:
Context Free Grammars Formally
For the grammar
we would have:
is the set of: expr -> ID
expr -> expr op expr
expr -> (expr)
op -> +
op -> -
op -> *
op -> /
and we would write:
The language is called context-free because it is not context-sensitive. In a context-sensitive language, for the production rules, you can have sequences on the left side of the production rules rather than single symbols.
Conventions
that is:
- Variables
will refer to non-terminal symbols in - Variables
will refer to terminal symbols in - Variables
will refer to sequences of terminal symbols from (i.e ) - Variables
will refer to terminal or non-terminal symbols in - Variables
will refer to sequences of terminal or non-terminal symbols from (i.e )
Direct Derivation
Whenever
NB: do not mix this up with an implication. To disambiguate, we will use a shorter arrow (
Derivation
This is just an extreme formalization running production rules on expr
to get something else
Generation or Specification
Context-Free Language
A language is context-free is there exists a CFG that specified/generates it
Undecidability
It is undecidable (there does not exist an algorithm to decide), given two CFGs
For a regular language, it is decidable whether two DFAs, NFAs, or regular expressions specify the same language.
It is undecidable whether a given context-free grammar is ambiguous. It is possible to prove that a specific grammar is unambiguous, but there is no algorithm that can determine this for any arbitrary grammar.