Context-‐free grammars
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Context-‐free grammars
Lexical analysis (scanning)
Syntac6c analysis (parsing)
Seman6c analysis
Intermediate code genera6on
Op6miza6on
Machine code genera6on
source code
machine code
tokens
AST APributed AST
intermediate code
intermediate code
Compiler phases and program representa6ons
Analysis Synthesis
2
tokens
AST
A closer look at the parser
Pure parsing
AST building
concrete parse tree (implicit)
Parser
3
Scanner text
regular expressions
context-‐free grammar
abstract grammar
Defined by:
The concrete parse tree is never constructed. The parser builds the AST directly using seman<c ac<ons.
This lecture
Regular Expressions vs Context-‐Free Grammars
4
An RE can have itera<on
A CFG can also have recursion (it is possible to derive a symbol, e.g., Stmt, from itself)
Example RE: ID = [a-z][a-z0-9]* �
Example CFG: Stmt -> WhileStmt | AssignStmt | …�WhileStmt -> WHILE LPAR Exp RPAR Stmt �…�
Elements of a Context-‐Free Grammar
5
Nonterminal symbols
Terminal symbols (tokens)
Produc<on rules: N -> s1 s2 … sn �
where sk is a symbol (terminal or nonterminal)
Start symbol (one of the nonterminals, usually the first one)
Example CFG: Stmt -> WhileStmt | AssignStmt | …�WhileStmt -> WHILE LPAR Exp RPAR Stmt �…�
Exercise Construct a grammar covering the following program:
6
CFG: Stmt -> WhileStmt | AssignStmt | CompoundStmt �WhileStmt -> "while" "(" Exp ")" Stmt �AssignStmt -> ID "=" Exp �CompoundStmt ->�Exp ->�LessEq ->�Add ->�
Example program: while (k <= n) {sum = sum + k; k = k+1;} �
Usually, simple tokens are wriPen directly as text strings
Solu6on Construct a grammar covering the following program:
7
CFG: Stmt -> WhileStmt | AssignStmt | CompoundStmt �WhileStmt -> "while" "(" Exp ")" Stmt �AssignStmt -> ID "=" Exp �CompoundStmt -> "{" (Stmt ";")* "}"�Exp -> LessEq | Add | ID �LessEq -> Exp "<=" Exp �Add -> Exp "+" Exp �
Example program: while (k <= n) {sum = sum + k; k = k+1;} �
Real world example: The Java Language Specifica6on
8
See http://docs.oracle.com/javase/specs/jls Take a look at Chapter 2 about the Java grammar defini6ons, and look at some
parts of the specifica6on.
Compila6onUnit: PackageDeclara6onopt ImportDeclara6onsopt TypeDeclara6onsopt
ImportDeclara6ons: ImportDeclara6on ImportDeclara6ons ImportDeclara6on
TypeDeclara6ons: TypeDeclara6on TypeDeclara6ons TypeDeclara6on
PackageDeclara6on: Annota6onsopt package PackageName ;
…
Parsing Use the grammar to derive a tree for a program:
9 sum = sum + k �
Stmt Example program: sum = sum + k �
Start symbol
Parse tree Use the grammar to derive a parse tree for a program:
10 sum = sum + k �
Stmt
AssignStmt
Exp
Add
Exp Exp
Example program: sum = sum + k �
Nonterminals are inner nodes
Start symbol
Terminals are leafs
Corresponding abstract syntax tree (will be discussed in Lecture 5)
11 sum = sum + k �
AssignStmt
Add
IdExp IdExp
Example program: sum = sum + k �
IdExp
Forms of CFGs:
12
EBNF: Stmt -> AssignStmt | CompoundStmt �AssignStmt -> ID "=" Exp �CompoundStmt -> "{" (Stmt ";")* "}"�Exp -> Add | ID �Add -> Exp "+" Exp �
Canonical form: Stmt -> ID "=" Exp �Stmt -> "{" Stmts "}"�Stmts -> ε Stmts -> Stmt ";" Stmts�Exp -> Exp "+" Exp �Exp -> ID �
Extended Backus-‐Naur Form: • Compact, easy to read and write • BNF has alterna6ves • EBNF also has repe66on, op6onals, parentheses
• Common nota6on for prac6cal use • EBNF used in JavaCC • BNF oben used in LR-‐parser generators
Canonical form: • Core formalism for CFGs • Replaces repe66on with recursion • Replaces alterna6ves, op6onals, parentheses with several produc6on rules
• Useful for proving proper6es and explaining algorithms
Formal defini6on of CFGs (canonical form)
13
A context-free grammar G = (N, T, P, S), where �N – the set of nonterminal symbols�T – the set of terminal symbols�P – the set of production rules, each with the form� X -> Y1 Y2 … Yn � where X ∈ N, and Yk ∈ N ∪ T �S – the start symbol (one of the nonterminals). I.e., S ∈ N �
So, the left-hand side X of a rule is a nonterminal.�
And the right-hand side Y1 Y2 … Yn is a sequence of nonterminals and terminals.�
If the rhs for a production is empty, i.e., n = 0, we write � X -> ε�
A grammar G defines a language L(G)
14
A context-free grammar G = (N, T, P, S), where �N – the set of nonterminal symbols�T – the set of terminal symbols�P – the set of production rules, each with the form� X -> Y1 Y2 … Yn � where X ∈ N, and Yk ∈ N ∪ T �S – the start symbol (one of the nonterminals). I.e., S ∈ N �
G defines a language L(G) over the alphabet T�
T* is the set of all possible sequences of T symbols.�
L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P.�
Exercise
15
G = (N, T, P, S) �
P = {� Stmt -> ID "=" Exp , � Stmt -> "{" Stmts "}" , � Stmts -> ε , Stmts -> Stmt ";" Stmts , � Exp -> Exp "+" Exp , � Exp -> ID �} �
N = { } �
T = { } �
S = �
L(G) = {�
} �
Solu6on
16
G = (N, T, P, S) �
P = {� Stmt -> ID "=" Exp , � Stmt -> "{" Stmts "}" , � Stmts -> ε , Stmts -> Stmt ";" Stmts , � Exp -> Exp "+" Exp , � Exp -> ID �} �
N = {Stmt, Exp, Stmts} �
T = {ID, "=", "{", "}", ";", "+"} �
S = Stmt �
L(G) = {� ID "=" ID, � ID "=" ID "+" ID, � ID "=" ID "+" ID "+" ID, � …�
"{" "}", � "{" ID "=" ID ";" "}", � "{" ID "=" ID ";" "{" "}" ";" "}", � …�
} �
The sequences in L(G) are usually called sentences or strings
Deriva6on step
17
If we have a sequence of terminals and nonterminals, e.g., �
X a Y Y b �
we can replace one of the nonterminals, applying a production rule. This is called a derivation step. �(Swedish: Härledningssteg)�
Suppose there is a production �
Y -> X a�
and we apply it for the first Y in the sequence. We write the derivation step as follows: �
X a Y Y b => X a X a Y b �
Deriva6on
18
A derivation, is simply a sequence of derivation steps, e.g.: �
γ1 => γ2 => … => γn (n ≥ 0) �
where each γi is a sequence of terminals and nonterminals�
If there is a derivation from γ1 to γn, we can write this as �
γ1 =>* γn �
So this means it is possible to get from the sequence γ1 to the sequence γn by following the production rules.�
Defini6on of the language L(G)
19
Recall that: �
G = (N, T, P, S) �
T* is the set of all possible sequences of T symbols.�
L(G) is the subset of T* that can be derived from the � start symbol S, by following the production rules P.�
Using the concept of derivations, we can formally define L(G) as follows: �
L(G) = { w ∈ T* | S =>* w } �
Exercise: Prove that a sentence belongs to a language
20
Prove that �
INT + INT * INT�
Proof (by showing all the derivation steps from the start symbol Exp):�
Exp �=>�
belongs to the language of the following grammar: �
p1:#Exp -> Exp "+" Exp �p2:#Exp -> Exp "*" Exp �p3:#Exp -> INT�
Solu6on: Prove that a sentence belongs to a language
21
Prove that �
INT + INT * INT�
Proof (by showing all the derivation steps from the start symbol Exp):�
Exp �=> Exp "+" Exp #(p1)�=> INT "+" Exp #(p3)�=> INT "+" Exp "*" Exp #(p2)�=> INT "+" INT "*" Exp #(p3)�=> INT "+" INT "*" INT #(p3)�
belongs to the language of the following grammar: �
p1:#Exp -> Exp "+" Exp �p2:#Exp -> Exp "*" Exp �p3:#Exp -> INT�
Lebmost and rightmost deriva6ons
22
In a leftmost derivation, the leftmost nonterminal is replaced in each derivation step, e.g.,: �
Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�
LL parsing algoritms use leftmost derivation.�LR parsing algorithms use rightmost derivation.�Will be discussed in later lectures.�
In a rightmost derivation, the rightmost nonterminal is replaced in each derivation step, e.g.,: �
Exp =>�Exp "+" Exp =>�Exp "+" Exp "*" Exp =>�Exp "+" Exp "*" INT =>�Exp "+" INT "*" INT =>�INT "+" INT "*" INT�
A deriva6on corresponds to building a parse tree
23
Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�
Example derivation: �
Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�
Exercise: build the parse tree�(also called derivation tree).�
A deriva6on corresponds to building a parse tree
24
Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�
Example derivation: �
Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�
Parse tree: �
Exp
Exp Exp
Exp Exp
"+"
INT "*"
INT INT
Exercise: Can we do another deriva6on of the same sentence,
that gives a different parse tree?
25
Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�
Parse tree: �
Another derivation: �
Exp =>�
Solu6on: Can we do another deriva6on of the same sentence,
that gives a different parse tree?
26
Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�
Parse tree: �
Exp
Exp "*"
INT
Exp
Exp Exp "+"
INT INT
Another derivation: �
Exp =>�Exp "*" Exp =>�Exp "+" Exp "*" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�
Which parse tree would we prefer?
Ambiguous context-‐free grammars
27
A CFG is ambiguous if a sentence in the language can be derived by two (or more) different parse trees.�
A CFG is unambiguous if each sentence in the language can be derived by only one syntax tree.�
(Swedish: tvetydig, otvetydig)�
How can we know if a CFG is ambiguous?
28
If we find an example of an ambiguity, we know the grammar is ambiguous.�
There are algorithms for deciding if a CFG belongs to certain subsets of CFGs, e.g. LL, LR, etc. (See later lectures.) These grammars are unambiguous.�
But in the general case, the problem is undecidable: it is not possible to construct a general algorithm that decides ambiguity for any CFG.�
Strategies for elimina6ng ambigui6es
29
We should try to eliminate ambiguities, without changing the language.�
First, decide which parse tree is the desired one.�
Create an equivalent unambiguous grammar from which only the desired parse trees can be derived.�
Or, use additional priority and associativity rules to instruct the parser to derive the "right" parse tree. (Works for some ambiguities.)�
Equivalent grammars
30
Two grammars, G1 and G2, are equivalent if they generate the same language.�
I.e., each sentence in one of the grammars can be derived also in the other grammar: �
L(G1) = L(G2)�
Example ambiguity: Priority (also called precedence)
31
Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�
Exp
Exp Exp
Exp Exp
"+"
INT "*"
INT INT
Exp
Exp "*"
INT
Exp
Exp Exp "+"
INT INT
Two parse trees for INT "+" INT "*" INT�
prio("*") > prio("+")�(according to tradition)
prio("+") > prio("*")�(would be unexpected and confusing)
Example ambiguity: Associa<vity
32
Exp -> Exp "+" Exp � Exp -> Exp "-" Exp � Exp -> Exp "**" Exp � Exp -> INT�
Exp
Exp Exp
Exp Exp
"**"
INT "**"
INT INT
Exp
Exp "+"
INT
Exp
Exp Exp "-"
INT INT
For operators with the same priority, �how do several in a sequence associate?�
Right-associative�(usual for the power operator)
Left-associative�(usual for most operators)
Example ambiguity: Non-‐associa<vity
33
Exp -> Exp "<" Exp � Exp -> INT�
Exp
Exp "<"
INT
Exp
Exp Exp "<"
INT INT
For some operators, it does not make sense to have�several in a sequence at all. They are non-associative.�
We would like to forbid both trees.�I.e., rule out the sentence from the langauge.
Exp
Exp Exp
Exp Exp
"<"
INT "<"
INT INT
Disambigua6ng expression grammars
34
How can we change the grammar so that only the desired trees can be derived?�
Idea: Restrict certain subtrees by introducing new nonterminals.�
Priority: Introduce a new nonterminal for each priority level: Term, Factor, Primary, ...�
Left associativity: Restrict the right operand so it only can contain expressions of higher priority�
Right associativity: ...�
Exercise
35
Ambiguous grammar: �
Expr -> Expr "+" Expr�Expr -> Expr "*" Expr�Expr -> INT�Expr -> "(" Expr ")"�
Equivalent unambiguous grammar: �
Solu6on
36
Ambiguous grammar: �
Expr -> Expr "+" Expr�Expr -> Expr "*" Expr�Expr -> INT�Expr -> "(" Expr ")"�
Equivalent unambiguous grammar: �
Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�
Here, we introduce a new nonterminal, Term, that is more restricted than Expr. That is, from Term, we can not derive any new addi6ons.
We use Term as the right operand in the addi6on produc6on, to make sure no new addi6ons will appear to the right. This give leb-‐associa6vity.
We use Term, and the even more restricted nonterminal Factor, in the mul6plica6on produc6on, to make sure no addi6ons can appear there, without using parentheses. This gives mul6plica6on higher priority than addi6on.
Wri6ng the grammar in different nota6ons
37
Canonical form: �
Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�
Equivalent BNF (Backus-Naur Form): �
Equivalent EBNF (Extended BNF): �
Use alterna<ves instead of several produc6ons per nonterminal.
Use repe<<on instead of recursion, where possible.
Wri6ng the grammar in different nota6ons
38
Canonical form: �
Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�
Equivalent BNF (Backus-Naur Form): �
Expr -> Expr "+" Term | Term�Term -> Term "*" Factor | Factor�Factor -> INT | "(" Expr ")" �
Equivalent EBNF (Extended BNF): �
Expr -> Term ("+" Term)* �Term -> Factor ("*" Factor)* �Factor -> INT | "(" Expr ")" �
Use alterna<ves instead of several produc6ons per nonterminal.
Use repe<<on instead of recursion, where possible.
Transla6ng EBNF to Canonical form
39
Top level repetition �X -> γ1 γ2* γ3 �
EBNF Equivalent canonical form
Top level alternative�X -> γ1 | γ2 �
Top level parentheses�X -> γ1 (...) γ2 �
Where γk is a sequence of terminals and nonterminals
Transla6ng EBNF to Canonical form
40
Top level repetition �X -> γ1 γ2* γ3 �
X -> γ1 N γ2�N -> ... �
EBNF Equivalent canonical form
X -> γ1 N γ3�N -> γ2 N �N -> ε�
Top level alternative�X -> γ1 | γ2 �
X -> γ1 �X -> γ2 �
Top level parentheses�X -> γ1 (...) γ2 �
Exercise: Translate from EBNF to Canonical form
41
EBNF: �
Expr -> Term ("+" Term)* �
Equivalent Canonical Form�
Solu6on: Translate from EBNF to Canonical form
42
EBNF: �
Expr -> Term ("+" Term)* �
Equivalent Canonical Form�
Expr -> Term N �N -> "+" Term N �N -> ε�
Can we show that these are equivalent?
43
EBNF: �
Expr -> Term ("+" Term)* �
Equivalent Canonical Form�
Expr -> Term N �N -> "+" Term N �N -> ε�
Alternative Equivalent Canonical Form�
Expr -> Expr "+" Term�Expr -> Term�
trivial
non-‐trivial
Example proof
44
1. We start with this:�Expr -> Term ("+" Term)* �
2. We can move the repetition: �Expr -> (Term "+")* Term�
3. Introduce a nonterminal N: �Expr -> N Term �N -> (Term "+")* �
5. Replace N Term by Expr in second production:�
Expr -> N Term �N -> Expr "+" �N -> ε�
6. Eliminate N: �
Expr -> Expr "+" Term �Expr -> Term�
4. Eliminate the repetition: �Expr -> N Term �N -> N Term "+" �N -> ε�
Equivalence of grammars
45
Given two context-free grammars, G1 and G2.�Are they equivalent?�
I.e., is L(G1) = L(G2)?�
Undecidable problem: �a general algorithm cannot be constructed.�
We need to rely on our ingenuity to find out.�(In the general case.)�
46
RE� CFG�Alphabet � characters� terminal symbols �
(tokens)�Language� strings �
(char sequences)�sentences�
(token sequences)�Used for� tokens� parse trees�
Power� iteration � recursion �Recognizer� DFA � NFA with stack �
Regular Expressions vs Context-‐Free Grammars
47
Grammar� Rule patterns� Type�regular� X -> aY or X -> a or X -> ε� 3�
context free� X -> γ � 2�context sensitive� α X β -> α γ β� 1 �
arbitrary� γ -> δ 0�
The Chomsky grammar hierarchy
Type(3) ⊂ Type (2) ⊂ Type(1) ⊂ Type(0)
Regular grammars have the same power as regular expressions (tail recursion = itera6on).
Type 2 and 3 are of prac6cal use in compiler construc6on. Type 0 and 1 are only of theore6cal interest.
Summary ques6ons
48
• Define a small example language with a CFG • What is a nonterminal symbol? A terminal symbol? A produc6on? A start symbol? A parse tree? • What is a leb-‐hand side of a produc6on? A right-‐hand side? • Given a grammar G, what is meant by the language L(G)? • What is a deriva6on step? A deriva6on? A lebmost deriva6on? • How does a deriva6on correspond to a parse tree? • When is a grammar ambiguous? Unambiguous? • How can ambigui6es for expressions be removed? • When are two grammars equivalent? • When should we use canonical form, and when EBNF? • Translate an EBNF grammar to canonical form. • Explain why context-‐free grammars are more powerful than regular expressions. • What is the Chomsky hierarchy?