Context-‐free grammars

49
EDA180: Compiler Construc6on Contextfree grammars Görel Hedin Revised: 20130128

Transcript of Context-‐free grammars

EDA180:  Compiler  Construc6on  Context-­‐free  grammars  

Görel  Hedin  Revised:  2013-­‐01-­‐28  

Lexical  analysis  (scanning)  

Syntac6c  analysis  (parsing)  

Seman6c  analysis  

Intermediate  code  genera6on  

Op6miza6on  

Machine  code  genera6on  

source  code  

machine  code  

tokens  

AST   APributed  AST  

intermediate  code  

intermediate  code  

Compiler  phases  and  program  representa6ons  

Analysis   Synthesis  

2  

tokens  

AST  

A  closer  look  at  the  parser  

Pure  parsing  

AST  building  

concrete  parse  tree      (implicit)  

Parser  

3  

Scanner  text  

regular  expressions  

context-­‐free  grammar  

abstract  grammar  

Defined  by:  

The  concrete  parse  tree  is  never  constructed.  The  parser  builds  the  AST  directly  using  seman<c  ac<ons.    

This  lecture  

Regular  Expressions  vs  Context-­‐Free  Grammars  

4  

An  RE  can  have  itera<on  

A  CFG  can  also  have  recursion  (it  is  possible  to  derive  a  symbol,  e.g.,  Stmt,  from  itself)  

Example  RE:  ID = [a-z][a-z0-9]* �

Example  CFG:  Stmt -> WhileStmt | AssignStmt | …�WhileStmt -> WHILE LPAR Exp RPAR Stmt �…�

Elements  of  a  Context-­‐Free  Grammar  

5  

Nonterminal  symbols  

Terminal  symbols  (tokens)  

Produc<on  rules:  N -> s1 s2 … sn �

where  sk  is  a  symbol  (terminal  or  nonterminal)  

Start  symbol  (one  of  the  nonterminals,  usually  the  first  one)  

Example  CFG:  Stmt -> WhileStmt | AssignStmt | …�WhileStmt -> WHILE LPAR Exp RPAR Stmt �…�

Exercise  Construct  a  grammar  covering  the  following  program:  

6  

CFG:  Stmt -> WhileStmt | AssignStmt | CompoundStmt �WhileStmt -> "while" "(" Exp ")" Stmt �AssignStmt -> ID "=" Exp �CompoundStmt ->�Exp ->�LessEq ->�Add ->�

Example  program:  while (k <= n) {sum = sum + k; k = k+1;} �

Usually,  simple  tokens  are  wriPen  directly  as  text  strings  

Solu6on  Construct  a  grammar  covering  the  following  program:  

7  

CFG:  Stmt -> WhileStmt | AssignStmt | CompoundStmt �WhileStmt -> "while" "(" Exp ")" Stmt �AssignStmt -> ID "=" Exp �CompoundStmt -> "{" (Stmt ";")* "}"�Exp -> LessEq | Add | ID �LessEq -> Exp "<=" Exp �Add -> Exp "+" Exp �

Example  program:  while (k <= n) {sum = sum + k; k = k+1;} �

Real  world  example:  The  Java  Language  Specifica6on  

8  

See  http://docs.oracle.com/javase/specs/jls  Take  a  look  at  Chapter  2  about  the  Java  grammar  defini6ons,  and  look  at  some  

parts  of  the  specifica6on.  

Compila6onUnit:          PackageDeclara6onopt  ImportDeclara6onsopt  TypeDeclara6onsopt  

ImportDeclara6ons:          ImportDeclara6on          ImportDeclara6ons  ImportDeclara6on  

TypeDeclara6ons:          TypeDeclara6on          TypeDeclara6ons  TypeDeclara6on  

PackageDeclara6on:          Annota6onsopt  package  PackageName  ;  

…  

Parsing  Use  the  grammar  to  derive  a  tree  for  a  program:  

9  sum = sum + k �

Stmt  Example  program:  sum = sum + k �

Start  symbol  

Parse  tree  Use  the  grammar  to  derive  a  parse  tree  for  a  program:  

10  sum = sum + k �

Stmt  

AssignStmt  

Exp  

Add  

Exp   Exp  

Example  program:  sum = sum + k �

Nonterminals  are  inner  nodes  

Start  symbol  

Terminals  are  leafs  

Corresponding  abstract  syntax  tree  (will  be  discussed  in  Lecture  5)    

11  sum = sum + k �

AssignStmt  

Add  

IdExp   IdExp  

Example  program:  sum = sum + k �

IdExp  

Forms  of  CFGs:  

12  

EBNF:  Stmt -> AssignStmt | CompoundStmt �AssignStmt -> ID "=" Exp �CompoundStmt -> "{" (Stmt ";")* "}"�Exp -> Add | ID �Add -> Exp "+" Exp �

Canonical  form:  Stmt -> ID "=" Exp �Stmt -> "{" Stmts "}"�Stmts -> ε Stmts -> Stmt ";" Stmts�Exp -> Exp "+" Exp �Exp -> ID �

Extended  Backus-­‐Naur  Form:  •  Compact,  easy  to  read  and  write  •  BNF  has  alterna6ves  •  EBNF  also  has  repe66on,  op6onals,  parentheses  

•  Common  nota6on  for  prac6cal  use  •  EBNF  used  in  JavaCC  •  BNF  oben  used  in  LR-­‐parser  generators  

Canonical  form:  •  Core  formalism  for  CFGs  •  Replaces  repe66on  with  recursion  •  Replaces  alterna6ves,  op6onals,  parentheses  with  several  produc6on  rules  

•  Useful  for  proving  proper6es  and  explaining  algorithms  

Formal  defini6on  of  CFGs  (canonical  form)  

13  

A context-free grammar G = (N, T, P, S), where �N – the set of nonterminal symbols�T – the set of terminal symbols�P – the set of production rules, each with the form� X -> Y1 Y2 … Yn � where X ∈ N, and Yk ∈ N ∪ T �S – the start symbol (one of the nonterminals). I.e., S ∈ N �

So, the left-hand side X of a rule is a nonterminal.�

And the right-hand side Y1 Y2 … Yn is a sequence of nonterminals and terminals.�

If the rhs for a production is empty, i.e., n = 0, we write � X -> ε�

A  grammar  G  defines  a  language  L(G)  

14  

A context-free grammar G = (N, T, P, S), where �N – the set of nonterminal symbols�T – the set of terminal symbols�P – the set of production rules, each with the form� X -> Y1 Y2 … Yn � where X ∈ N, and Yk ∈ N ∪ T �S – the start symbol (one of the nonterminals). I.e., S ∈ N �

G defines a language L(G) over the alphabet T�

T* is the set of all possible sequences of T symbols.�

L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P.�

Exercise  

15  

G = (N, T, P, S) �

P = {� Stmt -> ID "=" Exp , � Stmt -> "{" Stmts "}" , � Stmts -> ε , Stmts -> Stmt ";" Stmts , � Exp -> Exp "+" Exp , � Exp -> ID �} �

N = { } �

T = { } �

S = �

L(G) = {�

} �

Solu6on  

16  

G = (N, T, P, S) �

P = {� Stmt -> ID "=" Exp , � Stmt -> "{" Stmts "}" , � Stmts -> ε , Stmts -> Stmt ";" Stmts , � Exp -> Exp "+" Exp , � Exp -> ID �} �

N = {Stmt, Exp, Stmts} �

T = {ID, "=", "{", "}", ";", "+"} �

S = Stmt �

L(G) = {� ID "=" ID, � ID "=" ID "+" ID, � ID "=" ID "+" ID "+" ID, � …�

"{" "}", � "{" ID "=" ID ";" "}", � "{" ID "=" ID ";" "{" "}" ";" "}", � …�

} �

The  sequences  in  L(G)  are  usually  called  sentences  or  strings  

Deriva6on  step  

17  

If we have a sequence of terminals and nonterminals, e.g., �

X a Y Y b �

we can replace one of the nonterminals, applying a production rule. This is called a derivation step. �(Swedish: Härledningssteg)�

Suppose there is a production �

Y -> X a�

and we apply it for the first Y in the sequence. We write the derivation step as follows: �

X a Y Y b => X a X a Y b �

Deriva6on  

18  

A derivation, is simply a sequence of derivation steps, e.g.: �

γ1 => γ2 => … => γn (n ≥ 0) �

where each γi is a sequence of terminals and nonterminals�

If there is a derivation from γ1 to γn, we can write this as �

γ1 =>* γn �

So this means it is possible to get from the sequence γ1 to the sequence γn by following the production rules.�

Defini6on  of  the  language  L(G)  

19  

Recall that: �

G = (N, T, P, S) �

T* is the set of all possible sequences of T symbols.�

L(G) is the subset of T* that can be derived from the � start symbol S, by following the production rules P.�

Using the concept of derivations, we can formally define L(G) as follows: �

L(G) = { w ∈ T* | S =>* w } �

Exercise:  Prove  that  a  sentence  belongs  to  a  language  

20  

Prove that �

INT + INT * INT�

Proof (by showing all the derivation steps from the start symbol Exp):�

Exp �=>�

belongs to the language of the following grammar: �

p1:#Exp -> Exp "+" Exp �p2:#Exp -> Exp "*" Exp �p3:#Exp -> INT�

Solu6on:  Prove  that  a  sentence  belongs  to  a  language  

21  

Prove that �

INT + INT * INT�

Proof (by showing all the derivation steps from the start symbol Exp):�

Exp �=> Exp "+" Exp #(p1)�=> INT "+" Exp #(p3)�=> INT "+" Exp "*" Exp #(p2)�=> INT "+" INT "*" Exp #(p3)�=> INT "+" INT "*" INT #(p3)�

belongs to the language of the following grammar: �

p1:#Exp -> Exp "+" Exp �p2:#Exp -> Exp "*" Exp �p3:#Exp -> INT�

Lebmost  and  rightmost  deriva6ons  

22  

In a leftmost derivation, the leftmost nonterminal is replaced in each derivation step, e.g.,: �

Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�

LL parsing algoritms use leftmost derivation.�LR parsing algorithms use rightmost derivation.�Will be discussed in later lectures.�

In a rightmost derivation, the rightmost nonterminal is replaced in each derivation step, e.g.,: �

Exp =>�Exp "+" Exp =>�Exp "+" Exp "*" Exp =>�Exp "+" Exp "*" INT =>�Exp "+" INT "*" INT =>�INT "+" INT "*" INT�

A  deriva6on  corresponds  to  building  a  parse  tree  

23  

Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Example derivation: �

Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�

Exercise: build the parse tree�(also called derivation tree).�

A  deriva6on  corresponds  to  building  a  parse  tree  

24  

Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Example derivation: �

Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�

Parse tree: �

Exp  

Exp   Exp  

Exp   Exp  

"+"  

INT  "*"  

INT   INT  

Exercise:  Can  we  do  another  deriva6on  of  the  same  sentence,  

that  gives  a  different  parse  tree?  

25  

Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Parse tree: �

Another derivation: �

Exp =>�

Solu6on:  Can  we  do  another  deriva6on  of  the  same  sentence,  

that  gives  a  different  parse  tree?  

26  

Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Parse tree: �

Exp  

Exp  "*"  

INT  

Exp  

Exp   Exp  "+"  

INT   INT  

Another derivation: �

Exp =>�Exp "*" Exp =>�Exp "+" Exp "*" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�

Which  parse  tree  would  we  prefer?  

Ambiguous  context-­‐free  grammars  

27  

A CFG is ambiguous if a sentence in the language can be derived by two (or more) different parse trees.�

A CFG is unambiguous if each sentence in the language can be derived by only one syntax tree.�

(Swedish: tvetydig, otvetydig)�

How  can  we  know  if  a  CFG  is  ambiguous?  

28  

If we find an example of an ambiguity, we know the grammar is ambiguous.�

There are algorithms for deciding if a CFG belongs to certain subsets of CFGs, e.g. LL, LR, etc. (See later lectures.) These grammars are unambiguous.�

But in the general case, the problem is undecidable: it is not possible to construct a general algorithm that decides ambiguity for any CFG.�

Strategies  for  elimina6ng  ambigui6es  

29  

We should try to eliminate ambiguities, without changing the language.�

First, decide which parse tree is the desired one.�

Create an equivalent unambiguous grammar from which only the desired parse trees can be derived.�

Or, use additional priority and associativity rules to instruct the parser to derive the "right" parse tree. (Works for some ambiguities.)�

Equivalent  grammars  

30  

Two grammars, G1 and G2, are equivalent if they generate the same language.�

I.e., each sentence in one of the grammars can be derived also in the other grammar: �

L(G1) = L(G2)�

Example  ambiguity:  Priority  (also  called  precedence)  

31  

Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Exp  

Exp   Exp  

Exp   Exp  

"+"  

INT  "*"  

INT   INT  

Exp  

Exp  "*"  

INT  

Exp  

Exp   Exp  "+"  

INT   INT  

Two parse trees for INT "+" INT "*" INT�

prio("*") > prio("+")�(according to tradition)  

prio("+") > prio("*")�(would be unexpected and confusing)  

Example  ambiguity:  Associa<vity  

32  

Exp -> Exp "+" Exp � Exp -> Exp "-" Exp � Exp -> Exp "**" Exp � Exp -> INT�

Exp  

Exp   Exp  

Exp   Exp  

"**"  

INT  "**"  

INT   INT  

Exp  

Exp  "+"  

INT  

Exp  

Exp   Exp  "-"  

INT   INT  

For operators with the same priority, �how do several in a sequence associate?�

Right-associative�(usual for the power operator)  

Left-associative�(usual for most operators)  

Example  ambiguity:  Non-­‐associa<vity  

33  

Exp -> Exp "<" Exp � Exp -> INT�

Exp  

Exp  "<"  

INT  

Exp  

Exp   Exp  "<"  

INT   INT  

For some operators, it does not make sense to have�several in a sequence at all. They are non-associative.�

We would like to forbid both trees.�I.e., rule out the sentence from the langauge.  

Exp  

Exp   Exp  

Exp   Exp  

"<"  

INT  "<"  

INT   INT  

Disambigua6ng  expression  grammars  

34  

How can we change the grammar so that only the desired trees can be derived?�

Idea: Restrict certain subtrees by introducing new nonterminals.�

Priority: Introduce a new nonterminal for each priority level: Term, Factor, Primary, ...�

Left associativity: Restrict the right operand so it only can contain expressions of higher priority�

Right associativity: ...�

Exercise  

35  

Ambiguous grammar: �

Expr -> Expr "+" Expr�Expr -> Expr "*" Expr�Expr -> INT�Expr -> "(" Expr ")"�

Equivalent unambiguous grammar: �

Solu6on  

36  

Ambiguous grammar: �

Expr -> Expr "+" Expr�Expr -> Expr "*" Expr�Expr -> INT�Expr -> "(" Expr ")"�

Equivalent unambiguous grammar: �

Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�

Here,  we  introduce  a  new  nonterminal,  Term,  that  is  more  restricted  than  Expr.  That  is,  from  Term,  we  can  not  derive  any  new  addi6ons.  

We  use  Term  as  the  right  operand  in  the  addi6on  produc6on,  to  make  sure  no  new  addi6ons  will  appear  to  the  right.  This  give  leb-­‐associa6vity.  

We  use  Term,  and  the  even  more  restricted  nonterminal  Factor,  in  the  mul6plica6on  produc6on,  to  make  sure  no  addi6ons  can  appear  there,  without  using  parentheses.  This  gives  mul6plica6on  higher  priority  than  addi6on.    

Wri6ng  the  grammar  in  different  nota6ons  

37  

Canonical form: �

Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�

Equivalent BNF (Backus-Naur Form): �

Equivalent EBNF (Extended BNF): �

Use  alterna<ves  instead  of  several  produc6ons  per  nonterminal.  

Use  repe<<on  instead  of  recursion,  where  possible.  

Wri6ng  the  grammar  in  different  nota6ons  

38  

Canonical form: �

Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�

Equivalent BNF (Backus-Naur Form): �

Expr -> Expr "+" Term | Term�Term -> Term "*" Factor | Factor�Factor -> INT | "(" Expr ")" �

Equivalent EBNF (Extended BNF): �

Expr -> Term ("+" Term)* �Term -> Factor ("*" Factor)* �Factor -> INT | "(" Expr ")" �

Use  alterna<ves  instead  of  several  produc6ons  per  nonterminal.  

Use  repe<<on  instead  of  recursion,  where  possible.  

Transla6ng  EBNF  to  Canonical  form  

39  

Top level repetition �X -> γ1 γ2* γ3 �

EBNF   Equivalent  canonical  form  

Top level alternative�X -> γ1 | γ2 �

Top level parentheses�X -> γ1 (...) γ2 �

Where  γk  is  a  sequence  of  terminals  and  nonterminals  

Transla6ng  EBNF  to  Canonical  form  

40  

Top level repetition �X -> γ1 γ2* γ3 �

X -> γ1 N γ2�N -> ... �

EBNF   Equivalent  canonical  form  

X -> γ1 N γ3�N -> γ2 N �N -> ε�

Top level alternative�X -> γ1 | γ2 �

X -> γ1 �X -> γ2 �

Top level parentheses�X -> γ1 (...) γ2 �

Exercise:  Translate  from  EBNF  to  Canonical  form  

41  

EBNF: �

Expr -> Term ("+" Term)* �

Equivalent Canonical Form�

Solu6on:  Translate  from  EBNF  to  Canonical  form  

42  

EBNF: �

Expr -> Term ("+" Term)* �

Equivalent Canonical Form�

Expr -> Term N �N -> "+" Term N �N -> ε�

Can  we  show  that  these  are  equivalent?  

43  

EBNF: �

Expr -> Term ("+" Term)* �

Equivalent Canonical Form�

Expr -> Term N �N -> "+" Term N �N -> ε�

Alternative Equivalent Canonical Form�

Expr -> Expr "+" Term�Expr -> Term�

trivial  

non-­‐trivial  

Example  proof  

44  

1. We start with this:�Expr -> Term ("+" Term)* �

2. We can move the repetition: �Expr -> (Term "+")* Term�

3. Introduce a nonterminal N: �Expr -> N Term �N -> (Term "+")* �

5. Replace N Term by Expr in second production:�

Expr -> N Term �N -> Expr "+" �N -> ε�

6. Eliminate N: �

Expr -> Expr "+" Term �Expr -> Term�

4. Eliminate the repetition: �Expr -> N Term �N -> N Term "+" �N -> ε�

Equivalence  of  grammars  

45  

Given two context-free grammars, G1 and G2.�Are they equivalent?�

I.e., is L(G1) = L(G2)?�

Undecidable problem: �a general algorithm cannot be constructed.�

We need to rely on our ingenuity to find out.�(In the general case.)�

46  

RE� CFG�Alphabet � characters� terminal symbols �

(tokens)�Language� strings �

(char sequences)�sentences�

(token sequences)�Used for� tokens� parse trees�

Power� iteration � recursion �Recognizer� DFA � NFA with stack �

Regular  Expressions  vs  Context-­‐Free  Grammars  

47  

Grammar� Rule patterns� Type�regular� X -> aY or X -> a or X -> ε� 3�

context free� X -> γ � 2�context sensitive� α X β -> α γ β� 1 �

arbitrary� γ -> δ 0�

The  Chomsky  grammar  hierarchy  

Type(3)  ⊂  Type  (2)  ⊂  Type(1)  ⊂  Type(0)  

Regular  grammars  have  the  same  power  as  regular  expressions  (tail  recursion  =  itera6on).  

Type  2  and  3  are  of  prac6cal  use  in  compiler  construc6on.  Type  0  and  1  are  only  of  theore6cal  interest.  

Summary  ques6ons  

48  

•  Define  a  small  example  language  with  a  CFG  • What  is  a  nonterminal  symbol?  A  terminal  symbol?  A  produc6on?  A  start  symbol?  A  parse  tree?  • What  is  a  leb-­‐hand  side  of  a  produc6on?  A  right-­‐hand  side?  •  Given  a  grammar  G,  what  is  meant  by  the  language  L(G)?  • What  is  a  deriva6on  step?  A  deriva6on?  A  lebmost  deriva6on?  •  How  does  a  deriva6on  correspond  to  a  parse  tree?  • When  is  a  grammar  ambiguous?  Unambiguous?  •  How  can  ambigui6es  for  expressions  be  removed?  • When  are  two  grammars  equivalent?  • When  should  we  use  canonical  form,  and  when  EBNF?  •  Translate  an  EBNF  grammar  to  canonical  form.  •  Explain  why  context-­‐free  grammars  are  more  powerful  than  regular  expressions.  • What  is  the  Chomsky  hierarchy?  

Readings  

49  

• F3:  Context  free  grammars,  deriva6ons,  ambiguity,  EBNF  Appel,  chapter  3-­‐3.1.  Appel,  relevant  exercises:  3.1-­‐3.3  Try  solve  the  problems  in  Seminar  2.  

• F4:  Predic6ve  parsing.  Recursive  descent.  LL  grammars  and  parsing.  Leb  recursion  and  factoriza6on.  Appel,  chapter  3.2