Context-‐free grammars

EDA180: Compiler Construc6on Context-‐free grammars

Görel Hedin Revised: 2013-‐01-‐28

Lexical analysis (scanning)

Syntac6c analysis (parsing)

Seman6c analysis

Intermediate code genera6on

Op6miza6on

Machine code genera6on

source code

machine code

tokens

AST APributed AST

intermediate code

intermediate code

Compiler phases and program representa6ons

Analysis Synthesis

2

tokens

AST

A closer look at the parser

Pure parsing

AST building

concrete parse tree (implicit)

Parser

3

Scanner text

regular expressions

context-‐free grammar

abstract grammar

Defined by:

The concrete parse tree is never constructed. The parser builds the AST directly using seman<c ac<ons.

This lecture

Regular Expressions vs Context-‐Free Grammars

4

An RE can have itera<on

A CFG can also have recursion (it is possible to derive a symbol, e.g., Stmt, from itself)

Example RE: ID = [a-z][a-z0-9]* �

Example CFG: Stmt -> WhileStmt | AssignStmt | …�WhileStmt -> WHILE LPAR Exp RPAR Stmt �…�

Elements of a Context-‐Free Grammar

5

Nonterminal symbols

Terminal symbols (tokens)

Produc<on rules: N -> s1 s2 … sn �

where sk is a symbol (terminal or nonterminal)

Start symbol (one of the nonterminals, usually the first one)

Example CFG: Stmt -> WhileStmt | AssignStmt | …�WhileStmt -> WHILE LPAR Exp RPAR Stmt �…�

Exercise Construct a grammar covering the following program:

6

CFG: Stmt -> WhileStmt | AssignStmt | CompoundStmt �WhileStmt -> "while" "(" Exp ")" Stmt �AssignStmt -> ID "=" Exp �CompoundStmt ->�Exp ->�LessEq ->�Add ->�

Example program: while (k <= n) {sum = sum + k; k = k+1;} �

Usually, simple tokens are wriPen directly as text strings

Solu6on Construct a grammar covering the following program:

7

CFG: Stmt -> WhileStmt | AssignStmt | CompoundStmt �WhileStmt -> "while" "(" Exp ")" Stmt �AssignStmt -> ID "=" Exp �CompoundStmt -> "{" (Stmt ";")* "}"�Exp -> LessEq | Add | ID �LessEq -> Exp "<=" Exp �Add -> Exp "+" Exp �

Example program: while (k <= n) {sum = sum + k; k = k+1;} �

Real world example: The Java Language Specifica6on

8

See http://docs.oracle.com/javase/specs/jls Take a look at Chapter 2 about the Java grammar defini6ons, and look at some

parts of the specifica6on.

Compila6onUnit: PackageDeclara6onopt ImportDeclara6onsopt TypeDeclara6onsopt

ImportDeclara6ons: ImportDeclara6on ImportDeclara6ons ImportDeclara6on

TypeDeclara6ons: TypeDeclara6on TypeDeclara6ons TypeDeclara6on

PackageDeclara6on: Annota6onsopt package PackageName ;

…

Parsing Use the grammar to derive a tree for a program:

9 sum = sum + k �

Stmt Example program: sum = sum + k �

Start symbol

Parse tree Use the grammar to derive a parse tree for a program:

10 sum = sum + k �

Stmt

AssignStmt

Exp

Add

Exp Exp

Example program: sum = sum + k �

Nonterminals are inner nodes

Start symbol

Terminals are leafs

Corresponding abstract syntax tree (will be discussed in Lecture 5)

11 sum = sum + k �

AssignStmt

Add

IdExp IdExp

Example program: sum = sum + k �

IdExp

Forms of CFGs:

12

EBNF: Stmt -> AssignStmt | CompoundStmt �AssignStmt -> ID "=" Exp �CompoundStmt -> "{" (Stmt ";")* "}"�Exp -> Add | ID �Add -> Exp "+" Exp �

Canonical form: Stmt -> ID "=" Exp �Stmt -> "{" Stmts "}"�Stmts -> ε Stmts -> Stmt ";" Stmts�Exp -> Exp "+" Exp �Exp -> ID �

Extended Backus-‐Naur Form: •  Compact, easy to read and write •  BNF has alterna6ves •  EBNF also has repe66on, op6onals, parentheses

•  Common nota6on for prac6cal use •  EBNF used in JavaCC •  BNF oben used in LR-‐parser generators

Canonical form: •  Core formalism for CFGs •  Replaces repe66on with recursion •  Replaces alterna6ves, op6onals, parentheses with several produc6on rules

•  Useful for proving proper6es and explaining algorithms

Formal defini6on of CFGs (canonical form)

13

A context-free grammar G = (N, T, P, S), where �N – the set of nonterminal symbols�T – the set of terminal symbols�P – the set of production rules, each with the form� X -> Y1 Y2 … Yn � where X ∈ N, and Yk ∈ N ∪ T �S – the start symbol (one of the nonterminals). I.e., S ∈ N �

So, the left-hand side X of a rule is a nonterminal.�

And the right-hand side Y1 Y2 … Yn is a sequence of nonterminals and terminals.�

If the rhs for a production is empty, i.e., n = 0, we write � X -> ε�

A grammar G defines a language L(G)

14

A context-free grammar G = (N, T, P, S), where �N – the set of nonterminal symbols�T – the set of terminal symbols�P – the set of production rules, each with the form� X -> Y1 Y2 … Yn � where X ∈ N, and Yk ∈ N ∪ T �S – the start symbol (one of the nonterminals). I.e., S ∈ N �

G defines a language L(G) over the alphabet T�

T* is the set of all possible sequences of T symbols.�

L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P.�

Exercise

15

G = (N, T, P, S) �

P = {� Stmt -> ID "=" Exp , � Stmt -> "{" Stmts "}" , � Stmts -> ε , Stmts -> Stmt ";" Stmts , � Exp -> Exp "+" Exp , � Exp -> ID �} �

N = { } �

T = { } �

S = �

L(G) = {�

} �

Solu6on

16

G = (N, T, P, S) �

P = {� Stmt -> ID "=" Exp , � Stmt -> "{" Stmts "}" , � Stmts -> ε , Stmts -> Stmt ";" Stmts , � Exp -> Exp "+" Exp , � Exp -> ID �} �

N = {Stmt, Exp, Stmts} �

T = {ID, "=", "{", "}", ";", "+"} �

S = Stmt �

L(G) = {� ID "=" ID, � ID "=" ID "+" ID, � ID "=" ID "+" ID "+" ID, � …�

"{" "}", � "{" ID "=" ID ";" "}", � "{" ID "=" ID ";" "{" "}" ";" "}", � …�

} �

The sequences in L(G) are usually called sentences or strings

Deriva6on step

17

If we have a sequence of terminals and nonterminals, e.g., �

X a Y Y b �

we can replace one of the nonterminals, applying a production rule. This is called a derivation step. �(Swedish: Härledningssteg)�

Suppose there is a production �

Y -> X a�

and we apply it for the first Y in the sequence. We write the derivation step as follows: �

X a Y Y b => X a X a Y b �

Deriva6on

18

A derivation, is simply a sequence of derivation steps, e.g.: �

γ1 => γ2 => … => γn (n ≥ 0) �

where each γi is a sequence of terminals and nonterminals�

If there is a derivation from γ1 to γn, we can write this as �

γ1 =>* γn �

So this means it is possible to get from the sequence γ1 to the sequence γn by following the production rules.�

Defini6on of the language L(G)

19

Recall that: �

G = (N, T, P, S) �

T* is the set of all possible sequences of T symbols.�

L(G) is the subset of T* that can be derived from the � start symbol S, by following the production rules P.�

Using the concept of derivations, we can formally define L(G) as follows: �

L(G) = { w ∈ T* | S =>* w } �

Exercise: Prove that a sentence belongs to a language

20

Prove that �

INT + INT * INT�

Proof (by showing all the derivation steps from the start symbol Exp):�

Exp �=>�

belongs to the language of the following grammar: �

p1:#Exp -> Exp "+" Exp �p2:#Exp -> Exp "*" Exp �p3:#Exp -> INT�

Solu6on: Prove that a sentence belongs to a language

21

Prove that �

INT + INT * INT�

Proof (by showing all the derivation steps from the start symbol Exp):�

Exp �=> Exp "+" Exp #(p1)�=> INT "+" Exp #(p3)�=> INT "+" Exp "*" Exp #(p2)�=> INT "+" INT "*" Exp #(p3)�=> INT "+" INT "*" INT #(p3)�

belongs to the language of the following grammar: �

p1:#Exp -> Exp "+" Exp �p2:#Exp -> Exp "*" Exp �p3:#Exp -> INT�

Lebmost and rightmost deriva6ons

22

In a leftmost derivation, the leftmost nonterminal is replaced in each derivation step, e.g.,: �

Exp =>�Exp "+" Exp =>�INT "+" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�

LL parsing algoritms use leftmost derivation.�LR parsing algorithms use rightmost derivation.�Will be discussed in later lectures.�

In a rightmost derivation, the rightmost nonterminal is replaced in each derivation step, e.g.,: �

Exp =>�Exp "+" Exp =>�Exp "+" Exp "*" Exp =>�Exp "+" Exp "*" INT =>�Exp "+" INT "*" INT =>�INT "+" INT "*" INT�

A deriva6on corresponds to building a parse tree

23

Grammar: � Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Example derivation: �


Exercise: build the parse tree�(also called derivation tree).�

A deriva6on corresponds to building a parse tree

24


Example derivation: �


Parse tree: �

Exp

Exp Exp

Exp Exp

"+"

INT "*"

INT INT

Exercise: Can we do another deriva6on of the same sentence,

that gives a different parse tree?

25


Parse tree: �

Another derivation: �

Exp =>�

Solu6on: Can we do another deriva6on of the same sentence,

that gives a different parse tree?

26


Parse tree: �

Exp

Exp "*"

INT

Exp

Exp Exp "+"

INT INT

Another derivation: �

Exp =>�Exp "*" Exp =>�Exp "+" Exp "*" Exp =>�INT "+" Exp "*" Exp =>�INT "+" INT "*" Exp =>�INT "+" INT "*" INT�

Which parse tree would we prefer?

Ambiguous context-‐free grammars

27

A CFG is ambiguous if a sentence in the language can be derived by two (or more) different parse trees.�

A CFG is unambiguous if each sentence in the language can be derived by only one syntax tree.�

(Swedish: tvetydig, otvetydig)�

How can we know if a CFG is ambiguous?

28

If we find an example of an ambiguity, we know the grammar is ambiguous.�

There are algorithms for deciding if a CFG belongs to certain subsets of CFGs, e.g. LL, LR, etc. (See later lectures.) These grammars are unambiguous.�

But in the general case, the problem is undecidable: it is not possible to construct a general algorithm that decides ambiguity for any CFG.�

Strategies for elimina6ng ambigui6es

29

We should try to eliminate ambiguities, without changing the language.�

First, decide which parse tree is the desired one.�

Create an equivalent unambiguous grammar from which only the desired parse trees can be derived.�

Or, use additional priority and associativity rules to instruct the parser to derive the "right" parse tree. (Works for some ambiguities.)�

Equivalent grammars

30

Two grammars, G1 and G2, are equivalent if they generate the same language.�

I.e., each sentence in one of the grammars can be derived also in the other grammar: �

L(G1) = L(G2)�

Example ambiguity: Priority (also called precedence)

31

Exp -> Exp "+" Exp � Exp -> Exp "*" Exp � Exp -> INT�

Exp

Exp Exp

Exp Exp

"+"

INT "*"

INT INT

Exp

Exp "*"

INT

Exp

Exp Exp "+"

INT INT

Two parse trees for INT "+" INT "*" INT�

prio("*") > prio("+")�(according to tradition)

prio("+") > prio("*")�(would be unexpected and confusing)

Example ambiguity: Associa<vity

32

Exp -> Exp "+" Exp � Exp -> Exp "-" Exp � Exp -> Exp "**" Exp � Exp -> INT�

Exp

Exp Exp

Exp Exp

"**"

INT "**"

INT INT

Exp

Exp "+"

INT

Exp

Exp Exp "-"

INT INT

For operators with the same priority, �how do several in a sequence associate?�

Right-associative�(usual for the power operator)

Left-associative�(usual for most operators)

Example ambiguity: Non-‐associa<vity

33

Exp -> Exp "<" Exp � Exp -> INT�

Exp

Exp "<"

INT

Exp

Exp Exp "<"

INT INT

For some operators, it does not make sense to have�several in a sequence at all. They are non-associative.�

We would like to forbid both trees.�I.e., rule out the sentence from the langauge.

Exp

Exp Exp

Exp Exp

"<"

INT "<"

INT INT

Disambigua6ng expression grammars

34

How can we change the grammar so that only the desired trees can be derived?�

Idea: Restrict certain subtrees by introducing new nonterminals.�

Priority: Introduce a new nonterminal for each priority level: Term, Factor, Primary, ...�

Left associativity: Restrict the right operand so it only can contain expressions of higher priority�

Right associativity: ...�

Exercise

35

Ambiguous grammar: �

Expr -> Expr "+" Expr�Expr -> Expr "*" Expr�Expr -> INT�Expr -> "(" Expr ")"�

Equivalent unambiguous grammar: �

Solu6on

36

Ambiguous grammar: �

Expr -> Expr "+" Expr�Expr -> Expr "*" Expr�Expr -> INT�Expr -> "(" Expr ")"�

Equivalent unambiguous grammar: �

Expr -> Expr "+" Term�Expr -> Term�Term -> Term "*" Factor�Term -> Factor�Factor -> INT�Factor -> "(" Expr ")"�

Here, we introduce a new nonterminal, Term, that is more restricted than Expr. That is, from Term, we can not derive any new addi6ons.

We use Term as the right operand in the addi6on produc6on, to make sure no new addi6ons will appear to the right. This give leb-‐associa6vity.

We use Term, and the even more restricted nonterminal Factor, in the mul6plica6on produc6on, to make sure no addi6ons can appear there, without using parentheses. This gives mul6plica6on higher priority than addi6on.

Wri6ng the grammar in different nota6ons

37

Canonical form: �


Equivalent BNF (Backus-Naur Form): �

Equivalent EBNF (Extended BNF): �

Use alterna<ves instead of several produc6ons per nonterminal.

Use repe<<on instead of recursion, where possible.

Wri6ng the grammar in different nota6ons

38

Canonical form: �


Equivalent BNF (Backus-Naur Form): �

Expr -> Expr "+" Term | Term�Term -> Term "*" Factor | Factor�Factor -> INT | "(" Expr ")" �

Equivalent EBNF (Extended BNF): �

Expr -> Term ("+" Term)* �Term -> Factor ("*" Factor)* �Factor -> INT | "(" Expr ")" �

Use alterna<ves instead of several produc6ons per nonterminal.

Use repe<<on instead of recursion, where possible.

Transla6ng EBNF to Canonical form

39

Top level repetition �X -> γ1 γ2* γ3 �

EBNF Equivalent canonical form

Top level alternative�X -> γ1 | γ2 �

Top level parentheses�X -> γ1 (...) γ2 �

Where γk is a sequence of terminals and nonterminals

Transla6ng EBNF to Canonical form

40

Top level repetition �X -> γ1 γ2* γ3 �

X -> γ1 N γ2�N -> ... �

EBNF Equivalent canonical form

X -> γ1 N γ3�N -> γ2 N �N -> ε�

Top level alternative�X -> γ1 | γ2 �

X -> γ1 �X -> γ2 �

Top level parentheses�X -> γ1 (...) γ2 �

Exercise: Translate from EBNF to Canonical form

41

EBNF: �

Expr -> Term ("+" Term)* �

Equivalent Canonical Form�

Solu6on: Translate from EBNF to Canonical form

42

EBNF: �



Expr -> Term N �N -> "+" Term N �N -> ε�

Can we show that these are equivalent?

43

EBNF: �



Expr -> Term N �N -> "+" Term N �N -> ε�

Alternative Equivalent Canonical Form�

Expr -> Expr "+" Term�Expr -> Term�

trivial

non-‐trivial

Example proof

44

1. We start with this:�Expr -> Term ("+" Term)* �

2. We can move the repetition: �Expr -> (Term "+")* Term�

3. Introduce a nonterminal N: �Expr -> N Term �N -> (Term "+")* �

5. Replace N Term by Expr in second production:�

Expr -> N Term �N -> Expr "+" �N -> ε�

6. Eliminate N: �

Expr -> Expr "+" Term �Expr -> Term�

4. Eliminate the repetition: �Expr -> N Term �N -> N Term "+" �N -> ε�

Equivalence of grammars

45

Given two context-free grammars, G1 and G2.�Are they equivalent?�

I.e., is L(G1) = L(G2)?�

Undecidable problem: �a general algorithm cannot be constructed.�

We need to rely on our ingenuity to find out.�(In the general case.)�

46

RE� CFG�Alphabet � characters� terminal symbols �

(tokens)�Language� strings �

(char sequences)�sentences�

(token sequences)�Used for� tokens� parse trees�

Power� iteration � recursion �Recognizer� DFA � NFA with stack �

Regular Expressions vs Context-‐Free Grammars

47

Grammar� Rule patterns� Type�regular� X -> aY or X -> a or X -> ε� 3�

context free� X -> γ � 2�context sensitive� α X β -> α γ β� 1 �

arbitrary� γ -> δ 0�

The Chomsky grammar hierarchy

Type(3) ⊂ Type (2) ⊂ Type(1) ⊂ Type(0)

Regular grammars have the same power as regular expressions (tail recursion = itera6on).

Type 2 and 3 are of prac6cal use in compiler construc6on. Type 0 and 1 are only of theore6cal interest.

Summary ques6ons

48

•  Define a small example language with a CFG • What is a nonterminal symbol? A terminal symbol? A produc6on? A start symbol? A parse tree? • What is a leb-‐hand side of a produc6on? A right-‐hand side? •  Given a grammar G, what is meant by the language L(G)? • What is a deriva6on step? A deriva6on? A lebmost deriva6on? •  How does a deriva6on correspond to a parse tree? • When is a grammar ambiguous? Unambiguous? •  How can ambigui6es for expressions be removed? • When are two grammars equivalent? • When should we use canonical form, and when EBNF? •  Translate an EBNF grammar to canonical form. •  Explain why context-‐free grammars are more powerful than regular expressions. • What is the Chomsky hierarchy?

Readings

49

• F3: Context free grammars, deriva6ons, ambiguity, EBNF Appel, chapter 3-‐3.1. Appel, relevant exercises: 3.1-‐3.3 Try solve the problems in Seminar 2.

• F4: Predic6ve parsing. Recursive descent. LL grammars and parsing. Leb recursion and factoriza6on. Appel, chapter 3.2

Context-‐free grammars

Documents

Transcript of Context-‐free grammars