Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- .
- .Algorithms for Compiler Design
- by O.G. Kakde
- ISBN:1584501006
- Charles River Media © 2002 (334 pages)
- This text teaches the fundamental algorithms that underlie modern compilers, and focuses on the
- "front-end" of compiler design--lexical analysis, parsing, and syntax.
- Table of Contents
- Algorithms for Compiler Design
- Preface
- Chapter 1 -Introduction
- Chapter 2 -Finite Automata and Regular Expressions
- Chapter 3 -Context-Free Grammar and Syntax Analysis
- Chapter 4 -Top-Down Parsing
- Chapter 5 -Bottom-up Parsing
- Chapter 6 -Syntax-Directed Definitions and Translations
- Chapter 7 -Symbol Table Management
- Chapter 8 -Storage Management
- Chapter 9 -Error Handling
- Chapter 10-Code Optimization
- Chapter 11-Code Generation
- Chapter 12-Exercises
- Index
- List of Figures
- List of Tables
- List of Examples
- Back Cover
- A compiler translates a high-level language program into a functionally equivalent low-level language program that can be
- understood and executed by the computer. Crucial to any computer system, effective compiler design is also one of the most
- complex areas of system development. Before any code for a modern compiler is even written, many programmers have
- difficulty with the high-level algorithms that will be necessary for the compiler to function. Written with this in mind, Algorithms
- for Compiler Design teaches the fundamental algorithms that underlie modern compilers. The book focuses on the “front-end”
- of compiler design: lexical analysis, parsing, and syntax. Blending theory with practical examples throughout, the book
- presents these difficult topics clearly and thoroughly. The final chapters on code generation and optimization complete a
- solid foundation for learning the broader requirements of an entire compiler design.
- FEATURES
- Focuses on the “front-end” of compiler design—lexical analysis, parsing, and syntax—topics basic to any
- introduction to compiler design
- Covers storage management, error handling, and recovery
- Introduces important “back-end” programming concepts, including code generation and optimization
- Algorithms for Compiler Design
- O.G. Kakde
- CHARLES RIVER MEDIA, INC.
- Copyright © 2002, 2003 Laxmi Publications, LTD.
- O.G. Kakde. Algorithms for Compiler Design
- 1-58450-100-6
- No part of this publication may be reproduced in any way, stored in a retrieval system of any type, or transmitted by
- any means or media, electronic or mechanical, including, but not limited to, photocopy, recording, or scanning, without
- prior permission in writing from the publisher.
- Publisher: David Pallai
- Production: Laxmi Publications, LTD.
- Cover Design: The Printed Image
- CHARLES RIVER MEDIA, INC.
- 20 Downer Avenue, Suite 3
- Hingham, Massachusetts 02043
- 781-740-0400
- 781-740-8816 (FAX)
- info@charlesriver.com
- http://www.charlesriver.com
- Original Copyright 2002, 2003 by Laxmi Publications, LTD.
- O.G. Kakde. Algorithms for Compiler Design.
- Original ISBN: 81-7008-100-6
- All brand names and product names mentioned in this book are trademarks or service marks of their respective
- companies. Any omission or misuse (of any kind) of service marks or trademarks should not be regarded as intent to
- infringe on the property of others. The publisher recognizes and respects all marks used by companies,
- manufacturers, and developers as a means to distinguish their products.
- 02 7 6 5 4 3 2 First Edition
- CHARLES RIVER MEDIA titles are available for site license or bulk purchase by institutions, user groups,
- corporations, etc. For additional information, please contact the Special Sales Department at 781-740-0400.
- Acknowledgments
- The author wishes to thank all of the colleagues in the Department of Electronics and Computer Science Engineering
- at Visvesvaraya Regional College of Engineering Nagpur, whose constant encouragement and timely help have
- resulted in the completion of this book. Special thanks go to Dr. C. S. Moghe, with whom the author had long technical
- discussions, which found their place in this book. Thanks are due to the institution for providing all of the infrastructural
- facilities and tools for a timely completion of this book. The author would particularly like to acknowledge Mr. P. S.
- Deshpande and Mr. A. S. Mokhade for their invaluable help and support from time to time. Finally, the author wishes
- to thank all of his students.
- Preface
- This book on algorithms for compiler design covers the various aspects of designing a language translator in depth.
- The book is intended to be a basic reading material in compiler design.
- Enough examples and algorithms have been used to effectively explain various tools of compiler design. The first
- chapter gives a brief introduction of the compiler and is thus important for the rest of the book.
- Other issues like context free grammar, parsing techniques, syntax directed definitions, symbol table, code
- optimization and more are explain in various chapters of the book.
- The final chapter has some exercises for the readers for practice.
- Chapter 1: Introduction
- 1.1 WHAT IS A COMPILER?
- A compiler is a program that translates a high-level language program into a functionally equivalent low-level
- language program. So, a compiler is basically a translator whose source language (i.e., language to be translated) is
- the high-level language, and the target language is a low-level language; that is, a compiler is used to implement a
- high-level language on a computer.
- 1.2 WHAT IS A CROSS-COMPILER?
- A cross-compiler is a compiler that runs on one machine and produces object code for another machine. The
- cross-compiler is used to implement the compiler, which is characterized by three languages:
- The source language, 1.
- The object language, and 2.
- The language in which it is written. 3.
- If a compiler has been implemented in its own language, then this arrangement is called a "bootstrap" arrangement.
- The implementation of a compiler in its own language can be done as follows.
- Implementing a Bootstrap Compiler
- Suppose we have a new language, L, that we want to make available on machines A and B. As a first step, we can
- write a small compiler: S C A A , which will translate an S subset of L to the object code for machine A, written in a
- language available on A. We then write a compiler S C S A , which is compiled in language L and generates object code
- written in an S subset of L for machine A. But this will not be able to execute unless and until it is translated by S C A A ;
- therefore, S C S A is an input into S C A A , as shown below, producing a compiler for L that will run on machine A and
- self-generate code for machine A: S C A A .
- Now, if we want to produce another compiler to run on and produce code for machine B, the compiler can be written,
- itself, in L and made available on machine B by using the following steps:
- 1.3 COMPILATION
- Compilation refers to the compiler's process of translating a high-level language program into a low-level language
- program. This process is very complex; hence, from the logical as well as an implementation point of view, it is
- customary to partition the compilation process into several phases, which are nothing more than logically cohesive
- operations that input one representation of a source program and output another representation.
- A typical compilation, broken down into phases, is shown in Figure 1.1.
- Figure 1.1: Compilation process phases.
- The initial process phases analyze the source program. The lexical analysis phase reads the characters in the source
- program and groups them into streams of tokens; each token represents a logically cohesive sequence of characters,
- such as identifiers, operators, and keywords. The character sequence that forms a token is called a "lexeme". Certain
- tokens are augmented by the lexical value; that is, when an identifier like xyz is found, the lexical analyzer not only
- returns id, but it also enters the lexeme xyz into the symbol table if it does not already exist there. It returns a pointer to
- this symbol table entry as a lexical value associated with this occurrence of the token id. Therefore, when internally
- representing a statement like X: = Y + Z, after the lexical analysis will be id 1 : = id 2 + id 3 .
- The subscripts 1, 2, and 3 are used for convenience; the actual token is id. The syntax analysis phase imposes a
- hierarchical structure on the token string, as shown in Figure 1.2.
- Figure 1.2: Syntax analysis imposes a structure hierarchy on the token string.
- Intermediate Code Generation
- Some compilers generate an explicit intermediate code representation of the source program. The intermediate code
- can have a variety of forms. For example, a three-address code (TAC) representation for the tree shown in Figure 1.2
- will be:
- where T 1 and T 2 are compiler-generated temporaries.
- Code Optimization
- In the optimization phase, the compiler performs various transformations in order to improve the intermediate code.
- These transformations will result in faster-running machine code.
- Code Generation
- The final phase in the compilation process is the generation of target code. This process involves selecting memory
- locations for each variable used by the program. Then, each intermediate instruction is translated into a sequence of
- machine instructions that performs the same task.
- Compiler Phase Organization
- This is the logical organization of compiler. It reveals that certain phases of the compiler are heavily dependent on the
- source language and are independent of the code requirements of the target machine. All such phases, when grouped
- together, constitute the front end of the compiler; whereas those phases that are dependent on the target machine
- constitute the back end of the compiler. Grouping the compilation phases in the front and back ends facilitates the
- re-targeting of the code; implementation of the same source language on different machines can be done by rewriting
- only the back end.
- Note Different languages can also be implemented on the same machine by rewriting the front end and using the
- same back end. But to do this, all of the front ends are required to produce the same intermediate code; and this
- is difficult, because the front end depends on the source language, and different languages are designed with
- different viewpoints. Therefore, it becomes difficult to write the front ends for different languages by using a
- common intermediate code.
- Having relatively few passes is desirable from the point of view of reducing the compilation time. To reduce the
- number of passes, it is required to group several phases in one pass. For some of the phases, being grouped into one
- pass is not a major problem. For example, the lexical analyzer and syntax analyzer can easily be grouped into one
- pass, because the interface between them is a single token; that is, the processing required by the token is
- independent of other tokens. Therefore, these phases can be easily grouped together, with the lexical analyzer
- working as a subroutine of the syntax analyzer, which is charge of the entire analysis activity.
- Conversely, grouping some of the phases into one pass is not that easy. Grouping intermediate and object
- code-generation phases is difficult, because it is often very hard to perform object code generation until a sufficient
- number of intermediate code statements have been generated. Here, the interface between the two is not based on
- only one intermediate instruction-certain languages permit the use of a variable before it is declared. Similarly, many
- languages also permit forward jumps. Therefore, it is not possible to generate object code for a construct until
- sufficient intermediate code statements have been generated. To overcome this problem and enable the merging of
- intermediate and object code generation into one pass, the technique called "back-patching" is used; the object code
- is generated by leaving ‘statementholes,’ which will be filled later when the information becomes available.
- 1.3.1 Lexical Analysis Phase
- In the lexical analysis phase, the compiler scans the characters of the source program, one character at a time.
- Whenever it gets a sufficient number of characters to constitute a token of the specified language, it outputs that
- token. In order to perform this task, the lexical analyzer must know the keywords, identifiers, operators, delimiters, and
- punctuation symbols of the language to be implemented. So, when it scans the source program, it will be able to
- return a suitable token whenever it encounters a token lexeme. (Lexeme refers to the sequence of characters in the
- source program that is matched by language's character patterns that specify identifiers, operators, keywords,
- delimiters, punctuation symbols, and so forth.) Therefore, the lexical analyzer design must:
- Specify the token of the language, and 1.
- Suitably recognize the tokens. 2.
- We cannot specify the language tokens by enumerating each and every identifier, operator, keyword, delimiter, and
- punctuation symbol; our specification would end up spanning several pages—and perhaps never end, especially for
- those languages that do not limit the number of characters that an identifier can have. Therefore, token specification
- should be generated by specifying the rules that govern the way that the language's alphabet symbols can be
- combined, so that the result of the combination will be a token of that language's identifiers, operators, and keywords.
- This requires the use of suitable language-specific notation.
- Regular Expression Notation
- Regular expression notation can be used for specification of tokens because tokens constitute a regular set. It is
- compact, precise, and contains a deterministic finite automata (DFA) that accepts the language specified by the
- regular expression. The DFA is used to recognize the language specified by the regular expression notation, making
- the automatic construction of recognizer of tokens possible. Therefore, the study of regular expression notation and
- finite automata becomes necessary. Some definitions of the various terms used are described below.
- 1.4 REGULAR EXPRESSION NOTATION/FINITE AUTOMATA DEFINITIONS
- String
- A string is a finite sequence of symbols. We use a letter, such as w, to denote a string. If w is the string, then the
- length of string is denoted as | w |, and it is a count of number of symbols of w. For example, if w = xyz, | w | = 3. If | w |
- = 0, then the string is called an "empty" string, and we use ∈ to denote the empty string.
- Prefix
- A string's prefix is the string formed by taking any number of leading symbols of string. For example, if w = abc, then ∈ ,
- a, ab, and abc are the prefixes of w. Any prefix of a string other than the string itself is called a "proper" prefix of the
- string.
- Suffix
- A string's suffix is formed by taking any number of trailing symbols of a string. For example, if w = abc, then ∈ , c, bc,
- and abc are the suffixes of the w. Similar to prefixes, any suffix of a string other than the string itself is called a "proper"
- suffix of the string.
- Concatenation
- If w 1 and w 2 are two strings, then the concatenation of w 1 and w 2 is denoted as w 1 .w 2 —simply, a string obtained by
- writing w 1 followed by w 2 without any space in between (i.e., a juxtaposition of w 1 and w 2 ). For example, if w 1 = xyz,
- and w 2 = abc, then w 1 .w 2 = xyzabc. If w is a string, then w. ∈ = w, and ∈ .w = w. Therefore, we conclude that ∈ (empty
- string) is a concatenation identity.
- Alphabet
- An alphabet is a finite set of symbols denoted by the symbol Σ .
- Language
- A language is a set of strings formed by using the symbols belonging to some previously chosen alphabet. For
- example, if Σ = { 0, 1 }, then one of the languages that can be defined over this Σ will be L = { ∈ , 0, 00, 000, 1, 11, 111,
- … }.
- Set
- A set is a collection of objects. It is denoted by the following methods:
- We can enumerate the members by placing them within curly brackets ({ }). For example, the set
- A is defined by: A = { 0, 1, 2 }.
- 1.
- We can use a predetermined notation in which the set is denoted as: A = { x | P (x) }. This means
- that A is a set of all those elements x for which the predicate P (x) is true. For example, a set of all
- integers divisible by three will be denoted as: A = { x | x is an integer and x mod 3 = 0}.
- 2.
- Set Operations
- Union: If A and B are the two sets, then the union of A and B is denoted as: A ∪ B = { x | x in A or x is
- in B }.
- Intersection: If A and B are the two sets, then the intersection of A and B is denoted as: A ∩ B = { x | x
- is in A and x is in B }.
- Set difference: If A and B are the two sets, then the difference of A and B is denoted as: A − B = { x | x
- is in A but not in B }.
- Cartesian product: If A and B are the two sets, then the Cartesian product of A and B is denoted as: A ×
- B = { (a, b) | a is in A and b is in B }.
- Power set: If A is the set, then the power set of A is denoted as : 2 A = P | P is a subset of A } (i.e., the
- set contains of all possible subsets of A.) For example:
- Concatenation: If A and B are the two sets, then the concatenation of A and B is denoted as: AB = { ab |
- a is in A and b is in B }. For example, if A = { 0, 1 } and B = { 1, 2 }, then AB = { 01, 02, 11, 12 }.
- Closure: If A is a set, then closure of A is denoted as: A* = A 0 ∪ A 1 ∪ A 2 ∪ … ∪ A ∞ , where A i is the ith
- power of set A, defined as A i = A.A.A … i times.
- (i.e., the set of all possible combination of members of A of length 0)
- (i.e., the set of all possible combination of members of A of length 1)
- (i.e., the set of all possible combinations of members of A of length 2)
- Therefore A* is the set of all possible combinations of the members of A. For example, if Σ = { 0,1), then Σ * will be the
- set of all possible combinations of zeros and ones, which is one of the languages defined over Σ .
- 1.5 RELATIONS
- Let A and B be the two sets; then the relationship R between A and B is nothing more than a set of ordered pairs (a, b)
- such that a is in A and b is in B, and a is related to b by relation R. That is:
- R = { (a, b) | a is in A and b is in B, and a is related to b by R }
- For example, if A = { 0, 1 } and B = { 1, 2 }, then we can define a relation of ‘less than,’ denoted by < as follows:
- A pair (1, 1) will not belong to the < relation, because one is not less than one. Therefore, we conclude that a relation R
- between sets A and B is the subset of A × B.
- If a pair (a, b) is in R, then aRb is true; otherwise, aRb is false.
- A is called a "domain" of the relation, and B is called a "range" of the relation. If the domain of a relation R is a set A,
- and the range is also a set A, then R is called as a relation on set A rather than calling a relation between sets A and
- B. For example, if A = { 0, 1, 2 }, then a < relation defined on A will result in: < = { (0, 1), (0, 2), (1, 2) }.
- 1.5.1 Properties of the Relation
- Let R be some relation defined on a set A. Therefore:
- R is said to be reflexive if aRa is true for every a in A; that is, if every element of A is related with
- itself by relation R, then R is called as a reflexive relation.
- 1.
- If every aRb implies bRa (i.e., when a is related to b by R, and if b is also related to a by the same
- relation R), then a relation R will be a symmetric relation.
- 2.
- If every aRb and bRc implies aRc, then the relation R is said to be transitive; that is, when a is
- related to b by R, and b is related to c by R, and if a is also related to c by relation R, then R is a
- transitive relation.
- If R is reflexive and transitive, as well as symmetric, then R is an equivalence relation.
- 3.
- Property Closure of a Relation
- Let R be a relation defined on a set A, and if P is a set of properties, then the property closure of a relation R, denoted
- as P-closure, is the smallest relation, R ′ , which has the properties mentioned in P. It is obtained by adding every pair
- (a, b) in R to R ′ , and then adding those pairs of the members of A that will make relation R have the properties in P. If
- P contains only transitivity properties, then the P-closure will be called as a transitive closure of the relation, and we
- denote the transitive closure of relation R by R + ; whereas when P contains transitive as well as reflexive properties,
- then the P-closure is called as a reflexive-transitive closure of relation R, and we denote it by R*. R + can be obtained
- from R as follows:
- For example, if:
- Chapter 2: Finite Automata and Regular Expressions
- 2.1 FINITE AUTOMATA
- A finite automata consists of a finite number of states and a finite number of transitions, and these transitions are
- defined on certain, specific symbols called input symbols. One of the states of the finite automata is identified as the
- initial state the state in which the automata always starts. Similarly, certain states are identified as final states.
- Therefore, a finite automata is specified as using five things:
- The states of the finite automata; 1.
- The input symbols on which transitions are made; 2.
- The transitions specifying from which state on which input symbol where the transition goes; 3.
- The initial state; and 4.
- The set of final states. 5.
- Therefore formally a finite automata is a five-tuple:
- where:
- Q is a set of states of the finite automata,
- Σ is a set of input symbols, and
- δ specifies the transitions in the automata.
- If from a state p there exists a transition going to state q on an input symbol a, then we write δ (p, a) = q. Hence, δ is a
- function whose domain is a set of ordered pairs, (p, a), where p is a state and a is an input symbol, and the range is a
- set of states.
- Therefore, we conclude that δ defines a mapping whose domain will be a set of ordered pairs of the form (p, a) and
- whose range will be a set of states. That is, δ defines a mapping from
- q 0 is the initial state, and F is a set of final sates of the automata. For example:
- where
- A directed graph exists that can be associated with finite automata. This
- graph is called a "transition diagram of finite automata." To associate a graph with finite automata, the vertices of the
- graph correspond to the states of the automata, and the edges in the transition diagram are determined as follows.
- If δ (p, a) = q, then put an edge from the vertex, which corresponds to state p, to the vertex that corresponds to state q,
- labeled by a. To indicate the initial state, we place an arrow with its head pointing to the vertex that corresponds to the
- initial state of the automata, and we label that arrow "start." We then encircle the vertices twice, which correspond to
- the final states of the automata. Therefore, the transition diagram for the described finite automata will resemble Figure
- 2.1.
- Figure 2.1: Transition diagram for finite automata δ (p, a) = q.
- A tabular representation can also be used to specify the finite automata. A table whose number of rows is equal to the
- number of states, and whose number of columns equals the number of input symbols, is used to specify the transitions
- in the automata. The first row specifies the transitions from the initial state; the rows specifying the transitions from the
- final states are marked as *. For example, the automata above can be specified as follows:
- A finite automata can be used to accept some particular set of strings. If x is a string made of symbols belonging to Σ
- of the finite automata, then x is accepted by the finite automata if a path corresponding to x in a finite automata starts
- in an initial state and ends in one of the final states of the automata; that is, there must exist a sequence of moves for x
- in the finite automata that takes the transitions from the initial state to one of the final states of the automata. Since x is
- a member of Σ *, we define a new transition function, δ 1 , which defines a mapping from Q × Σ * to Q. And if δ 1 (q 0 , x) =
- a member of F, then x is accepted by the finite automata. If x is written as wa, where a is the last symbol of x, and w is
- a string of the of remaining symbols of x, then:
- For example:
- where
- Let x be 010. To find out if x is accepted by the automata or not, we proceed as follows:
- δ 1 (q 0 , 0) = δ (q 0 , 0) = q 1
- Therefore, δ 1 (q 0 , 01 ) = δ { δ 1 (q 0 , 0), 1} = q 0
- Therefore, δ 1 (q 0 , 010) = δ { δ 1 (q 0 , 0 1), 0} = q 1
- Since q 1 is a member of F, x = 010 is accepted by the automata.
- If x = 0101, then δ 1 (q 0 , 0101) = δ { δ 1 (q 0 , 010), 1} = q 0
- Since q 0 is not a member of F, x is not accepted by the above automata.
- Therefore, if M is the finite automata, then the language accepted by the finite automata is denoted as L(M) = {x | δ 1
- (q 0 , x) = member of F }.
- In the finite automata discussed above, since δ defines mapping from Q × Σ to Q, there exists exactly one transition
- from a state on an input symbol; and therefore, this finite automata is considered a deterministic finite automata (DFA).
- Therefore, we define the DFA as the finite automata:
- M = (Q, Σ , δ , q 0 , F ), such that there exists exactly one transition from a state on a input symbol.
- 2.2 NON-DETERMINISTIC FINITE AUTOMATA
- If the basic finite automata model is modified in such a way that from a state on an input symbol zero, one or more
- transitions are permitted, then the corresponding finite automata is called a "non-deterministic finite automata" (NFA).
- Therefore, an NFA is a finite automata in which there may exist more than one paths corresponding to x in Σ * (because
- zero, one, or more transitions are permitted from a state on an input symbol). Whereas in a DFA, there exists exactly
- one path corresponding to x in Σ *. Hence, an NFA is nothing more than a finite automata:
- in which δ defines mapping from Q × Σ to 2 Q (to take care of zero, one, or more transitions). For example, consider the
- finite automata shown below:
- where:
- The transition diagram of this automata is:
- Figure 2.2: Transition diagram for finite automata that handles several transitions.
- 2.2.1 Acceptance of Strings by Non-deterministic Finite Automata
- Since an NFA is a finite automata in which there may exist more than one path corresponding to x in Σ *, and if this is,
- indeed, the case, then we are required to test the multiple paths corresponding to x in order to decide whether or not x
- is accepted by the NFA, because, for the NFA to accept x, at least one path corresponding to x is required in the NFA.
- This path should start in the initial state and end in one of the final states. Whereas in a DFA, since there exists exactly
- one path corresponding to x in Σ *, it is enough to test whether or not that path starts in the initial state and ends in one
- of the final states in order to decide whether x is accepted by the DFA or not.
- Therefore, if x is a string made of symbols in Σ of the NFA (i.e., x is in Σ *), then x is accepted by the NFA if at least one
- path exists that corresponds to x in the NFA, which starts in an initial state and ends in one of the final states of the
- NFA. Since x is a member of Σ * and there may exist zero, one, or more transitions from a state on an input symbol, we
- define a new transition function, δ 1 , which defines a mapping from 2 Q × Σ * to 2 Q ; and if δ 1 ({q 0 },x) = P, where P is a set
- containing at least one member of F, then x is accepted by the NFA. If x is written as wa, where a is the last symbol of
- x, and w is a string made of the remaining symbols of x then:
- For example, consider the finite automata shown below:
- where:
- If x = 0111, then to find out whether or not x is accepted by the NFA, we proceed as follows:
- Since δ 1 ({q 0 }, 0111) = {q 1 , q 2 , q 3 }, which contains q 3 , a member of F of the NFA—, hence, x = 0111 is accepted by
- the NFA.
- Therefore, if M is a NFA, then the language accepted by NFA is defined as:
- L(M) = {x | δ 1 ({q 0 } x) = P, where P contains at least one member of F }.
- 2.3 TRANSFORMING NFA TO DFA
- For every non-deterministic finite automata, there exists an equivalent deterministic finite automata. The equivalence
- between the two is defined in terms of language acceptance. Since an NFA is a nothing more than a finite automata in
- which zero, one, or more transitions on an input symbol is permitted, we can always construct a finite automata that
- will simulate all the moves of the NFA on a particular input symbol in parallel. We then get a finite automata in which
- there will be exactly one transition on an input symbol; hence, it will be a DFA equivalent to the NFA.
- Since the DFA equivalent of the NFA simulates (parallels) the moves of the NFA, every state of a DFA will be a
- combination of one or more states of the NFA. Hence, every state of a DFA will be represented by some subset of the
- set of states of the NFA; and therefore, the transformation from NFA to DFA is normally called the "construction"
- subset. Therefore, if a given NFA has n states, then the equivalent DFA will have 2 n number of states, with the initial
- state corresponding to the subset {q 0 }. Therefore, the transformation from NFA to DFA involves finding all possible
- subsets of the set states of the NFA, considering each subset to be a state of a DFA, and then finding the transition
- from it on every input symbol. But all the states of a DFA obtained in this way might not be reachable from the initial
- state; and if a state is not reachable from the initial state on any possible input sequence, then such a state does not
- play role in deciding what language is accepted by the DFA. (Such states are those states of the DFA that have
- outgoing transitions on the input symbols—but either no incoming transitions, or they only have incoming transitions
- from other unreachable states.) Hence, the amount of work involved in transforming an NFA to a DFA can be
- reduced if we attempt to generate only reachable states of a DFA. This can be done by proceeding as follows:
- Let M = (Q, Σ , δ , q 0 , F ) be an NFA to be transformed into a DFA.
- Let Q 1 be the set states of equivalent DFA.
- begin:
- Q 1old = Φ
- Q 1new = {q 0 }
- While (Q 1old ≠ Q 1new )
- {
- Temp = Q 1new - Q 1old
- Q 1 = Q 1new
- for every subset P in Temp do
- for every a in Σ do
- If transition from P on a goes to new subset S of Q
- then
- (transition from P on a is obtained by finding out
- the transitions from every member of P on a in a given
- NFA
- and then taking the union of all such transitions)
- Q 1 new = Q 1 new ∪ S
- }
- Q 1 = Q 1new
- end
- A subset P in Q l will be a final state of the DFA if P contains at least one member of F of the NFA. For example,
- consider the following finite automata:
- where:
- The DFA equivalent of this NFA can be obtained as follows:
- 0 1
- {q 0 ) {q 1 }
- Φ
- {q 1 } {q 1 } {q 1 , q 2 }
- {q 1 , q 2 } {q 1 } {q 1 , q 2 , q 3 }
- *{q 1 , q 2 , q 3 } {q 1 , q 3 } {q 1 , q 2 , q 3 }
- *{q 1 , q 3 } {q 1 , q 3 } {q 1 , q 2 , q 3 }
- Φ Φ Φ
- The transition diagram associated with this DFA is shown in Figure 2.3.
- Figure 2.3: Transition diagram for M = ({q 0 , q 1 , q 2 , q 3 }, {0, 1} δ , q 0 , {q 3 }).
- 2.4 THE NFA WITH ∈ -MOVES
- If a finite automata is modified to permit transitions without input symbols, along with zero, one, or more transitions on
- the input symbols, then we get an NFA with ‘ ∈ -moves,’ because the transitions made without symbols are called
- " ∈ -transitions."
- Consider the NFA shown in Figure 2.4.
- Figure 2.4: Finite automata with ∈ -moves.
- This is an NFA with ∈ -moves because it is possible to transition from state q 0 to q 1 without consuming any of the
- input symbols. Similarly, we can also transition from state q 1 to q 2 without consuming any input symbols. Since it is a
- finite automata, an NFA with ∈ -moves will also be denoted as a five-tuple:
- where Q, Σ , q 0 , and F have the usual meanings, and δ defines a mapping from
- (to take care of the ∈ -transitions as well as the non ∈ -transitions).
- Acceptance of a String by the NFA with ∈-Moves
- A string x in Σ * will ∈ -moves will be accepted by the NFA, if at least one path exists that corresponds to x starts in an
- initial state and ends in one of the final states. But since this path may be formed by ∈ -transitions as well as
- non- ∈ -transitions, to find out whether x is accepted or not by the NFA with ∈ -moves, we must define a function,
- ∈ -closure(q), where q is a state of the automata.
- The function ∈ -closure(q) is defined follows:
- ∈ -closure(q)= set of all those states of the automata that can be reached from q on a path labeled by
- ∈ .
- For example, in the NFA with ∈ -moves given above:
- ∈ -closure(q 0 ) = { q 0 , q 1 , q 2 }
- ∈ -closure(q 1 ) = { q 1 , q 2 }
- ∈ -closure(q 2 ) = { q 2 }
- The function
- ∈ -closure (q) will never be an empty set, because q is always reachable from itself, without dependence on any input
- symbol; that is, on a path labeled by ∈ , q will always exist in ∈ -closure(q) on that labeled path.
- If P is a set of states, then the ∈ -closure function can be extended to find ∈ -closure(P ), as follows:
- 2.4.1 Algorithm for Finding ∈ -Closure(q)
- Let T be the set that will comprise ∈ -closure(q). We begin by adding q to T, and then initialize the stack by pushing q
- onto stack:
- while (stack not empty) do
- {
- p = pop (stack)
- R = δ (p, ∈ )
- for every member of R do
- if it is not present in T then
- {
- add that member to T
- push member of R on stack
- }
- }
- Since x is a member of Σ *, and there may exist zero, one, or more transitions from a state on an input symbol, we
- define a new transition function, δ 1 , which defines a mapping from 2 Q × Σ * to 2 Q . If x is written as wa, where a is the
- last symbol of x and w is a string made of remaining symbols of x then:
- since δ 1 defines a mapping from 2 Q × Σ * to 2 Q .
- such that P contains at least one member of F and:
- For example, in the NFA with ∈ -moves, given above, if x = 01, then to find out whether x is accepted by the automata
- or not, we proceed as follows:
- Therefore:
- ∈ -closure( δ 1 ( ∈ -closure (q 0 ), 01) = ∈ -closure({q 1 }) = {q 1 , q 2 } Since q 2 is a final state, x = 01 is accepted by the
- automata.
- Equivalence of NFA with ∈-Moves to NFA Without ∈-Moves
- For every NFA with ∈ -moves, there exists an equivalent NFA without ∈ -moves that accepts the same language. To
- obtain an equivalent NFA without ∈ -moves, given an NFA with ∈ -moves, what is required is an elimination of
- ∈ -transitions from a given automata. But simply eliminating the ∈ -transitions from a given NFA with ∈ -moves will
- change the language accepted by the automata. Hence, for every ∈ -transition to be eliminated, we have to add some
- non- ∈ -transitions as substitutes in order to maintain the language's acceptance by the automata. Therefore,
- transforming an NFA with ∈ -moves to and NFA without ∈ -moves involves finding the non- ∈ -transitions that must be
- added to the automata for every ∈ -transition to be eliminated.
- Consider the NFA with ∈ -moves shown in Figure 2.5.
- Figure 2.5: Transitioning from an ∈ -move NFA to a non- ∈ -move NFA.
- There are ∈ -transitions from state q 0 to q 1 and from state q 1 to q 2 . To eliminate these ∈ -transitions, we must add a
- transition on 0 from q 0 to q 1 , as well as from state q 0 to q 2 . Similarly, a transition must be added on 1 from q 0 to q 1 , as
- well as from state q 0 to q 2 , because the presence of these ∈ -transitions in a given automata makes it possible to
- reach from q 0 to q 1 on consuming only 0, and it is possible to reach from q 0 to q 2 on consuming only 0. Similarly, it is
- possible to reach from q 0 to q 1 on consuming only 1, and it is possible to reach from q 0 to q 2 on consuming only 1. It is
- also possible to reach from q 1 to q 2 on consuming 0 as well as 1; and therefore, a transition from q 1 to q 2 on 0 and 1 is
- also required to be added. Since ∈ is also accepted by the given NFA ∈ -moves, to accept ∈ , and initial state of the
- NFA without ∈ -moves is required to be marked as one of the final states. Therefore, by adding these
- non- ∈ -transitions, and by making the initial state one of the final states, we get the automata shown in Figure 2.6.
- Figure 2.6: Making the initial state of the NFA one of the final states.
- Therefore, when transforming an NFA with ∈ -moves into an NFA without ∈ -moves, only the transitions are required
- to be changed; the states are not required to be changed. But if a given NFA with q 0 and ∈ -moves accepts ∈ (i.e., if
- the ∈ -closure (q 0 ) contains a member of F), then q 0 is also required to be marked as one of the final states if it is not
- already a member of F. Hence:
- If M = (Q, Σ , δ , q 0 , F) is an NFA with ∈ -moves, then its equivalent NFA without ∈ -moves will be M 1 = (Q, Σ , δ 1 , q 0 , F 1 )
- where δ 1 (q, a) = ∈ -closure( δ ( ∈ -closure(q), a))
- and
- F 1 = F ∪ (q 0 ) if ∈ -closure (q 0 ) contains a member of F
- F 1 = F otherwise
- For example, consider the following NFA with ∈ -moves:
- where
- δ
- 0 1
- ∈
- q 0 {q 0 }
- φ
- {q 1 }
- q 1
- φ
- {q 1 } {q 2 }
- q 2
- φ
- {q 2 }
- φ
- Its equivalent NFA without ∈ -moves will be:
- where
- δ 1
- 0 1
- q 0 {q 0 , q 1 , q 2 } {q 1 , q 2 }
- q 1
- φ
- {q 1 , q 2 }
- q 2
- φ
- {q 2 }
- Since there exists a DFA for every NFA without ∈ -moves, and for every NFA with ∈ -moves there exists an equivalent
- NFA without ∈ -moves, we conclude that for every NFA with ∈ -moves there exists a DFA.
- 2.5 THE NFA WITH ∈ -MOVES TO THE DFA
- There always exists a DFA equivalent to an NFA with ∈ -moves which can be obtained as follows:
- A DFA equivalent to this NFA will be:
- If this transition generates a new subset of Q, then it will be added to Q 1 ; and next time transitions from it are found,
- we continue in this way until we cannot add any new states to Q 1 . After this, we identify those states of the DFA whose
- subset representations contain at least one member of F. If ∈ -closure(q 0 ) does not contain a member of F, and the set
- of such states of DFA constitute F 1 , but if ∈ -closure(q 0 ) contains a member of F, then we identify those members of
- Q 1 whose subset representations contain at least one member of F, or q 0 and F 1 will be set as a member of these
- states.
- Consider the following NFA with ∈ -moves:
- where
- δ
- 0 1
- ∈
- q 0 {q 0 }
- φ
- {q 1 }
- q 1
- φ
- {q 1 } {q 2 }
- q 2
- φ
- {q 2 }
- φ
- A DFA equivalent to this will be:
- where
- δ 1
- 0 1
- {q 0 , q 1 , q 2 } {q 0 , q 1 , q 2 } {q 1 , q 2 }
- {q 1 , q 2 }
- φ
- {q 1 , q 2 }
- φ φ φ
- If we identify the subsets {q 0 , q 1 , q 2 }, {q 0 , q 1 , q 2 } and φ as A, B, and C, respectively, then the automata will be:
- where
- δ 1
- 0 1
- A A B
- B C B
- C C C
- EXAMPLE 2.1
- Obtain a DFA equivalent to the NFA shown in Figure 2.7.
- Figure 2.7: Example 2.1 NFA.
- A DFA equivalent to NFA in Figure 2.7 will be:
- 0 1
- {q 0 } {q 0 , q 1 } {q 0 }
- {q 0 , q 1 } {q 0 , q 1 } {q 0 , q 2 }
- {q 0 , q 2 } {q 0 , q 1 } {q 0 , q 3 }
- {q 0 , q 2 , q 3 }* {q 0 , q 1 , q 3 } {q 0 , q 3 }
- {q 0 , q 1 , q 3 }* {q 0 , q 3 } {q 0 , q 2 , q 3 }
- {q 0 , q 3 }* {q 0 , q 1 , q 3 } {q 0 , q 3 }
- where {q 0 } corresponds to the initial state of the automata, and the states marked as * are final states. lf we rename
- the states as follows:
- {q 0 } A
- {q 0 , q 1 } B
- {q 0 , q 2 } C
- {q 0 , q 2 , q 3 } D
- {q 0 , q 1 , q 3 } E
- {q 0 , q 3 } F
- then the transition table will be:
- 0 1
- A B A
- B B C
- C B F
- D* E F
- E* F D
- F* E F
- EXAMPLE 2.2
- Obtain a DFA equivalent to the NFA illustrated in Figure 2.8.
- Figure 2.8: Example 2.2 DFA equivalent to an NFA.
- A DFA equivalent to the NFA shown in Figure 2.8 will be:
- 0 1
- {q 0 } {q 0 } {q 0 , q 1 }
- {q 0 , q 1 } {q 0 , q 2 } {q 0 , q 1 }
- {q 0 , q 2 } {q 0 } {q 0 , q 1 , q 3 }
- {q 0 , q 1 , q 3 }* {q 0 , q 2 , q 3 } {q 0 , q 1 , q 3 }
- {q 0 , q 2 , q 3 }* {q 0 , q 3 } {q 0 , q 1 , q 3 }
- {q 0 , q 3 }* {q 0 , q 3 } {q 0 , q 1 , q 3 }
- where {q 0 } corresponds to the initial state of the automata, and the states marked as * are final states. If we rename
- the states as follows:
- {q 0 } A
- {q 0 , q 1 } B
- {q 0 , q 2 } C
- {q 0 , q 2 , q 3 } D
- {q 0 , q 1 , q 3 } E
- {q 0 , q 3 } F
- then the transition table will be:
- 0 1
- A A B
- B C B
- C A E
- D* F E
- E* D E
- F* F E
- 2.6 MINIMIZATION/OPTIMIZATION OF A DFA
- Minimization/optimization of a deterministic finite automata refers to detecting those states of a DFA whose presence
- or absence in a DFA does not affect the language accepted by the automata. Hence, these states can be eliminated
- from the automata without affecting the language accepted by the automata. Such states are:
- Unreachable States: Unreachable states of a DFA are not reachable from the initial state of DFA on any
- possible input sequence.
- Dead States: A dead state is a nonfinal state of a DFA whose transitions on every input symbol
- terminates on itself. For example, q is a dead state if q is in Q F, and δ (q, a) = q for every a in Σ .
- Nondistinguishable States: Nondistinguishable states are those states of a DFA for which there exist no
- distinguishing strings; hence, they cannot be distinguished from one another.
- Therefore, optimization entails:
- Detection of unreachable states and eliminating them from DFA; 1.
- Identification of nondistinguishable states, and merging them together; and 2.
- Detecting dead states and eliminating them from the DFA. 3.
- 2.6.1 Algorithm to Detect Unreachable States
- Input M = (Q, Σ , δ , q 0 , F )
- Output = Set U (which is set of unreachable states)
- {Let R be the set of reachable states of DFA. We take two R's, R new , and R old so that we will be able to perform
- iterations in the process of detecting unreachable states.}
- begin
- R old = φ
- R new = {q 0 }
- while (R old # R new ) do
- begin
- temp 1 = R new − R old
- R old = R new
- temp 2 = φ
- for every a in Σ do
- temp 2 = temp 2 ∪ δ ( temp 1 , a)
- R new = R new ∪ temp 2
- end
- U = Q − R new
- end
- If p and q are the two states of a DFA, then p and q are said to be ‘distinguishable’ states if a distinguishing string w
- exists that distinguishes p and q.
- A string w is a distinguishing string for states p and q if transitions from p on w go to a nonfinal state, whereas
- transitions from q on w go to a final state, or vice versa.
- Therefore, to find nondistinguishable states of a DFA, we must find out whether some distinguishing string w, which
- distinguishes the states, exists. If no such string exists, then the states are nondistinguishable and can be merged
- together.
- The technique that we use to find nondistinguishable states is the method of successive partitioning. We start with two
- groups/partitions: one contains all nonfinal states, and other contains all the final state. This is because if every final
- state is known to be distinguishable from a nonfinal state, then we find transitions from members of a partition on every
- input symbol. If on a particular input symbol a we find that transitions from some of the members of a partition goes to
- one place, whereas transitions from other members of a partition go to an other place, then we conclude that the
- members whose transitions go to one place are distinguishable from members whose transitions goes to another
- place. Therefore, we divide the partition in two; and we continue this partitioning until we get partitions that cannot be
- partitioned further. This happens when either a partition contains only one state, or when a partition contains more
- than one state, but they are not distinguishable from one another. If we get such a partition, we merge all of the states
- of this partition into a single state. For example, consider the transition diagram in Figure 2.9.
- Figure 2.9: Partitioning down to a single state.
- Initially, we have two groups, as shown below:
- Since
- Partitioning of Group I is not possible, because the transitions from all the members of Group I go only to Group I. But
- since
- state F is distinguishable from the rest of the members of Group I. Hence, we divide Group I into two groups: one
- containing A, B, C, E, and the other containing F, as shown below:
- Since
- partitioning of Group I is not possible, because the transitions from all the members of Group I go only to Group I. But
- since
- states A and E are distinguishable from states B and C. Hence, we further divide Group I into two groups: one
- containing A and E, and the other containing B and C, as shown below:
- Since
- state A is distinguishable from state E. Hence, we divide Group I into two groups: one containing A and the other
- containing E, as shown below:
- Since
- partitioning of Group III is not possible, because the transitions from all the members of Group III on a go to group III
- only. Similarly,
- partitioning of Group III is not possible, because the transitions from all the members of Group III on b also only go to
- Group III.
- Hence, B and C are nondistinguishable states; therefore, we merge B and C to form a single state, B 1 , as shown in
- Figure 2.10.
- Figure 2.10: Merging nondistinguishable states B and C into a single state B 1 .
- 2.6.2 Algorithm for Detection of Dead States
- Input M = (Q, Σ , δ , q 0 , F )
- Output = Set X (which is a set of dead states) {
- {
- X = φ
- for every q in (Q − F ) do
- {
- flag = true;
- for every a in Σ do
- if ( δ (q, a) # q) then
- {
- flag = false
- break
- }
- if flag = true then
- X = X ∪ {q}
- }
- }
- 2.7 EXAMPLES OF FINITE AUTOMATA CONSTRUCTION
- EXAMPLE 2.3
- Construct a finite automata accepting the set of all strings of zeros and ones, with at most one pair of consecutive
- zeros and at most one pair of consecutive ones.
- A transition diagram of the finite automata accepting the set of all strings of zeros and ones, with at most one pair of
- consecutive zeros and at most one pair of consecutive ones is shown in Figure 2.11.
- Figure 2.11: Transition diagram for Example 2.3 finite automata.
- EXAMPLE 2.4
- Construct a finite automata that will accept strings of zeros and ones that contain even numbers of zeros and odd
- numbers of ones.
- A transition diagram of the finite automata that accepts the set of all strings of zeros and ones that contains even
- numbers of zeros and odd numbers of ones is shown in Figure 2.12.
- Figure 2.12: Finite automata containing even number of zeros and odd number of ones.
- EXAMPLE 2.5
- Construct a finite automata that will accept a string of zeros and ones that contains an odd number of zeros and an
- even number of ones.
- A transition diagram of finite automata accepting the set of all strings of zeros and ones that contains an odd number
- of zeros and an even number of ones is shown in Figure 2.13.
- Figure 2.13: Finite automata containing odd number of zeros and even number of ones.
- EXAMPLE 2.6
- Construct the finite automata for accepting strings of zeros and ones that contain equal numbers of zeros and ones,
- and no prefix of the string should contain two more zeros than ones or two more ones than zeros.
- A transition diagram of the finite automata that will accept the set of all strings of zeros and ones, contain equal
- numbers of zeros and ones, and contain no string prefixes of two more zeros than ones or two more ones than zeros
- is shown in Figure 2.14.
- Figure 2.14: Example 2.6 finite automata considers the set prefix.
- EXAMPLE 2.7
- Construct a finite automata for accepting all possible strings of zeros and ones that do not contain 101 as a substring.
- Figure 2.15 shows a transition diagram of the finite automata that accepts the strings containing 101 as a substring.
- Figure 2.15: Finite automata accepts strings containing the substring 101.
- A DFA equivalent to this NFA will be:
- 0 1
- {A} {A} {A, B}
- {A, B} {A, C} {A, B}
- {A, C} {A} {A, B, D}
- {A, B, D}* {A, C, D} {A, B, D}
- {A, C, D}* {A, D} {A, B, D}
- {A, C, D}* {A, D} {A, B, D}
- Let us identify the states of this DFA using the names given below:
- {A} q 0
- {A, B} q 1
- {A, C} q 2
- {A, B, D} q 3
- {A, C, D} q 4
- {A, D} q 5
- The transition diagram of this automata is shown in Figure 2.16.
- Figure 2.16: DFA using the names A-D and q 0 − 5 .
- The complement of the automata in Figure 2.16 is shown in Figure 2.17.
- Figure 2.17: Complement to Figure 2.16 automata.
- After minimization, we get the DFA shown in Figure 2.18, because states q 3 , q 4 , and q 5 are nondistinguishable states.
- Hence, they get combined, and this combination becomes a dead state and, can be eliminated.
- Figure 2.18: DFA after minimization.
- EXAMPLE 2.8
- Construct a finite automata that will accept those strings of decimal digits that are divisible by three (see Figure 2.19).
- Figure 2.19: Finite automata that accepts string decimals that are divisible by three.
- EXAMPLE 2.9
- Construct a finite automata that accepts all possible strings of zeros and ones that do not contain 011 as a substring.
- Figure 2.20 shows a transition diagram of the automata that accepts the strings containing 101 as a substring.
- Figure 2.20: Finite automata accepts strings containing 101.
- A DFA equivalent to this NFA will be:
- 0 1
- {A} {A, B} {A}
- {A, B} {A, B} {A, C}
- {A, C} {A, B} {A, D}
- {A, D}* {A, B, D} {A, D}
- {A, B, D}* {A, B, D} {A, C, D}
- {A, C, D}* {A, B, D} {A, D}
- Let us identify the states of this DFA using the names given below:
- {A} q 0
- {A, B} q 1
- {A, C} q 2
- {A, D} q 3
- {A, B, D} q 4
- {A, C, D} q 5
- The transition diagram of this automata is shown in Figure 2.21.
- Figure 2.21: Finite automata identified by the name states A-D and q 0 − 5 .
- The complement of automata shown in Figure 2.21 is illustrated in Figure 2.22.
- Figure 2.22: Complement to Figure 2.21 automata.
- After minimization, we get the DFA shown in Figure 2.23, because the states q 3 , q 4 , and q 5 are nondistinguishable
- states. Hence, they get combined, and this combination becomes a dead state that can be eliminated.
- Figure 2.23: Minimization of nondistinguishable states of Figure 2.22.
- EXAMPLE 2.10
- Construct a finite automata that will accept those strings of a binary number that are divisible by three.
- The transition diagram of this automata is shown in Figure 2.24.
- Figure 2.24: Automata that accepts binary strings that are divisible by three.
- 2.8 REGULAR SETS AND REGULAR EXPRESSIONS
- 2.8.1 Regular Sets
- A regular set is a set of strings for which there exists some finite automata that accepts that set. That is, if R is a
- regular set, then R = L(M) for some finite automata M. Similarly, if M is a finite automata, then L(M) is always a regular
- set.
- 2.8.2 Regular Expression
- A regular expression is a notation to specify a regular set. Hence, for every regular expression, there exists a finite
- automata that accepts the language specified by the regular expression. Similarly, for every finite automata M, there
- exists a regular expression notation specifying L(M). Regular expressions and the regular sets they specify are shown
- in the following table.
- Regular
- expression
- Regular Set Finite automata
- φ
- { }
- ∈ { ∈ }
- Every a in Σ is
- a regular
- expression
- {a}
- r 1 + r 2 or r 1 | r 2
- is a regular
- expression,
- R 1 ∪ R 2 (Where R 1
- and R 2 are regular
- sets corresponding to
- r 1 and r 2 , respectively)
- where N 1 is a finite automata accepting R 1 , and N 2 is a finite
- automata accepting R 2
- r 1 . r 2 is a
- regular
- expression,
- R 1 .R 2 (Where R 1 and
- R 2 are regular sets
- corresponding to r 1
- and r 2 , respectively)
- where N 1 is a finite automata accepting R 1 , and N 2 is finite
- automata accepting R 2
- r* is a regular
- expression,
- R* (where R is a
- regular set
- corresponding to r)
- where N is a finite automata accepting R.
- Hence, we only have three regular-expression operators: | or + to denote union operations,. for concatenation
- operations, and * for closure operations. The precedence of the operators in the decreasing order is: *, followed by.,
- followed by | . For example, consider the following regular expression:
- To construct a finite automata for this regular expression, we proceed as follows: the basic regular expressions
- involved are a and b, and we start with automata for a and automata for b. Since brackets are evaluated first, we
- initially construct the automata for a + b using the automata for a and the automata for b, as shown in Figure 2.25.
- Figure 2.25: Transition diagram for (a + b).
- Since closure is required next, we construct the automata for (a + b)*, using the automata for a + b, as shown in
- Figure 2.26.
- Figure 2.26: Transition diagram for (a + b)*.
- The next step is concatenation. We construct the automata for a. (a + b)* using the automata for (a + b)* and a, as
- shown in Figure 2.27.
- Figure 2.27: Transition diagram for a. (a + b)*.
- Next we construct the automata for a.(a + b)*.b, as shown in Figure 2.28.
- Figure 2.28: Automata for a.(a + b)* .b.
- Finally, we construct the automata for a.(a + b)*.b.b (Figure 2.29).
- Figure 2.29: Automata for a.(a + b)*.b.b.
- This is an NFA with ∈ -moves, but an algorithm exists to transform the NFA to a DFA. So, we can obtain a DFA from
- this NFA.
- 2.9 OBTAINING THE REGULAR EXPRESSION FROM THE FINITE
- AUTOMATA
- Given a finite automata, to obtain a regular expression that specifies the regular set accepted by the given finite
- automata, the following steps are necessary:
- Associate suitable variables (e.g., A, B, C, etc.) with the states of finite automata. 1.
- Form a set of equations using the following rules:
- If there exists a transition from a state associated with variable A to a state
- associated with variable B on an input symbol a, then add the equation
- a.
- If the state associated with variable A is a final state, add A = ∈ to the set of
- equations.
- b.
- If we have the two equations A = ab and A = bc, then they can be combined
- as A = aB | bc.
- c.
- 2.
- Solve these equations to get the value of the variable associated with the starting state of the
- automata. In order to solve these equations, it is necessary to bring the equation in the following
- form:
- 3.
- where S is a variable, and a and b are expressions that do not contain S. The solution to this equation is S = a*b.
- (Here, the concatenation operator is between a* and b, and is not explicitly shown.) For example, consider the finite
- automata whose transition diagram is shown in Figure 2.30.
- Figure 2.30: Deriving the regular expression for a regular set.
- We use the names of the states of the automata as the variable names associated with the states.
- The set of equations obtained by the application of the rules are:
- To solve these equations, we do the substitution of (II) and (III) in (I), to obtain:
- Therefore, the value of variable S comes out be:
- Therefore, the regular expression specifying the regular set accepted by the given finite automata is
- 2.10 LEXICAL ANALYZER DESIGN
- Since the function of the lexical analyzer is to scan the source program and produce a stream of tokens as output, the
- issues involved in the design of lexical analyzer are:
- Identifying the tokens of the language for which the lexical analyzer is to be built, and to specify
- these tokens by using suitable notation, and
- 1.
- Constructing a suitable recognizer for these tokens. 2.
- Therefore, the first thing that is required is to identify what the keywords are, what the operators are, and what the
- delimiters are. These are the tokens of the language. After identifying the tokens of the language, we must use
- suitable notation to specify these tokens. This notation, should be compact, precise, and easy to understand. Regular
- expressions can be used to specify a set of strings, and a set of strings that can be specified by using
- regular-expression notation is called a "regular set." The tokens of a programming language constitutes a regular set.
- Hence, this regular set can be specified by using regular-expression notation. Therefore, we write regular expressions
- for things like operators, keywords, and identifiers. For example, the regular expressions specifying the subset of
- tokens of typical programming language are as follows:
- operators = +| -| * |/ | mod|div
- keywords = if|while|do|then
- letter = a|b|c|d|....|z|A|B|C|....|Z
- digit = 0|1|2|3|4|5|6|7|8|9
- identifier = letter (letter|digit)*
- The advantage of using regular-expression notation for specifying tokens is that when regular expressions are used,
- the recognizer for the tokens ends up being a DFA. Therefore, the next step is the construction of a DFA from the
- regular expression that specifies the tokens of the language. But the DFA is a flow-chart (graphical) representation of
- the lexical analyzer. Therefore, after constructing the DFA, the next step is to write a program in suitable programming
- language that will simulate the DFA. This program acts as a token recognizer or lexical analyzer. Therefore, we find
- that by using regular expressions for specifying the tokens, designing a lexical analyzer becomes a simple mechanical
- process that involves transforming regular expressions into finite automata and generating the program for simulating
- the finite automata.
- Therefore, it is possible to automate the procedure of obtaining the lexical analyzer from the regular expressions and
- specifying the tokens—and this is what precisely the tool LEX is used to do. LEX is a compiler-writing tool that
- facilitates writing the lexical analyzer, and hence a compiler. It inputs a regular expression that specifies the token to
- be recognized and generates a C program as output that acts as a lexical analyzer for the tokens specified by the
- inputted regular expressions.
- 2.10.1 Format of the Input or Source File of LEX
- The LEX source file contains two things:
- Auxiliary definitions having the format: name = regular expression.
- The purpose of the auxiliary definitions is to identify the larger regular expressions by using
- suitable names.
- LEX makes use of the auxiliary definitions to replace the names used for specifying the patterns
- of corresponding regular expressions.
- 1.
- The translation rules having the format:
- pattern {action}.
- 2.
- The ‘pattern’ specification is a regular expression that specifies the tokens, and ‘{action}’ is a program fragment written
- in C to specify the action to be taken by the lexical analyzer generated by LEX when it encounters a string matching
- the pattern. Normally, the action taken by the lexical analyzer is to return a pair to the parser or syntax analyzer. The
- first member of the pair is a token, and the second member is the value or attribute of the token. For example, if the
- token is an identifier, then the value of the token is a pointer to the symbol-table record that contains the
- corresponding name of the identifier. Hence, the action taken by the lexical analyzer is to install the name in the
- symbol table and return the token as an id, and to set the value of the token as a pointer to the symbol table record
- where the name is installed. Consider the following sample source program:
- letter [ a-z, A-Z ]
- digit [ 0-9 ]
- %%
- begin { return ("BEGIN")}
- end { return ("END")}
- if {return ("IF")}
- letter ( letter|digit)* { install ( );
- return ("identifier")
- }
- < { return ("LT")}
- < = { return ("LE")}
- %%
- definition of install()
- In the above specification, we find that the keyword ‘begin’ can be matched against two patterns one specifying the
- keyword and the other specifying identifiers. In this case, pattern-matching is done against whichever pattern comes
- first in the physical order of the specification. Hence, ‘begin’ will be recognized as a keyword and not as an identifier.
- Therefore, patterns that specify keywords of the language are required to be listed before a pattern-specifying
- identifier; otherwise, every keyword will get recognized as identifier. A lexical analyzer generated by LEX always tries
- to recognize the longest prefix of the input as a token. Hence, if < = is read, it will be recognized as a token "LE" not
- "LT."
- 2.11 PROPERTIES OF REGULAR SETS
- Since the union of two regular sets is always a regular set, regular sets are closed under the union operation. Similarly,
- regular sets are closed under concatenation and closure operations, because the concatenation of a regular sets is
- also a regular set, and the closure of a regular set is also a regular set.
- Regular sets are also closed under the complement operation, because if L(M) is a language accepted by a finite
- automata M, then the complement of L(M) is Σ * − L(M). If we make all final states of M nonfinal, and we make all
- nonfinal states of M final, then the automata accepts Σ * − L(M); hence, we conclude that the complement of L(M) is also
- a regular set. For example, consider the transition diagram in Figure 2.31.
- Figure 2.31: Transition diagram.
- The transition diagram of the complement to the automata shown in Figure 2.31 is shown in Figure 2.32.
- Figure 2.32: Complement to transition diagram in Figure 2.31.
- Since the regular sets are closed under complement as well as union operations, they are closed under intersection
- operations also, because intersection can be expressed in terms of both union and complement operations, as shown
- below:
- where L 1 denotes the complement of L 1 .
- An automata for accepting L 1 ∩ L 2 is required in order to simulate the moves of an automata that accepts L 1 as well as
- the moves of an automata that accepts L 2 on the input string x. Hence, every state of the automata that accepts L 1 ∩
- L 2 will be an ordered pair [p, q], where p is a state of the automata accepting L 1 and q is a state of the automata
- accepting L 2 .
- Therefore, if M 1 = (Q 1 , Σ , δ 1 , q 1 , F 1 ) is an automata accepting L 1 , and if M 2 = (Q 2 , Σ , δ 2 , q 2 , F 2 ) is an automata
- accepting L 2 , then the automata accepting L 1 ∩ L 2 will be: M = (Q 1 × Q 2 , Σ , δ , [q 1 , q 2 ], F 1 × F 2 ) where δ ([p, q], a) = [ δ 1
- (p, a), δ 2 (q, a)]. But all the members of Q 1 × Q 2 may not necessarily represent reachable states of M. Hence, to
- reduce the amount of work, we start with a pair [q 1 , q 2 ] and find transitions on every member of Σ from [q 1 , q 2 ]. If some
- transitions go to a new pair, then we only generate that pair, because it will then represent a reachable state of M.
- We next consider the newly generated pairs to find out the transitions from them. We continue this until no new pairs
- can be generated.
- Let M 1 = ( Q 1 , Σ , δ 1 , q 1 , F 1 ) be a automata accepting L 1 , and let M 2 = (Q 2 , Σ , δ 2 , q 2 , F 2 ) be a automata accepting L 2 .
- M = (Q, Σ , δ , q 0 , F) will be an automata accepting L 1 ∩ L 2 .
- begin
- Q old = Φ
- Q new = { [ q 1 , q 2 ] }
- While ( Q old ≠ Q new )
- {
- Temp = Q new − Q old
- Q old = Q new
- for every pair [p, q] in Temp do
- for every a in Σ do
- Q new = Q new ∪ δ ([p, q ], a)
- }
- Q = Q new
- end
- Consider the automatas and their transition diagrams shown in Figure 2.33 and Figure 2.34.
- Figure 2.33: Transition diagram of automata M 1 .
- Figure 2.34: Transition diagram of automata M 2 .
- The transition table for the automata accepting L(M 1 ) ∩ L(M 2 ) is:
- δ
- A b
- [1, 1] [1, 1] [2, 4]
- [2, 4] [3, 3] [4, 2]
- [3, 3] [2, 2] [1, 1]
- [4, 2] [1, 1] [2, 4]
- [2, 2] [3, 1] [4, 4]
- [3, 1] [2, 1] [1, 4]
- [4, 4] [1, 3] [2, 2]
- [2, 1] [3, 1] [4, 4]
- [1, 4]* [1, 3] [2, 2]
- [1, 3] [1, 2] [2, 1]
- [1, 2]* [1, 1] [2, 4]
- We associate the names with states of the automata obtained, as shown below:
- [1, 1] A
- [2, 4] B
- [3, 3] C
- [4, 2] D
- [2, 2] E
- [3, 1] F
- [4, 4] G
- [2, 1] H
- [1, 4] I
- [1, 3] J
- [1, 2] K
- The transition table of the automata using the names associated above is:
- δ
- a B
- A A B
- B C D
- C E A
- D A B
- E F G
- F H I
- G J E
- H F G
- I* J E
- J K H
- K* A B
- 2.12 EQUIVALENCE OF TWO AUTOMATAS
- Automatas M 1 and M 2 are said to be equivalent if they accept the same language; that is, L(M 1 ) = L(M 2 ). It is possible
- to test whether the automatas M 1 and M 2 accept the same language—and hence, whether they are equivalent or not.
- One method of doing this is to minimize both M 1 and M 2 , and if the minimal state automatas obtained from M 1 and M 2
- are identical, then M 1 is equivalent to M 2 .
- Another method to test whether or not M 1 is equivalent to M 2 is to find out if:
- For this, complement M 2 , and construct an automata that accepts both the intersection of language accepted by M 1
- and the complement of M 2 . If this automata accepts an empty set, then it means that there is no string acceptable to
- M 1 that is not acceptable to M 2 . Similarly, construct an automata that accepts the intersection of language accepted by
- M 2 and the complement of M 1 . If this automata accepts an empty set, then it means that there is no string acceptable
- to M 2 that is not acceptable to M 1 . Hence, the language accepted by M 1 is same as the language accepted by M 2 .
- Chapter 3: Context-Free Grammar and Syntax Analysis
- 3.1 SYNTAX ANALYSIS
- In the syntax-analysis phase, a compiler verifies whether or not the tokens generated by the lexical analyzer are
- grouped according to the syntactic rules of the language. If the tokens in a string are grouped according to the
- language's rules of syntax, then the string of tokens generated by the lexical analyzer is accepted as a valid construct
- of the language; otherwise, an error handler is called. Hence, two issues are involved when designing the
- syntax-analysis phase of a compilation process:
- All valid constructs of a programming language must be specified; and by using these
- specifications, a valid program is formed. That is, we form a specification of what tokens the
- lexical analyzer will return, and we specify in what manner these tokens are to be grouped so that
- the result of the grouping will be a valid construct of the language.
- 1.
- A suitable recognizer will be designed to recognize whether a string of tokens generated by the
- lexical analyzer is a valid construct or not.
- 2.
- Therefore, suitable notation must be used to specify the constructs of a language. The notation for the construct
- specifications should be compact, precise, and easy to understand. The syntax-structure specification for the
- programming language (i.e., the valid constructs of the language) uses context-free grammar (CFG), because for
- certain classes of grammar, we can automatically construct an efficient parser that determines if a source program is
- syntactically correct. Hence, CFG notation is required topic for study.
- 3.2 CONTEXT-FREE GRAMMAR
- CFG notation specifies a context-free language that consists of terminals, nonterminals, a start symbol, and
- productions. The terminals are nothing more than tokens of the language, used to form the language constructs.
- Nonterminals are the variables that denote a set of strings. For example, S and E are nonterminals that denote
- statement strings and expression strings, respectively, in a typical programming language. The nonterminals define
- the sets of strings that are used to define the language generated by the grammar.
- They also impose a hierarchical structure on the language, which is useful for both syntax analysis and translation.
- Grammar productions specify the manner in which the terminals and string sets, defined by the nonterminals, can be
- combined to form a set of strings defined by a particular nonterminal. For example, consider the production S → aSb.
- This production specifies that the set of strings defined by the nonterminal S are obtained by concatenating terminal a
- with any string belonging to the set of strings defined by nonterminal S, and then with terminal b. Each production
- consists of a nonterminal on the left-hand side, and a string of terminals and nonterminals on the right-hand side. The
- left-hand side of a production is separated from the right-hand side using the " → " symbol, which is used to identify a
- relation on a set (V ∪ T)*.
- Therefore context-free grammar is a four-tuple denoted as:
- where:
- V is a finite set of symbols called as nonterminals or variables, 1.
- T is a set a symbols that are called as terminals, 2.
- P is a set of productions, and 3.
- S is a member of V, called as start symbol. 4.
- For example:
- 3.2.1 Derivation
- Derivation refers to replacing an instance of a given string's nonterminal, by the right-hand side of the production rule,
- whose left-hand side contains the nonterminal to be replaced. Derivation produces a new string from a given string;
- therefore, derivation can be used repeatedly to obtain a new string from a given string. If the string obtained as a result
- of the derivation contains only terminal symbols, then no further derivations are possible. For example, consider the
- following grammar for a string S:
- where P contains the following productions:
- It is possible to replace the nonterminal S by a string aSa. Therefore, we obtain aSa from S by deriving S to aSa. It is
- possible to replace S in aSa by ∈ , to obtain a string aa, which cannot be further derived.
- If α 1 and α 2 are the two strings, and if α 2 can be obtained from α 1 , then we say α 1 is related to α 2 by "derives to
- relation," which is denoted by " → ". Hence, we write α 1 → α 2 , which translates to: α 1 derives to α 2 . The symbol →
- denotes a derive to relation that relates the two strings α 1 and α 2 such that α 2 is a direct derivative of α 1 (if α 2 can be
- obtained from α 1 by a derivation of only one step). Therefore, will denote the transitive closure of derives to
- relation, and if we have the two strings α 1 and α 2 such that α 2 can be obtained from α 1 by derivation, but α 2 may not
- be a direct derivative of α 1 , then we write α 1 α 2 , which translates to: α 1 derives to α 2 through one or more
- derivations.
- Similarly, denotes the reflexive transitive closure of derives to relation; and if we have two strings α 1 and α 2 such
- that α 1 derives to α 2 in zero, one, or more derivations, then we write α 1 α 2 . For example, in the grammar above,
- we find that S → aSa → abSba → abba. Therefore, we can write S abba.
- The language defined by a CFG is nothing but the set of strings of terminals that, in the case of the string S, can be
- generated from S as a result of derivations using productions of the grammar. Hence, they are defined as the set of
- those strings of terminals that are derivable from the grammar's start symbol. Therefore, if G = (V, T, P, S) is a
- grammar, then the language by the grammar is denoted as L(G) and defined as:
- The above grammar can generate the string ∈ , aa, bb, abba, … , but not aba.
- 3.2.2 Standard Notation
- The capital letters toward the start of the alphabet are used to denote nonterminals (e.g., A, B, C,
- etc.).
- 1.
- Lowercase letters toward the start of the alphabet are used to denote terminals (e.g., a, b, c, etc.). 2.
- S is used to denote the start symbol. 3.
- Lowercase letters toward the end of the alphabet (e.g., u, v, w, etc.) are used to denote strings of
- terminals.
- 4.
- The symbols α , β , γ , and so forth are used to denote strings of terminals as well as strings of
- nonterminals.
- 5.
- The capital letters toward the end of alphabet (e.g., X, Y, and Z) are used to denote grammar
- symbols, and they may be terminals or nonterminals.
- 6.
- The benefit of using these notations is that it is not required to explicitly specify all four grammar components. A
- grammar can be specified by only giving the list of productions; and from this list, we can easily get information about
- the terminals, nonterminals, and start symbols of the grammar.
- 3.2.3 Derivation Tree or Parse Tree
- When deriving a string w from S, if every derivation is considered to be a step in the tree construction, then we get the
- graphical display of the derivation of string w as a tree. This is called a "derivation tree" or a "parse tree" of string w.
- Therefore, a derivation tree or parse tree is the display of the derivations as a tree. Note that a tree is a derivation tree
- if it satisfies the following requirements:
- All the leaf nodes of the tree are labeled by terminals of the grammar. 1.
- The root node of the tree is labeled by the start symbol of the grammar. 2.
- The interior nodes are labeled by the nonterminals. 3.
- If an interior node has a label A, and it has n descendents with labels X 1 , X 2 , … , X n from left to
- right, then the production rule A → X 1 X 2 X 3 …… X n must exist in the grammar.
- 4.
- For example, consider a grammar whose list of productions is:
- The tree shown in Figure 3.1 is a derivation tree for a string id + id * id.
- Figure 3.1: Derivation tree for the string id + id * id.
- Given a parse (derivation) tree, a string whose derivation is represented by the given tree is one obtained by
- concatenating the labels of the leaf nodes of the parse tree in a left-to-right order.
- Consider the parse tree shown in Figure 3.2. A string whose derivation is represented by this parse tree is abba.
- Figure 3.2: Parse tree resulting from leaf-node concatenation.
- Since a parse tree displays derivations as a tree, given a grammar G = (V, T, P, S) for every w in T *, and which is
- derivable from S, there exists a parse tree displaying the derivation of w as a tree. Therefore, we can define the
- language generated by the grammar as:
- For some w in L(G), there may exist more than one parse tree. That means that more than one way may exist to
- derive w from S, using the productions of the grammar. For example, consider a grammar having the productions
- listed below:
- We find that for a string id + id* id, there exists more than one parse tree, as shown in Figure 3.3.
- Figure 3.3: Multiple parse trees.
- If more than one parse tree exists for some w in L(G), then G is said to be an "ambiguous" grammar. Therefore, the
- grammar having the productions E → E + E | E * E | id is an ambiguous grammar, because there exists more than one
- parse tree for the string id + id * id in L(G) of this grammar.
- Consider a grammar having the following productions:
- This grammar is also an ambiguous grammar, because more than one parse tree exists for a string abab in L(G), as
- shown in Figure 3.4.
- Figure 3.4: Ambiguous grammar parse trees.
- The parse tree construction process is such that the order in which the nonterminals are considered for replacement
- does not matter. That is, given a string w, the parse tree for that string (if it exists) can be constructed by considering
- the nonterminals for derivation in any order. The two specific orders of derivation, which are important from the point of
- view of parsing, are:
- Left-most order of derivation 1.
- Right-most order of derivation 2.
- The left-most order of derivation is that order of derivation in which a left-most nonterminal is considered first for
- derivation at every stage in the derivation process. For example, one of the left-most orders of derivation for a string id
- + id * id is:
- In a right-most order of derivation, the right-most nonterminal is considered first. For example, one of the right-most
- orders of derivation for id + id* id is:
- The parse tree generated by using the left-most order of derivation of id + id*id and the parse tree generated by using
- the right-most order of derivation of id + id*id are the same; hence, these orders are equivalent. A parse tree
- generated using these orders is shown in Figure 3.5.
- Figure 3.5: Parse tree generated by using both the right- and left-most derivation orders.
- Another left-most order of derivation of id + id* id is given below:
- And here is another right-most order of derivation of id + id*id:
- The parse tree generated by using the left-most order of derivation of id + id* id and the parse tree generated using the
- right-most order of derivation of id + id* id are the same. Hence, these orders are equivalent. A parse tree generated
- using these orders is shown in Figure 3.6.
- Figure 3.6: Parse tree generated from both the left- and right-most orders of derivation.
- Therefore, we conclude that for every left-most order of derivation of a string w, there exists an equivalent right-most
- order of derivation of w, generating the same parse tree.
- Note If a grammar G is unambiguous, then for every w in L(G), there exists exactly one parse tree. Hence, there exists
- exactly one left-most order of derivation and (equivalently) one right-most order of derivation for every w in L(G).
- But if grammar G is ambiguous, then for some w in L(G), there exists more than one parse tree. Therefore, there
- is more than one left-most order of derivation; and equivalently, there is more than one right-most order of
- derivation.
- 3.2.4 Reduction of Grammar
- Reduction of a grammar refers to the identification of those grammar symbols (called "useless grammar symbols"),
- and hence those productions, that do not play any role in the derivation of any w in L(G), and which we eliminate from
- the grammar. This has no effect on the language generated by the grammar. For example, a grammar symbol X is
- useful if and only if:
- It derives to a string of terminals, and 1.
- It is used in the derivation of at least one w in L(G). 2.
- Thus, X is useful if and only if:
- X w, where w is in T *, and 1.
- S α X β w in L(G). 2.
- Therefore, reduction of a given grammar G, involves:
- Identification of those grammar symbols that are not capable of deriving to a w in T * and
- eliminating them from the grammar; and
- 1.
- Identification of those grammar symbols that are not used in any derivation and eliminating them
- from the grammar.
- 2.
- When identifying the grammar symbols that do not derive a w in T *, only nonterminals need be tested, because every
- terminal member of T will also be in T *; and by default, they satisfy the first condition. A simple, iterative algorithm can
- be used to identify those nonterminals that do not derive to w in T *: we start with those productions that are of the form
- A → w that is, those productions whose right side is w in T *. We mark as nonterminal every A on the left side of every
- production that is capable of deriving to w in T *, and then we consider every production of the form A → X 1 X 2 … X n ,
- where A is not yet marked. If every X, (for 1<= i <= n) is either a terminal or a nonterminal that is already marked, then
- we mark A (nonterminal on the left side of the production).
- We repeat this process until no new nonterminals can be marked. The nonterminals that are not marked are those not
- deriving to w in T *. After identifying the nonterminals that do not derive to w in T *, we eliminate all productions
- containing these nonterminals in order to obtain a grammar that does not contain any nonterminals that do not derive
- in T *. The algorithm for identifying as well as eliminating the nonterminals that do not derive to w in T * is given below:
- Input: G = (V, T, P, S)
- Output: G 1 = (V 1 , T, P 1 , S)
- { where V 1 is the set of nonterminals deriving to w in T *, we maintain V 1 old and V 1 new to continue
- iterations, and P 1 is the set of productions that do not contain nonterminals that do not derive to w in T
- * }
- Let U be the set of nonterminals that are not capable of deriving to w in T *.
- Then,
- begin
- V 1 old = φ
- V 1 new = φ
- for every production of the form A → w do
- V 1 new = V 1 new ∪ { A }
- while (V 1 old ≠ V 1 new ) do
- begin
- temp = V − V 1 new
- V 1 old = V 1 new
- For every A in temp do
- for every A-production of the form A → X 1 X 2 ... X n in P do
- if each Xi is either in T or in V 1 old , then
- begin
- V 1 new = V 1 new ∪ { A }
- break;
- end
- end
- V 1 = V 1 new
- U = V − V 1
- for every production in P do
- if it does not contain a member of U then
- add the production to P 1
- end
- If S is itself a useless nonterminal, then the reduced grammar is a ‘null’ grammar.
- When identifying the grammar symbols that are not used in the derivation of any w in L(G), terminals as well as
- nonterminals must be tested. A simple, iterative algorithm can be used to identify those grammar symbols that are not
- used in the derivation of any w in L(G): we start with S-productions and mark every grammar symbol X on the right
- side of every S-production. We then consider every production of the form A → X 1 X 2 … X n , where A is an
- already-marked nonterminal; and we mark every X on the right side of these productions. We repeat this process until
- no new nonterminals can be marked. We do not mark any terminals or nonterminals not used in the derivation of any
- w in L(G). After identifying the terminals and nonterminals not used in the derivation of any w in L(G), we eliminate all
- productions containing them; thus, we obtain a grammar that does not contain any useless symbols-hence, a reduced
- grammar.
- The algorithm for identifying as well as eliminating grammar symbols that are not used in the derivation of any w in
- L(G) is given below:
- Input: G 1 = (V 1 , T, P 1 , S)
- { The grammar obtained after elimination of the nonterminals not deriving to w in T * }
- Output: G 2 = (V 2 , T 2 , P 2 , S)
- { where V 2 is the set of nonterminals used in derivation of some w in L(G), and T 2 is set of terminals
- used in the derivation of some w in L(G), and P 2 is set of productions containing the members of V 2
- and T 2 only. We maintain V 2 old and V 2 new to continue iterations }
- begin
- T 2 = φ
- V 2 old = φ
- P 2 = φ
- V 2 new = { S }
- While (V 2 old # V 2 new ) do
- begin
- temp = V 2 new - V 2 old
- V 2 old = V 2 new for every A in temp do
- for every A-production of the form A → X 1 X 2 ... X n in P 1 do
- for each X i (1 <= i <= n) do
- begin
- if (X i is in V 2 old ) then
- V 2 new = V 2 new ∪ { X i }
- if (X 1 is in T ) then
- T 2 = T 2 ∪ { X i }
- end
- V 2 = V 2 new
- temp 1 = V 1 − V 2
- temp 2 = T 1 − T 2
- for every production in P 1 do add the production to P 2 if it does
- not contain a member of temp 1 as well as temp 2
- G 2 = (V 2 , T 2 , P 2 , S)
- end
- end
- EXAMPLE 3.1
- Find the reduced grammar equivalent to CFG
- where P contains
- Since the productions A → a and C → ad exist in form A → w, nonterminals A and C are derivable to w in T *, The
- production S → AC also exists, the right side of which contains the nonterminals A and C, which are derivable to w in T
- *. Hence, S is also derivable to w in T *. But since the right side of both of the B-productions contain B, the nonterminal
- B is not derivable to w in T *.
- Hence, B can be eliminated from the grammar, and the following grammar is obtained:
- where P 1 contains
- Since the right side of the S-production of this grammar contains the nonterminals A and C, A and C will be used in the
- derivation of some w in L(G). Similarly, the right side of the A-production contains bASC and a; hence, the terminals a
- and b will be used. The right side of the C-production contains ad, so terminal d will also be useful. Therefore, every
- terminal, as well as the nonterminal in G1, is useful. So the reduced grammar is:
- where P 1 contains
- 3.2.5 Useless Grammar Symbols
- A grammar symbol is a useless grammar symbol if it does not satisfy either of the following conditions:
- That is, a grammar symbol X is useless if it does not derive to terminal strings. And even if it does derive to a string of
- terminals, X is a useless grammar symbol if it does not occur in a derivation sequence of any w in L(G). For example,
- consider the following grammar:
- First, we find those nonterminals that do not derive to the string of terminals so that they can be separated out. The
- nonterminals A and X directly derive to the string of terminals because the production A → q and X → ad exist in a
- grammar. There also exists a production S → bX, where b is a terminal and X is a nonterminal, which is already known
- to derive to a string of terminals. Therefore, S also derives to string of terminals, and the nonterminals that are capable
- of deriving to a string of terminals are: S, A, and X. B ends up being a useless nonterminal; and therefore, the
- productions containing B can be eliminated from the given grammar to obtain the grammar given below:
- We next find in the grammar obtained those terminals and nonterminals that occur in the derivation sequence of some
- w in L(G). Since every derivation sequence starts with S, S will always occur in the derivation sequence of every w in
- L(G). We then consider those productions whose left-hand side is S, such as S → bX, since the right side of this
- production contains a terminal b and a nonterminal X. We conclude that the terminal b will occur in the derivation
- sequence, and a nonterminal X will also occur in the derivation sequence. Therefore, we next consider those
- productions whose left-hand side is a nonterminal X. The production is X → ad. Since the right side of this production
- contains terminals a and d, these terminals will occur in the derivation sequence. But since no new nonterminal is
- found, we conclude that the nonterminals S and X, and the terminals a, b, and d are the grammar symbols that can
- occur in the derivation sequence. Therefore, we conclude that the nonterminal A will be a useless nonterminal, even
- though it derives to the string of terminals. So we eliminate the productions containing A to obtain a reduced grammar,
- given below:
- EXAMPLE 3.2
- Consider the following grammar, and obtain an equivalent grammar containing no useless grammar symbols.
- Since A → xyz and Z → z are the productions of the form A → w, where w is in T *, nonterminals A and Z are capable
- of deriving to w in T *. There are two X-productions: X → Xz and X → xYx. The right side of these productions contain
- nonterminals X and Y, respectively. Similarly, there are two Y-productions: Y → yYy and Y → XZ. The right side of
- these productions contain nonterminals Y and X, respectively. Hence, both X and Y are not capable of deriving to w in
- T *. Therefore, by eliminating the productions containing X and Y, we get:
- Since A is a start symbol, it will always be used in the derivation of every w in L(G). And since A → xyz is a production
- in the grammar, the terminals x, y, and z will also be used in the derivation. But no nonterminal Z occurs on the right
- side of the A-production, so Z will not be used in the derivation of any w in L(G). Hence, by eliminating the productions
- containing nonterminal Z, we get:
- which is a grammar containing no useless grammar symbols.
- EXAMPLE 3.3
- Find the reduced grammar that is equivalent to the CFG given below:
- Since C → ad is the production of the form A → w, where w is in T *, nonterminal C is capable of deriving to w in T *.
- The production S → aC contains a terminal a on the right side as well as a nonterminal C that is known to be capable
- of deriving to w in T *.
- Hence, nonterminal S is also capable of deriving to w in T *. The right side of the production A → bSCa contains the
- nonterminals S and C, which are known to be capable of deriving to w in T *. Hence, nonterminal A is also capable of
- deriving to w in T *. There are two B-productions: B → aSB and B → bBC. The right side of these productions contain
- the nonterminals S, B, and C; and even though S and C are known to be capable of deriving to w in T *, nonterminal B
- is not. Hence, by eliminating the productions containing B, we get:
- Since S is a start symbol, it will always be used in the derivation of every w in L(G). And since S → aC is a production
- in the grammar, terminal a as well as nonterminal C will also be used in the derivation. But since a nonterminal C
- occurs on the right side of the S-production, and C → ad is a production, terminal d will be used along with terminal a
- in the derivation. A nonterminal A, though, occurs nowhere in the right side of either the S-production or the
- C-production; it will not be used in the derivation of any w in L(G). Hence, by eliminating the productions containing
- nonterminal A, we get:
- which is a reduced grammar equivalent to the given grammar, but it contains no useless grammar symbols.
- EXAMPLE 3.4
- Find the useless symbols in the following grammar, and modify the grammar so that it has no useless symbols.
- Since S → 0 and B → 1 are productions of the form A → w, where w is in T *, the nonterminals S and B are capable of
- deriving to w in T *. The production A → AB contains the nonterminals A and B on the right side; and even though B is
- known to be capable of deriving to w in T *, nonterminal A is not capable of deriving to w in T *. Therefore, by
- eliminating the productions containing A, we get:
- Since S is a start symbol, it will always be used in the derivation of any w in L(G). And because S → 0 is a production
- in the grammar, terminal 0 will also be used in the derivation. But nonterminal B does not occur anywhere in the right
- side of the S-production, it will not be used in the derivation of any w in L(G). Hence, by eliminating the productions
- containing nonterminal B, we get:
- which is a grammar equivalent to the given grammar and contains no useless grammar symbols.
- EXAMPLE 3.5
- Find the useless symbols in the following grammar, and modify the grammar to obtain one that has no useless
- symbols.
- Since A → a and C → b are productions of the form A → w, where w is in T *, the nonterminals A and C are capable of
- deriving to w in T *. The right side of the production S → CA contains nonterminals C and A, both of which are known
- to be derivable to w in T *.
- Hence, S is also capable of deriving to w in T *. There are two B-productions, B → BC and B → AB. The right side of
- these productions contain the nonterminals A, B, and C. Even though A and C are known to be capable of deriving to
- w in T *, nonterminal B is not capable of deriving to w in T *. Therefore, by eliminating the productions containing B, we
- get:
- Since S is a start symbol, it will always be used in the derivation of every w in L(G). And since S → CA is a production
- in the grammar, nonterminals C and A will both be used in the derivation. For the productions A → a and C → b, the
- terminals a and b will also be used in the derivation. Hence, every grammar symbol in the above grammar is useful.
- Therefore, a grammar equivalent to the given grammar that contains no useless grammar symbols is:
- 3.2.6 ∈ -Productions and Nullable Nonterminals
- A production of the form A → ∈ is called a " ∈ -production". If A is a nonterminal, and if A ∈ (i.e., if A derives to an
- empty string in zero, one, or more derivations), then A is called a "nullable nonterminal".
- Algorithm for Identifying Nullable Nonterminals
- Input: G = (V, T, P, S)
- Output: Set N (i.e., the set of nullable nonterminals)
- { we maintain N old and N new to continue iterations }
- begin
- N old = φ
- N new = φ
- for every production of the form A → ∈ do
- N new = N new ∪ { A }
- while (N old ≠ N new ) do
- begin
- temp = V - N new
- N old = N new
- For every A in temp do
- for every A-production of the form A → X 1 X 2 ...X n in P do
- if each X 1 is in N old then
- N new = N new ∪ { A }
- end
- N = N new
- end
- EXAMPLE 3.6
- Consider the following grammar and identify the nullable nonterminals.
- By applying the above algorithm, the results after each iteration are shown below:
- Initially:
- After the first execution of the for loop:
- After the first iteration of the while loop:
- After the second iteration of the while loop:
- After the third iteration of the while loop:
- Therefore, N = { S, A, B, C }; and hence, all the nonterminals of the grammar are nullable.
- 3.2.7 Eliminating ∈ -Productions
- Given a grammar G that contains ∈ -productions, if L(G) does not contain ∈ , then it is possible to eliminate all
- ∈ -productions in the given grammar G. Whereas, if L(G) contains ∈ , then elimination of all ∈ -productions from G
- gives a grammar G in which L(G 1 ) = L(G) - { ∈ }. To eliminate the ∈ -productions from a grammar, we use the
- following technique.
- If A → ∈ is an ∈ -production to be eliminated, then we look for all those productions in the grammar whose right side
- contains A, and we replace each occurrence of A in these productions. Thus, we obtain the non- ∈ -productions to be
- added to the grammar so that the language's generation remains the same. For example, consider the following
- grammar:
- To eliminate A → ∈ form the above grammar, we replace A on the right side of the production S → aA and obtain a
- non- ∈ -production, S → a, which is added to the grammar as a substitute in order to keep the language generated by
- the grammar the same. Therefore, the ∈ -free grammar equivalent to the given grammar is:
- EXAMPLE 3.7
- Consider the following grammar, and eliminate all the ∈ -productions from the grammar without changing the language
- generated by the grammar.
- To eliminate A → ∈ from this grammar, the non- ∈ -productions to be added are obtained as follows: the list of the
- productions containing A on the right-hand side is:
- Replace each occurrence of A in each of these productions in order to obtain the non- ∈ -productions to be added to
- the grammar. The list of these productions is:
- Add these productions to the grammar, and eliminate A → ∈ from the grammar. This gives us the following grammar:
- To eliminate B → ∈ from the grammar, the non- ∈ -productions to be added are obtained as follows. The productions
- containing B on the right-hand side are:
- Replace each occurrence of B in these productions in order to obtain the non- ∈ -productions to be added to the
- grammar. The list of these productions is:
- Add these productions to the grammar, and eliminate A → ∈ from the grammar in order to obtain the following:
- EXAMPLE 3.8
- Consider the following grammar and eliminate all the ∈ -productions without changing the language generated by the
- grammar.
- To eliminate A → ∈ from the grammar, the non- ∈ -productions to be added are obtained as follows: the list of
- productions containing A on right is:
- Replace each occurrence of A in this production to obtain the non- ∈ -productions to be added to the grammar. This
- are:
- Add these productions to the grammar, and eliminate A → ∈ from the grammar to obtain the following:
- 3.2.8 Eliminating Unit Productions
- A production of the form A → B, where A and B are both nonterminals, is called a "unit production". Unit productions in
- the grammar increase the cost of derivations. The following algorithm can be used to eliminate unit productions from
- the grammar:
- While there exist a unit production A → B in the grammar do
- {
- select a unit production A → B such that there exists
- at least one nonunit production
- B → α
- for every nonunit production B → α do
- add production A → α to the grammar
- eliminate A → B from the grammar
- }
- EXAMPLE 3.9
- Given the grammar shown below, eliminate all the unit productions from the grammar.
- The given grammar contains the productions:
- which are the unit productions. To eliminate these productions from the given grammar, we first select the unit
- production B → C. But since no nonunit C-productions exist in the grammar, we then select C → D. But since no
- nonunit D-productions exist in the grammar, we next select D → E. There does exist a nonunit E-production: E → a.
- Hence, we add D → a to the grammar and eliminate D → E. But since B → C and C → D are still there, we once again
- select unit production B → C. Since no nonunit C-production exists in the grammar, we select C → D. Now there exists
- a nonunit production D → a in the grammar. Hence, we add C → a to the grammar and eliminate C → D. But since B
- → C is still there in the grammar, we once again select unit production B → C. Now there exists a nonunit production C
- → a in the grammar, so we add B → a to the grammar and eliminate B → C. Now no unit productions exist in the
- grammar. Therefore, the grammar that we get that does not contain unit productions is:
- But we see that the grammar symbols C, D, and E become useless as a result of the elimination of unit productions,
- because they will not be used in the derivation of any w in L(G). Hence, we can eliminate them from the grammar to
- obtain:
- Therefore, we conclude that to obtain the grammar in the most simplified form, we have to eliminate unit productions
- first. We then eliminate the useless grammar symbols.
- 3.2.9 Eliminating Left Recursion
- If a grammar contains a pair of productions of the form A → A α | β , then the grammar is a "left-recursive grammar". If
- left-recursive grammar is used for specification of the language, then the top-down parser specified by the grammar's
- language may enter into an infinite loop during the parsing process on some erroneous input. This is because a
- top-down parser attempts to obtain the left-most derivation of the input string w; hence, the parser may see the same
- nonterminal A every time as the left-most nonterminal. And every time, it may do the derivation using A → A α .
- Therefore, for top-down parsing, nonleft-recursive grammar should be used. Left-recursion can be eliminated from the
- grammar by replacing A → A α | β with the productions A → β B and B → αβ | ∈ . In general, if a grammar contain
- productions:
- then the left-recursion can be eliminated by adding the following productions in place of the ones above.
- EXAMPLE 3.10
- Consider the following grammar:
- The grammar is left-recursive because it contains a pair of productions, B → Bb | c. To eliminate the left-recursion from
- the grammar, replace this pair of productions with the following productions:
- Therefore, the grammar that we get after the elimination of left-recursion is:
- EXAMPLE 3.11
- Consider the following grammar:
- The grammar is left-recursive because it contains the productions A → Ad | Ae | aB | aC. To eliminate the left-recursion
- from the grammar, replace these productions by the following productions:
- Therefore, the resulting grammar after the elimination of left-recursion is:
- EXAMPLE 3.12
- Consider the following grammar:
- The grammar is left-recursive because it contains the productions L → L, S | S. To eliminate the left-recursion from the
- grammar, replace these productions by the following productions:
- Therefore, after the elimination of left-recursion, we get:
- 3.3 REGULAR GRAMMAR
- Regular grammar is a context-free grammar in which every production is restricted to one of the following forms:
- A → aB, or 1.
- A → w, where A and B are the nonterminals, a is a terminal symbol, and w is in T *. 2.
- The ∈ -productions are permitted as a special case when L(G) contains ∈ . This grammar is called "regular grammar,"
- because if the format of every production in CFG is restricted to A → aB or A → a, then the grammar can specify only
- regular sets. Hence, a finite automata exists that accepts L(G), if G is regular grammar. Given a regular grammar G, a
- finite automata accepting L(G) can be obtained as follows:
- The number of states of the automata will be equal to the number of nonterminals of the grammar
- plus one; that is, there will be a state corresponding to every nonterminal of the grammar. And one
- more state will be there, which will be the final state of the automata. The state corresponding to
- the start symbol of the grammar will be the initial state of the automata. If L(G) contains ∈ , then
- make the start state also the final state.
- 1.
- The transitions in the automata can be obtained as follows:
- for every production A → aB do
- for every production of the form A → a do
- 2.
- EXAMPLE 3.13
- Consider the regular grammar shown below and the transition diagram of the automata, shown in Figure 3.7, that
- accepts the language generated by the grammar.
- Figure 3.7: Transition diagram for automata that accepts the regular grammar of Example 3.13.
- This is a non-deterministic automata. Its deterministic equivalent can be obtained as follows:
- 0 1
- { S } { A, C } { B, C }
- *{ A, C } { S } { B, C }
- *{ B, C } { A } { S }
- { A } { S } { B, C }
- The transition diagram of the automata is shown in Figure 3.8.
- Figure 3.8: Deterministic equivalent of the non-deterministic automata shown in Figure 3.7.
- Consider the following grammar:
- The transition diagram of the finite automata that accepts the language generated by the above grammar is shown in
- Figure 3.9.
- Figure 3.9: Non-deterministic automata.
- This is a non-deterministic automata. Its deterministic equivalent can be obtained as follows, and the transition
- diagram is shown in Figure 3.10.
- Figure 3.10: Transition diagram for deterministic automata equivalent shown in Figure 3.9.
- Given a finite automata M, a regular grammar G that generates L(M) can be obtained as follows:
- Associate suitable variables like A, B, C, etc, with the states of the automata. The labels of the
- states can also be used as variable names.
- 1.
- Obtain the productions of the grammar as follows. If δ (A, a) = B, then add a production A → aB to
- the list of productions of the grammar. If B is a final state, then add either A → a or B → ∈ , to the
- grammar's list of productions.
- 2.
- The variable associated with the initial state of the automata is the start symbol of the grammar. 3.
- For example consider the automata shown in Figure 3.11.
- Figure 3.11: Regular-grammar automata.
- The regular grammar that generates the language accepted by the automata shown in Figure 3.11 will have the
- following productions:
- or
- where A is the start symbol. Both the grammars are same, but the first one contains ∈ -productions, whereas the
- second is ∈ -free.
- EXAMPLE 3.14
- Find out whether the following grammar generates the same language.
- G 1 :
- G 2 :
- Since the grammars G 1 and G 2 are the regular grammars, L(G 1 ) = L(G 2 ) if the minimal state automata accepting
- L(G 1 ), and the minimal state automata accepting L(G 2 ) are identical. The transition diagram of the automata accepting
- L(G 1 ) is shown in Figure 3.12.
- Figure 3.12: Transition diagram of automata that accepts L(G 1 ).
- The automata is deterministic. Hence, to minimize, it we proceed as follows. Since state D is an unreachable state,
- eliminate it first. So, after eliminating state D, we get the transition diagram shown in Figure 3.13.
- Figure 3.13: Transition diagram of automata after removal of state D.
- We then identify the nondistinguishable states of the automata shown in Figure 3.13, as follows. Initially, we have two
- groups:
- Since
- state B is distinguishable from rest of the members of Group I. Hence, we divide Group I into two groups—one
- containing A, and other containing E and C, as shown below:
- Since
- partitioning of Group II is not possible, because the transitions from all the members of Group II only go to Group II.
- Similarly:
- Partitioning of Group II is not possible, because the transitions from all the members of Group II only go to Group I. And
- since:
- partitioning of Group III is not possible, because the transitions from all the members of Group III only go to Group I.
- Similarly:
- Partitioning of Group III is not possible, because the transitions from all the members of Group III only go to Group III.
- Hence, states E and C are nondistinguishable states. States B and F are also nondistinguishable states. Therefore, if
- we merge E and C to form a state E 1 , and we merge B and F to form B 1 , we get the automata shown in Figure 3.14.
- Figure 3.14: Transition diagram for the automata that results from merged states.
- Since no dead states exist in the automata shown in Figure 3.14, it is a minimal state automata that accepts L(G 1 ).
- The transition diagram of the non-deterministic automata that accepts L(G 2 ) is shown in Figure 3.15.
- Figure 3.15: Non-deterministic automata that accepts L(G 2 ).
- Its equivalent deterministic automata is as follows, and the transition diagram is shown in Figure 3.16.
- 0 1
- { X } { Y, F } { Z }
- *{ Y, F } { X } { Y, F }
- { Z } { Z } { X }
- Figure 3.16: Transition diagram of the equivalent deterministic automata for Figure 3.15.
- This automata does not contain unreachable, nondistinguishable states or dead states. Hence, it is a minimal state
- automata accepting L(G 2 ), and since it is identical to the minimal state automata accepting L(G 1 ), L(G 2 ) = L(G 2 ); and
- therefore, G 1 and G 2 generate the same language.
- Obtaining a Regular Expression from the Regular Grammar
- Given a regular grammar G, a regular expression that specifies L(G) can be directly obtained as follows:
- Replace the " → " symbols in the grammar's productions with "=" symbols to get a set of equations. 1.
- Solve the set of equations obtained above to obtain the value of the variable S, where S is the
- start symbol of the grammar. The result is the regular expression specifying L(G).
- 2.
- For example consider the following regular grammar:
- Replacing the " → " symbol in the productions of the grammar with the "=" symbol, we get the
- following set of equations:
- 3.
- From equation (III) we get:
- because equation (III) is of the form A = aA | b, where a and b are the expressions that do not contain variable A, and
- the solution of this is A = a*b. Similarly, from equation (II) we get:
- Substituting the values of A in (I) gives:
- Hence, the required regular expression is:
- 3.4 RIGHT LINEAR AND LEFT LINEAR GRAMMAR
- 3.4.1 Right Linear Grammar
- Right linear grammar is a context-free grammar in which every production is restricted to one of the following forms:
- A → wB 1.
- A → w, where A and B are the nonterminals, and w is in T * 2.
- Since w is in T *, w can also be a single terminal; hence, every regular grammar, by default, satisfies this requirement
- of a right linear grammar. Therefore every regular grammar is a right linear grammar. Similarly, when | w | > 1,
- productions containing w on the right side can be split into more than one production. Each contains only one terminal
- and only one nonterminal on the right side by using additional nonterminals, because w can be written as ay, where a
- is the first terminal symbol of w and y is string made of the remaining symbols of w. Therefore, a production A → wB
- can be split into the productions A → aB 1 and B 1 → yB without affecting the language generated by the grammar. The
- production B 1 → yB can be further split in a similar manner. And this can continue until | y | becomes one. A production
- A → w can also be split into the productions A → aB 1 and B 1 → y without affecting the language generated by the
- grammar. The production B 1 → y can be further split in a similar manner, and this can continue until | y | becomes one,
- bringing the productions into the form required by the regular grammar. Therefore, we conclude that every right linear
- grammar can be rewritten in such a manner; every production of the grammar will satisfy the requirement of the
- regular grammar. For example, consider the following grammar:
- The grammar is a right linear grammar; the production S → aaB can be split into the productions S → aC and C → aB
- without affecting what is derived from S. Similarly, the production S → ab can be split into the productions S → aD and
- D → a. The production B → bb can also be split into the productions B → bE and E → b. Therefore, the above
- grammar can be rewritten as:
- which is a regular grammar.
- 3.4.2 Left Linear Grammar
- Left linear grammar is a context-free grammar in which every production is restricted to one of the following forms:
- A → Bw 1.
- A → w, where A and B are the nonterminals, and w is in T * 2.
- For every left linear grammar, there exists an equivalent right linear grammar that generates the same language, and
- vice versa. Hence, we conclude that every linear grammar (left or right) is a regular grammar. Given a right linear
- grammar, an equivalent left linear grammar can be obtained as follows:
- Obtain a regular expression for the language generated by the given grammar. 1.
- Reverse the regular expression obtained in step 1, above. 2.
- Obtain the regular, right linear grammar for the regular expression obtained in step 2. 3.
- Reverse the right side of every production of the grammar obtained in step 3. The resulting
- grammar will be an equivalent left linear grammar.
- 4.
- For example consider the right linear grammar given below:
- The regular expression for the above grammar is obtained as follows. Replace the → by = in the above productions
- to obtain the equations:
- Solving equation (II) gives:
- By substituting the value of B in (I), we get:
- Therefore, the required regular expression is:
- And the reverse regular expression is:
- The finite automata accepting the language specified by the above regular expression is shown in Figure 3.17.
- Figure 3.17: Finite automata accepting the right linear grammar for a regular expression.
- Therefore, the right linear grammar that generates the language accepted by the automata in Figure 3.17 is:
- Since C is not useful, eliminating C gives:
- which can be further simplified by replacing D in B → 1D, using D → 0 to give:
- Reversing the right side of the productions yields:
- which is the equivalent left linear grammar. So, given a left linear grammar, an equivalent right linear grammar can be
- obtained as follows:
- Reverse the right side of every production of the given grammar. 1.
- Obtain a regular expression for the language generated by the grammar obtained in step 1,
- above.
- 2.
- Reverse the regular expression obtained in the step 2. 3.
- Obtain the regular, right linear grammar for the regular expression obtained in the step 3. 4.
- The resulting grammar will be an equivalent left linear grammar. For example, consider the following left linear
- grammar:
- Reversing the right side of the productions gives us:
- The regular expression that specifies the language generated by the above grammar can be obtained as follows.
- Replace the → symbols with "=" symbols in the productions of the above grammar to get the following set of
- equations:
- From equation (II), we get:
- Substituting this value in (I) gives us:
- Therefore,
- and the regular expression is:
- The reversed regular expression is:
- The finite automata that accepts the language specified by the reversed regular expression is shown in Figure 3.18.
- Figure 3.18: Transition diagram for a finite automata specified by a reversed regular expression.
- Therefore, the regular grammar that generates the language accepted by the automata shown in Figure 3.18 is:
- which can be reduced to:
- which is the required right linear grammar.
- EXAMPLE 3.15
- Consider the following grammar to obtain an equivalent left linear grammar.
- The regular expression for the above grammar is obtained as follows. Replace the → by = in the above productions
- to obtain the equations:
- By substituting (III) in (Il) we get:
- Therefore, A = (a | gg)A | g and A = (a | gg)*g. By substituting this value in (I) we get:
- And the regular expression is:
- Therefore, the reversed regular expression is:
- But since (a | gg)* is the same as (gg | a)*, the reversed regular expression is same. Hence, the regular, right linear
- grammar that generates the language specified by the reversed regular expression is the given grammar itself.
- Therefore, an equivalent left linear grammar can be obtained by reversing the right side of the productions of the given
- grammar:
- Chapter 4: Top-Down Parsing
- INTRODUCTION
- A syntax analyzer or parser is a program that performs syntax analysis. A parser obtains a string of tokens from the
- lexical analyzer and verifies whether or not the string is a valid construct of the source language-that is, whether or not
- it can be generated by the grammar for the source language. And for this, the parser either attempts to derive the
- string of tokens w from the start symbol S, or it attempts to reduce w to the start symbol of the grammar by tracing the
- derivations of w in reverse. An attempt to derive w from the grammar's start symbol S is equivalent to an attempt to
- construct the top-down parse tree; that is, it starts from the root node and proceeds toward the leaves. Similarly, an
- attempt to reduce w to the grammar's start symbol S is equivalent to an attempt to construct a bottom-up parse tree;
- that is, it starts with w and traces the derivations in reverse, obtaining the root S.
- 4.1 TOP-DOWN PARSING
- Top-down parsing attempts to find the left-most derivations for an input string w, which is equivalent to constructing a
- parse tree for the input string w that starts from the root and creates the nodes of the parse tree in a predefined order.
- The reason that top-down parsing seeks the left-most derivations for an input string w and not the right-most
- derivations is that the input string w is scanned by the parser from left to right, one symbol/token at a time, and the
- left-most derivations generate the leaves of the parse tree in left-to-right order, which matches the input scan order.
- Since top-down parsing attempts to find the left-most derivations for an input string w, a top-down parser may require
- backtracking (i.e., repeated scanning of the input); because in the attempt to obtain the left-most derivation of the input
- string w, a parser may encounter a situation in which a nonterminal A is required to be derived next, and there are
- multiple A-productions, such as A → α 1 | α 2 | … | α n . In such a situation, deciding which A-production to use for the
- derivation of A is a problem. Therefore, the parser will select one of the A-productions to derive A, and if this derivation
- finally leads to the derivation of w, then the parser announces the successful completion of parsing. Otherwise, the
- parser resets the input pointer to where it was when the nonterminal A was derived, and it tries another A-production.
- The parser will continue this until it either announces the successful completion of the parsing or reports failure after
- trying all of the alternatives. For example, consider the top-down parser for the following grammar:
- Let the input string be w = acb. The parser initially creates a tree consisting of a single node, labeled S, and the input
- pointer points to a, the first symbol of input string w. The parser then uses the S-production S → aAb to expand the
- tree as shown in Figure 4.1.
- Figure 4.1: Parser uses the S-production to expand the parse tree.
- The left-most leaf, labeled a, matches the first input symbol of w. Hence, the parser will now advance the input pointer
- to c, the second symbol of string w, and consider the next leaf labeled A. It will then expand A, using the first
- alternative for A in order to obtain the tree shown in Figure 4.2.
- Figure 4.2: Parser uses the first alternative for A in order to expand the tree.
- The parser now has the match for the second input symbol. So, it advances the pointer to b, the third symbol of w,
- and compares it to the label of the next leaf. If the label does not match d, it reports failure and goes back (backtracks)
- to A, as shown in Figure 4.3. The parser will also reset the input pointer to the second input symbol—the position it
- had when the parser encountered A—and it will try a second alternative to A in order to obtain the tree. If the leaf c
- matches the second symbol, and if the next leaf b matches the third symbol of w, then the parser will halt and
- announce the successful completion of parsing.
- Figure 4.3: If the parser fails to match a leaf, the point of failure, d, reroutes (backtracks) the pointer to
- alternative paths from A.
- 4.2 IMPLEMENTATION
- A top-down parser can be implemented by writing a set of recursive procedures to process the input. One procedure
- will take care of the left-most derivations for each nonterminal while processing the input. Each procedure should also
- provide for the storing of the input pointer in some local variable so that it can be reset properly when the parser
- backtracks. This implementation, called a "recursive descent parser," is a top-down parser for the above-described
- grammar that can be implemented by writing the following set of procedures:
- S( )
- {
- if (input =='a' )
- {
- advance( );
- if (A( ) != error)
- if (input =='b')
- { advance( );
- if (input == endmarker)
- return(success);
- else
- return(error);
- }
- else
- return(error);
- }
- else
- return(error);
- }
- A( )
- {
- if (input =='c')
- {
- advance( );
- if (input == 'd')
- advance( ); }
- else
- return(error);
- }
- main( )
- {
- Append the endmarker to the string w to be parsed;
- Set the input pointer to the left most token of w;
- if ( S( ) != error)
- print f ("Successful completion of the parsing");
- else
- printf ("Failure");
- }
- where advance() is a routine that, when called, advances the input's pointer to the next occurrence of the symbol w.
- Caution In a backtracking parser, the order in which alternatives are tried affects the language accepted by the parser.
- For example, in the above parser, if a production A → c is tried before A → cd, then the parser will fail to accept
- the string w = acdb, because it first expands S, as shown in Figure 4.4.
- Figure 4.4: The parser first expands S and fails to accept w = acdb.
- The first input symbol matches the left-most leaf; and therefore, the parser will advance the pointer to c and consider
- the nonterminal A for expansion in order to obtain the tree shown in Figure 4.5.
- Figure 4.5: The parser advances to c and considers nonterminal A for expension.
- The second input symbol also matches. Therefore, the parser will advance the pointer to d, the third input symbol,
- and consider the next leaf, labeled b in Figure 4.5. It finds that there is no match; and therefore, it will backtrack to S
- (as shown in Figure 4.5 by the thick arrow). But since there is no alternative to S that can be tried, the parser will return
- failure. Because the point of mismatch is the descendent of a node labeled by S, the parser will backtrack to S. It
- cannot backtrack to A. Therefore, the parser will not accept the string acdb. Whereas, if the parser tries the alternative
- A → cd first and A → c second, then the parser is capable of accepting the string acdb as well as acb because, for the
- string w = acb, when the parser encounters a mismatch, it is at a node labeled by d, which is a descendent of a node
- labeled by A. Hence, it will backtrack to A and try A → c, and end up in the parse tree for acb. Hence, we conclude that
- the order in which alternatives are tried in a backtracking parser affect the language accepted by the compiler or
- parser.
- EXAMPLE 4.1
- Consider a grammar S → aa | aSa. If a top-down backtracking parser for this grammar tries S → aSa before S → aa,
- show that the parser succeeds on two occurrences of a and four occurrences of a, but not on six occurrences of a.
- In the case of two occurrences of a, the parser will first expand S, as shown in Figure 4.6.
- Figure 4.6: The parser first expands S.
- The first input symbol matches the left-most leaf. Therefore, the parser will advance the pointer to a second a and
- consider the nonterminal S for expansion in order to obtain the tree shown in Figure 4.7.
- Figure 4.7: The parser advances the pointer to a second occurrence of a.
- The second input symbol also matches. Therefore, the parser will consider the next leaf labeled S and expand it, as
- shown in Figure 4.8.
- Figure 4.8: The parser expands the next leaf labeled S.
- The parser now finds that there is no match. Therefore, it will backtrack to S, as shown by the thick arrow in Figure
- 4.9. The parser then continues matching and backtracking, as shown in Figures 4.10 through 4.15, until it arrives at the
- required parse tree, shown in Figure 4.16.
- Figure 4.9: The parser finds no match, so it backtracks.
- Figure 4.10: The parser tries an alternate aa.
- Figure 4.11: There is no further alternate of S that can be tried, so the parser will backtrack one more step.
- Figure 4.12: The parser again finds a mismatch; hence, it backtracks.
- Figure 4.13: The parser tries an alternate aa.
- Figure 4.14: Since no alternate of S remains to be tried, the parser backtracks one more step.
- Figure 4.15: The parser tries an alternate aa.
- Figure 4.16: The parser arrives at the required parse tree.
- Now, consider a string of four occurrences of a. The parser will first expand S, as shown in Figure 4.17.
- Figure 4.17: The parser first expands S.
- The first input symbol matches the left-most leaf. Therefore, the parser will advance the pointer to a second a and
- consider the nonterminal S for expansion, obtaining the tree shown in Figure 4.18.
- Figure 4.18: The parser advances the pointer to a second occurrence of a.
- The second input symbol also matches. Therefore, the parser will consider the next leaf labeled by S and expand it, as
- shown in Figure 4.19.
- Figure 4.19: The parser considers the next leaf labeled by S.
- The third input symbol also matches. So, the parser moves on to the next leaf labeled by S and expands it, as shown
- in Figure 4.20.
- Figure 4.20: The parser matches the third input symbol and moves on to the next leaf labeled by S.
- The fourth input symbol also matches. Therefore, the next leaf labeled by S is considered. The parser expands it, as
- shown in Figure 4.21.
- Figure 4.21: The parser considers the fourth occurrence of the input symbol a.
- Now it finds that there is no match. Therefore, it will backtrack to S (Figure 4.22) and continue backtracking, as shown
- in Figures 4.23 through 4.30, until the parser finally arrives at the successful generation of a parse tree for aaaa in
- Figure 4.31.
- Figure 4.22: The parser finds no match, so it backtracks.
- Figure 4.23: The parser tries an alternate aa.
- Figure 4.24: No alternate of S can be tried, so the parser will backtrack one more step.
- Figure 4.25: Again finding a mismatch, the parser backtracks.
- Figure 4.26: The parser then tries an alternate.
- Figure 4.27: No alternate of S remains to be tried, so the parser will backtrack one more step.
- Figure 4.28: The parser again finds a mismatch; therefore, it backtracks.
- Figure 4.29: The parser tries an alternate aa.
- Figure 4.30: The parser then tries an alternate aa.
- Figure 4.31: The parser successfully generates the parse tree for aaaa.
- Now consider a string of six occurrences of a. The parser will first expand S, as shown in Figure 4.32.
- Figure 4.32: The parser expands S.
- The first input symbol matches the left-most leaf. Therefore, the parser will advance the pointer to the second a and
- consider the nonterminal S for expansion. The tree shown in Figure 4.33 is obtained.
- Figure 4.33: The parser matches the first symbol, advances to the second occurrence of a, and considers S for
- expansion.
- The second input symbol also matches. Therefore, the parser will consider next leaf labeled S and expand it, as
- shown in Figure 4.34.
- Figure 4.34: The parser finds a match for the second occurrence of a and expands S.
- The third input symbol also matches, as do the fourth through sixth symbols. In each case, the parser will consider
- next leaf labeled S and expand it, as shown in Figures 4.35 through 4.38.
- Figure 4.35: The parser matches the third input symbol, considers the next leaf, and expands S.
- Figure 4.36: The parser matches the fourth input symbol, considers the next leaf, and expands S.
- Figure 4.37: A match is found for the fifth input symbol, so the parser considers the next leaf, and expands S.
- Figure 4.38: The sixth input symbol also matches. So the next leaf is considered, and S is expanded.
- Now the parser finds that there is no match. Therefore, it will backtrack to S, as shown by the thick arrow in Figure
- 4.39.
- Figure 4.39: No match is found, so the parser backtracks to S.
- Since there is no alternate of S that can be tried, the parser will backtrack one more step, as shown in Figure 4.40.
- This procedure continues (Figures 4.41 through 4.47), until the parser tries the sixth alternate aa (Figure 4.48) and
- fails to find a match.
- Figure 4.40: The parser backtracks one more step.
- Figure 4.41: The parser tries the alternate aa.
- Figure 4.42: Again, a mismatch is found. So, the parser backtracks.
- Figure 4.43: No alternate of S remains, so the parser will back-track one more step.
- Figure 4.44: The parser tries an alternate aa.
- Figure 4.45: Again, a mismatch is found. The parser backtracks.
- Figure 4.46: The parser then tries an alternate aa.
- Figure 4.47: A mismatch is found, and the parser backtracks.
- Figure 4.48: The parser tries for the alternate aa, fails to find a match, and cannot generate the parse tree for six
- occurrences of a.
- 4.3 THE PREDICTIVE TOP-DOWN PARSER
- A backtracking parser is a non-deterministic recognizer of the language generated by the grammar. The backtracking
- problems in the top-down parser can be solved; that is, a top-down parser can function as a deterministic recognizer if
- it is capable of predicting or detecting which alternatives are right choices for the expansion of nonterminals (that
- derive to more than one alternative) during the parsing of input string w. By carefully writing a grammar, eliminating
- left recursion, and left-factoring the result, we obtain a grammar that can be parsed by a top-down parser. This
- grammar will be able to predict the right alternative for the expansion of a nonterminal during the parsing process; and
- hence, it need not backtrack.
- If A → α 1 | α 2 | … | α n are the A-productions in the grammar, then a top-down parser can decide if a nonterminal A is
- to be expanded or not. And if it is to be expanded, the parser decides which A-production should be used. It looks at
- the next input symbol and finds out which of the α i derivatives to a string that start with the terminal symbol comes next
- in the input. If none of the α i derives to a string starting with a terminal symbol, the parser reports the failure; otherwise,
- it carries out the derivation of A using a production A → α i , where α i derives to a string whose first terminal symbol is
- the symbol coming next in the input. Therefore, we conclude that if the set of first-terminal symbols of the strings
- derivable from α i is computed for each α i , and this set is made available to the parser, then the parser can predict the
- right choice for the expansion of nonterminal A. This information can be easily computed using the productions of the
- grammar. We define a function FIRST( α ), where α is in (V ∪ T)*, as follows:
- FIRST( α ) = Set of those terminals with which the strings derivable from α start
- If α = XYZ, then FIRST( α ) is computed as follows:
- FIRST( α ) = FIRST(XYZ) = { X } if X is terminal.
- Otherwise,
- FIRST( α ) = FIRST(XYZ) = FIRST(X) if X does not derive to an empty string; that is, if
- FIRST(X) does not contain ∈ .
- If FIRST(X) contains ∈ , then
- FIRST( α ) = FIRST(XYZ) = FIRST(X) − { ∈ } ∪ FIRST(YZ)
- FIRST(YZ) is computed in an identical manner:
- FIRST(YZ) = { Y } if Y is terminal.
- Otherwise,
- FIRST(YZ) = FIRST(Y) if Y does not derive to an empty string (i.e., if FIRST(Y) does not contain ∈ ). If FIRST(Y)
- contains ∈ , then
- FIRST(YZ) = FIRST(Y) − { ∈ } ∪ FIRST(Z)
- For example, consider the grammar:
- FIRST(S) = FIRST(ACB) ∪ FIRST(CbB) ∪
- FIRST(A) = FIRST(da) ∪ FIRST(BC)
- FIRST(B) = FIRST(g) ∪ FIRST( ∈ )
- FIRST(C) = FIRST(h) ∪ FIRST( ∈ )
- Therefore:
- FIRST(BC) = FIRST(B) − { ∈ } ∪ FIRST(C)
- Substituting in (II) we get:
- FIRST(A)={ d } ∪ { g, h, ∈ }
- FIRST(ACB) =FIRST(A) − { ∈ } ∪ FIRST(CB)
- FIRST(CB) =FIRST(C) − { ∈ } ∪ FIRST(B)
- Therefore, substituting in (III) we get:
- FIRST(ACB)={ d, g, h, ∈ } ∪ { g, h, ∈ }
- Similarly,
- FIRST(CbB) =FIRST(C) − { ∈ } ∪ FIRST(bB)
- Similarly,
- FIRST(Ba) =FIRST(B) − { ∈ } ∪ FIRST(a)
- Therefore, substituting in (I), we get:
- FIRST(S)={ d, g, h, ∈ } ∪ { b, h, ∈ } ∪ { a, g, ∈ }
- EXAMPLE 4.2
- Consider the following grammar:
- FIRST(aAb)= { a }
- FIRST(cd)= { c }, and
- FIRST(ef)= { e }
- Hence, while deriving S, the parser looks at the next input symbol. And if it happens to be the terminal a, then the
- parser derives S using S → aAb. Otherwise, the parser reports an error. Similarly, when expanding A, the parser looks
- at the next input symbol; if it happens to be the terminal c, then the parser derives A using A → cd. If the next terminal
- input symbol happens to be e, then the parser derives A using A → ef. Otherwise, an error is reported.
- Therefore, we conclude that if the right-hand FIRST for the production S → aAb is computed, we can decide when the
- parser should do the derivation using the production S → aAb. Similarly, if the right-hand FIRST for the productions A
- → cd and A → ef are computed, then we can decide when derivation is to be done using A → cd and A → ef,
- respectively. These decisions can be encoded in the form of table, as shown in Table 4.1, and can be made available
- to the parser for the correct selection of productions for derivations during parsing.
- Table 4.1: Production Selections for Parsing Derivations
- a b c d e f
- $
- S
- S → aAb
- A
- A → cd
- A → ef
- The number of rows of the table are equal to the number of nonterminals, whereas the number of columns are equal to
- the number of terminals, including the end marker. The parser uses of the nonterminal to be derived as the row index
- of the table, and the next input symbol is used as the column index when the parser decides which production will be
- derived. Here, the production S → aAb is added in the table at [S, a] because FIRST(aAb) contains a terminal a.
- Hence, S must be derived using S → aAb if and only if the terminal symbol coming next in the input is a. Similarly, the
- production A → cd is added at [A, c], because FIRST(cd) contain c. Hence, A must be derived using A → cd if and only
- if the terminal symbol coming next in the input is c. Finally, A must be derived using A → ef if and only if the terminal
- symbol coming next in the input is e. Hence, the production A → ef is added at [A, e]. Therefore, we conclude that the
- table can be constructed as follows:
- for every production A → α do
- for every a in FIRST( α ) do
- TABLE[A, a] = A → α
- Using the above method, every production of the grammar gets added into the table at the proper place when the
- grammar is ∈ -free. But when the grammar is not ∈ -free, ∈ -productions will not get added to the table. If there is an
- ∈ -production A → ∈ in the grammar, then deciding when A is to be derived to ∈ is not possible using the production's
- right-hand FIRST. Some additional information is required to decide where the production A → ∈ is to be added to the
- table.
- Tip The derivation by A → ∈ is a right choice when the parser is on the verge of expanding the nonterminal A and the
- next input symbol happens to be a terminal, which can occur immediately following A in any string occurring on the
- right side of the production. This will lead to the expansion of A to ∈ , and the next leaf in the parse tree will be
- considered, which is labeled by the symbol immediately following A and, therefore, may match the next input
- symbol.
- Therefore, we conclude that the production A → ∈ is to be added in the table at [A, b] for every b that immediately
- follows A in any of the production grammar's right-hand strings. To compute the set of all such terminals, we make use
- of the function FOLLOW(A), where A is a nonterminal, as defined below:
- FOLLOW(A) = Set of terminals that immediately follow A in any string occurring on the right side of productions of the
- grammar
- For example, if A → α B β is a production, then FOLLOW(B) can be computed using A → α B β , as shown below:
- FOLLOW(B) = FIRST( β ) if FIRST( β ) does not contain ∈ .
- Therefore, we conclude that when the grammar is not ∈ -free, then the table can be constructed as follows:
- Compute FIRST and FOLLOW for every nonterminal of the grammar. 1.
- For every production A → α , do:
- {
- for every non- ∈ member a in FIRST( α ) do
- TABLE[A, a] = A → α
- If FIRST( α ) contain ∈ then
- For every b in FOLLOW(A) do
- TABLE[A, b] = A → α
- }
- 2.
- Therefore, we conclude that if the table is constructed using the above algorithm, a top-down parser can be
- constructed that will be a nonbacktracking, or ‘predictive’ parser.
- 4.3.1 Implementation of a Table-Driven Predictive Parser
- A table-driven parser can be implemented using an input buffer, a stack, and a parsing table. The input buffer is used
- to hold the string to be parsed. The string is followed by a "$" symbol that is used as a right-end maker to indicate the
- end of the input string. The stack is used to hold the sequence of grammar symbols. A "$" indicates bottom of the
- stack. Initially, the stack has the start symbol of a grammar above the $. The parsing table is a table obtained by using
- the above algorithm presented in the previous section. It is a two-dimensional array TABLE[A, a], where A is a
- nonterminal and a is a terminal, or $ symbol. The parser is controlled by a program that behaves as follows:
- The program considers X, the symbol on the top of the stack, and the next input symbol a. 1.
- If X = a = $, then parser announces the successful completion of the parsing and halts. 2.
- If X = a ≠ $, then the parser pops the X off the stack and advances the input pointer to the next
- input symbol.
- 3.
- If X is a nonterminal, then the program consults the parsing table entry TABLE[x, a]. If TABLE[x, a]
- = x → UVW, then the parser replaces X on the top of the stack by UVW in such a manner that U
- will come on the top. If TABLE[x, a] = error, then the parser calls the error-recovery routine.
- 4.
- For example consider the following grammar:
- FIRST(S) = FIRST(aABb) = { a }
- FIRST(A) = FIRST(c) ∪ FIRST( ∈ ) = { c, ∈ }
- FIRST(B) = FIRST(d) ∪ FIRST( ∈ ) = { d, ∈ }
- Since the right-end marker $ is used to mark the bottom of the stack, $ will initially be immediately below S (the start
- symbol) on the stack; and hence, $ will be in the FOLLOW(S). Therefore:
- Using S → aABb, we get:
- Therefore, the parsing table is as shown in Table 4.2.
- Table 4.2: Production Selections for Parsing Derivations
- a b c d
- $
- S
- S → aABb
- A
- A → ∈ A → c A → ∈
- B
- B → ∈
- B → d
- Consider an input string acdb. The various steps in the parsing of this string, in terms of the contents of the stack and
- unspent input, are shown in Table 4.3.
- Table 4.3: Steps Involved in Parsing the String acdb
- Stack Contents Unspent Input Moves
- $S acdb$
- Derivation using S → aABb
- $bBAa acdb$ Popping a off the stack and advancing one position in the input
- $bBA cdb$
- Derivation using A → c
- $bBc cdb$ Popping c off the stack and advancing one position in the input
- $bB db$
- Derivation using B → d
- $bd db$ Popping d off the stack and advancing one position in the input
- $b b$ Popping b off the stack and advancing one position in the input
- $ $ Announce successful completion of the parsing
- Similarly, for the input string ab, the various steps in the parsing of the string, in terms of the contents of the stack and
- unspent input, are shown in Table 4.4.
- Table 4.4: Production Selections for String ab Parsing Derivations
- Stack Contents Unspent Input Moves
- $S ab$
- Derivation using S → aABb
- $bBAa ab$ Popping a off the stack and advancing one position in the input
- $bBA b$
- Derivation using A → ∈
- $bB b$
- Derivation using B → ∈
- $b b$ Popping b off the stack and advancing one position in the input
- $ $ Announce successful completion of the parsing
- For a string adb, the various steps in the parsing of the string, in terms of the contents of the stack and unspent input,
- are shown in Table 4.5.
- Table 4.5: Production Selections for Parsing Derivations for the String adb
- Stack Contents Unspent Input Moves
- $S adb$
- Derivation using S → aABb
- $bBAa adb$ Popping a off the stack and advancing one position in the input
- $bBA ab$ Calling an error-handling routine
- The heart of the table-driven parser is the parsing table-the parser looks at the parsing table to decide which
- alternative is a right choice for the expansion of a nonterminal during the parsing of the input string. Hence,
- constructing a table-driven predictive parser can be considered as equivalent to constructing the parsing table.
- A parsing table for any grammar can be obtained by the application of the above algorithm; but for some grammars,
- some of the entries in the parsing table may end up being multiple defined entries. Whereas for certain grammars, all
- of the entries in the parsing table are singly defined entries. If the parsing table contains multiple entries, then the
- parser is still non-deterministic. The parser will be a deterministic recognizer if and only if there are no multiple entries
- in the parsing table. All such grammars (i.e., those grammars that, after applying the algorithm above, contain no
- multiple entries in the parsing table) constitute a subset of CFGs called "LL(1)" grammars. Therefore, a given grammar
- is LL(1) if its parsing table, constructed by algorithm above, contains no multiple entries. If the table contains multiple
- entries, then the grammar is not LL(1).
- In the acronym LL(1), the first L stands for the left-to-right scan of the input, the second L stands for the left-most
- derivation, and the (1) indicates that the next input symbol is used to decide the next parsing process (i.e., length of the
- lookahead is "1").
- In the LL(1) parsing system, parsing is done by scanning the input from left to right, and an attempt is made to derive
- the input string in a left-most order. The next input symbol is used to decide what is to be done next in the parsing
- process. The predictive parser discussed above, therefore, is a LL(1) parser, because it also scans the input from left
- to right and attempts to obtain the left-most derivation of it; and it also makes use of the next input symbol to decide
- what is to be done next. And if the parsing table used by the predictive parser does not contain multiple entries, then
- the parser acts as a recognizer of only the members of L(G); hence, the grammar is LL(1).
- Therefore, LL(1) is the grammar for which an LL(1) parser can be constructed, which acts as a deterministic recognizer
- of L(G). If a grammar is LL(1), then a deterministic top-down table-driven recognizer can be constructed to recognize
- L(G). A parsing table constructed for a given grammar G will have multiple entries if the grammar contains multiple
- productions that derive the same nonterminal-that is, the grammar contains the productions A → α | β , and both α and
- β derive to a string that starts with the same terminal symbol. Therefore, one of the basic requirements for a grammar
- to be considered LL(1) is when the grammar contains multiple productions that derive the same nonterminal, such as:
- for every pair of productions A → α | β
- FIRST( α ) ∩ FIRST( β ) = φ (i.e., FIRST( α ) and FIRST( β ) should be disjoint sets for every pair of productions A → α | β )
- For a grammar to be LL(1), the satisfaction of the condition above is necessary as well sufficient if the grammar is
- ∈ -free. When the grammar is not ∈ -free, then the satisfaction of the above condition is necessary but not sufficient,
- because either FIRST( α ) or FIRST( β ) might contain ∈ , but not both. The above condition will still be satisfied; but if
- FIRST( β ) contains ∈ , then production A → β will be added in the table on all terminals in FOLLOW(A). Hence, it also
- required that FIRST( α ) and FOLLOW(A) contain no common symbols. Therefore, an additional condition must be
- satisfied in order for a grammar to be LL(1). When the grammar is not ∈ -free: for every pair of productions A → α | β
- if FIRST( β ) contains ∈ , and FIRST( α ) does not contain ∈ , then
- FIRST( α ) ∩ FOLLOW(A) = φ
- Therefore, for a grammar to be LL(1), the following conditions must be satisfied:
- For every pair of productions
- {
- (1) FIRST( α ) ∩ FIRST( β ) = φ
- and
- if FIRST( β ) contains ∈ , and FIRST( α ) does not contain ∈
- then
- (1) FIRST( α ) ∩ FOLLOW(A) = φ
- }
- 4.3.2 Examples
- EXAMPLE 4.3
- Test whether the grammar is LL(1) or not, and construct a predictive parsing table for it.
- Since the grammar contains a pair of productions S → AaAb | BbBa, for the grammar to be LL(1), it is required that:
- Hence, the grammar is LL(1).
- To construct a parsing table, the FIRST and FOLLOW sets are computed, as shown below:
- Using S → AaAb, we get: 1.
- Using S → BbBa, we get 2.
- Table 4.6: Production Selections for Example 4.3 Parsing Derivations
- a b
- $
- S
- S → AaAb S → BbBa
- A
- A → ∈ A → ∈
- B
- B → ∈ B → ∈
- EXAMPLE 4.4
- Consider the following grammar, and test whether the grammar is LL(1) or not.
- For a pair of productions S → 1AB | ∈ :
- because FOLLOW(S) = { $ } (i.e., it contains only the end marker. Similarly, for a pair of productions A → 1AC | 0C:
- Hence, the grammar is LL(1). Now, show that no left-recursive grammar can be LL(1).
- One of the basic requirements for a grammar to be LL(1) is: for every pair of productions A → α | β in the grammar's
- set of productions, FIRST( α ) and FIRST( β ) should be disjointed.
- If a grammar is left-recursive, then the set of productions will contain at least one pair of the form A → A α | β ; and
- hence, FIRST(A α ) and FIRST( β ) will not be disjointed sets, because everything in the FIRST( β ) will also be in the
- FIRST(A α ). It thereby violates the condition for LL(1) grammar. Hence, a grammar containing a pair of productions A
- → A α | β (i.e., a left-recursive grammar) cannot be LL(1).
- Now, let X be a nullable nonterminal that derives to at least two terminal strings. Show that in LL(1) grammar, no
- production rule can have two consecutive occurrences of X on the right side of the production.
- Since X is a nullable X ∈ , X is also deriving to at least to two terminal strings-Xw 1 and Xw 2 -where w 1 and w 2 are the
- strings of terminals. Therefore, for a grammar using X to be LL(1), it is required that:
- FIRST(w 1 ) ∩ FIRST(w 2 ) = φ
- FIRST (w 1 ) ∩ FOLLOW(X) and FIRST(w 2 ) ∩ FOLLOW(X) = φ
- If this grammar contains a production rule A → α XX β -a production whose right side has two consecutive occurrences
- of X-then everything in FIRST(X) will also be in the FOLLOW(X); and since FIRST(X) contains FIRST(w 1 ) as well as
- FIRST(w 2 ), the second condition will therefore not be satisfied. Hence, a grammar containing a production of the form
- A → α XX β will never be LL(1), thereby proving that in LL(1) grammar, no production rule can have two consecutive
- occurrences of X on the right side of the production.
- EXAMPLE 4.5
- Construct a predictive parsing table for the following grammar where S| is a start symbol and # is the end marker.
- Here, # is taken as one of the grammar symbols. And therefore, the initial configuration of the parser will be (S|, w#),
- where the first member of the pair is the contents of the stack and the second member is the contents of input buffer.
- Therefore, by substituting in (I), we get:
- Using S| → S# we get: 1.
- Using S → qABC we get:
- Substituting in (II) we get:
- 2.
- Using A → bbD we get: 3.
- Therefore, the parsing table is derived as shown in Table 4.7.
- Table 4.7: Production Selections for Example 4.5 Parsing Derivations
- q a b c
- #
- S
- S → S#
- S
- S → qabc
- A
- A → a A → bbD
- B
- B → a B → ∈
- B → ∈
- C
- C → b
- C → ∈
- D
- D → ∈ D → ∈ D → c D → ∈
- EXAMPLE 4.6
- Construct predictive parsing table for the following grammar:
- Since the grammar is ∈ -free, FOLLOW sets are not required to be computed in order to enter the productions into the
- parsing table. Therefore the parsing table is as shown in Table 4.8.
- Table 4.8: Production Selections for Example 4.6 Parsing Derivations
- a b f g d
- S
- S → A
- A
- A → aS
- A → d
- B
- B → bBC B → f
- C
- C → g
- EXAMPLE 4.7
- Construct a predictive parsing table for the following grammar, where S is a start symbol.
- Using S → iEtSS 1 : 1.
- Using S 1 → eS: 2.
- Therefore, the parsing table is as shown in Table 4.9.
- Table 4.9: Production Selections for Example 4.7 Parsing Derivations
- i a b e T
- $
- S
- S → iEtSS 1 S → a
- S 1
- S1 → eS
- S 1 → ∈
- S 1
- S1 → ∈
- E
- E → b
- EXAMPLE 4.8
- Construct an LL(1) parsing table for the following grammar:
- Computation of FIRST and FOLLOW:
- Therefore by substituting in (I) we get:
- Using the production S → aBDh we get: 1.
- Using the production B → cC, we get: 2.
- Using the production C → bC, we get: 3.
- Using the production D → EF, we get: 4.
- Therefore, the parsing table is as shown in Table 4.10.
- Table 4.10: Production Selections for Example 4.8 Parsing Derivations
- a b c g f h
- $
- S
- S → aBDh
- B
- B → cC
- C
- C → bC
- C → ∈ C → ∈ C → ∈
- D
- D → EF D → EF D → EF
- E
- E → g E → ∈ E → ∈
- F
- F → f F → ∈
- Chapter 5: Bottom-up Parsing
- 5.1 WHAT IS BOTTOM-UP PARSING?
- Bottom-up parsing can be defined as an attempt to reduce the input string w to the start symbol of a grammar by
- tracing out the right-most derivations of w in reverse. This is equivalent to constructing a parse tree for the input string
- w by starting with leaves and proceeding toward the root—that is, attempting to construct the parse tree from the
- bottom, up. This involves searching for the substring that matches the right side of any of the productions of the
- grammar. This substring is replaced by the left-hand-side nonterminal of the production if this replacement leads to the
- generation of the sentential form that comes one step before in the right-most derivation. This process of replacing the
- right side of the production by the left side nonterminal is called "reduction". Hence, reduction is nothing more than
- performing derivations in reverse. The reason why bottom-up parsing tries to trace out the right-most derivations of an
- input string w in reverse and not the left-most derivations is because the parser scans the input string w from the left to
- right, one symbol/token at a time. And to trace out right-most derivations of an input string w in reverse, the tokens of w
- must be made available in a left-to-right order. For example, if the right-most derivation sequence of some w is:
- then the bottom-up parser starts with w and searches for the occurrence of a substring of w that matches the right side
- of some production A → β such that the replacement of β by A will lead to the generation of α n − 1 . The parser replaces
- β by A, then it searches for the occurrence of a substring of α n − 1 that matches the right side of some production B → γ
- such that replacement of γ by B will lead to the generation of α n − 2 . This process continues until the entire w substring
- is reduced to S, or until the parser encounters an error.
- Therefore, bottom-up parsing involves the selection of a substring that matches the right side of the production, whose
- reduction to the nonterminal on the left side of the production represents one step along the reverse of a right-most
- derivation. That is, it leads to the generation of the previous right-most derivation. This means that selecting a
- substring that matches the right side of production is not enough; the position of this substring in the sentential form is
- also important.
- Tip The substring should occur in the position and sentential form that is currently under consideration and, if it is
- replaced by the left-side nonterminal of the production, that it leads to the generation of the previous right-hand
- sentential form of the currently considered sentential form. Therefore, finding a substring that matches the right
- side of a production, as well as its position in the current sentential form, are both equally important. In order to take
- both of these factors into account, we will define a "handle" of the right sentential form.
- 5.2 A HANDLE OF A RIGHT SENTENTIAL FORM
- A handle of a right sentential form γ is a production A → β and a position of β in γ . The string β will be found and
- replaced by A to produce the previous right sentential form in the right-most derivation of γ . That is, if S → α A β → αγβ ,
- then A → γ is a handle of αγβ , in the position following α . Consider the grammar:
- and the right-most derivation:
- The handles of the sentential forms occurring in the above derivation are shown in Table 5.1.
- Table 5.1: Sentential Form Handles
- Sentential Form Handle
- id + id * id
- E → id at the position preceding +
- E + id * id
- E → id at the position following +
- E + E * id
- E → id at the position following*
- E + E * E
- E → E * E at the position following +
- E + E
- E → E + E at the position preceding the end marker
- Therefore, the bottom-up parsing is only an attempt to detect the handle of a right sentential form. And whenever a
- handle is detected, the reduction is performed. This is equivalent to performing right-most derivations in reverse and is
- called "handle pruning".
- Therefore, if the right-most derivation sequence of some w is S → α 1 → α 2 → α 3 → … → α n − 1 → w, then handle
- pruning starts with w, the nth right sentential form, the handle β n of w is located, and β n is replaced by the left side of
- some production A n → β n in order to obtain α n − 1 . By continuing this process, if the parser obtains a right sentential
- form that consists of only a start symbol, then it halts and announces the successful completion of parsing.
- EXAMPLE 5.1
- Consider the following grammar, and show the handle of each right sentential form for the string (a,(a, a)).
- The right-most derivation of the string (a, (a, a)) is:
- Table 5.2 presents the handles of the sentential forms occurring in the above derivation.
- Table 5.2: Sentential Form Handles
- Sentential Form Handle
- (a, (a, a))
- S → a at the position preceding the first comma
- (S, (a, a))
- L → S at the position preceding the first comma
- (L, (a, a))
- S → a at the position preceding the second comma
- (L, (S, a))
- L → S at the position preceding the second comma
- (L, (L, a))
- S → a at the position following the second comma
- (L, (L, S))
- L → L, S, at the position following the second left bracket
- (L, (L))
- S → (L) at the position following the first comma
- (L, S)
- L → L, S, at the position following the first left bracket
- (L)
- S → (L) at the position before the endmarker
- 5.3 IMPLEMENTATION
- A convenient way to implement a bottom-up parser is to use a shift-reduce technique: a parser goes on shifting the
- input symbols onto the stack until a handle comes on the top of the stack. When a handle appears on the top of the
- stack, it performs reduction. This implementation makes use of a stack to hold grammar symbols and an input buffer to
- hold the string w to be parsed, which is terminated by the right endmarker $, the same symbol used to mark the bottom
- of the stack. The configuration of the parser is given by a token pair-the first component of which is a stack content,
- and second component is an unexpended input.
- Initially, the parser will be in the configuration given by the pair ($, w$); that is, the stack is initially empty, and the
- buffer contains the entire string w. The parser shifts zero or more symbols from the input on to the stack until handle α
- appears on the top of the stack. The parser then reduces α to the left side of the appropriate production. This cycle is
- repeated until the parser either detects an error or until the stack contains a start symbol and the input is empty, giving
- the configuration ($S, $). If the parser enters ($S, $), then it announces the successful completion of parsing. Thus,
- the primary operation of the parser is to shift and reduce.
- For example consider the bottom-up parser for the grammar having the productions:
- and the input string: id+id * id. The various steps in parsing this string are shown in Table 5.3 in terms of the contents
- of the stack and unspent input.
- Table 5.3: Steps in Parsing the String id + id * id
- Stack Contents Input Moves
- $ id + id*id$ shift id
- $id + id*id$
- reduce by F → id
- $F + id*id$
- reduce by T → F
- $T + id*id$
- reduce by E → T
- $E + id*id$ shift +
- $E + id*id$ shift id
- $E + id *id$
- reduce by F → id
- $E + F *id$
- reduce by T → F
- $E + T *id$ shift *
- $E + T * id$ shift id
- $E + T*id $
- reduce by F → id
- $E + T *F $
- reduce by T → T *F
- $E + T $
- reduce by E → E + T
- $E $ accept
- Shift-reduce implementation does not tell us anything about the technique used for detecting the handles; hence, it is
- possible to make use of any suitable technique to detect handles. Depending upon the technique that is used to detect
- handles, we get different shift-reduce parsers. For example, an operator-precedence parser is a shift-reduce parser
- that uses the precedence relationship between certain pairs of terminals to guide the selection of handles. Whereas
- LR parsers make use of a deterministic finite automata that recognizes the set of all viable prefixes; by reading the
- stack from bottom to top, it determines what handle, if any, is on the top of the stack.
- 5.4 THE LR PARSER
- The LR parser is a shift-reduce parser that makes use of a deterministic finite automata, recognizing the set of all
- viable prefixes by reading the stack from bottom to top. It determines what handle, if any, is available. A viable prefix of
- a right sentential form is that prefix that contains a handle, but no symbol to the right of the handle. Therefore, if a
- finite-state machine that recognizes viable prefixes of the right sentential forms is constructed, it can be used to guide
- the handle selection in the shift-reduce parser.
- Since the LR parser makes use of a DFA that recognizes viable prefixes to guide the selection of handles, it must keep
- track of the states of the DFA. Hence, the LR parser stack contains two types of symbols: state symbols used to
- identify the states of the DFA and grammar symbols. The parser starts with the initial state of a DFA 10 on the stack.
- The parser operates by looking at the next input symbol a and the state symbol I i on the top of the stack. If there is a
- transition from the state I i on a in the DFA going to state I j , then it shifts the symbol a, followed by the state symbol I j ,
- onto the stack. If there is no transition from I i on a in the DFA, and if the state I i on the top of the stack recognizes,
- when entered, a viable prefix that contains the handle A → α , then the parser carries out the reduction by popping α
- and pushing A onto the stack. This is equivalent to making a backward transition from I i on α in the DFA and then
- making a forward transition on A. Every shift action of the parser corresponds to a transition on a terminal symbol in
- the DFA. Therefore, the current state of the DFA and the next input symbol determine whether the parser shifts the
- next input symbol goes for reduction.
- If we construct a table mapping every state and input symbol pair as either "shift," "reduce," "accept," or "error," we get
- a table that can be used to guide the parsing process. Such a table is called a parsing "action" table. When carrying
- out the reduction by A → α , the parser has to pop α and push A onto the stack. This requires knowledge of where the
- transition goes in a DFA from the state brought onto the top of the stack after popping α on the nonterminal A; and
- hence, we require another table mapping of every state and nonterminal pair into a state. The table of transitions on
- the nonterminals in the DFA is called a "goto" table. Therefore, to create an LR parser we require an Action|GOTO
- table.
- If the current state of a DFA has a transition on the terminal symbol a to the state I j , then the next move will be to shift
- the symbol a and enter the state I j . But if the current state of the DFA is one in which when entered recognizes that a
- viable prefix contains the handle, then the next move of the parser will be to reduce.
- Therefore, an LR parser is comprised of an input buffer (which holds the input string w to be parsed and assumed to
- be terminated by the right endmarker $), a stack holding the viable prefixes of the right sentential forms, and a parsing
- table that is obtain by mapping the moves of a DFA that recognizes viable prefixes and controls the parsing actions.
- The configuration of a parser is given by a token pair: the first component is a stack's content, and second component
- is unexpended input. If, at a particular instant (and $ is used as bottom-of-the-stack marker, also), a parser is
- configured as follows:
- where I i is a state symbol identifying the state of a DFA recognizing the viable prefixes, and X i is the grammar symbol.
- The parser consults the parsing action table entry, [I m , a i ]. If action[I m , a i ] = S j , then the parser shifts the next input
- symbol followed by the state I j on the stack and enters into the configuration:
- If action[I m , a i ] = reduce by production A → α , then the parser carries out the reduction as follows. If | α | = r, then the
- parser pops two r symbols from the stack (because every shift action shifts a grammar symbol as well as state
- symbol), thereby bringing I m − r on the top. It then consults the goto table entry, goto[I m − r , A]. If goto[I m − r , A] = I k , then
- it shifts A followed by I k onto the stack, thereby entering into the configuration:
- If action[I m , a i ] = accept, then the parser halts and accepts the input string. If action[I m , a i ] = error, then the parser
- invokes a suitable error-recovery routine. Initially the parser will be in the configuration given by the pair ($I 0 , w$).
- Therefore, we conclude that parsing table construction involves constructing a DFA that recognizes the viable prefixes
- of the right sentential forms, using the given grammar, and then maps its the moves into the form of the Action|GOTO
- table. To construct such a DFA, we make use of the items that are part of a grammar's productions. Here, an item
- called the "LR(0)" of a production is a production with a dot placed at some position on the right side of the production.
- For example if A → XYZ is a production, then the following items can be generated from it:
- If the length of the right side of the production is n, then there are (n+1) different positions on the right side of a
- production where a dot can be placed. Hence, the number of items that can be generated are (n+1).
- The dot's position on the right side tells us how much of the right-hand side of the production is seen in the process of
- parsing. For example, the item A → X.YZ tells us that we have already seen a string derivable from X in the input and
- expect to see the string derivable from YZ next in the input.
- 5.4.1 Augmented Grammar
- To construct a DFA that recognizes the viable prefixes, we make use of augmented grammar, which is defined as
- follows: if G = (V, T, P, S) is a given grammar, then the augmented grammar will be G 1 = (V ∪ {S 1 }, T, P ∪ {S 1 → S},
- S 1 ); that is, we add a unit production S 1 → S to the grammar G and make S 1 the new start symbol. The resulting
- grammar will be an augmented grammar. The purpose of augmenting the grammar is to make it explicitly clear to
- parser when to accept the string. Parsing will stop when the parser is on the verge of carrying out the reduction using
- S 1 → S. A NFA that recognizes the viable prefixes will be a finite automata whose states correspond to the production
- items of the augmented grammar. Every item represents one state in the automata, with the initial state corresponding
- to an item S 1 → S. The transitions in the automata are defined as follows:
- δ (A → α .B β , ∈ ) = B → . γ (This transition is required, because if the current state is A → α .B β , that means we have not
- yet seen a string derivable from the nonterminal B; and since B → γ is a production of the grammar, unless we see γ ,
- we will not get B. Therefore, we have to travel the path that recognizes γ , which requires entering into the state
- identified by B → . γ without consuming any input symbols.)
- This NFA can then be transformed into a DFA using the subset construction method. For example, consider the
- following grammar:
- The augmented grammar is:
- The items that can be generated using these productions are:
- Therefore, the transition diagram of the NFA that recognizes viable prefixes is as shown in Figure 5.1.
- Figure 5.1: NFA transition diagram recognizes viable prefixes.
- The DFA equivalent of the NFA shown in Figure 5.1 is, by using subset construction, illustrated in Figure 5.2.
- Figure 5.2: Using subset construction, a DFA equivalent is derived from the transition diagram in Figure 5.1.
- Therefore, every state of the DFA that recognizes viable prefixes is a set of items; and hence, the set of DFA states
- will be a collection of sets of items—but any arbitrary collection of set of items will not correspond to the DFA set of
- states. A set of items that corresponds to the states of a DFA that recognizes viable prefixes is called a "canonical
- collection". Therefore, construction of a DFA involves finding canonical collection. An algorithm exists that directly
- obtains the canonical collection of LR(0) sets of items, thereby allowing us to obtain the DFA. Using this algorithm, we
- can directly obtain a DFA that recognizes the viable prefixes, rather than going through NFA to DFA transformation, as
- explained above. The algorithm for finding out the canonical collection of LR(0) sets of items makes use of the closure
- and goto functions. The set closure(I), where I is a set of items, is computed as follows:
- Add every item in I to closure(I) 1.
- Repeat
- For every item of the form A → α .B β in closure(I) do
- For every production B β → do
- Add B β . → to closure(I)
- Until no new item can be added to closure(I)
- 2.
- For example, consider the following grammar:
- That is, to find out goto from I on X, first identify all the items in I in which the dot precedes X on the right side. Then,
- move the dot in all the selected items one position to the right(i.e., over X), and then take a closure of the set of these
- items.
- 5.4.2 An Algorithm for Finding the Canonical Collection of Sets of LR(0) Items
- /* Let C be the canonical collection of sets of LR(0) items. We maintain C new and C old to continue the iterations*/
- Input: augmented grammar
- Output: canonical collection of sets of LR(0) items (i.e., set C)
- C old = φ 1.
- add closure ({S 1 → .S}) to C 2.
- while C old ≠ C new do 3.
- C = C new 4.
- For example consider the following grammar:
- The augmented grammar is:
- Initially, C old = φ . First we obtain:
- We call it I 0 and add it to C new . Therefore:
- In the first iteration, we obtain the goto from I 0 on every grammar symbol, as shown below:
- Add it to C new :
- Add it to C new :
- Add it to C new :
- Add it to C new :
- Therefore, at the end of first iteration:
- In the second the iteration:
- So, in the second iteration, we obtain goto from {I 1 , I 2 , I 3 , I 4 }on every grammar symbol, as shown below:
- Add it to C new :
- Add it to C new :
- Therefore, at the end of the second iteration:
- In the third iteration:
- In the third iteration, we obtain goto from {I 5 , I 6 } on every grammar symbol, as shown below:
- Add it to C new :
- Add it to C new :
- Therefore, at the end of the third iteration:
- In the fourth iteration:
- So, in the fourth iteration, we obtain a goto from {I 7 , I 8 } on every grammar symbol, as shown below:
- At the end of fourth iteration:
- The transition diagram of the DFA is shown in Figure 5.3.
- Figure 5.3: DFA transition diagram showing four iterations for a canonical collection of sets.
- 5.4.3 Construction of a Parsing Action|GOTO Table for an SLR(1) Parser
- The methods for constructing the parsing Action|GOTO table are described below.
- Construction of the Action Table
- For every state I 1 in C do
- for every terminal symbol a do
- if goto(I i , a) = I j , then
- make action[I i , a] = S j /*for shift and enter into the state j*/
- 1.
- For every state I i in C whose underlying set of LR(0) items contains an item of the form A → α .do
- for every b in FOLLOW(A) do
- make action[I i , b] = R k /*where k is the number of the production A → α standing for reduce by A
- → α */
- 2.
- Make [I i , $) = accept if I i contains an item S 1 → S. 3.
- It is obvious that if a state I i has a transition on a terminal a going to I j , then the parser's next move will be to shift and
- enter into state j. Therefore, the shift entries in the action table are the mappings of the transitions in the DFA on
- terminals. Similarly, if state I i corresponds to the viable prefix that contains the right side of the production A → α , then
- the parser will call a reduction by A → α on all those symbols that are in the FOLLOW(A). This is because if the next
- input symbol happens to be a terminal symbol that can FOLLOW(A), then only the reduction by A → α may lead to a
- previous right-most derivation. That is, if the next input symbol belongs to FOLLOW(A), then the position of α can be
- considered to be the one where, if it is replaced by A, we might get a previous right-most derivation. Whether or not A
- → α is a handle is decided in this manner.
- The initial state is the one whose underlying set of items' representations contain an item S 1 → .S. This method is
- called "SLR(1)"— α Simple LR; and the (1) indicates a length of one lookahead (the next symbol used by the parser to
- decide its next move) used. Therefore, this parsing table is an SLR parsing table. (When the parentheses are not
- specified, the length of the lookahead is assumed to be one.)
- Construction of the Goto Table
- A goto table is simply a mapping of transitions in the DFA on nonterminals. Therefore, it is constructed as follows:
- For every I i in C do
- For every nonterminal A do
- if goto(I i , A) = I i then
- Make GOTO[I i , A) = j
- Therefore, the SLR parsing table for the grammar having the following productions is shown in Table 5.4.
- Table 5.4: Action|GOTO SLR Parsing Table
- Action Table GOTO Table
- id + * $ E T F
- I 0 S 4
- 1 2 3
- I 1
- S 5
- Accept
- I 2
- R 2 S 6 R 2
- I 3
- R 4 R 4 R 4
- I 4
- R 5 R 5 R 5
- I 5 S 4
- 7 3
- I 6 S 4
- 8
- I 7
- R 1 S 6 R 1
- I 8
- R 3 R 3 R 3
- The productions are numbered as:
- EXAMPLE 5.2
- Consider the following grammar:
- The augmented grammar is:
- The canonical collection of sets of LR(0) items are computed as follows.
- The transition diagram of the DFA is shown in Figure 5.4.
- Figure 5.4: Transition diagram for Example 5.2 DFA.
- Therefore, the grammar has the following productions:
- which are numbered as:
- has an SLR parsing table as shown in Table 5.5.
- Table 5.5: SLR Parsing Table
- Action Table GOTO Table
- c d $ S C
- I 0 S 3 S 4
- 1 2
- I 1
- accept
- I 2 S 3 S 4
- 5
- I 3 S 3 S 4
- 6
- I 4 R 3 R 3 R 3
- I 5
- R 1
- I 6 R 2 R 2 R 2
- By using the method discussed above, a parsing table can be obtained for any grammar. But the action table
- obtained from the method above will not necessarily be without multiple entries for every grammar. Therefore, we
- define a SLR(1) grammar as one in which we can obtain the action table without multiple entries by using the method
- described. If the action table contains multiple entries, then the grammar for which the table is obtained is not SLR(1)
- grammar.
- For example, consider the following grammar:
- The augmented grammar will be:
- The canonical collection sets of LR(0) items are computed as follows.
- The transition diagram for the DFA is shown in Figure 5.5.
- Figure 5.5: DFA Transition diagram.
- Table 5.6 shows the SLR parsing table for the grammar having the following productions:
- Table 5.6: Action | GOTO SLR Parsing Table
- Action Table GOTO Table
- a b $ S A B
- I 0 R 3 /R 4 R 3 /R 4
- 1 2 3
- I 1
- Accept
- I 2 S 4
- I 3
- S 5
- I 4 R 3 R 3
- 6
- I 5 R 4 R 4
- 7
- I 6
- S 8
- I 7 S 9
- I 8
- R 1
- I 9
- R 2
- The productions are numbered as follows:
- Since the action table shown in Table 5.6 contains multiple entries, the above grammar is not SLR(1).
- SLR(1) grammars constitute a small subset of context-free grammars, so an SLR parser can only succeed on a small
- number of context-free grammars. That means an SLR parser is a less-powerful LR parser. (The power of the parser is
- measured in terms of the number of grammars on which it can succeed.) This is because when an SLR parser sees a
- right-hand-side production rule A → α on the top of the stack, it replaces this rule by the left-hand-side nonterminal A if
- the next input symbol can FOLLOW the nonterminal A. But sometimes this reduction may not lead to the generation of
- previous right-most derivations. For example, the parser constructed above can do the reduction by the production A
- → ∈ in the state I 0 if the next input symbol happens to be either a or b, because both a and b are in the FOLLOW(A).
- But since the reduction by A → ∈ in I 0 leads to the generation of a first instance of A in the sentential form AaAb, this
- reduction proves to be a proper one if the next input symbol is a. This is because the first instance of A in the sentential
- form AaAb is followed by a. But if the next input symbol is b, then this is not a proper reduction, because even though
- b follows A, b follows a second instance of A in the sentential form AaAb. Similarly, if the parser carries out the
- reduction by A → ∈ in the state I 4 , then it should be done if the next input symbol is b, because reduction by A → ∈ in
- I 4 leads to the generation of a second instance of A in the sentential form AaAb.
- Therefore, we conclude that if:
- We let terminal a follow the first instance of A and let terminal b follow the second instance of A in
- the sentential form AaAb;
- 1.
- We associate a with the item A → . in I 0 and terminal b with item A → . in I 4 ; 2.
- The parser has been asked to carry out a reduction by A → ∈ in I 0 on those terminals associated
- 3.
- with the item A → . in I 0 , and carry out the reduction A → ∈ in I 4 on those terminals associated with
- the item A → . in I 4 ;
- then there would have been no conflict and the parser would have been more powerful. But this requires associating a
- list of terminals (lookaheads) with the items. You may recall (see Chapter 4) that lookaheads are symbols that the
- parser uses to ‘look ahead’ in the input buffer to decide whether or not reduction is to be done. That is, we have to
- work with items of the form A → α .X β . The item a is called as an LR(1) item, because the length of the lookahead is
- one; therefore, an item without a lookahead is one with lookahead length of zero 0, an LR(0) item. In the SLR method,
- we were working with LR(0) items. Therefore, we define an LR(k) item to be an item using lookaheads of length k. So,
- an LR(1) item is comprised of two parts: the LR(0) item and the lookahead associated with the item.
- Note We conclude that if we work with LR(1) items instead of using LR(0) items, then every state of the parser will
- correspond to a set of LR(1) items. When the parser looks ahead in the input buffer to decide whether reduction
- is to be done or not, the information about the terminals will be available in the state of the parser itself, which is
- not the case with the SLR parser state. Hence, with LR(1), we get a more powerful parser.
- Therefore, if we modify the closure and the goto functions to work suitably with the LR(1) items, by allowing them to
- compute the lookaheads, we can obtain the canonical collection of sets of LR(1). And from this we can obtain the
- parsing Action|GOTO table. For example, closure(I), where I is a set of LR(1) items, is computed as follows:
- Add every item in I to closure(I). 1.
- Repeat
- For every item of the form A → α .B β , a in closure(I) do
- For every production B → γ do
- Add B → . γ , FIRST( β a) to closure(I)
- 2.
- /* because the reduction by B → γ generates B preceding β in the right side of A → α B β ; and hence, the reduction by B
- → γ is proper only on those symbols that are in the FIRST( β ). But if β derives to an empty string, then a will also follow
- B, and the lookaheads of B → γ will be FIRST( β a)
- until no new item can be added to closure(I)
- For example, consider the following grammar:
- goto(I, X) = closure({A → α X. β , a | A → α .X β ,a is in I})
- That is, to find out goto from I on X, first identify all the items in I with a dot preceding X in the LR(0) section of the item.
- Then, move the dot in all the selected items one position to the right (i.e., over X), and then take this new set's closure.
- For example:
- 5.4.4 An Algorithm for Finding the Canonical Collection of Sets of LR(1) Items
- /* Let C be the canonical collection of sets of LR(1) items. We maintain C new and C old to continue the iterations */
- Input : augmented grammar
- Output: canonical collection of sets of LR(1) items (i.e., set C)
- C old = φ 1.
- add closure({S 1 → .S, $}) to C 2.
- while C old ≠ C new do
- temp = C new − C old
- C old = C new
- for every I in temp do
- for every X in V ∪ T (i.e., for every grammar symbol X) do
- if goto(I, X) is not empty and not in C new , then
- add goto(I, X) to C new
- }
- 3.
- C = C new 4.
- For example, consider the following grammar:
- The augmented grammar will be:
- The canonical collection of sets of LR(1) items are computed as follows:
- The transition diagram of the DFA is shown in Figure 5.6.
- Figure 5.6: Transition diagram for the canonical collection of sets of LR(1) items.
- 5.4.5 Construction of the Action|GOTO Table for the LR(1) Parser
- The following steps will construct the parsing action table for the LR(1) parser:
- for every state I i in C do
- for every terminal symbol a do
- if goto(I i , a) = I j then
- make action[I i , a] = S j /*for shift and enter
- into the state j*/
- 1.
- for every state I i in C whose underlying set of LR(1) items contains an item of the form A → α ., a
- do
- make action[I i , a] = R k /*where k is the number of
- 2.
- the production A → α , standing for reduce by A → α */
- make [I i , $] = accept if I i contains an item S 1 → S., $ 3.
- The goto table is simply a mapping of transitions in the DFA on nonterminals. Therefore, it is constructed as follows:
- for every I i in C do
- for every nonterminal A do
- if goto (I i , A) = I j then
- make goto[I i , A] = j
- This method is called as CLR(1) or LR(1), and is more powerful than SLR(1). Therefore, the CLR or LR parsing table
- for the grammar having the following productions is:
- Table 5.7: CLR/LR Parsing Action | GOTO Table
- Action Table GOTO Table
- a b $ S A B
- I 0 R 3 R 4
- 1 2 3
- I 1
- Accept
- I 2 S 4
- I 3
- S 5
- I 4 R 3 R 3
- 6
- I 5 R 4 R 4
- 7
- I 6
- S 8
- I 7 S 9
- I 8
- R 1
- I 9
- R 2
- The productions are numbered as follows:
- By comparing the SLR(1) parser with the CLR(1) parser, we find that the CLR(1) parser is more powerful. But the
- CLR(1) has a greater number of states than the SLR(1) parser; hence, its storage requirement is also greater than the
- SLR(1) parser. Therefore, we can devise a parser that is an intermediate between the two; that is, the parser's power
- will be in between that of SLR(1) and CLR(1), and its storage requirement will be the same as SLR(1)'s. Such a parser,
- LALR(1), will be much more useful: since each of its states corresponds to the set of LR(1) items, the information
- about the lookaheads is available in the state itself, making it more powerful than the SLR parser. But a state of the
- LALR(1) parser is obtained by combining those states of the CLR parser that have identical LR(0) (core) items, but
- which differ in the lookaheads in their item set representations. Therefore, even if there is no reduce-reduce conflict in
- the states of the CLR parser that has been combined to form an LALR parser, a conflict may get generated in the state
- of LALR parser. We may be able to obtain a CLR parsing table without multiple entries for a grammar, but when we
- construct the LALR parsing table for the same grammar, it might have multiple entries.
- 5.4.6 Construction of the LALR Parsing Table
- The steps in constructing an LALR parsing table are as follows:
- Obtain the canonical collection of sets of LR(1) items. 1.
- If more than one set of LR(1) items exists in the canonical collection obtained that have identical
- cores or LR(0)s, but which have different in lookaheads, then combine these sets of LR(1) items to
- obtain a reduced collection, C 1 , of sets of LR(1) items.
- 2.
- Construct the parsing table by using this reduced collection, as follows. 3.
- Construction of the Action Table
- for every state I i in C 1 do
- for every terminal symbol a do
- if goto(I i , a) = I j then
- make action[I i , a] = S j /*for shift and enter
- into the state j*/
- 1.
- for every state I i in C 1 whose underlying set of LR(1) items contains an item of the form A → α ., a,
- do
- make action[I i , a] = R k /*where k is the number of the production
- A → α standing for reduce by A → α */
- 2.
- make [I i , $] = accept if I i contains an item S 1 → S., $ 3.
- Construction of the Goto Table
- The goto table simply maps transitions on nonterminals in the DFA. Therefore, the table is constructed as follows:
- for every I i in C 1 do
- for every nonterminal A do
- if goto(I i , A) = I j then
- make goto(I i , A) = j
- For example, consider the following grammar:
- The augmented grammar is:
- The canonical collection of sets of LR(1) items are computed as follows:
- We see that the states I 3 and I 6 have identical LR(0) set items that differ only in their lookaheads. The same goes for
- the pair of states I 4 , I 7 and the pair of states I 8 , I 9 . Hence, we can combine I 3 with I 6 , I 4 with I 7 , and I 8 with I 9 to obtain
- the reduced collection shown below:
- where I 36 stands for combination of I 3 and I 6 , I 47 stands for the combination of I 4 and I 7 , and I 89 stands for the
- combination of I 8 and I 9 . The transition diagram of the DFA using the reduced collection is shown in Figure 5.7.
- Figure 5.7: Transition diagram for a DFA using a reduced collection.
- Therefore, Table 5.8 shows the LALR parsing table for the grammar having the following productions:
- which are numbered as:
- Table 5.8: LALR Parsing Table for a DFA Using a Reduced Collection
- Action Table GOTO Table
- a b $ S A
- I 0 S 36 S 47
- 1 2
- I 1
- Accept
- I 2 S 36 S 47
- 5
- I 36 S 36 S 47
- 89
- I 47 R 3 R3 R 3
- I 5
- R 1
- I 89 R 2 R 2 R 2
- 5.4.7 Parser Conflicts
- An LR parser may encounter two types of conflicts: shift-reduce conflicts and reduce-reduce conflicts.
- Shift-Reduce Conflict
- A shift-reduce (S-R) conflict occurs in an SLR parser state if the underlying set of LR(0) item representations contains
- items of the form depicted in Figure 5.8, and FOLLOW(B) contains terminal a.
- Figure 5.8: LR(0) underlying set representations that can cause SLR parser conflicts.
- Similarly, an S-R conflict occurs in a state of the CLR or LALR parser if the underlying set of LR(1) items
- representation contains items of the form shown in Figure 5.9.
- Figure 5.9: LR(1) underlying set representations that can cause CLR/LALR parser conflicts.
- Reduce-Reduce Conflict
- A reduce-reduce (R-R) conflict occurs if the underlying set of LR(0) items representation in a state of an SLR parser
- contains items of the form shown in Figure 5.10, and FOLLOW(A) and FOLLOW(B) are not disjoint sets.
- Figure 5.10: LR(0) underlying set representations that can cause an SLR parser reduce-reduce conflict.
- Similarly an R-R conflict occurs if the underlying set of LR(1) items representation in a state of a CLR or LALR parser
- contains items of the form shown in Figure 5.11.
- Figure 5.11: LR(1) underlying set representations that can cause an CLR/LALR parser.
- If a set of items' representation contains only nonfinal items, then there is no conflict in the corresponding state. (An
- item in which the dot is in the right-most position, like A → XYZ., is called as a final item; and an item in which the dot
- is not in the right-most position, like A → X.YZ, is a nonfinal item).
- Even if there is no R-R conflict in the states of a CLR parser, conflicts may be generated in the state of a LALR parser
- that is obtained by combining the states of the CLR parser. We combine the states of the CLR parser in order to form
- an LALR state. The items' lookaheads in the LALR parser state are obtained by combining the lookaheads of the
- corresponding items in the states of the CLR parser. And since reduction depends on the lookaheads, even if there is
- no R-R conflict in the states of the CLR parser, a conflict may become generated in the state of the LALR parser as a
- result of this state combination. For example, consider the sets of LR(1) items that represent the two different states of
- the CLR(1) parser, as shown in Figure 5.12.
- Figure 5.12: Sets of LR(1) items represent two different CLR(1) parser states.
- There is no R-R conflict in these states. But when we combine these states to form an LALR, the state's set of items
- representation will be as shown in Figure 5.13.
- Figure 5.13: States are combined to form an LALR.
- We see that there is an R-R conflict in this state, because the parser will call a reduction by A → α as well as by B → γ
- on both a and b. If there is a S-R conflict in the CLR(1) states, it will never be reflected in the LALR(1) state obtained by
- combining the CLR(1) states. For example consider the sets of LR(1) items representing the two different states of the
- CLR(1) parser as shown in Figure 5.14.
- Figure 5.14: LR(1) items represent two different states of the CLR(1) parser.
- There is no S-R conflict in these states. But when we combine these states, the resulting LALR state set will be as
- shown in Figure 5.15. There is no S-R conflict in this state, as well.
- Figure 5.15: LALR state set resulting from the combination of CLR(1) state sets.
- 5.4.8 Handling Ambiguous Grammars
- Since every ambiguous grammar fails to be LR, they will not belong to either the SLR, CLR, or LALR grammar
- classes. But some ambiguous grammars are quite useful for specifying languages. Hence, the question is how to deal
- with these grammars in the framework of LR parsing. For example, the natural grammar that specifies
- nonparenthesized expressions with + and * operators is:
- But this is ambiguous grammar, because the precedence and associations of the operators has not been specified.
- Even so, we prefer this grammar, because we can easily change the precedence and associations as required,
- thereby allowing us more flexibility. Similarly, if we use unambiguous grammar instead of the above grammar to
- specify the same language, it will have the following productions:
- This parser will spend a substantial portion its time in carrying out reductions by the unit productions E → T and T → F.
- These production are in the grammar to enforce associations and precedence, thereby making the parsing time
- excessive. With an ambiguous grammar, every LR parser construction method will have conflicts. But these conflicts
- can be resolved by using the precedence and association information of + and * as per the language's usage. For
- example, consider the SLR parser construction for the above grammar. The augmented grammar is:
- The canonical collection of sets of LR(0) items is shown below:
- The transition diagram of the DFA for the augmented grammar is shown in Figure 5.16. The SLR parsing table is
- shown in Table 5.9.
- Figure 5.16: Transition diagram for augmented grammar DFA.
- Table 5.9: SLR Parsing Table for Augmented Grammar
- Action Table GOTO Table
- + * id $ E
- I 0
- S 2
- 1
- I 1 S 3 S 4
- accept
- I 2 R 3 R 3
- R 3
- I 3
- S 2
- 5
- I 4
- S 2
- 6
- I 5 S 3 /R 1 S 4 /R 1
- R 1
- I 6 S 3 /R 2 S 4 /R 2
- R 2
- We see that there are shift-reduce conflicts in I 5 and I 6 on + as well as *. Therefore, for an input string id + id + id$,
- when the parser enters into the state I 5 , the parser will be in the configuration:
- Hence, the parser can either reduce by E → E + E or shift the + onto the stack and enter into the state I 3 . To resolve
- this conflict, we make use of associations. If we want left-associativity, then a reduction by E → E + E is the right
- choice. Whereas if we want right-associativity, then shift is a right choice.
- Similarly, if the input string is id + id * id$ when the parser enters into the state I 5 , it will be in the configuration:
- Hence, the parser can either reduce by E → E + E or shift the * onto the stack and enter into the state I 4 in order to
- resolve this conflict. We must make use of precedence if we want a higher precedence for + then the reduction by E →
- E + E. If we want a higher precedence for *, then shift is a right choice.
- Similarly if the input string is id * id + id$ when the parser enters into the state I 6 , it will be in the configuration:
- Hence, the parser can either reduce by E → E * E or shift the + onto the stack and enter into the state I 3 in order to
- resolve this conflict. We have to make use of precedence if we want a higher precedence for *; then reduction by E →
- E * E is a right choice. Whereas if we want a higher precedence for +, then shift is right choice.
- Similarly, if the input string is id * id * id$ when the parser enters into the state I6, the parser will be in the configuration:
- The parser can either reduce by E → E * E or shift the * onto the stack and enter into the state I 4 . To resolve this
- conflict, we have to make use of associations. If we want left-associativity, then a reduction by E → E * E is a right
- choice. If we want right-associativity, then shift is a right choice.
- Therefore, for a higher precedence to *, and for left-associativity for both + and *, we get the SLR parsing table shown
- in Table 5.10.
- Table 5.10: SLR Parsing Table Reflects Higher Precedence and Left-Associativity
- Action Table GOTO Table
- + * id $ E
- I 0
- S 2
- 1
- I 1 S 3 S 4
- Accept
- I 2 R 3 R 3
- R 3
- I 3
- S 2
- 5
- I 4
- S 2
- 6
- I 5 R 1 S 4
- R 1
- I 6 R 2 R 2
- R 2
- Therefore, we have a way to deal with ambiguous grammars. We can make use of nonambiguous rules to resolve
- parsing action conflicts.
- 5.5 DATA STRUCTURES FOR REPRESENTING PARSING TABLES
- Since there are only a few entries in the goto table, separate data structures must be used for the action table and the
- goto table. These data structures are described below.
- Representing the Action Table
- One of the simplest ways to represent the action table is to use a two-dimensional array. But since many rows of the
- action table are identical, we can save considerable space (and expend a negligible cost in processing time) by
- creating an array of pointers for each state. Then, pointers for states with the same actions will point to the same
- location, as shown in Figure 5.17.
- Figure 5.17: States with actions in common point to the same location via an array.
- To access information, we assign each terminal a number from zero to one less than the number of terminals. We use
- this integer as an offset from the pointer value for each state. Further reduction in the space is possible at the expense
- of speed by creating a list of actions for each state. Each node on a list will be comprised of a terminal symbol and the
- action for that terminal symbol. It is here that the most frequent actions, like error actions, can be appended at the end
- of the list. For example, for the state I 0 in Table 5.10, the list will be as shown in Figure 5.18.
- Figure 5.18: List that incorporates the ability to append actions.
- Representing the GOTO Table
- An efficient way to represent the goto table is to make a list of pairs for each nonterminal A. Each pair is of the form:
- goto(current-state, A) = next-state
- Since the error entries in the goto table are never consulted, we can replace each error entry by the most common
- nonerror entry in its column is represented by any in place of current-state .
- 5.6 WHY LR PARSING IS ATTRACTIVE
- There are several reasons why LR parsers are attractive:
- An LR parser can be constructed to recognize virtually all programming language constructs for
- which a CFG can be written.
- 1.
- The LR parsing method is the most general, nonbacktracking shift-reduce method known. Yet it
- can be implemented as efficiently as any other method.
- 2.
- The class of grammars that can be parsed by using the LR method is a proper superset of the
- class of grammars that can be parsed with a predictive parser.
- 3.
- The LR parser can quickly detect a syntactic error via the left-to-right scanning of input. 4.
- The main drawback of the LR method is that it is too much work to construct an LR parser by hand for a typical
- programming language grammar. But fortunately, many LR parser generators are available that automatically
- generate the required LR parser.
- 5.7 EXAMPLES
- The examples that follow further illustrate the concepts covered within this chapter.
- EXAMPLE 5.3
- Construct an SLR(1) parsing table for the following grammar:
- First, augment the given grammar by adding a production S1 → S to the grammar. Therefore, the augmented
- grammar is:
- Next, we obtain the canonical collection of sets of LR(0) items, as follows:
- The transition diagram of this DFA is shown in Figure 5.19.
- Figure 5.19: Transition diagram for the canonical collection of sets of LR(0) items in Example 5.3.
- The FOLLOW sets of the various nonterminals are FOLLOW(S 1 ) = {$}. Therefore:
- Using S 1 → S, we get FOLLOW(S) = FOLLOW(S 1 ) = {$} 1.
- Using S → xAy, we get FOLLOW(A) = {y} 2.
- Using S → xBy, we get FOLLOW(B) = {y} 3.
- Using S → xAz, we get FOLLOW(A) = {z} 4.
- Therefore, FOLLOW(A) = {y, z}. Using A → qS, we get FOLLOW(S) = FOLLOW(A) = {y, z}. Therefore, FOLLOW(S) =
- {y, z, $}. Let the productions of the grammar be numbered as follows:
- The SLR parsing table for the productions above is shown in Table 5.11.
- Table 5.11: SLR(1) Parsing Table
- Action Table GOTO Table
- x Y Z q $ S A B
- I 0 S 2 R 3 /R 4
- 1
- I 1
- Accept
- I 2
- S 5
- 3 4
- I 3
- S 6 S 7
- I 4
- S 8
- I 5 S 2 R 5 /R 6 R 5
- 9
- I 6
- R 1 R 1
- R 1
- I 7
- R 3 R 3
- R 3
- I 8
- R 2 R 2
- R 2
- I 9
- R 4 R 4
- EXAMPLE 5.4
- Construct an SLR(1) parsing table for the following grammar:
- First, augment the given grammar by adding the production S 1 → S to the grammar. The augmented grammar is:
- Next, we obtain the canonical collection of sets of LR(0) items, as follows:
- The transition diagram of the DFA is shown in Figure 5.20.
- Figure 5.20: DFA transition diagram for Example 5.4.
- The FOLLOW sets of the various nonterminals are FOLLOW(S 1 ) = {$}. Therefore:
- Using S 1 → S, we get FOLLOW(S) = FOLLOW(S 1 ) = {$} 1.
- Using S → 0S0, we get FOLLOW(S) = { 0 } 2.
- Using S → 1S1, we get FOLLOW(S) = {1} 3.
- So, FOLLOW(S) = {0, 1, $}. Let the productions be numbered as follows:
- The SLR parsing table for the production set above is shown in Table 5.12.
- Table 5.12: SLR Parsing Table for Example 5.4
- Action Table GOTO Table
- 0 1 $ S
- I 0 S 2 S 3
- 1
- I 1
- accept
- I 2 S 2 S 3
- 4
- I 3 S 6 S 3
- 5
- I 4 S 7
- I 5
- S 8
- I 6 S2/R 3 S 3 /R 3 R 3 4
- I 7 R 1 R 1
- R 1
- I 8 R 2 R 2
- R 2
- EXAMPLE 5.5
- Consider the following grammar, and construct the LR(1) parsing table.
- The augmented grammar is:
- The canonical collection of sets of LR(1) items is:
- The parsing table for the production above is shown in Table 5.13.
- Table 5.13: Parsing Table for Example 5.5
- Action Table GOTO Table
- A B $ S
- I 0 S 2 S 3 R 3 1
- I 1
- Accept
- I 2 S 5 S 6 /R 3
- 4
- I 3 S 8 /R 3 S 9
- 7
- I 4
- S 10
- I 5 S 5 S 6 /R 3
- 11
- I 6 S 8 /R 3 S 9
- 12
- I 7 S 13
- I 8 S 5 S 6 /R 3
- 14
- I 9 S 8 /R 3 S 9
- 15
- I 10 S 2 S 3 R 3 16
- I 11
- S 17
- I 12 S 18
- I 13 S 2 S 3 R 3 19
- I 14
- S 20
- I 15
- S 21
- I 16
- R 1
- I 17 S 5 S 6 /R 3
- 22
- I 18 S 5 S 6 /R 3
- 23
- I 19
- R 2
- I 20 S 8 /R 3 S 9
- 24
- I 21 S 8 /R 3 S 9
- 25
- I 22
- R 1
- I 23
- R 2
- I 24 R 1
- I 25 R 2
- The productions for the grammar are numbered as shown below:
- EXAMPLE 5.6
- Construct an LALR(1) parsing table for the following grammar:
- The augmented grammar is:
- The canonical collection of sets of LR(1) items is:
- There no sets of LR(1) items in the canonical collection that have identical LR(0)-part items and that differ only in their
- lookaheads. So, the LALR(1) parsing table for the above grammar is as shown in Table 5.14.
- Table 5.14: LALR(1) Parsing Table for Example 5.5
- Action Table GOTO Table
- a b c d $ S A
- I 0
- S 3
- S 4
- 1 2
- I 1
- Accept
- I 2 S 5
- I 3
- S 7
- 1
- I 4 R 5
- S 8
- I 5
- R 1
- I 6 S 10
- S 9
- I 7
- R 5
- I 8
- R 3
- I 9
- R 2
- I 10
- R 4
- The productions of the grammar are numbered as shown below:
- S → Aa 1.
- S → bAc 2.
- S → dc 3.
- S → bda 4.
- A → d 5.
- EXAMPLE 5.7
- Construct an LALR(1) parsing table for the following grammar:
- The augmented grammar is:
- The canonical collection of sets of LR(1) items is:
- Since no sets of LR(1) items in the canonical collection have identical LR(0)-part items and differ only in their
- lookaheads, the LALR(1) parsing table for the above grammar is as shown in Table 5.15.
- Table 5.15: LALR(1) Parsing Table for Example 5.6
- Action Table GOTO Table
- a b c d $ S A B
- I 0 S 4 S 5
- S 6
- 1 2 3
- I 1
- Accept
- I 2 S 7
- I 3
- S 8
- I 4
- S 10
- 9
- I 5
- S 12
- 11
- I 6 R 5
- R 6
- I 7
- R 1
- I 8
- R 3
- I 9
- S 13
- I 10
- R 5
- I 11 S 14
- I 12 R 6
- I 13
- R 2
- I 14
- R 4
- The productions of the grammar are numbered as shown below:
- S → Aa 1.
- S → aAc 2.
- S → Bc 3.
- S → bBa 4.
- A → d 5.
- B → d 6.
- EXAMPLE 5.8
- Construct the nonempty sets of LR(1) items for the following grammar:
- The collection of nonempty sets of LR(1) items is shown in Figure 5.21.
- Figure 5.21: Collection of nonempty sets of LR(1) items for Example 5.7.
- Chapter 6: Syntax-Directed Definitions and Translations
- 6.1 SPECIFICATION OF TRANSLATIONS
- The specification of a construct's translation in a programming language involves specifying what the construct is, as
- well as specifying the translating rules for the construct. Whenever a compiler encounters that construct in a program,
- it will translate the construct according to the rules of translation. Here, the term "translation" is used in a much broader
- sense. Translation does not necessarily mean generating either intermediate code or object code. Translation also
- involves adding information into the symbol table as well as performing construct-specific computations. For example,
- if a construct is a declarative statement, then its translation adds information about the construct's type attribute into
- the symbol table. Whereas, if the construct is an expression, then its translation generates the code for evaluating the
- expression.
- When we specify what the construct is, we specify the syntactic structure of the construct; hence, syntactic
- specification is the part of the specification of the construct's translation. Therefore, if we suitably extend the notation
- that we use for syntactic specification so that it will allow for both the syntactic structure and the rules of translation that
- go along with it, then we can use this notation as a framework for the specification of the construct translation.
- Translation of a construct involves manipulating the values of various quantities. For example, when translating the
- declarative statement int a, b, c, the compiler needs to extract the type int and add it to the symbol records of a, b,
- and c. This requires that the compiler keep track of the type int, as well as the pointers to the symbol records
- containing a, b, and c.
- Since we use a context-free grammar to specify the syntactic structure of a programming language, we extend that
- context-free grammar by associating sets of attributes with the grammar symbols. These sets hold the values of the
- quantities, which a compiler is required to track, as well as the associated set of production rules of the grammar that
- specify the how the attributed values of the grammar symbols of the production are manipulated. These extensions
- allow us to specify the translations. Syntax-directed definitions and translation schemes are examples of these
- extensions of context-free grammars, allowing us to specify the translations.
- Syntax-directed definitions use CFG to specify the syntactic structure of the construct. It associates a set of attributes
- with each grammar symbol; and with each production, it associates a set of semantic rules for computing the values of
- the attributes of the grammar symbols appearing in that production. Therefore, the grammar and the set of semantic
- rules constitute syntax-directed definitions.
- 6.2 IMPLEMENTATION OF THE TRANSLATIONS SPECIFIED BY
- SYNTAX-DIRECTED DEFINITIONS
- Attributes are associated with the grammar symbols that are the labels of the parse tree nodes. They are thus
- associated with the construct's parse tree translation specification. Therefore, when a semantic rule is evaluated, the
- parser computes the value of an attribute at a parse tree node. For example, a semantic rule could specify the
- computation of the value of an attribute val that is associated with the grammar symbol X (a labeled parse tree node).
- To refer to the attribute val associated with the grammar symbol X, we use the notation X.val. Therefore, to evaluate
- the semantic rules and carry out translations, we must traverse the parse tree and get the values of the attributes at
- the nodes computed. The order in which we traverse the parse tree nodes depends on the dependencies of the
- attributes at the parse tree nodes. That is, if an attribute val at a parse tree node X depends on the attribute val at the
- parse tree node Y, as shown in Figure 6.1, then the val attribute at node X cannot be computed unless the val attribute
- at Y is also computed.
- Figure 6.1: The attribute value of node X is inherently dependent on the attribute value of node Y.
- Hence, carrying out the translation specified by the syntax-directed definitions involves:
- Generating the parse tree for the input string W, 1.
- Finding out the traversal order of the parse tree nodes by generating a dependency graph and
- doing a topological sort of that graph, and
- 2.
- Traversing the parse tree in the proper order and getting the semantic rules evaluated. 3.
- If the parse tree attribute's dependencies are such that an attribute of node X depends on the attributes of nodes
- generated before it in the parse tree-construction process, then it is possible to get X's attribute value during the
- parsing itself; the parser is not required to generate an explicit parse tree, and the translations can be carried out along
- with the parsing. The attributes associated with a grammar symbol are classified into two categories: the synthesized
- and the inherited attributes of the grammar symbol.
- Synthesized Attributes
- An attribute is said to be synthesized if its value at a parse tree node is determined by the attribute values at the child
- nodes. A synthesized attribute has a desirable property; it can be evaluated during a single bottom-up traversal of the
- parse tree. Synthesized attributes are, in practice, extensively used. Syntax-directed definitions that only use
- synthesized attributes are shown below:
- These definitions specify the translations that must be carried by the expression evaluator. A parse tree, along with the
- values of the attributes at the nodes (called an "annotated parse tree"), for an expression 2+3*5 is shown in Figure 6.2.
- Figure 6.2: An annotated parse tree.
- Syntax-directed definitions that only use synthesized attributes are known as "S-attributed" definitions. If translations
- are specified using S-attributed definitions, then the semantic rules can be conveniently evaluated by the LR parser
- itself during the parsing, thereby making translation more efficient. Therefore, S-attributed definitions constitute a
- subclass of the syntax-directed definitions that can be implemented using an LR parser.
- Inherited Attributes
- Inherited attributes are those whose initial value at a node in the parse tree is defined in terms of the attributes of the
- parent and/or siblings of that node. For example, syntax-directed definitions that use inherited attributes are given
- below:
- A parse tree, along with the attributes' values at the parse tree nodes, for an input string int id1,id2,id3 is shown in
- Figure 6.3.
- Figure 6.3: Parse tree with node attributes for the string int id1,id2,id3.
- Inherited attributes are convenient for expressing the dependency of a programming language construct on the
- context in which it appears. When inherited attributes are used, then the interdependencies among the attributes at
- the nodes of the parse tree must be taken into account when evaluating their semantic rules, and these
- interdependencies among attributes are depicted by a directed graph called a "dependency graph". For example, if a
- semantic rule is of the form A.val = f(X.val,Y.val,Z.val)—that is, if A.val is function of X.val, Y.val, and Z.val)-and is
- associated with a production A → XYZ, then we conclude that A.val depends on X.val, Y.val, and Z.val. Therefore,
- every semantic rule must adopt the above form (if it hasn't already) by introducing a dummy, synthesized attribute.
- Dummy Synthesized Attributes
- If the semantic rule is in the form of a procedure call fun(al,a2,a3, … , ak), then we can transform it into the form b =
- fun(a1,a2,a3, … , ak), where b is a dummy synthesized attribute. The dependency graph has a node for each attribute
- and an edge from node b to node a if attribute a depends on attribute b. For example, if a production A → XYZ is
- used in the parse tree, then there will be four nodes in the dependency graph—A.val, X.val, Y.val, and Z.val—with
- edges from X.val, Y.val, and Z.val to A.val.
- The dependency graph for such a parse tree is shown in Figure 6.4. The ellipses denote the nodes of the dependency
- graph, and the circles denote the nodes of the parse tree.
- Figure 6.4: Dependency graph with four nodes.
- This topological sort of a dependency graph results in an order in which the semantic rules can be evaluated. But for
- reasons of efficiency, it is better to get the semantic rules evaluated (i.e., carry out the translation) during the parsing
- itself. If the translations are to be carried out during the parsing, then the evaluation order of the semantic rules gets
- linked to the order in which the parse tree nodes are created, even though the actual parse tree is not required to be
- generated by the parser. Many top-down as well as bottom-up parsers generate nodes in a depth-first left-to-right
- order; so the semantic rules must be evaluated in this same order if the translations are to be carried out during the
- parsing itself. A class of syntax-directed definitions, called "L-attributed" definitions, has attributes that can always be
- evaluated in depth-first, left-to-right order. Hence, if the translations are specified using L-attributed definitions, then it
- is possible to carry out translations during the parsing.
- 6.3 L-ATTRIBUTED DEFINITIONS
- A syntax-directed definition is L-attributed if each inherited attribute of X j for i between 1 and n, and on the right side of
- production A → X 1 X 2 … ,X n , depends only on:
- The attributes (both inherited as well as synthesized) of the symbols X 1 ,X 2 , … , X j − 1 (i.e., the
- symbols to the left of X j in the production, and
- 1.
- The inherited attributes of A. 2.
- The syntax-directed definition above is an example of the L-attributed definition, because the inherited attribute L.type
- depends on T.type, and T is to the left of L in the production D → TL. Similarly, the inherited attribute L 1 .type depends
- on the inherited attribute L.type, and L is parent of L 1 in the production L → L 1 ,id.
- When translations carried out during parsing, the order in which the semantic rules are evaluated by the parser must
- be explicitly specified. Hence, instead of using the syntax-directed definitions, we use syntax-directed translation
- schemes to specify the translations. Syntax-directed definitions are more abstract specifications for translations;
- therefore, they hide many implementation details, freeing the user from having to explicitly specify the order in which
- translation takes place. Whereas the syntax-directed translation schemes indicate the order in which semantic rules
- are evaluated, allowing some implementation details to be specified.
- 6.4 SYNTAX-DIRECTED TRANSLATION SCHEMES
- A syntax-directed translation scheme is a context-free grammar in which attributes are associated with the grammar
- symbols, and semantic actions, enclosed within braces ({ }), are inserted in the right sides of the productions. These
- semantic actions are basically the subroutines that are called at the appropriate times by the parser, enabling the
- translation. The position of the semantic action on the right side of the production indicates the time when it will be
- called for execution by the parser. When we design a translation scheme, we must ensure that an attribute value is
- available when the action refers to it. This requires that:
- An inherited attribute of a symbol on the right side of a production must be computed in an action
- immediately preceding (to the left of) that symbol, because it may be referred to by an action
- computing the inherited attribute of the symbol to the right of (following) it.
- 1.
- An action that computes the synthesized attribute of a nonterminal on the left side of the
- production should be placed at the end of the right side of the production, because it might refer to
- the attributes of any of the right-side grammar symbols. Therefore, unless they are computed, the
- synthesized attribute of a nonterminal on the left cannot be computed.
- 2.
- These restrictions are motivated by the L-attributed definitions. Below is an example of a syntax-directed translation
- scheme that satisfies these requirements, which are implemented during predictive parsing:
- The advantage of a top-down parser is that semantic actions can be called in the middle of the productions. Thus, in
- the above translation scheme, while using the production D → TL to expand D, we call a routine after recognizing T
- (i.e., after T has been fully expanded), thereby making it easier to handle the inherited attributes. Whereas a bottom-up
- parser reduces the right side of the production D → TL by popping T and L from the top of the parser stack and
- replacing them by D, the value of the synthesized attribute T.type is already on the parser stack at a known position. It
- can be inherited by L. Since L.type is defined by a copy rule, L.type = T.type, the value of T.type can be used in place
- of L.type. Thus, if the parser stack is implemented as two parallel arrays—state and value—and state [I] holds a
- grammar symbol X, then value [I] holds a synthesized attribute of X. Therefore, the translation scheme implemented
- during bottom-up parsing is as follows, where [top] is value of stack top before the reduction and [newtop] is the value
- of the stack top after the reduction:
- 6.5 INTERMEDIATE CODE GENERATION
- While translating a source program into a functionally equivalent object code representation, a parser may first
- generate an intermediate representation. This makes retargeting of the code possible and allows some optimizations
- to be carried out that would otherwise not be possible. The following are commonly used intermediate representations:
- Postfix notation 1.
- Syntax tree 2.
- Three-address code 3.
- Postfix Notation
- In postfix notation, the operator follows the operand. For example, in the expression (a − b) * (c + d) + (a − b), the
- postfix representation is:
- Syntax Tree
- The syntax tree is nothing more than a condensed form of the parse tree. The operator and keyword nodes of the
- parse tree (Figure 6.5) are moved to their parent, and a chain of single productions is replaced by single link (Figure
- 6.6).
- Figure 6.5: Parse tree for the string id+id*id.
- Figure 6.6: Syntax tree for id+id*id.
- Three-Address Code
- Three address code is a sequence of statements of the form x = y op z. Since a statement involves no more than
- three references, it is called a "three-address statement," and a sequence of such statements is referred to as
- three-address code. For example, the three-address code for the expression a + b * c + d is:
- Sometimes a statement might contain less than three references; but it is still called a three-address statement. The
- following are the three-address statements used to represent various programming language constructs:
- Used for representing arithmetic expressions:
- Used for representing Boolean expressions:
- Used for representing array references and dereferencing operations:
- Used for representing a procedure call:
- 6.6 REPRESENTING THREE-ADDRESS STATEMENTS
- Records with fields for the operators and operands can be used to represent three-address statements. It is possible
- to use a record structure with four fields: the first holds the operator, the next two hold the operand1 and operand2,
- respectively, and the last one holds the result. This representation of a three-address statement is called a "quadruple
- representation".
- Quadruple Representation
- Using quadruple representation, the three-address statement x = y op z is represented by placing op in the operator
- field, y in the operand1 field, z in the operand2 field, and x in the result field. The statement x = op y, where op is a
- unary operator, is represented by placing op in the operator field, y in the operand1 field, and x in the result field; the
- operand2 field is not used. A statement like param t1 is represented by placing param in the operator field and t1 in the
- operand1 field; neither operand2 nor the result field are used. Unconditional and conditional jump statements are
- represented by placing the target labels in the result field. For example, a quadruple representation of the
- three-address code for the statement x = (a + b) * - c/d is shown in Table 6.1. The numbers in parentheses represent
- the pointers to the triple structure.
- Table 6.1: Quadruple Representation of x = ( a + b ) * − c/d
- Operator Operand1 Operand2 Result
- (1) + a b t1
- (2)
- −
- c
- t2
- (3) * t1 t2 t3
- (4) / t3 d t4
- (5) = t4
- x
- Triple Representation
- The contents of the operand1, operand2, and result fields are therefore normally the pointers to the symbol records
- for the names represented by these fields. Hence, it becomes necessary to enter temporary names into the symbol
- table as they are created. This can be avoided by using the position of the statement to refer to a temporary value. If
- this is done, then a record structure with three fields is enough to represent the three-address statements: the first
- holds the operator value, and the next two holding values for the operand1 and operand2, respectively. Such a
- representation is called a "triple representation". The contents of the operand1 and operand2 fields are either pointers
- to the symbol table records, or they are pointers to records (for temporary names) within the triple representation itself.
- For example, a triple representation of the three-address code for the statement x = (a+b)* − c/d is shown in Table 6.2.
- Table 6.2: Triple Representation of x = ( a + b ) * − c/d
- Operator Operand1 Operand2
- (1) + a b
- (2)
- −
- c
- (3) * (1) (2)
- (4) / (3) d
- (5) = x (4)
- Indirect Triple Representation
- Another representation uses an additional array to list the pointers to the triples in the desired order. This is called an
- indirect triple representation. For example, a triple representation of the three-address code for the statement x =
- (a+b)* − c/d is shown in Table 6.3.
- Table 6.3: Indirect Triple Representation of x = ( a + b ) * − c/d
- Operator Operand1 Operand2
- (1) (1) + a b
- (2) (2)
- −
- c
- (3) (3) * (1) (2)
- (4) (4) / (3) d
- (5) (5) = x (4)
- Comparison
- By using quadruples, we can move a statement that computes A without requiring any changes in the statements
- using A, because the result field is explicit. However, in a triple representation, if we want to move a statement that
- defines a temporary value, then we must change all of the pointers in the operand1 and operand2 fields of the records
- in which this temporary value is used. Thus, quadruple representation is easier to work with when using an optimizing
- compiler, which entails a lot of code movement. Indirect triple representation presents no such problems, because a
- separate list of pointers to the triple structure is maintained. When statements are moved, this list is reordered, and no
- change in the triple structure is necessary; hence, the utility of indirect triples is almost the same as that of quadruples.
- 6.7 SYNTAX-DIRECTED TRANSLATION SCHEMES TO SPECIFY THE
- TRANSLATION OF VARIOUS PROGRAMMING LANGUAGE CONSTRUCTS
- Specifying the translation of the construct involves specifying the construct's syntactic structure, using CFG, and
- associating suitable semantic actions with the productions of the CFG. For example, if we want to specify the
- translation of the arithmetic expressions into postfix notation so they can be carried along with the parsing, and if the
- parsing method is LR, then first we write a grammar that specifies the syntactic structure of the arithmetic expressions.
- We then associate suitable semantic actions with the productions of the grammar. The expressions used for these
- associations are covered below.
- 6.7.1 Arithmetic Expressions
- The grammar that specifies the syntactic structure of the expressions in a typical programming language will have the
- following productions:
- Translating arithmetic expressions involves generating code to evaluate the given expression. Hence, for an
- expression a + b * c, the three-address code that is required to be generated is:
- where t1 and t2 are pointers to the symbol table records that contain compiler-generated temporaries, and a, b, and c
- are pointers to the symbol table records that contain the programmer-defined names a, b, and c, respectively.
- Syntax-directed translation schemes to specify the translation of an expression into postfix notation are as follows:
- where code is a string value attribute used to hold the postfix expression, and place is pointer value attribute used to
- link the pointer to the symbol record that contains the name of the identifier. The label getname returns the name of
- the identifier from the symbol table record that is pointed to by ptr, and concate(s 1 , s 2 , s 3 ) returns the concatenation of
- the strings s 1 , s 2 , and s 3 , respectively. For the string a+b*c, the values of the attributes at the parse tree node are
- shown in Figure 6.7.
- Figure 6.7: Values of attributes at the parse tree node for the string a + b * c.
- id.place = addr(symtab rec of a)
- Syntax-directed translation schemes to specify the translation of an expression into the syntax tree are as follows:
- where ptr is pointer value attribute used to link the pointer to a node in the syntax tree, and place is pointer value
- attribute used to link the pointer to the symbol record that contains the name of the identifier. The mkleaf generates
- leaf nodes, and mknode generates intermediate nodes.
- For the string a+b*c, the values of the attributes at the parse tree nodes are shown in Figure 6.8.
- Figure 6.8: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(symtab rec of a).
- id.place = addr(sumtab rec of a)
- Syntax-directed translation schemes specify the translation of an expression into three-address code, as follows:
- where ptr is a pointer value attribute used to link the pointer to the symbol record that contains the name of the
- identifier, mkleaf generates leafnodes, and mknode generates intermediate nodes. For the string a+b*c, the values of
- the attributes at the parse tree nodes are shown in Figure 6.9.
- Figure 6.9: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(sumtab rec of a).
- 6.7.2 Boolean Expressions
- One way of translating a Boolean expression is to encode the expression's true and false values as the integers one
- and zero, respectively. The code to evaluate the value of the expression in some temporary is generated as shown
- below:
- E → E 1 relop E 2
- {
- t1 = gentemp();
- gencode(if E 1 .place relop.val E 2 .place
- goto(nextquad + 3));
- gencode(t1 = 0);
- gencode(goto(nextquad+2))
- gencode(t1 = 1)}
- E.place = t1;
- }
- where nextquad keeps track of the index in the code array. The next statement will be inserted by the gencode
- procedure, and will update the value of nextquad. The following translation scheme:
- translates the expression a < b to the following three-address code:
- Similarly, a Boolean expression formed by using logical operators involves generating code to evaluate those
- operators in some temporary form, as shown below:
- E → E1 lop E2
- {
- t1 = gentemp();
- gencode (t1 = E1.place lop.val E2.place);
- E.place = t1;
- }
- E → not E1
- {
- t1 = gentemp();
- gencode (t1 = not E1.place)
- E.place = t1
- }
- lop → and { lop.val = and}
- lop → or { lop.val = or}
- The translation scheme above translates the expressions a < b and c > d to the following three-address code:
- Another way to translate a Boolean expression is to represent its value by a position in the three-address code
- sequence. For example, if we point to the statement labeled L1, then the value of the expression is true (1); whereas if
- we point to the statement labeled L2, then the value of the expression is false (0). In this case, the use of a temporary
- to hold either a one or zero, depending upon the true of false value of the expression, becomes redundant. This also
- makes it possible to decide the value of the expression without evaluating it completely. This is called a "short circuit"
- or "jumping the code". To discover the true/false value of the expression a<b or c>d, it is not necessary to completely
- evaluate the expression; if a<b is true, then the entire expression will be true. Similarly to discover the true/false value
- of the expression a<b and c>d, it is not necessary to completely evaluate the expression, because if a<b is false, then
- the entire expression will be false.
- Tip Therefore a Boolean expression can be translated into two to three address statements, a conditional jump, and an
- unconditional jump. But the targets of these jumps are known at the time of translating a Boolean expression;
- hence, these jumps are generated without their targets, which are filled in later on.
- Therefore, we must remember the indices of these jumps in the code array by using suitable attributes of E. For this,
- we use two pointer value attributes: E.true and E.false. The attribute E.true will hold the pointer to the list that contains
- the index of the conditional jump in the code array, whereas the attribute E.false will hold the pointer to the list that
- contains the index of the unconditional jump. The translation scheme for the Boolean expression that uses relational
- operators is as follows:
- E → E 1 relop E 2
- {
- E.true = mklist(nextquad);
- E.false = mklist(nextquad + l);
- gencode (if E 1 .place relop.val E 2 .place goto);
- gencode (goto_);
- }
- where mklist(ind) is a procedure that creates a list containing ind and returns a pointer to the created list.
- The above translation scheme translates the expression a < b to the following three address code:
- 6.7.3 Short-Circuit Code for Logical Expressions
- There are several methods to adequately handle the various elements of Boolean operators. These are covered by
- type below.
- AND
- Logical expressions that use the ‘and’ operator are expressions defined by the production E → E1 and E2. Generating
- the short-circuit code for these logical expressions involves setting the true value of the first expression, E1, to the
- start of the second expression, E2, in the code array. We make the true value of E the same as the true value of
- expression E2; and we make the false value of E the same as the false values of both E1 and E2. This requires
- remembering where E2 starts in the code array index, which means we must provision the memory of the nextquad
- value just before E2 is processed. This can accomplished by introducing a nullable nonterminal M before E2 in the
- above production, providing for a reduction by M → ∈ just before the processing of E2. Hence, we can get a semantic
- action associated with this production and executed at this point. We therefore have a method for remembering the
- value of nextquad just before the E2 code is generated.
- E → E 1 and M E 2 { backpatch(E1.true, M.quad);
- E.true = E2.true;
- E.false = merge(E1.false, E2.false);
- }
- M → ∈ {M.quad = nextquad; }
- where backpatch(ptr,L) is a procedure that takes a pointer ptr to a list containing indices of the code array and fills the
- target of the statements at these indices in the code array by L.
- OR
- For an expression using the ‘or’ operator-that is, an expression defined by the production E → E1 or E2—generating
- the short-circuit code involves setting the false value of the first expression, E1, to the start of E2 in the code array,
- and making the false value of E the same as the false value of E2. The true value of E is assigned the same true value
- as both E1 and E2. This requires remembering where E2 starts in the code array index, which requires making a
- provision for remembering the value of nextquad just before the expression E2 is processed. This can achieved by
- introducing a nullable nonterminal M before E2 in the above production, providing for a reduction by M → ∈ just before
- the processing of E2. Hence, we obtain a semantic action that is associated with this production and executed at this
- point; therefore, we have provisioned the recall of the value of nextquad just before the E2 code is generated.
- E → E1 or M E2 { backpatch(E1.false, M.quad);
- E.false = E2.false;
- E.true = merge(E1.true, E2.true);
- }
- M → ∈ {M.quad = nextquad; }
- NOT
- For the logical expression using the ‘not’ operator, that is, one defined by the production E → not E1, generating the
- short-circuit code involves making the false value of the expression E the same as the true value of E1. And the true
- value of E is assigned the false value of E1.
- E → not E1 {
- E.true = E1.false
- E.false = E1.true
- }
- The above translation scheme translates the expression a < b and c > d to the following three-address code:
- For example, consider the following Boolean expression:
- When the above translation scheme is used to translate this construct, the three-address code generated for it is as
- shown below, and the translation scheme is shown in Figure 6.10.
- Figure 6.10: Translation scheme for a Boolean expression containing and, not, and or.
- IF-THEN-ELSE
- Since an if-then-else statement is composed of three components—a boolean expression E, a statement S1 that is to
- be executed when E is true, and a statement S2 that is to be executed when E is false—the translation of if-then-else
- involves making a provision for transferring control to the start of S1 if E is true, for transferring control to the start of
- S2 if E is false, and for transferring control to the next statement after the execution of S1 and S2 is over. This
- requires remembering where S1 starts in the index of the code array as well as remembering where S2 starts in the
- index of the code array.
- This is achieved by introducing a nullable nonterminal M1 before the S1 and a nullable nonterminal M2 before the S2
- in the above production, providing for the reduction by M1 → ∈ just before processing S1. Hence, we get a semantic
- action associated with this production and executed at this point, which enables the recall of the value of nextquad just
- before generating S1 code. Similarly, it provides for the reduction by M2 → ∈ just before processing S2, and we get a
- semantic action associated with production executed at this point, enabling the recall of the value of nextquad just
- before generating S2 code.
- In addition, an unconditional jump is required at the end of S1 in order to transfer control to the statement that follows
- the if-then-else statement. To generate this unconditional jump, we add a nullable nonterminal N after S1 to the
- production and associate a semantic action with the production N → ∈ , which takes care of generating this
- unconditional jump, as shown in Figure 6.11.
- S → if E then M1 S1 N
- else M2 S2 {
- backpatch (E.true, M1.quad)
- backpatch (E.false, M2.quad)
- S.next:
- = merge (S1.next, S2.next, N.next)
- }
- M1 → ∈ { M1.quad = nextquad;}
- M2 → ∈ { M2.quad = nextquad}
- N → ∈ {
- N.next = mklist (nextquad);
- gencode (goto...);
- }
- Figure 6.11: The addition of the nullable nonterminal N facilitates an unconditional jump.
- Hence, for the statement if a<b then x = y + z else p = q + r, the three-address code that is required to be generated is:
- IF-THEN
- Since an if-then statement is comprised of two components, a Boolean expression E and an S1 statement that will be
- executed when E is true, the translation of if-then involves making a provision for transferring control to the start of S1
- code if E is either true or false, and a provision is made for transferring control to the next statement after the execution
- of S1 is over. This requires recalling the index of the start of S1 in the code array, and can be achieved by introducing
- a nullable nonterminal M before S1 in the production. This will provide for a reduction by M → ∈ , just before the
- processing of S1. Hence, we can get a semantic action associated with this production and executed at this point,
- which makes a provisioning the recall of for remembering the value of nextquad just before generating code of S1
- code is generated, as shown in Figure 6.12 below:
- S → if E then M S1 {
- backpatch (E.true, M.quad);
- S.next = merge(E.false, S1.next)
- }
- M → ∈ { M.quad = nextquad; }
- Figure 6.12: A nullable nonterminal M provisions the translation of if-then.
- Hence, for the statement if a<b then x = y + z, the three-address code that is required to be generated is:
- WHILE
- Since a while statement has two components, a Boolean expression E and a statement S1, which is the statement to
- be executed repeatedly as long as E is true, the translation of while involves provisioning the transfer of control to the
- start of S1 code if E is true. The expression must be tested again after S1 execution is over, control must be
- transferred to the next statement if E is false. This requires remembering the index in the code array where S1 code
- starts as well as where the E code starts. This can be achieved by introducing a nullable nonterminal M2 before S1 in
- the production. This will provide for the reduction by M2 → ∈ just before the processing of S1. Hence, a semantic
- action is associated with this production and is executed at this point, enabling the recall of the value of nextquad just
- before generating S code. Similarly, introducing a nullable nonterminal M1 before E will provide for the reduction by M1
- → ∈ just before the processing of E. Hence, a semantic action is now associated with this production and is executed
- at this point, provisioning the recall of the value of nextquad just before E code is generated, as shown in Figure 6.13.
- S → while M1 E
- do M2 S1 {
- backpatch (E.true, M2.quad)
- backpatch (S1.next, M1.quad)
- S.next = E.false
- gencode (goto(M1.quad))
- }
- M1 →∈ { M1.quad = nextquad; }
- M2 →∈ { M2.quad = nextquad; }
- Figure 6.13: The translation of the Boolean while statement is facilitated by a nullable nonterminal M.
- Hence, for the statement while a<b do x = y + z, the three-address code that is required to be generated is:
- DO-WHILE
- Since a do-while statement is comprised of two components, a Boolean expression E and an S1 statement that is
- executed repeatedly as long as E is true (as well as the test for whether E is true or false at the end of S1 execution),
- translation involves provisioning the transfer of control to test the expression after the execution of S1 is over. Control
- must also be transferred to the start of S1 code if E is true, and conversely to the next statement if E is false.
- This requires recalling the S1 start index in the code array as well as the E start index. We introduce a nullable
- nonterminal M1 before S1 in the production, providing for the reduction by M1 → ∈ just before the processing of S1.
- Hence, a semantic action is now associated with this production and is executed at this point, provisioning the recall of
- the value of nextquad just before S1 code generates. Similarly, introducing a nullable nonterminal M2 before E will
- provide for the reduction by M2 → ∈ just before the processing of E. We then have a semantic action associated with
- this production and executed at this point, and which provisions the recall of the value of nextquad just before E code
- generates, as shown in Figure 6.14.
- S → do M1 S1 while M2 E {
- backpatch (E.true, M1.quad)
- backpatch (S1.next, M2.quad)
- S.next = F.false
- }
- M1 → ∈ { M1.quad = nextquad; }
- M2 → ∈ { M2.quad = nextquad; }
- Figure 6.14: Translation of the Boolean do-while.
- Hence, for a statement do x = y + z while a<b, the three-address code that is required to be generated is:
- REPEAT-UNTIL
- Since a repeat-until statement has two components, a Boolean expression E and an S1 statement that is executed
- repeatedly until E becomes true (as well as the test of whether E is true or false at the end of S1), the translation of
- repeat-until involves provisioning transfer of control to a test of the expression after the execution of S1 is over. We
- must also engineer a transfer a control to the start code of S1 if E is false and to the next statement if E is true.
- This requires recalling the index in the code array where S1 code starts as well as the index in the code array where E
- code starts. We achieve this by introducing a nullable nonterminal M1 before S1 in the production. This will provide for
- the reduction by M 1 → ∈ , just before the processing of S1. Hence, we can get a semantic action that is associated
- with this production and is executed at this point. This makes a provision for remembering the value of nextquad just
- before S code generates, and introduces a nullable non-terminal M2 before E. This will provide for the reduction by M 2
- → ∈ , just before the processing of E. Now we can get a semantic action associated with this production and executed
- at this point, and which provisions the recall of the value of nextquad just before E code generates, as shown in Figure
- 6.15.
- S → repeat M1 S1
- until M2 E {
- backpatch (E.false, M1.quad)
- backpatch (S1.next, M2.quad)
- S.next = E.true
- }
- M1 → ∈ { M1.quad = nextquad; }
- M2 → ∈ { M2.quad = nextquad; }
- Figure 6.15: Translation of Boolean repeat-until.
- Hence, for the Boolean statement repeat x = y + z until a<b, the three-address code that is required to be generated
- is:
- FOR
- A for statement is composed of four components: an expression E1, which is used to initialize the iteration variable; an
- expression E2, which is a Boolean expression used to test whether or not the value of the iteration variable exceeds
- the final value; an expression E3, which is used to specify the step by which the value of the iteration variable is to be
- incremented or decremented; and an S1 statement, which is the statement to be executed as long as the value of the
- iteration variable is less than or equal to the final value. Hence, the translation of a for statement involves provisioning
- the transfer a control to the start of S1 code if E2 is true, transferring control to the start of E3 code after the execution
- of S1 is over, transferring control to the start of E2 code after E3 code is executed, and transferring control to the next
- statement if E2 is false, as shown in Figure 6.16.
- S → for (E1; M1 E2; M2 E3) M3 S1
- {
- backpatch (E2.true, M3.quad)
- backpatch (M3.next, M1.quad)
- backpatch (S1.next, M2.quad)
- gencode (goto(M2.quad))
- S.next = E2.false
- }
- M1 → ∈ { M1.quad = nextquad; }
- M2 → ∈ { M2.quad = nextquad; }
- M3 → ∈ {
- M3.next: = mklist (nextquad)
- gencode (goto...)
- M3.quad = nextquad;
- }
- Figure 6.16: Handling the translation of the Boolean for.
- Hence, for a statement for(i = 1; i< = 20; i+ +) x = y + z, the three-address code that is required to be generated is:
- 6.8 IMPLEMENTATION OF INCREMENT AND DECREMENT OPERATORS
- L → id++ {
- t1 = gentemp();
- t2 = gentemp();
- gencode(t1 = id.place);
- gencode(t2 = id.place +1);
- gencode (id.place = t2);
- L.place = t1;
- }
- L → ++id {
- t1 = gentemp();
- gencode(t1 = id.place +1);
- gencode(id.place = t1);
- L.place = t1;
- }
- L → id- - {
- t1 = gentemp();
- t2 = gentemp();
- gencode(t1 = id.place);
- gencode(t2 = id.place -1);
- gencode(id.place = t2);
- L.place = t1;
- }
- L → - -id {
- t1 = gentemp();
- gencode (t1 = id.place -1);
- gencode (id.place = t1);
- L.place = t1;
- }
- 6.9 THE ARRAY REFERENCE
- An array reference is an expression with an l-value. Therefore, to capture its syntactic structure, we add the following
- productions to the grammar:
- An array reference in a source program is replaced by the l-value of an expression that specifies the arrayreference to
- an element of the array. Computing the l-value involves finding the offset of the referred element of the array and then
- adding it to the base. But since deriving an offset depends on the subscripts used in an array reference, and the
- values of these subscripts are not known during the compilation, unless the subscripts are constant expressions, a
- compiler has to generate the code for evaluating the l-value of an expression that specifies the reference to an
- element of an array. This l-value computation is achieved as follows:
- where lbi and ubi are the lower and upper bounds of the ith dimension.
- If the lower bound of each dimension is one, and the upper bound of the ith dimension is di, then the offset computing
- formula becomes:
- The [i1*d2*d3* … *dk + i2*d3*d4* … *dk + … + ik]*bpw is a variable part of the offset computation, whereas [d2* d3* … *dk
- + d3*d4* … *dk + … +dk]*bpw is a constant part of the offset computation and is not required to be computed for every
- reference to an array a. It can be computed once while processing the declaration of the array a. We call this value
- "constant C". Therefore:
- where V is the variable part, and
- Since addr(a) is fixed, we can combine C with addr(a) and store this value in an attribute, L.place, and we can store V
- in another attribute, L.off, so that:
- Hence, the translation of an array reference involves generating code for computing V, and V is made a value of
- attribute L.off. We compute addr(a) − C and make it the value of the attribute L.place. Computing V involves evaluating
- the expression:
- This expression can be rewritten as:
- Therefore, the three-address code that is required to be generated for computing V is:
- Therefore, the translation scheme is:
- elist → E (Initialize queue by adding E.place)
- elist → elist1, E (Append E.place to queue)
- L → id[elist] { T1: = gentemp ( )
- elist.Ndim = 1
- gencode(T1 = retrieve();
- while (queue not empty ) do
- {
- gencode (T1= T1 * limit (id.place, elist.Ndim))
- gencode (T1 : = T1 + retrieve())
- elist.Ndim = elist.Ndim + 1
- }
- V = gentemp();
- U = gentemp();
- gencode (V : = T1 * bpw)
- gencode (U : = id.place - C)
- L.off = V
- L.place: = U
- }
- where retrieve() is a function that retrieves a value from the queue, and limit() returns the upper bound of the
- dimension of the array.
- In this translation scheme, the attribute id.place cannot be accessed in the semantic action associated with the
- production elist → E or in the semantic action associated with the production elist → elist l, E. So it is not possible to
- make use of the value of the subscript that is available in E.place to get the required three-address statements
- generated. Hence, a queue is necessary in order to maintain the subscripts' storage. These subscripts are used later
- on for generating the code for computing the offset.
- Another way to approach this is to modify the grammar to make it suitable for translation. This requires rewriting the
- productions in such a manner that both id and E exist in the same production so that the pointer to the symbol table
- record of the array name is available in id.place. This can be used to retrieve the upper-bound dimension information
- of the array. And the value of the subscript is available E.place; so by using both of these, the required three-address
- statements can be generated, and the value of the subscript does not need to be stored. Therefore, the modified
- grammar, along with the semantic actions, is:
- L → elist { U = newtemp(); V = newtemp()
- V = elist.place * bpw
- U = gencode (elist.array - C)
- L.place = U
- L.off = V
- }
- elist → id E {elist.place = E.place
- elist.array = id.place
- elist.Ndim = 1; }
- elist → elist, E { T1 = newtemp ();
- gencode (T1 = elist.place *
- limit (elist.array, elist.Ndim +1))
- gencode (T1 = T1 + E.place)
- elist.array = elist1.array
- elist.place = T1,
- elist.Ndim = elist.Ndim +1
- }
- For example, consider the following assignment statement:
- where a and b are arrays of size 30 × 40, and c and d are arrays of size 20.
- There are four bytes per word, and the arrays are allocated statically. When the above translation scheme is used to
- translate this construct, the three-address code generated is:
- 6.10 SWITCH/CASE
- To capture the syntactic structure of the switch statement, we add the following productions to the grammar. Here,
- break is assumed to be a part of statement that is derivable from a nonterminal S.
- S → switch E { caselist}
- caselist → caselist case V : S
- caselist → case V: S
- caselist → default: S
- caselist → caselist default: S
- A switch statement is comprised of two components: an expression E, which is used to select a particular case from
- the list of cases; and a caselist, which is a list of n number of cases, each of which corresponds to one of the possible
- values of the expression E, perhaps including a default case.
- Note A case statement can be implemented in a variety of different ways. If the number of cases is not too great, then
- a case statement can be implemented by generating a sequence of conditional jumps, each of which tests for an
- individual value and transfers to the code for the corresponding statement. If the number of cases is large, then it
- is more efficient to construct a hash table for the case values with the labels of the various statements as entries.
- A syntax-directed translation scheme that translates a case statement into a sequence of conditional jumps, each of
- which tests for an individual value and transfers to the code for the corresponding statement, is considered below. We
- begin with a typical switch statement:
- switch (E)
- {
- case V1: S1
- case V2: S2
- .
- .
- .
- case Vn:Sn
- }
- The generated three-address that is required for the statement is shown in Figure 6.17. Here, next is the label of the
- code for the statement that comes next in the switch statement execution order.
- Figure 6.17: A switch/case three-address translation.
- Therefore, switch statement translation involves generating an unconditional jump after the code of every S1, S2, … ,
- Sn statement in order to transfer control to the next element of the switch statement, as well as to remember the code
- start of S1, S2, … , Sn, and to generate the conditional jumps. Each of these jumps tests for an individual value and
- transfers to the code for the corresponding statement. This requires introducing nullable nonterminals before S1, as
- shown in Figure 6.18.
- Figure 6.18: Nullable nonterminals are introduced into a switch statement translation.
- EXAMPLE 6.1
- Consider the following switch statement:
- switch (i + j )
- {
- case 1: x = y + z
- default: p = q + r
- case 2: u = v + w
- }
- The above translation scheme translates into the following three-address code, which is also shown in Figure 6.19:
- Figure 6.19: Contents of queue during the translation.
- EXAMPLE 6.2
- Using the above translation scheme translates the following switch statement:
- switch (a+b)
- {
- case 2: { x = y; break; }
- case 5: {switch x
- {
- case 0: { a = b + 1; break; }
- case 1: { a = b + 3; break; }
- default: { a = 2; break; }
- }
- break;
- case 9: { x = y - 1; break; }
- default: { a = 2; break; }
- }
- The three address code is:
- t1 = a + b 1.
- goto(23) 2.
- x = y 3.
- goto NEXT 4.
- goto(14) 5.
- t3 = b + 1 6.
- a = t3 7.
- goto NEXT 8.
- t4 = b + 3 9.
- a = t4 10.
- goto NEXT 11.
- a = 2 12.
- goto NEXT 13.
- if x = 0 goto(6) 14.
- if x = 1 goto(9) 15.
- goto(12) 16.
- goto NEXT 17.
- t5 = y − 1 18.
- x = t5 19.
- goto NEXT 20.
- a = 2 21.
- goto NEXT 22.
- if t1 = 2 goto(3) 23.
- if t1 = 5 goto(5) 24.
- if t1 = 9 goto(18) 25.
- goto(21) 26.
- 6.11 THE PROCEDURE CALL
- S → call id (arglist)
- { for every value T in queue generate
- Param T gencode
- (call id.place, arglist.count)
- }
- arglist → arglist, E{ append (queue, E.place)
- arglist.count:= arglist. count + 1}
- arglist → E { initialize queue by E.place
- arglist.count: = 1}
- 6.12 EXAMPLES
- Following are additional examples of syntax-directed definitions and translations.
- EXAMPLE 6.3
- Generate the three-address code for the following C program:
- main()
- { int i = 1;
- int a[10];
- while(i <= 10)
- a[i] = ;
- }
- The three-address code for the above C program is:
- i = 1 1.
- if i <= 10 goto(4) 2.
- goto(8) 3.
- t1 = i * width 4.
- t2 = addr(a) − width 5.
- t2[t1] = 0 6.
- goto(2) 7.
- where width is the number of bytes required for each element.
- EXAMPLE 6.4
- Generate the three-address code for the following program fragment:
- while (A < C and B > D) do
- if A = 1 then C = C+1
- else
- while A <= D do
- A = A + 3
- The three-address code is:
- if a < c goto(3) 1.
- goto(16) 2.
- if b >d goto(5) 3.
- goto(16) 4.
- if a = 1 goto(7) 5.
- goto(10) 6.
- t1 = c+1 7.
- c = t1 8.
- goto(1) 9.
- if a <= d goto 10.
- goto(1) 11.
- t2 = a+3 12.
- a = t2 13.
- goto(10) 14.
- goto(1) 15.
- EXAMPLE 6.5
- Generate the three-address code for the following program fragment, where a and b are arrays of size 20 × 20, and
- there are four bytes per word.
- begin
- add = 0;
- i = 1;
- j = 1;
- do
- begin
- add = add + a[i,j] * b[j,i]
- i = i + 1;
- j = j + 1;
- end
- while i <= 20 and j <= 20;
- end
- The three-address code is:
- add = 0 1.
- i = 1 2.
- j = 1 3.
- t1 = i * 20 4.
- t1 = t1 + j 5.
- t1 = t1 * 4 6.
- t2 = addr(a) − 84 7.
- t3 = t2[t1] 8.
- t4 = j * 20 9.
- t4 = t4 + i 10.
- t4 = t4 * 4 11.
- t5 = addr(b) − 84 12.
- t6 = t5[t4] 13.
- t7 = t3 * t6 14.
- t7 = add + t7 15.
- t8 = i + 1 16.
- i = t8 17.
- t9 = j + 1 18.
- j = t9 19.
- if i <= 20 goto(22) 20.
- goto NEXT 21.
- if j <= 20 goto(4) 22.
- goto NEXT 23.
- EXAMPLE 6.6
- Consider the program fragment:
- sum = 0
- for(i = 1; i<= 20; i++)
- sum = sum + a[i] + b[i];
- and generate the three-address code for it. There are four bytes per word.
- The three address code is:
- sum = 0 1.
- i = 1 2.
- if i <= 20 goto(8) 3.
- goto NEXT 4.
- t1 = i+1 5.
- i = t1 6.
- goto(3) 7.
- t2 = i * 4 8.
- t3 = addr(a) − 4 9.
- t4 = t3[t2] 10.
- t5 = i * 4 11.
- t6 = addr(b) − 4 12.
- t7 = t6[t5] 13.
- t8 = sum + t4 14.
- t8 = t8 + t7 15.
- sum = t8 16.
- goto(5) 17.
- Chapter 7: Symbol Table Management
- 7.1 THE SYMBOL TABLE
- A symbol table is a data structure used by a compiler to keep track of scope/ binding information about names. This
- information is used in the source program to identify the various program elements, like variables, constants,
- procedures, and the labels of statements. The symbol table is searched every time a name is encountered in the
- source text. When a new name or new information about an existing name is discovered, the content of the symbol
- table changes. Therefore, a symbol table must have an efficient mechanism for accessing the information held in the
- table as well as for adding new entries to the symbol table.
- For efficiency, our choice of the implementation data structure for the symbol table and the organization its contents
- should be stress a minimal cost when adding new entries or accessing the information on existing entries. Also, if the
- symbol table can grow dynamically as necessary, then it is more useful for a compiler.
- 7.2 IMPLEMENTATION
- Each entry in a symbol table can be implemented as a record that consists of several fields. These fields are
- dependent on the information to be saved about the name. But since the information about a name depends on the
- usage of the name (i.e., on the program element identified by the name), the entries in the symbol table records will
- not be uniform. Hence, to keep the symbol table records uniform, some of the information about the name is kept
- outside of the symbol table record, and a pointer to this information is stored in the symbol table record, as shown in
- Figure 7.1. Here, the information about the lower and upper bounds of the dimension of the array named a is kept
- outside of the symbol table record, and the pointer to this information is stored within the symbol table record.
- Figure 7.1: A pointer steers the symbol table to remotely stored information for the array a.
- 7.3 ENTERING INFORMATION INTO THE SYMBOL TABLE
- Information is entered into the symbol table in various ways. In some cases, the symbol table record is created by the
- lexical analyzer as soon as the name is encountered in the input, and the attributes of the name are entered when the
- declarations are processed. But very often, the same name is used to denote different objects, perhaps even in the
- same block. For example, in C programming, the same name can be used as a variable name and as a member
- name of a structure, both in the same block. In such cases, the lexical analyzer only returns the name to the parser,
- rather than a pointer to the symbol table record. That is, a symbol table record is not created by the lexical analyzer;
- the string itself is returned to the parser, and the symbol table record is created when the name's syntactic role is
- discovered.
- 7.4 WHERE SHOULD NAMES BE HELD?
- If there is a modest upper bound on the length of the name, then the name can be stored in the symbol table record
- itself. But if there is no such limit, or if the limit is rarely reached, then an indirect scheme of storing name is used. A
- separate array of characters, called a "string table," is used to store the name, and a pointer to the name is kept in the
- symbol table record, as shown in Figure 7.2.
- Figure 7.2: Symbol table names are held either in the symbol table record or in a separate string table.
- 7.5 INFORMATION ABOUT THE RUNTIME STORAGE LOCATION
- The information about the runtime, name storage location is kept in the symbol table. If the compiler is going to be
- generating assembly code, then the assembler takes care of the storage locations of the various names. After
- generating the assembly code, the compiler scans the symbol table and generates the assembly language data
- definitions. These are appended to the assembly language code for each name. But if machine code is being
- generated, then the compiler must ascertain the position of each data object relative to a fixed origin.
- 7.6 VARIOUS APPROACHES TO SYMBOL TABLE ORGANIZATION
- There are several methods of organizing the symbol table. These methods are discussed below.
- 7.6.1 The Linear List
- A linear list of records is the easiest way to implement a symbol table. The new names are added to the table in the
- order that they arrive. Whenever a new name is to be added to the table, the table is first searched linearly or
- sequentially to check whether or not the name is already present in the table. If the name is not present, then the
- record for new name is created and added to the list at a position specified by the available pointer, as shown in the
- Figure 7.3.
- Figure 7.3: A new record is added to the linear list of records.
- To retrieve the information about the name, the table is searched sequentially, starting from the first record in the
- table. The average number of comparisons, p, required for search are p = (n + 1)/2 for successful search and p = n for
- an unsuccessful search, where n is the number of records in symbol table. The advantage of this organization is that it
- takes less space, and additions to the table are simple. This method's disadvantage is that it has a higher accessing
- time.
- 7.6.2 Search Trees
- A search tree is a more efficient approach to symbol table organization. We add two links, left and right, in each
- record, and these links point to the record in the search tree. Whenever a name is to be added, first the name is
- searched in the tree. If it does not exist, then a record for the new name is created and added at the proper position in
- the search tree. This organization has the property of alphabetical accessibility; that is, all the names accessible from
- name i will, by following a left link, precede name 1 in alphabetical order. Similarly, all the name accessible from name i
- will follow name i in alphabetical order by following the right link (see Figure 7.4). The expected time needed to enter n
- names and to make m queries is proportional to (m + n) log 2 n; so for greater numbers of records (higher n) this method
- has advantages over linear list organization.
- Figure 7.4: The search tree organization approach to a symbol table.
- 7.6.3 Hash Tables
- A hash table is a table of k pointers numbered from zero to k − 1 that point to the symbol table and a record within the
- symbol table. To enter a name into symbol table, we find out the hash value of the name by applying a suitable hash
- function. The hash function maps the name into an integer between zero and k − 1, and using this value as an index in
- the hash table, we search the list of the symbol table records that is built on that hash index. If the name is not present
- in that list, we create a record for name and insert it at the head of the list. When retrieving the information associated
- with the name, the hash value of the name is first obtained, and then the list that was built on this hash value is
- searched for information about the name (Figure 7.5).
- Figure 7.5: Hash table method of symbol table organization.
- 7.7 REPRESENTING THE SCOPE INFORMATION IN THE SYMBOL TABLE
- Every name possesses a region of validity within the source program, called the "scope" of that name. The rules
- governing the scope of names in a block-structured language are as follows:
- A name declared within a block B is valid only within B. 1.
- If block B1 is nested within B2, then any name that is valid for B2 is also valid for B1, unless the
- identifier for that name is re-declared in B1.
- 2.
- These scope rules require a more complicated symbol table organization than simply a list of associations between
- names and attributes. One technique that can be used is to keep multiple symbol tables, one for each active block,
- such as the block that the compiler is currently in. Each table is list of names and their associated attributes, and the
- tables are organized into a stack. Whenever a new block is entered, a new empty table is pushed onto the stack for
- holding the names that are declared as local to this block. And when a declaration is compiled, the table on the stack is
- searched for a name. If the name is not found, then the new name is inserted. When a reference to a name is
- translated, each table is searched, starting from the top table on the stack, ensuring compliance with static scope
- rules. For example, consider following program structure. The symbol table organization will be as shown in Figure
- 7.6.
- Program main
- Var x,y : integer :
- Procedure P :
- Var x,a : boolean;
- Procedure q
- Var x,y,z : real;
- Begin
- .
- .
- end
- begin :
- end
- begin :
- end
- Figure 7.6: Symbol table organization that complies with static scope information rules.
- Another technique can be used to represent scope information in the symbol table. We store the nesting depth of
- each procedure block in the symbol table and use the [procedure name, nesting depth] pair as the key to accessing
- the information from the table. A nesting depth of a procedure is a number that is obtained by starting with a value of
- one for the main and adding one to it every time we go from an enclosing to an enclosed procedure. This number is
- basically a count of how many procedures are there in the referencing environment of the procedure.
- For example, refer to the program code structure above. The symbol table's contents are shown in Table 7.1.
- Table 7.1: Symbol Table Contents Using a Nesting Depth Approach
- X 1 real
- Y 1 real
- Z 1 real
- q 3 proc
- a 3 Boolean
- X 3 Boolean
- P 2 proc
- Y 2 integer
- X 2 integer
- Chapter 8: Storage Management
- 8.1 STORAGE ALLOCATION
- One of the important tasks that a compiler must perform is to allocate the resources of the target machine to
- represent the data objects that are being manipulated by the source program. That is, a compiler must decide the
- run-time representation of the data objects in the source program. Source program run-time representations of the
- data objects, such as integers and real variables, usually take the form of equivalent data objects at the machine level;
- whereas data structures, such as arrays and strings, are represented by several words of machine memory.
- The strategies that can be used to allocate storage to the data objects are determined by the rules defining the scope
- and duration of the names in the programming language. The simplest strategy is static allocation, which is used in
- languages like FORTRAN. With static allocation, it is possible to determine the run-time size and relative position of
- each data object during compilation. A more-complex strategy for dynamic memory allocation that involves stacks is
- required for languages that support recursion: an entry to a new block or procedure causes the allocation of space on
- a stack, which is freed on exit from the block or procedure. An even more-complex strategy is required for languages,
- which allows the allocation and freeing of memory for some data in a non-nested fashion. This storage space can be
- allocated and freed arbitrarily from an area called a "heap". Therefore, implementation of languages like PASCAL and
- C allow data to be allocated under program control. The run-time organization of the memory will be as shown in
- Figure 8.1.
- Figure 8.1: Heap memory storage allows program-controlled data allocation.
- The run-time storage has been subdivided to hold the generated target code and the data objects, which are allocated
- statically for the stack and heap. The sizes of the stack and heap can change as the program executes.
- 8.2 ACTIVATION OF THE PROCEDURE AND THE ACTIVATION RECORD
- Each execution of a procedure is referred to as an activation of the procedure. This is different from the procedure
- definition, which in its simplest form is the association of an identifier with a statement; the identifier is the name of the
- procedure, and the statement is the body of the procedure.
- If a procedure is non-recursive, then there exists only one activation of procedure at any one time. Whereas if a
- procedure is recursive, several activations of that procedure may be active at the same time. The information needed
- by a single execution or a single activation of a procedure is managed using a contiguous block of storage called an
- "activation record" or fiactivation framefl consisting of the collection of fields. (Very often, registers take the place of
- one or more of the fields in the activation record.) The activation record contains the following information:
- Temporary values, such as those arising during the evaluation of the expression. 1.
- Local data of a procedure. 2.
- The information about the machine state (i.e., the machine status) just before a procedure is
- called, including PC values and the values of these registers that must be restored when control is
- relinquished after the procedure.
- 3.
- Access links (optional) referring to non-local data that is held in other activation records. This is
- not required for a language like FORTRAN, because non-local data is kept in fixed place. But it is
- required for Pascal.
- 4.
- Actual parameters (i.e., the parameters supplied to the called procedure). These parameters may
- also be passed in machine registers for greater efficiency.
- 5.
- The return value used by called procedure to return a value to calling procedure. Again, for
- greater efficiency, a machine register may be used for returning values.
- 6.
- The size of almost all of the fields of the activation record can be determined at compile time. An exception is if a
- called procedure has a local array whose size is determined by the values of the actual parameters.
- The information in the activation record is organized in a manner that enables easy access at execution time. A pointer
- to the activation record is required. This pointer is called the current environment pointer (CEP), and it points to one of
- the fixed fields in the activation record. Using the proper offset from this pointer, and depending upon the format of the
- activation record, the contents of the activation record can be accessed. Figure 8.2 shows the organization of
- information in a typical activation record.
- Figure 8.2: Typical format of an activation record.
- 8.3 STATIC ALLOCATION
- In static allocation, the names are bound to specific storage locations as the program is compiled. These storage
- locations cannot be changed during the program's execution. Since the binding does not change at run time, every
- time a procedure is called, its names are bound to the same storage locations. Hence, if the local names are allocated
- statically, then their values will be retained throughout the activation of a procedure. The compiler uses the name type
- to determine the amount of storage to set aside for that name. The address of this storage consists of an offset from
- an end-of-activation record for the procedure. The compiler must decide where the activation records go relative to the
- target code and relative to other activation records. Once this decision is made, the storage position for each name in
- the record is fixed. Therefore, at compile time, it is possible to fill in both the address at which the target code can find
- the data and the address at which information is saved. However, there are some limitations to using static allocation:
- The size of the data object and any constraints on its position in memory must be known at
- compile time.
- 1.
- Recursive procedures cannot be permitted, because all activations of a procedure use the same
- binding for local names.
- 2.
- Data structures cannot be created dynamically, since there is no mechanism for storage
- allocation at run time.
- 3.
- 8.4 STACK ALLOCATION
- In stack allocation, storage is organized as a stack, and activation records are pushed and popped as the activation of
- procedures begin and end, respectively, thereby permitting recursive procedures. The storage for the locals in each
- procedure call is contained in the activation record for that call. Hence, the locals are bound to fresh storage in each
- activation, because a new activation record is pushed onto stack when a call is made. The storage values of locals are
- deleted when the activation ends.
- 8.4.1 The Call and Return Sequence
- Procedure calls are implemented by generating what is called a "call sequence and return sequence" in the target
- code. The job of a call sequence is to set up an activation record. Setting up an activation record means entering the
- information into the fields of the activation record if the storage for the activation record is allocated statically. When
- the storage for the activation record is allocated dynamically, storage is allocated for it on the stack, and the
- information is entered in its fields.
- On the other hand, the job of a return sequence is to restore the state of machine so that the machine's calling
- procedure can continue executing. This also involves destroying the activation record if it was allocated dynamically on
- the stack.
- The code in a call sequence is often divided between the caller and the callee. But there is no exact division of
- run-time tasks between the caller and callee. It depends on the source language, the target machine, and the
- operating system. Hence, even when using a common language, the call sequence may differ from implementation to
- implementation. But it is desirable to put as much of the calling sequence into the callee as possible, because there
- may be several calls for a procedure. And even though that portion of the calling sequence is generated for each call
- by the various callers, this portion of the calling sequence is shared within the callee, so it is generated only once.
- Figure 8.3 shows the format of a typical activation record. Here, the contents of the activation record are accessed
- using the CEP pointer.
- Figure 8.3: The CEP pointer is used to access the contents of the activation record.
- The stack is assumed to be growing from higher to lower addresses. A positive offset will be used to access the
- contents of the activation record when we want to go in a direction opposite to that of the growth of the stack (in Figure
- 8.3, the field pointed to by the CEP). A negative offset will be used to access the contents of the activation record
- when we want to go in the same direction as the growth of stack. A typical call sequence for caller code to evaluate
- parameters is as follows:
- push ( ) /* for return value
- push (T 1 ) /* T 1 is holding the first argument
- push (T 2 ) /* T 2 is holding the second argument
- .
- .
- .
- push (T n ) /* T n is holding the nth argument
- push (n) /* n is the count of arguments
- push (return address)
- push (CEP)
- goto start of code segment of callee
- A typical callee code segment is shown in Figure 8.4.
- Call sequence
- Object code of the callee
- Return sequence
- Figure 8.4: Typical callee code segment.
- A typical call sequence in the callee will be:
- CEP = top /*
- Code for pushing the local data of
- the callee
- And a typical return sequence is:
- top = CEP + 1
- 1 = *top /* for retrieving return address
- top = top + 1
- CEP =*CEP / for resetting the CEP to point to the activation record of the caller
- top = top+ *top +2 /*for resetting top to point to the top of the activation record of caller goto1
- 8.4.2 Access to Nonlocal Names
- The way that the nonlocals are accessed depends on the scope rules of the language (see Chapter 7). There are two
- different types of scope rules: static scope rules and dynamic scope rules.
- Static scope rules determine which declaration a name's reference will be associated with, depending upon the
- program's language, thereby determining from where the name's value will be obtained at run time. When static scope
- rules are used during compilation, the compiler knows how the declarations are bound to the name references, and
- hence, from where their values will be obtained at run time. What the compiler has to do is to provision the retrieval of
- the nonlocal name value when it is accessed at run time.
- Whereas when dynamic scope rules are used, the values of nonlocal names are retrieved at run time by scanning
- down the stack, starting at the top-most activation record. The rule for associating a nonlocal reference to a declaration
- is simple when procedure nesting is not permitted. In the absence of nested procedures, the storage for all names
- declared outside any procedure can be allocated statically. The position of this storage is known at compile time, so if
- a name is nonlocal in some procedure's body, its statically determined address is used; whereas if a name is local, it is
- assessed via a CEP pointer using the suitable offset.
- An important benefit of static allocation for nonlocals is that declared procedures can be freely passed as parameters
- and returned as results. For example, a function inCis passed by address; that is, a pointer is passed to it. When the
- procedures are nested, declarations are bound to name references according to the following rule: if a name x is not
- declared in a procedure P, then an occurrence of x in P is in the scope of a declaration of x in an enclosing procedure
- P 1 such that:
- The enclosing procedure P 1 has a declaration of x, and 1.
- P 1 is more closely nested around P than any other procedure with a declaration of x. 2.
- Therefore, a reference to a nonlocal name x is resolved by associating it with the declaration of x in P 1 , and the
- compiler is required to provision getting the value of x at run time from the most-recent activation record of P 1 by
- generating a suitable call sequence.
- One of the ways to implement this is to add a pointer, called an "access link," to each activation record. And if a
- procedure P is nested immediately within Q in the source text, then make the access link in the activation record P,
- pointing to the most-recent activation record of Q. This requires an activation record with a format like that shown in
- Figure 8.5.
- Figure 8.5: An activation record that deals with nonlocal name references.
- The modified call and return sequence, required for setting up the activation record shown in Figure 8.5, is:
- push ( ) /* for return value
- push (T 1 ) /* T 1 is holding the first argument
- push (T 2 ) /* T 2 is holding the second argument
- .
- .
- .
- push (T n ) /* T n is holding the nth argument
- push(n) /* n is the count of arguments
- push (return address)
- push (CEP)
- code to set up access link
- goto start of code segment of callee
- A typical callee segment is shown in Figure 8.6.
- Call sequence
- Object code of the callee
- Return sequence
- Figure 8.6: A typical callee segment.
- A typical call sequence in the callee is:
- CEP = top+1/* code for pushing the local data of the callee
- A typical return sequence is:
- top = CEP+1
- 1 = *top /* for retrieving return address
- top = top+1
- CEP = *CEP / for resetting the CEP to point to the activation record of caller
- top = top + *top +2/* for resetting top to point to the top of the activation record of caller goto 1
- 8.4.3 Setting Up the Access Link
- To generate the code for setting up the access link, a compiler makes use of the following information: the nesting
- depth of the caller procedure and the nesting depth of the callee procedure. A procedure's nesting depth is number
- that is obtained by starting with value of one for the main and adding one to it every time we go from an enclosing to
- an enclosed procedure. This number is basically a count of how many procedures are there in the referencing
- environment of the procedure.
- Suppose that procedure p at a nesting depth Np calls a procedure at nesting depth Nq. Then the access link in the
- activation record of procedure q is set up as follows:
- if (Nq > Np) then
- The access link in the activation record of procedure q is set to point to the activation record of procedure p.
- else
- if (Nq =Np) then
- Copy the access link in the activation record of procedure p into the activation record of procedure q.
- else
- if (Nq < Np) then
- Follow (Np − Nq) links to reach to the activation record, and copy the access link of this activation record into the
- activation record of procedure q.
- The Block Statement
- A block is a statement that contains its own local data declarations. Blocks can either be independent—like B1 begin
- and B1 end, then B2 begin and B2 end—or they can be nested—like B1 begin and B2 begin, then B2 end and B1
- end. This nesting property is sometimes called a "block structure". The scope of a declaration in a block-structured
- language is given by the most closely nested rule:
- The scope of a declaration in a block B includes B. 1.
- If a name X is not declared in a block B, then an occurrence of X in B is in the scope of a
- declaration of X in an enclosing block B ′ , such that:
- B ′ has a declaration of X, and a.
- B ′ is more closely nested around B than any other block with a declaration of
- X.
- b.
- 2.
- Block structure can be implemented using stack allocation. Space is allocated for declared names. The block is
- entered by pushing an activation record, and it is de-allocated when control leaves the block and the activation record
- is destroyed. That is, a block is treated like a parameter-less procedure, called only at the entry to the block and
- returned upon exit from the block. An alternative is to allocate storage for a complete procedure body at one time. If
- there are blocks within the procedure, then an allowance is made for the storage needed by the declarations within the
- block, as shown in Figure 8.7. For example, consider the following program structure:
- main ()
- {
- int a;
- {
- int b;
- {
- int c;
- printf ("% d% d\n", b,c);
- }
- {
- intd;
- printf("% d% d\n", b, d);
- }
- }
- printf("% d\n",a);
- }
- a
- b
- c,d
- Figure 8.7: Storage for declared names.
- Chapter 9: Error Handling
- 9.1 ERROR RECOVERY
- One of the important tasks that a compiler must perform is the detection of and recovery from errors. Recovery from
- errors is important, because the compiler will be scanning and compiling the entire program, perhaps in the presence
- of errors; so as many errors as possible need to be detected.
- Every phase of a compilation expects the input to be in a particular format, and whenever that input is not in the
- required format, an error is returned. When detecting an error, a compiler scans some of the tokens that are ahead of
- the error's point of occurrence. The fewer the number of tokens that must be scanned ahead of the point of error
- occurrence, the better the compiler's error-detection capability. For example, consider the following statement:
- if a = b then x: = y + z;
- The error in the above statement will be detected in the syntactic analysis phase, but not before the syntax analyzer
- sees the token "then"; but the first token, itself, is in error.
- After detecting an error, the first thing that a compiler is supposed to do is to report the error by producing a suitable
- diagnostic. A good error diagnostic should possess the following properties.
- The message should be produced in terms of the original source program rather than in terms of
- some internal representation of the source program. For example, the message should be
- produced along with the line numbers of the source program.
- 1.
- The error message should be easy to understand by the user. 2.
- The error message should be specific and should localize the problem. For example, an error
- message should read, "x is not declared in function fun," and not just, "missing declaration".
- 3.
- The message should not be redundant; that is, the same message should not be produced again
- and again.
- 4.
- Therefore, a compiler should report errors by generating messages with the above properties. The errors captured by
- the compiler can be classified as either syntactic errors or semantic errors. Syntactic errors are those errors that are
- detected in the lexical or syntactic analysis phase by the compiler. Semantic errors are those errors detected by the
- compiler.
- 9.2 RECOVERY FROM LEXICAL PHASE ERRORS
- The lexical analyzer detects an error when it discovers that an input's prefix does not fit the specification of any token
- class. After detecting an error, the lexical analyzer can invoke an error recovery routine. This can entail a variety of
- remedial actions.
- The simplest possible error recovery is to skip the erroneous characters until the lexical analyzer finds another token.
- But this is likely to cause the parser to read a deletion error, which can cause severe difficulties in the syntaxanalysis
- and remaining phases. One way the parser can help the lexical analyzer can improve its ability to recover from errors
- is to make its list of legitimate tokens (in the current context) available to the error recovery routine. The error-recovery
- routine can then decide whether a remaining input's prefix matches one of these tokens closely enough to be treated
- as that token.
- 9.3 RECOVERY FROM SYNTACTIC PHASE ERRORS
- A parser detects an error when it has no legal move from its current configuration. The LL(1) and LR(1) parsers use
- the valid prefix property; therefore, they are capable of announcing an error as soon as they read an input that is not a
- valid continuation of the previous input's prefix. This is earliest time that a left-to-right parser can announce an error.
- But there are a variety of other types of parsers that do not necessarily have this property.
- The advantages of using a parser with a valid-prefix-property capability is that it reports an error as soon as possible,
- and it minimizes the amount of erroneous output passed to subsequent phases of the compiler.
- Panic Mode Recovery
- Panic mode recovery is an error recovery method that can be used in any kind of parsing, because error recovery
- depends somewhat on the type of parsing technique used. In panic mode recovery, a parser discards input symbols
- until a statement delimiter, such as a semicolon or an end, is encountered. The parser then deletes stack entries until it
- finds an entry that will allow it to continue parsing, given the synchronizing token on the input. This method is simple to
- implement, and it never gets into an infinite loop.
- 9.4 ERROR RECOVERY IN LR PARSING
- A systematic method for error recovery in LR parsing is to scan down the stack until a state S with a goto on a
- particular nonterminal A is found, and then discard zero or more input symbols until a symbol a is found that can
- legitimately follow A. The parser then shifts the state goto [S, A] on the stack and resumes normal parsing.
- There might be more than one choice for the nonterminal A. Normally, these would be nonterminals representing
- major program pieces, such as statements.
- Another method of error recovery that can be implemented is called "phrase level recovery". Each error entry in the LR
- parsing table is examined, and, based on language usage, an appropriate error-recovery procedure is constructed.
- For example, to recover from an construct error that starts with an operator, the error-recovery routine will push an
- imaginary id onto the stack and cover it with the appropriate state. While doing this, the error entries in a particular
- state that call for a particular reduction on some input symbols are replaced by that reduction. This has the effect of
- postponing the error detection until one or more reductions are made; but the error will still be caught before a shift.
- A phrase level error-recovery implementation for an LR parser is shown below. The parsing table's grammar is:
- The SLR parsing table for the above grammar is shown in Table 9.1.
- Table 9.1: Parsing Table for E → E + E | E * E | id
- id + * $ E
- I 0 S 2
- 1
- I 1
- S 3 S 4 Accept
- I 2
- R 3 R 3 R 3
- I 3 S 2
- 5
- I 4 S 2
- 6
- I 5
- S 3 /R 1 S 4 /R 1 R 1
- I 6
- S 3 /R 2 S 4 /R 2 R 2
- The conflict is resolved by giving higher precedence to * and using leftassociativity, as shown in Table 9.2.
- Table 9.2: Higher Precedent * and Left-Associativity
- id + * $ E
- I 0 S 2
- 1
- I 1
- S 3 S 4 Accept
- I 2
- R 3 R 3 R 3
- I 3 S 2
- 5
- I 4 S 2
- 6
- I 5
- R 1 S 4 R 1
- I 6
- R 2 R 2 R 2
- The parsing table with error routines is shown in Table 9.3, where routine e 1 is called from states I 0 , I 3 , and I 4 , which
- pushes an imaginary id onto the stack and covers it with state I 2 . The routine e 2 is called from state I 1 , which pushes +
- onto stack and covers it with state I 3 .
- Table 9.3: Parsing Table with Error Routines
- id + * $ E
- I 0 S 2 e 1 e 1 e 1 1
- I 1 E 2 S 3 S 4 Accept
- I 2 R 3 R 3 R 3 R 3
- I 3 S 2 e 1 E 1 E 1 5
- I 4 S 2 E 1 E 1 E 1 6
- I 5 R 1 R 1 S 4 R 1
- I 6 R 2 R 2 R 2 R 2
- For example, if we trace the behavior of the parser described above for the input id + *id $:
- Stack Contents Unspent Input Moves
- $I 0 id+*id$ shift and enter into state 2
- $I 0 idI 2 +*id$ reduce by production number 3
- $I 0 EI 1 +*id$ shift and enter into state 3
- $I 0 EI 1 +I 3 *id$ call error routine e1
- $I 0 EI 1 +I 3 id I 2 *id$ reduce by production number 3
- (id I 2 pushed by e 1)
- $I 0 EI 1 +I 3 EI 5 *id$ shift and enter into state 4
- $I 0 EI 1 +I 3 E I 5 *I 4 id$ shift and enter into state 2
- $I 0 EI 1 +I 3 E I 5 *I 4 idI 2 $ reduce by production number 3
- Stack Contents Unspent Input Moves
- $I 0 EI 1 +I 3 E I 5 *I 4 EI 6 $ reduce by production number 2
- $I 0 EI 1 +I 3 EI 5 $ reduce by production number 1
- $I 0 EI 1 $ accept
- Similarly, if we trace the behavior of the parser for the input id id*id $:
- Stack Contents Unspent Input Moves
- $I 0 id id*id$ shift and enter into state 2
- $I 0 idI 2 id*id$ reduce by production number 3
- $I 0 EI 1 id*id$ call error routine e 2
- $I 0 EI 1 + I 3 id*id$ shift and enter into state 2
- ( I 3 pushed by e 2)
- $I 0 EI 1 +I 3 id I 2 *id$ reduce by production number 3
- $I 0 EI 1 +I 3 EI 5 *id$ shift and enter into state 4
- $I 0 EI 1 +I 3 EI 5 *I 4 id$ shift and enter into state 2
- $I 0 EI 1 +I 3 EI 5 *I 4 idI 2 $ reduce by production number 3
- $I 0 EI 1 +I 3 EI 5 *I 4 EI 6 $ reduce by production number 2
- $I 0 EI 1 +I 3 EI 5 $ reduce by production number 1
- $I 0 EI 1 $ accept
- 9.5 AUTOMATIC ERROR RECOVERY IN YACC
- The tool YACC can generate a parser with the ability to automatically recover from the errors. Major nonterminals,
- such as those for program blocks or statements, are identified; and then error productions of the form A → error α are
- added to the grammar, where α is usually ∈ .
- When YACC-generated parser encounters an error, it finds the top-most state on its stack, whose underlying set of
- items includes an item of the form A → .error. Therefore, the parser shifts the token error, and a reduction to A is
- immediately possible. The parser then invokes a semantic action associated with production A → error, and this
- semantic action takes care of recovering from the error.
- 9.6 PREDICTIVE PARSING ERROR RECOVERY
- An error is detected during predictive parsing when the terminal on the top of the stack does not match the next input
- symbol, or when nonterminal A is on top of the stack and a is the next input symbol. M[A,a] is the error entry used to
- for recovery. Panic mode recovery can be used to recover from an error detected by the LL parser. The effectiveness
- of panic mode recovery depends on the choice of the synchronizing token. Several heuristics can be used when
- selecting the synchronizing token in order to ensure quick recovery from common errors:
- All the symbols in the FOLLOW(A) must be kept in the set of synchronizing tokens, because if we
- skip until an a symbol in FOLLOW(A) is read, and we pop A from the stack, it is likely that the
- parsing can continue.
- 1.
- Since the syntactic structure of a language is very often hierarchical, we add the symbols that
- begin higher constructs to the synchronizing set of lower constructs. For example, we add
- keywords to the synchronizing sets of nonterminals that generate expressions.
- 2.
- We also add the symbols in FIRST(A) to the synchronizing set of nonterminal A. This provides for
- a resumption of parsing according to A if a symbol in FIRST(A) appears in the input.
- 3.
- A derivation by an ∈ -production can be used as a default. Error detection will be postponed, but
- the error will still be captured. This method reduces the number of nonterminals that must be
- considered during error recovery.
- 4.
- Note Another method of error recovery that can be implemented is called "phrase level recovery". In phrase level
- recovery, each error entry in the LL parsing table is examined, and based on language usage, an appropriate
- error-recovery procedure is constructed. For example, to recover from a construct error that starts with an
- operator, the error-recovery routine will insert an imaginary id into the input. Then, if some state terminal symbols
- are derived using an ∈ -production, the error entries in that state are replaced by the derivation using the
- imaginary-id ∈ -production. This has the effect of postponing error detection.
- A phrase level error-recovery implementation for an LR parser is shown in Tables 9.4 and 9.5. The parsing table is
- constructed for the following grammar:
- Table 9.4: LR Parsing Table
- id + * $
- E
- E → TE 1
- T
- T → FT 1
- F
- F → id
- E 1
- E 1 → +TE 1
- E 1 → ∈
- T 1
- T 1 → ∈ T 1 → * FT 1 T 1 → ∈
- id pop
- +
- pop
- *
- pop
- $
- accept
- The modified table is shown in Table 9.5. Routine e 1 , when called, pushes an imaginary id into the input; and routine
- e 2 , when called, removes all the remaining symbols from the input.
- Table 9.5: Phrase Level Error-Recovery Implementation
- id + * $
- E
- E → TE 1
- e 1 e 1 e 1
- T
- T → FT 1
- e 1 e 1 e 1
- F → id
- e 1 e 1 e 1
- E 1
- E 1 → ∈ E 1 → +TE 1 E 1 → ∈ E 1 → ∈
- T 1
- T 1 → ∈ T 1 → ∈ T 1 → *FT 1 T 1 → ∈
- id pop
- +
- pop
- *
- pop
- $ e 2 e 2 e 2 accept
- For example, if we trace the behavior of the parser shown in Table 9.5 for the input id + *id $:
- Stack Contents Unspent Input Moves
- $E id+*id$
- derive using E → TE 1
- $E 1 T id+*id$
- derive using T → FT 1
- $E 1 T 1 F id+*id$
- derive using F → id
- $E 1 T 1 id id+*id$ pop
- $E 1 T 1 +*id$
- derive using T 1 → ∈
- Stack Contents Unspent Input Moves
- $E 1 +*id$
- derive using E 1 → +TE 1
- $E 1 T+ +*id$ pop
- $E 1 T *id$ call error routine e1
- $E 1 T id*id$
- derive using T → FT 1
- (imaginary id is pushed by e 1 )
- $E 1 T 1 F id*id$
- derive using F → id
- $E 1 T 1 id id*id$ pop
- $E 1 T 1 *id$
- derive using T 1 → *FT 1
- $E 1 T 1 F id$
- derive using F → id
- $E 1 T 1 id id$ pop
- $E 1 T 1 $
- derive using T 1 → ∈
- $E 1 $
- derive using E 1 → ∈
- $ $ accept
- Similarly, if we trace the behavior for the input id id*id $:
- Stack Contents Unspent Input Moves
- $E id id*id$
- derive using E → TE 1
- $E 1 T id+*id$
- derive using T → FT 1
- $E 1 T 1 F id+*id$
- derive using F → id
- $E 1 T 1 id id+*id$ pop
- $E 1 T 1 id*id$
- derive using T 1 → ∈
- $E 1 id*id$
- derive using E 1 → ∈
- $ id*id$ call error routine e 2
- (id*id$ is removed by e 2 )
- $ $ accept
- 9.7 RECOVERY FROM SEMANTIC ERRORS
- The primary sources of semantic errors are undeclared names and type incompatibilities. Recovery from an
- undeclared name is rather straightforward. The first time the undeclared name is encountered, an entry can be made
- in the symbol table for that name with an attribute that is appropriate to the current context. For example, if missing
- declaration error of x is encountered, then the error-recovery routine enters the appropriate attribute for x in x's symbol
- table, depending on the current context of x. A flag is then set in the x symbol table record to indicate that an attribute
- has been added, and to recover from an error or not in response to the declaration of x.
- Chapter 10: Code Optimization
- 10.1 INTRODUCTION TO CODE OPTIMIZATION
- The translation of a source program to an object program is basically one of many mappings; that is, there are many
- object programs for the same source program, which implement the same computations. Some of these
- object-translated source programs may be better than other object programs when it comes to storage requirements
- and execution speeds. Code optimization refers to techniques a compiler can employ in order to produce an improved
- object code for a given source program.
- How beneficial the optimization is depends upon the situation. For a program that is only expected to be run a few
- times, and which will then be discarded, no optimization is necessary. Whereas if a program is expected to run
- indefinitely, or if it is expected to run many times, then optimization is useful, because the effort spent on improving the
- program's execution time will be paid back, even if execution time is only reduced by a small percentage.
- What follows are some optimization techniques that are useful when designing optimizing compilers.
- 10.2 WHAT IS CODE OPTIMIZATION?
- Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated
- object code. It involves a complex analysis of the intermediate code and the performance of various transformations;
- but every optimizing transformation must also preserve the semantics of the program. That is, a compiler should not
- attempt any optimization that would lead to a change in the program's semantics.
- Optimization can be machine-independent or machine-dependent. Machine-independent optimizations can be
- performed independently of the target machine for which the compiler is generating code; that is, the optimizations are
- not tied to the target machine's specific platform or language. Examples of machine-independent optimizations are:
- elimination of loop invariant computation, induction variable elimination, and elimination of common subexpressions.
- On the other hand, machine-dependent optimization requires knowledge of the target machine. An attempt to generate
- object code that will utilize the target machine's registers more efficiently is an example of machine-dependent code
- optimization. Actually, code optimization is a misnomer; even after performing various optimizing transformations,
- there is no guarantee that the generated object code will be optimal. Hence, we are actually performing code
- improvement. When attempting any optimizing transformation, the following criteria should be applied:
- The optimization should capture most of the potential improvements without an unreasonable
- amount of effort.
- 1.
- The optimization should be such that the meaning of the source program is preserved. 2.
- The optimization should, on average, reduce the time and space expended by the object code. 3.
- 10.3 LOOP OPTIMIZATION
- Loop optimization is the most valuable machine-independent optimization because a program's inner loops are good
- candidates for improvement. The important loop optimizations are elimination of loop invariant computations and
- elimination of induction variables. A loop invariant computation is one that computes the same value every time a loop
- is executed. Therefore, moving such a computation outside the loop leads to a reduction in the execution time.
- Induction variables are those variables used in a loop; their values are in lock-step, and hence, it may be possible to
- eliminate all except one.
- 10.3.1 Eliminating Loop Invariant Computations
- To eliminate loop invariant computations, we first identify the invariant computations and then move them outside
- loop if the move does not lead to a change in the program's meaning. Identification of loop invariant computation
- requires the detection of loops in the program. Whether a loop exists in the program or not depends on the program's
- control flow, therefore, requiring a control flow analysis. For loop detection, a graphical representation, called a
- "program flow graph," shows how the control is flowing in the program and how the control is being used. To obtain
- such a graph, we must partition the intermediate code into basic blocks. This requires identifying leader statements,
- which are defined as follows:
- The first statement is a leader statement. 1.
- The target of a conditional or unconditional goto is a leader. 2.
- A statement that immediately follows a conditional goto is a leader. 3.
- A basic block is a sequence of three-address statements that can be entered only at the beginning, and control ends
- after the execution of the last statement, without a halt or any possibility of branching, except at the end.
- 10.3.2 Algorithm to Partition Three-Address Code into Basic Blocks
- To partition three-address code into basic blocks, we must identify the leader statements in the three-address code
- and then include all the statements, starting from a leader, and up to, but not including, the next leader. The basic
- blocks into which the three-address code is partitioned constitute the nodes or vertices of the program flow graph. The
- edges in the flow graph are decided as follows. If B1 and B2 are the two blocks, then add an edge from B1 to B2 in the
- program flow graph, if the block B2 follows B1 in an execution sequence. The block B2 follows B1 in an execution
- sequence if and only if:
- The first statement of block B2 immediately follows the last statement of block B1 in the
- three-address code, and the last statement of block B1 is not an unconditional goto statement.
- 1.
- The last statement of block B1 is either a conditional or unconditional goto statement, and the first
- statement of block B2 is the target of the last statement of block B1.
- 2.
- For example, consider the following program fragment:
- Fact(x)
- {
- int f = 1;
- for(i = 2; i<=x; i++)
- f = f*i;
- return(f);
- }
- The three-address-code representation for the program fragment above is:
- f = 1; 1.
- i = 2 2.
- if i <= x goto(8) 3.
- f = f *i 4.
- t1 = i + 1 5.
- i = t1 6.
- goto(3) 7.
- goto calling program 8.
- The leader statements are:
- Statement number 1, because it is the first statement.
- Statement number 3, because it is the target of a goto.
- Statement number 4, because it immediately follows a conditional goto statement.
- Statement number 8, because it is a target of a conditional goto statement.
- Therefore, the basic blocks into which the above code can be partitioned are as follows, and the program flow graph is
- shown in Figure 10.1.
- Block B 1 :
- Block B 2 :
- Block B 3 :
- Block B 4 :
- Figure 10.1: Program flow graph.
- 10.3.3 Loop Detection
- A loop is a cycle in the flow graph that satisfies two properties:
- It should have a single entry node or header, so that it will be possible to move all of the loop
- invariant computations to a unique place, called a "preheader," which is a block/node placed
- outside the loop, just in front of the header.
- 1.
- It should be strongly connected; that is, it should be possible to go from any node of the loop to
- any other node while staying within the loop. This is required until at least some of the loops get
- executed repeatedly.
- 2.
- If the flow graph contains one or more back edges, then only one or more loops/ xcycles exist in the program.
- Therefore, we must identify any back edges in the flow graph.
- 10.3.4 Identification of the Back Edges
- To identify the back edges in the flow graph, we compute the dominators of every node of the program flow graph. A
- node a is a dominator of node b if all the paths starting at the initial node of the graph that reach to node b go through
- a. For example, consider the flow graph in Figure 10.2. In this flow graph, the dominator of node 3 is only node 1,
- because all the paths reaching up to node 3 from node 1 do not go through node 2.
- Figure 10.2: The flow graph back edges are identified by computing the dominators.
- Dominator (dom) relationships have the following properties:
- They are reflexive; that is, every node dominates itself. 1.
- That are transitive; that is, if a dom b and b dom c, this implies a dom c. 2.
- 10.3.5 Reducible Flow Graphs
- Several code-optimization transformations are easy to perform on reducible flow graphs. A flow graph G is reducible if
- and only if we can partition the edges into two disjointed groups, forward edges and back edges, with the following two
- properties:
- The forward edges form an acyclic graph in which every node can be reached from the initial
- node G.
- 1.
- The back edges consist only of edges whose heads dominate their tails. 2.
- For example, consider the flow graph shown in Figure 10.3. This flow graph has no back edges, because no edge's
- head dominates the tail of that edge. Hence, it could have been a reducible graph if the entire graph had been acyclic.
- But that is not the case. Therefore, it is not a reducible flow graph.
- Figure 10.3: A flow graph with no back edges.
- After identifying the back edges, if any, the natural loop of every back edge must be identified. The natural loop of a
- back edge a → b is the set of all those nodes that can reach a without going through b, including node b itself.
- Therefore, to find a natural loop of the back edge n → d, we start with node n and add all the predecessors of node n
- to the loop. Then we add the predecessors of the nodes that were just added to the loop; and we continue this process
- until we reach node d. These nodes plus node d constitute the set of all those nodes that can reach node n without
- going through node d. This is the natural loop of the edge n → d. Therefore, the algorithm for detecting the natural loop
- of a back edge is:
- Input : back edge n → d.
- Output: set loop, which is a set of nodes forming the natural
- loop of the back edge n → d.
- main()
- {
- loop = { d } / * Initialize by adding node d to the set loop*/
- insert(n); /* call a procedure insert with the node n */
- }
- procedure insert(m)
- {
- if m is not in the loop then
- {
- loop = loop ∪ { m }
- for every predecessor p of m do
- insert(p);
- }
- }
- For example in the flow graph shown in Figure 10.1, the back edges are edge B3 → B2, and the loop is comprised of
- the blocks B2 and B3.
- After the natural loops of the back edges are identified, the next task is to identify the loop invariant computations. The
- three-address statement x = y op z, which exists in the basic block B (a part of the loop), is a loop invariant statement if
- all possible definitions of b and c that reach upto this statement are outside the loop, or if b and c are constants,
- because then the calculation b op c will be the same each time the statement is encountered in the loop. Hence, to
- decide whether the statement x = b op c is loop invariant or not, we must compute the u − d chaining information. The
- u − d chaining information is computed by doing a global data flow analysis of the flow graph. All of the definitions that
- are capable of reaching to a point immediately before the start of the basic block are computed, and we call the set of
- all such definitions for a block B the IN(B). The set of all the definitions capable of reaching to a point immediately after
- the last statement of block B will be called OUT(B). We compute both IN(B) and OUT(B) for every block B, GEN(B)
- and KILL(B), which are defined as:
- GEN(B): The set of all the definitions generated in block B.
- KILL(B): The set of all the definitions outside block B that define the same variables as are defined in
- block B.
- Consider the flow graph in Figure 10.4.
- The GEN and KILL sets for the basic blocks are as shown in Table 10.1.
- Table 10.1: GEN and KILL sets for Figure 10.4 Flow Graph
- Block GEN KILL
- B1 {1,2} {6,10,11}
- B2 {3,4} {5,8}
- B3 {5} {4,8}
- B4 {6,7} {2,9,11}
- B5 {8,9} {4,5,7}
- B6 {10,11} {1,2,6}
- Figure 10.4: Flow graph with GEN and KILL block sets.
- IN(B) and OUT(B) are defined by the following set of equations, which are called "data flow equations":
- The next step, therefore, is to solve these equations. If there are n nodes, there will be 2n equations in 2n unknowns.
- The solution to these equations is not generally unique. This is because we may have a situation like that shown in
- Figure 10.5, where a block B is a predecessor of itself.
- Figure 10.5: Nonunique solution to a data flow equation, where B is a predecessor of itself.
- If there is a solution to the data flow equations for block B, and if the solution is IN(B) = IN 0 and OUT(B) = OUT 0 , then
- IN 0 ∪ {d} and OUT 0 ∪ {d}, where d is any definition not in IN 0 . OUT 0 and KILL(B) also satisfy the equations, because if
- we take OUT 0 ∪ {d} as the value of OUT(B), since B is one of the predecessors of itself according to IN(B) = ∪
- OUT(P), d gets added to IN(B), because d is not in the KILL(B). Hence, we get IN(B) = IN 0 ∪ {d}. And according to
- OUT(B) = IN(B) − KILL(B) GEN(B), OUT(B) = OUT 0 ∪ {d} gets satisfied. Therefore, IN 0 , OUT 0 is one of the solutions,
- whereas IN 0 ∪ {d}, OUT 0 ∪ {d} is another solution to the equations—no unique solution. What we are interested in is
- finding smallest solution, that is, the smallest IN(B) and OUT(B) for every block B, which consists of values that are in
- all solutions. For example, since IN 0 is in IN 0 ∪ {d}, and OUT 0 is in OUT 0 ∪ {d}, IN 0 , OUT 0 is the smallest solution. And
- this is what we want, because the smallest IN(B) turns out to be the set of all definitions reaching the point just before
- the beginning of B. The algorithm for computing the smallest IN(B) and OUT(B) is as follows:
- For each block B do
- {
- IN(B)= φ
- OUT(B)= GEN(B)
- }
- 1.
- flag = true 2.
- while (flag) do
- {
- flag = false
- for each block B do
- {
- IN new (B) = Φ
- for each predecessor P of B
- IN new (B) = IN new (B) ∪ OUT(P)
- if IN new (B) ≠ IN(B) then
- {
- flag = true
- IN(B) = IN new (B)
- OUT(B) = IN(B) - KILL(B) ∪ GEN(B)
- }
- }
- }
- 3.
- Initially, we take IN(B) for every block that is to be an empty set, and we take OUT(B) for GEN(B), and we compute
- IN new (B). If it is different from IN(B), we compute a new OUT(B) and go for the next iteration. This is continued until
- IN(B) comes out to be the same for every B in a previous or current iteration.
- For example, for the flow graph shown in Figure 10.5, the IN and OUT iterations for the blocks are computed using
- above algorithm, as shown in Tables 10.2–10.6.
- Table 10.2: IN and OUT Computation for Figure 10.5
- Block IN OUT
- B1
- Φ
- {1,2}
- B2
- Φ
- {3,4}
- B3
- Φ
- {5}
- B4
- Φ
- {6,7}
- B5
- Φ
- {8,9}
- B6
- Φ
- {10,11}
- Table 10.3: First Iteration for the IN and OUT Values
- Block IN OUT
- B1
- Φ
- {1,2}
- B2 {1,2,6,7} {1,2,3,4,6,7}
- B3 {3,4,8,9} {3,5,9}
- B4 {3,4,5} {3,4,5,6,7}
- B5 {5} {8,9}
- B6 {6,7} {7,10,11}
- Table 10.4: Second Iteration for the IN and OUT Values
- Block IN OUT
- B1
- Φ
- {1,2}
- B2 {1,2,3,4,5,6,7} {1,2,3,4,6,7}
- B3 {1,2,3,4,6,7,8,9} {1,2,3,5,6,7,9}
- B4 {1,2,3,4,5,6,7,9} {1,3,4,5,6,7}
- B5 {3,5,9} {3,8,9}
- B6 {3,4,5,6,7} {3,4,5,7,10,11}
- Table 10.5: Third Iteration for the IN and OUT Values
- Block IN OUT
- B1
- Φ
- {1,2}
- B2 {1,2,3,4,5,6,7} {1,2,3,4,6,7}
- B3 {1,2,3,4,6,7,8,9} {1,2,3,5,6,7,9}
- B4 {1,2,3,4,5,6,7,9} {1,3,4,5,6,7}
- B5 {1,2,3,5,6,7,9} {1,2,3,6,8,9}
- B6 {1,3,4,5,6,7} {1,3,4,5,7,10,11}
- Table 10.6: Fourth Iteration for the IN and OUT Values
- Block IN OUT
- B1
- Φ
- {1,2}
- B2 {1,2,3,4,5,6,7} {1,2,3,4,6,7}
- B3 {1,2,3,4,6,7,8,9} {1,2,3,5,6,7,9}
- B4 {1,2,3,4,5,6,7,9} {1,3,4,5,6,7}
- B5 {1,2,3,5,6,7,9} {1,2,3,6,8,9}
- B6 {1,3,4,5,6,7} {1,3,4,5,7,10,11}
- The next step is to compute the u − d chains from the reaching definitions information, as follows.
- If the use of A in block B is preceded by its definition, then the u − d chain of A contains only the last definition prior to
- this use of A. If the use of A in block B is not preceded by any definition of A, then the u − d chain for this use consists of
- all definitions of A in IN(B).
- For example, in the flow graph for which IN and OUT were computed in Tables 10.2–10.6, the use of a in definition 4,
- block B2 is preceded by definition 3, which is the definition of a. Hence, the u − d chain for this use of a only contains
- definition 3. But the use of b in B2 is not preceded by any definition of b in B2. Therefore, the u − d chain for this use of
- B will be {1}, because this is the only definition of b in IN(B2).
- The u − d chain information is used to identify the loop invariant computations. The next step is to perform the code
- motion, which moves a loop invariant statement to a newly created node, called "preheader," whose only successor is
- a header of the loop. All the predecessors of the header that lie outside the loop will become predecessors of the
- preheader.
- But sometimes the movement of a loop invariant statement to the preheader is not possible because such a move
- would alter the semantics of the program. For example, if a loop invariant statement exists in a basic block that is not a
- dominator of all the exits of the loop (where an exit of the loop is the node whose successor is outside the loop), then
- moving the loop invariant statement in the preheader may change the semantics of the program. Therefore, before
- moving a loop invariant statement to the preheader, we must check whether the code motion is legal or not. Consider
- the flow graph shown in Figure 10.6.
- Figure 10.6: A flow graph containing a loop invariant statement.
- In the flow graph shown in Figure 10.6, x = 2 is the loop invariant. But since it occurs in B3, which is not the dominator
- of the exit of loop, if we move it to the preheader, as shown in Figure 10.7, a value of two will always get assigned to y
- in B5; whereas in the original program, y in B5 may get value one as well as two.
- Figure 10.7: Moving a loop invariant statement changes the semantics of the program.
- After Moving x = 2 to the Preheader
- In the flow graph shown in Figure 10.7, if x is not used outside the loop, then the statement x = 2 can be moved to the
- preheader. Therefore, for a code motion to be legal, the following conditions must be met, even if no errors are
- encountered:
- The block in which a loop invariant statement occurs should be a dominator of all exits of the loop,
- or the name assigned to the block should not be used outside the loop.
- 1.
- We cannot move a loop invariant statement assigned to A into preheader if there is another
- statement in the loop that assigns to A. For example, consider the flow graph shown in Figure
- 10.8.
- Figure 10.8: Moving the preheader changes the meaning of the program.
- Even though the statement x = 3 in B2 satisfies condition (1), moving it to the preheader will
- change the meaning of the program. Because if x = 3 is moved to the preheader, then the value
- that will be assigned to y in B5 will be two if the execution path is B1–B2–B3–B4–B2–B4–B5.
- Whereas for the same execution path, the original program assigns a three to y in B5.
- 2.
- The move is illegal if A is used in the loop, and A is reached by any definition of A other than the
- statement to be moved. For example, consider the flow graph shown in Figure 10.9.
- 3.
- Figure 10.9: Moving a value to the preheader changes the original meaning of the program.
- Even though x is not used outside the loop, the statement x = 2 in the block B2 cannot be moved to the preheader,
- because the use of x in B4 is also reached by the definition x = 1 in B1. Therefore, if we move x = 2 to the preheader,
- then the value that will get assigned to a in B4 will always be a one, which is not the case in the original program.
- 10.4 ELIMINATING INDUCTION VARIABLES
- We define basic induction variables of a loop as those names whose only assignments within the loop are of the form I
- = I ± C, where C is a constant or a name whose value does not change within the loop. A basic induction variable may
- or may not form an arithmetic progression at the loop header.
- For example, consider the flow graph shown in Figure 10.10. In the loop formed by B2, I is a basic induction variable.
- Figure 10.10: Flow graph where I is a basic induction variable.
- We then define an induction variable of loop L as either a basic induction variable or a name J for which there is a
- basic induction variable I, such that each time J is assigned in L, J's value is some linear function or value of I. That is,
- the value of J in L should be C 1 I + C 2 , where C 1 and C 2 could be functions of both constants and loop invariant
- names. For example, in loop L, I is a basic induction variable; and T1 is also an induction variable, because the only
- assignment of T1 in the loop assigns a value to T1 that is a linear function of I, computed as 4 * I.
- Algorithm for Detecting and Eliminating Induction Variables
- An algorithm exists that will detect and eliminate induction variables. Its method is as follows:
- Find all of the basic induction variables by scanning the statements of loop L. 1.
- Find any additional induction variables, and for each such additional induction variable A, find the
- family of some basic induction B to which A belongs. (If the value of A at the point of assignment is
- expressed as C 1 B + C 2 , then A is said to belong to the family of basic induction variable B).
- Specifically, we search for names A with single assignments to A within loop L, and which have
- 2.
- one of the following forms:
- where C is a loop constant, and B is an induction variable, basic or otherwise. If B is basic, then A
- is in the family of B. If B is not basic, let B be in the family of D, then the additional requirements to
- be satisfied are:
- There must be no assignment to D between the lone point of assignment to B
- in L and the assignment to A.
- a.
- There must be no definition of B outside of L reaches A. b.
- Consider each basic induction variable B in turn. For every induction variable A in the family of B:
- Create a new name, temp. a.
- Replace the assignment to A in the loop with A = temp. b.
- Set temp to C 1 B + C 2 at the end of the preheader by adding the statements: c.
- Immediately after each assignment B = B + D, where D is a loop invariant,
- append:
- If D is a loop invariant name, and if C 1 ≠ 1, create a new loop invariant name
- for C 1 * D, and add the statements:
- d.
- For each basic induction variable B whose only uses are to compute other
- induction variables in its family and in conditional branches, take some A in B's
- family, preferably one whose function expresses its value simply, and replace
- each test of the form B reloop X goto Y by:
- Delete all assignments to B from the loop, as they will now be useless.
- e.
- If there is no assignment to temp between the introduced statement A = temp
- (step 1) and the only use of A, then replace all uses of A by temp and delete
- the statement A = temp.
- In the flow graph shown in Figure 10.10, we see that I is a basic induction
- variable, and T1 is the additional induction variable in the family of I, because
- f.
- 3.
- the value of T1 at the point of assignment in the loop is expressed as T1 = 4 *
- I. Therefore, according to step 3b, we replace T1 = 4 * I by T1 = temp. And
- according to step 3c, we add temp = 4 * I to the preheader. We then append
- the statement temp = temp + 4 after Figure 10.10 statement (10), as per step
- 3d. And according to step 3e, we replace the statement if I ≤ 20 goto B2 by:
- The results of these modifications are shown in Figure 10.11.
- Figure 10.11: Modified flow graph.
- By step 3f, replace T1 by temp. And by copy propagation, temp = 4 * I, in the preheader, can be replaced by temp =
- 4, and the statement I = 1 can be eliminated. In B1, the statement if temp ≤ temp2 goto B2 can be replaced by if temp
- ≤ 80 goto B2, and we can eliminate temp2 = 80, as shown in Figure 10.12.
- Figure 10.12: Flow graph preheader modifications.
- 10.5 ELIMINATING LOCAL COMMON SUBEXPRESSIONS
- The first step in eliminating local common subexpressions is to detect the common subexpression in a basic block.
- The common subexpressions in a basic block can be automatically detected if we construct a directed acyclic graph
- (DAG).
- DAG Construction
- For constructing a basic block DAG, we make use of the function node(id), which returns the most recently created
- node associated with id. For every three-address statement x = y op z, x = op y, or x = y in the block we:
- do
- {
- If node(y) is undefined, create a leaf labeled y, and let node(y) be this node. If node(z) is
- undefined, create a leaf labeled z, and let that leaf be node(z). If the statement is of the form x =
- op y or x = y, then if node(y) is undefined, create a leaf labeled y, and let node(y) be this node.
- 1.
- If a node exists that is labeled op whose left child is node(y) and whose right child is node(z) (to
- catch the common subexpressions), then return this node. Otherwise, create such a node, and
- return it. If the statement is of the form x = op y, then check if a node exists that is labeled op
- whose only child is node(y). Return this node. Otherwise, create such a node and return. Let the
- returned node be n.
- 2.
- Append x to the list of identifiers for the node n returned in step 2. Delete x from the list of
- attached identifiers for node(x), and set node(x) to be node n.
- 3.
- }
- Therefore, we first go for a DAG representation of the basic block. And if the interior nodes in the DAG have more than
- one label, then those nodes of the DAG represent the common subexpressions in the basic block. After detecting
- these common subexpressions, we eliminate them from the basic block. The following example shows the elimination
- of local common subexpressions, and the DAG is shown in Figure 10.13.
- S1 : = 4 * I 1.
- S2 : addr(A) − 4 2.
- S3 : S2 [S1] 3.
- S4 : 4 * I 4.
- S5 : = addr(B) − 4 5.
- S6 : = S5 [S4] 6.
- S7 : = S3 * S6 7.
- S8 : PROD + S7 8.
- PROD : = S8 9.
- S9 : = I + 1 10.
- I = S9 11.
- if I ≤ 20 goto (1). 12.
- Figure 10.13: DAG representation of a basic block.
- In Figure 10.13, PROD 0 indicates the initial value of PROD, and I0 indicates the initial value of I. We see that the
- same value is assigned to S8 and PROD. Similarly, the value assigned to S9 is the same as I. And the value
- computed for S1 and S4 are the same; hence, we can eliminate these common subexpressions by selecting one of
- the attached identifiers (one that is needed outside the block). We assume that none of the temporaries is needed
- outside the block. The rewritten block will be:
- S1 : = 4 * I 1.
- S2 : = addr(A) − 4 2.
- S3 : = S2 [S1] 3.
- S5 : = addr(B) − 4 4.
- S6 : = S5 [S1] 5.
- S7 : = S3 * S6 6.
- PROD : = PROD + S7 7.
- I : = I + 1 8.
- if I ≤ 20 goto (1) 9.
- 10.6 ELIMINATING GLOBAL COMMON SUBEXPRESSIONS
- Global common subexpressions are expressions that compute the same value but in different basic blocks. To detect
- such expressions, we need to compute available expressions.
- 10.6.1 Available Expressions
- An expression x op y is available at a point p if every path from the initial node of the flow graph reaching to p
- evaluates x op y, and if after the last such evaluation and prior to reaching p there are no subsequent assignments to x
- or y. To eliminate global common subexpressions, we need to compute the set of all the expressions available at the
- point just before the start of every block. This requires computing the set all the expressions available at a point just
- after the end of every block. We call these sets IN(b) and OUT(b), respectively. The computation of IN(b) and OUT(b)
- requires computing the set of all expressions generated by the basic block and the set of all expressions killed by the
- basic block, respectively:
- A block kills an expression x op y if it assigns to x or y and if does not subsequently recompute as op
- y.
- A block generates an expression x op y if it evaluates x op y and subsequently does not redefine x or
- y.
- To compute the available expressions, we solve the following equations:
- Here, also, we obtain the smallest solution.
- The algorithm for computing the smallest IN(b) and OUT(b) is given below, where b1 is the initial block, and U is a
- "universal" set of all expressions appearing on the right of one or more statements of the program.
- IN(b1) = φ
- OUT(b1) = GEN(b1);
- 1.
- for (i=2; i <= n; i++)
- {
- IN(b) = U
- OUT(b) = U - GEN(b)
- }
- 2.
- flag = true 3.
- while (flag) do
- {
- flag = false
- for (i=2; i <= n; i++)
- {
- IN new (bi) = Φ
- for each predecessor p of bi
- IN new (bi) = IN new (bi) ∩ OUT(p)
- if IN new (bi) ≠ IN(bi) then
- {
- flag = true
- IN(bi) = IN new (bi)
- 4.
- OUT(bi) = IN(bi) - KILL(bi) ∪ GEN(bi)
- }
- }
- }
- After computing IN(b) and OUT(b), eliminating the global common subexpressions is done as follows. For every
- statement s of the form x = y op z such that y op z is available at the beginning of the block containing s, and neither y
- nor z is defined prior to the statement x = y op z in that block, do:
- Find all definitions reaching up to the s statement block that have y op z on the right. 1.
- Create a new temp. 2.
- Replace each statement U = y op z found in step 1 by: 3.
- Replace the statement x = y op z in block by x = temp. 4.
- 10.7 LOOP UNROLLING
- Loop unrolling involves replicating the body of the loop to reduce the required number of tests if the number of
- iterations are constant. For example consider the following loop:
- I = 1
- while (I <= 100)
- {
- x[I] = 0;
- I++;
- }
- In this case, the test I <= 100 will be performed 100 times. But if the body of the loop is replicated, then the number of
- times this test will need to be performed will be 50. After replication of the body, the loop will be:
- I = 1
- while(I<= 100)
- {
- x[I] = 0;
- I++;
- X[I] = 0;
- I++;
- }
- It is possible to choose any divisor for the number of times the loop is executed, and the body will be replicated that
- many times. Unrolling once—that is, replicating the body to form two copies of the body—saves 50% of the maximum
- possible executions.
- 10.8 LOOP JAMMING
- Loop jamming is a technique that merges the bodies of two loops if the two loops have the same number of iterations
- and they use the same indices. This eliminates the test of one loop. For example, consider the following loop:
- {
- for (I = 0; I < 10; I++)
- for (J = 0; J < 10; J++)
- X[I,J] = 0;
- for (I = 0; I < 10; I++)
- X[I,I] = 1;
- }
- Here, the bodies of the loops on I can be concatenated. The result of loop jamming will be:
- {
- for (I = 0; I < 10; I++)
- {
- for (J = 0; J < 10; J++)
- X[I,J] = 0;
- X[I,I] = 1;
- }
- }
- The following conditions are sufficient for making loop jamming legal:
- No quantity is computed by the second loop at the iteration I if it is computed by the first loop at
- iteration J ≥ I.
- 1.
- If a value is computed by the first loop at iteration J ≥ I, then this value should not be used by
- second loop at iteration I.
- 2.
- Chapter 11: Code Generation
- 11.1 AN INTRODUCTION TO CODE GENERATION
- Code generation is the last phase in the compilation process. Being a machine-dependent phase, it is not possible to
- generate good code without considering the details of the particular machine for which the compiler is expected to
- generate code. Even so, a carefully selected code-generation algorithm can produce code that is twice as fast as code
- generated by an ill-considered code-generation algorithm.
- In this chapter, we first discuss straightforward code generation from a sequence of three-address statements. This is
- followed by a discussion of the code-generation algorithm that takes into account the flow of control structures in the
- program when assigning registers to names. Then we will look at a code-generation algorithm that is capable of
- generating reasonably good code from a basic block. Finally, various machine-dependent optimizations that are
- capable of improving the efficiency of object code are discussed. Throughout our discussion, we assume that the input
- to the code-generation algorithm is a sequence of three-address statements partitioned into basic blocks.
- 11.2 PROBLEMS THAT HINDER GOOD CODE GENERATION
- There are three main difficulties that we face when attempting to generate efficient object code, namely:
- Selection of the most-efficient instructions to represent the computation specified by the
- three-address statement;
- 1.
- Deciding on a computation order that leads to the generation of the more-efficient object code;
- and
- 2.
- Deciding which registers to use. 3.
- Selecting the Most-Efficient Instructions to Represent the Computation Specified by the
- Three-Address Statement
- Many machines allow certain computations to be done in more than one way. For example, if a machine permits an
- instruction AOS for incrementing the contents of a storage location directly, then for a three-address statement a = a +
- 1, it is possible to generate the instruction AOS a, rather than a sequence of instructions like the following:
- MOVE a, R
- ADD #1, R
- MOVE R, a
- Now, deciding which instruction sequence is better is the problem. This decision requires an extensive knowledge
- about the context in which these three-address statements will appear.
- Deciding on the Computation Order that Will Lead to the Generation of More-Efficient
- Object Code
- Some computation orders require fewer registers to hold intermediate results than others. Now, deciding the best
- order is very difficult. For example, consider the basic block:
- If the order of computation used is the one given in the basic block t1-t2-t3-t4, then the number of registers required for
- holding the intermediate result is more than when the order t2-t3-t1-t4 is used.
- Deciding on Registers
- Deciding which register should handle the computation is another problem that stands in the way of good code
- generation. The problem is further complicated when a machine requires register-pairs for some operands and results.
- 11.3 THE MACHINE MODEL
- Being a machine-dependent phase, we will need to describe some of the features of a typical computer in order to
- discuss the various issues involved in code generation. For this purpose, we describe a hypothetical machine model,
- as follows.
- We assume that the machine is byte-addressable with two bytes per word, having 2 16 bytes, and eight
- general-purpose registers, R0 to R7, that are capable of holding a 16-bit quantity. The format of the instruction is an op
- source destination with four-bit opcode, and the source and destination are each six-bit fields. Since a six-bit field is
- not capable of holding a memory address (a memory address is a 16-bit), when sources and destinations are memory
- addresses, then these six-bit fields hold certain bit patterns that specify that the words following an instruction contain
- memory addresses used as source and destination operands, respectively. The following addressing modes are
- assumed to be supported by the machine model:
- r (register addressing) 1.
- *r (indirect register) 2.
- X (absolute address) 3.
- #data (immediate) 4.
- X(r) (indexed address) 5.
- *X(r) (indirect indexed address) 6.
- We assume that opcodes like the one listed below are available:
- MOV (for moving source to destination),
- ADD (for adding source to destination), and
- SUB (for subtracting source from destination), and so on.
- The cost of the instruction is considered to be its length, because generating a shorter instruction not only reduces the
- storage requirement of the object code, but it also reduces the time taken to perform the operation. This is because
- most machines spend more time fetching words from memory than they spend in executing the instruction. Hence, by
- minimizing the instruction length, we minimize the time taken to perform the instruction, as well.
- For example, length of the instruction MOV R0, R1 is one memory word, because, three-bit code is enough for
- uniquely identifying each of the registers. Therefore, the six-bit fields, each for source and destination operand, can
- easily hold the three-bit codes for the registers shown in Table 11.1.
- Table 11.1: Six-Bit Registers for the Instruction MOV R0, R1
- MOV R0 R1
- Similarly, the length of the instruction MOV R0, M is two memory words, because since the destination operand is a
- memory address, it will occupy the word following an instruction, as shown in Table 11.2.
- Table 11.2: Six-Bit Registers for the Instruction MOV R0, R2
- MOV R 0 bit pattern
- M
- Similarly, the length of the instruction MOV M1, M2 is three memory words, because the source and the destination
- operands, being memory addresses, will occupy the words following the instruction, as shown in Table 11.3.
- Table 11.3: Six-Bit Registers for the Instruction MOV M1, M2
- MOV bit pattern bit pattern
- M1
- M2
- For example, consider a three-address statement, a = b + c. We can generate the following different instruction
- sequences for this statement, depending upon where the values of operand b and c can be found.
- If the values of b and c can be found in the memory locations of the same name, then the following instruction
- sequences can be generated:
- MOV b, R0
- ADD c, R0
- MOV R0, a length = six words
- 1.
- MOV b, a
- ADD c, a length = six words
- If the addresses of a, b, and c are assumed to be in registers R0, R1, and R2, respectively then
- the following instruction sequence can be generated:
- 2.
- MOV *R1, *R0
- ADD *R2, *R0 length = two words
- If the values of b and c are assumed to be in registers R0 and R1, respectively, then the following
- instruction sequence can be generated:
- 3.
- ADD R2, R1
- MOV R1, a length = three words
- 4.
- Therefore, we conclude that for generating good code, we must utilize the addressing capabilities of the machine
- efficiently. And this will be possible if we keep the one-value or the r-value of the name in the register if it is going to be
- used in the future.
- 11.4 STRAIGHTFORWARD CODE GENERATION
- Given a sequence of three-address statements partitioned into basic blocks, straightforward code generation involves
- generating code for each three-address statement in turn by taking the advantage of any of the operands of the
- three-address statements that are in the register, and leaving the computed result in the register as long as possible.
- We store it only if the register is needed for another computation or just before a procedure call, jump, or labeled
- statement, such as at the end of a basic block. The reason for this is that after leaving a basic block, we may go to
- several different blocks, or we may go to one particular block that can be reached from several others. In either case,
- we cannot assume that a datum used by a block appears in the same register, no matter how the program's control
- reached that block. Hence, to avoid possible error, our code-generation strategy stores everything across the basic
- block boundaries.
- When generating code by using the above strategy, we need to keep track of what is currently in
- each register. For this, we maintain what is called a "register descriptor," which is simply a pointer
- to a list that contains information about what is currently in each of the registers. Initially, all of the
- registers are empty.
- We also need to keep track of the locations for each name—where the current value of the name can be found at run
- time. For this, we maintain what is called an "address descriptor" for each name in the block. This information can be
- stored in the symbol table.
- We also need a location to perform the computation specified by each of the three-address statements. For this, we
- make use of the function getreg(). When called, getreg() returns a location specifying the computation performed by a
- three-address statement. For example, if x = y op z is performed, getreg() returns a location L where the computation y
- op z should be performed; and if possible, it returns a register.
- Algorithm for the Function Getreg()
- What follows is an algorithm for storing and returning the register locations for three-address statements by using the
- function getreg().
- {
- For every three-address statement of the form x = y op z
- in the basic block do
- {
- Call getreg() to obtain the location L in which the computation y op z should be performed. /* This
- requires passing the three-address statement x = y op z as a parameter to getreg(), which can be
- done by passing the index of this statement in the quadruple array.
- 1.
- Obtain the current location of the operand y by consulting its address descriptor, and if the value
- of y is currently both in the memory location as well as in the register, then prefer the register. If
- the value of y is currently not available in L, then generate an instruction MOV y, L (where y as
- assumed to represent the current location of y).
- 2.
- Generate the instruction OP z, L, and update the address descriptor of x to indicate that x is now
- available in L, and if L is in a register, then update its descriptor to indicate that it will contain the
- run-time value of x.
- 3.
- If the current values of y and /or z are in the register, we have no further uses for them, and they
- are not live at the end of the block, then alter the register descriptor to indicate that after the
- execution of the statement x = y op z, those registers will no longer contain y and /or z.
- 4.
- }
- Store all the results.
- }
- The function getreg(), when called upon to return a location where the computation specified by the three-address
- statement x = y op z should be performed, returns a location L as follows:
- First, it searches for a register already containing the name y. If such a register exists, and if y
- has no further use after the execution of x = y op z, and if it is not live at the end of the block and
- holds the value of no other name, then return the register for L.
- 1.
- Otherwise, getreg() searches for an empty register; and if an empty register is available, then it
- returns it for L.
- 2.
- If no empty register exists, and if x has further use in the block, or op is an operator such as
- indexing that requires a register, then getreg() it finds a suitable, occupied register. The register is
- emptied by storing its value in the proper memory location M, the address descriptor is updated,
- the register is returned for L. (The least-recently used strategy can be used to find a suitable,
- occupied register to be emptied.)
- 3.
- If x is not used in the block or no suitable, occupied register can be found, getreg() selects a
- memory location of x and returns it for L.
- 4.
- EXAMPLE 11.1
- Consider the expression:
- The three-address code for this is:
- Applying the algorithm above results in Table 11.4.
- Table 11.4: Computation for the Expression x = ( a + b ) − (( c + d ) − e )
- Statement L Instructions Generated Register Descriptor Address Descriptor
- All registers empty
- t1 = a + b R0 MOV a, R0 ADD b, R0 R0 will hold t1 t1 is in R0
- t2 = c + d R1 MOV c, R1 ADD d, R1 R1 will hold t2 t2 is in R1
- t3 = t2 − e
- R1 SUB e, R1 R1 will hold t3 t3 is in R1
- x = t1 − t3
- R0 SUB R1, R0 R0 will hold x x is in R0
- MOV R0, x
- x is in R0 and memory
- The algorithm makes use of the next-use information of each name in order to make more-informed decisions
- regarding register allocation. Therefore, it is required to compute the next-use information. If:
- A statement at the index i in a block assigns a value to name x,
- And if a statement at the index j in the same block uses x as an operand,
- And if the path from the statement at index i to the statement at index j is a path without any
- intervening assignment to name x, then
- we say that the value of x computed by the statement at index i is used in the statement at index j. Hence, the next use
- of the name x in the statement i is statement j. For each three-address statement i, we need to compute information
- about those three-address statements in the block that are the next uses of the names coming in statement i. This
- requires the backward scanning of the basic block, which will allow us to attach to every statement i under
- consideration the information about those statements that are the next uses of each name in the statement i. The
- algorithm is as follows:
- For each statement i of the form x = y op z do
- {
- attach information about the next uses of x, y, and z
- to statement i
- set the information for x to no next-use /* This information
- can be kept into the symbol table */
- set the information for y and z to be the next use
- in statement i
- }
- Consider the basic block:
- When straightforward code generation is done using the above algorithm, and if only two registers, R0 and R1, are
- available, then the generated code is as shown in Table 11.5.
- Table 11.5: Generated Code with Only Two Available Registers, R0 and R1
- Statement L Instructions Generated Cost Register Descriptor Address
- Descriptor
- R0 and R1 empty
- t1 = a + b R0 MOV a, R0
- ADD b, R0
- 2
- words
- 2
- words
- R0 will hold t1 is in t1 R0
- t2 = c + d R1 MOV c, R1
- ADD d, R1
- 2
- words
- 2
- words
- R1 will hold t2 t2 is in R1
- t3 = e − t2
- MOV R0, t1 (generated
- memory by getreg())
- 2
- words
- t1 is in
- t3 is in R0
- R0 MOV e, R0SUB R1, R0 2
- words
- 1 word
- R0 will hold t3
- R1 will be empty
- because t2 has no next
- use
- x = t1 − t3
- R1 MOV t1, R1 SUB R0, R1 2
- words
- 1 word
- R1 will hold x
- R0 will be empty
- because t3 has no next
- use
- x is in R1
- MOV R1, x 2
- words
- x is in R1 and
- memory
- We see that the total length of the instruction sequence generated is 18 memory words. If we rearrange the final
- computations as:
- and then generate the code, we get Table 11.6.
- Table 11.6: Generated Code with Rearranged Computations
- Statement L Instructions
- Generated
- Cost Register Descriptor Address
- Descriptor
- R0 and R1 empty
- t2 = c + d R0 MOV c, R0 ADD
- d, R0
- 2
- words
- 2
- words
- R0 will hold t2 t2 is in R0
- t3 = e − t2
- R1 MOV e, R1SUB
- R0, R1
- 2
- words
- 1 word
- R1 will hold t3 R0 will be empty
- because t2 has no next use
- t3 is in R1
- t1 = a + b R0 MOV a, R0 ADD
- b, R0
- 2
- words
- 2
- words
- R0 will hold t1 t1 is in R0
- x = t1 − t3
- R1 SUB R1, R0 1 word R0 will hold x R1 will be empty
- because t3 has no next use
- x is in R0
- MOV R0, x 2
- words
- x is in R0 and
- memory
- Here, the length of the instruction sequence generated is 14 memory words. This indicates that the order of the
- computation is a deciding factor in the cost of the code generated. In the above example, the cost is reduced when the
- order t2-t3-t1-t4 is used, because t1 gets computed immediately before the statement that computes t4, which uses t1
- as its left operand. Hence, no intermediate store-and-load is required, as is the case when the order t1-t2-t3-t4 is used.
- Good code generation requires rearranging the final computation order, and this can be done conveniently with a DAG
- representation of a basic block rather than with a linear sequence of three-address statements.
- 11.5 USING DAG FOR CODE GENERATION
- To rearrange the final computation order for more-efficient code-generation, we first obtain a DAG representation of
- the basic block, and then we order the nodes of the DAG using heuristics. Heuristics attempts to order the nodes of a
- DAG so that, if possible, a node immediately follows the evaluation of its left-most operand.
- 11.5.1 Heuristic DAG Ordering
- The algorithm for heuristic ordering is given below. It lists the nodes of a DAG such that the node's reverse listing
- results in the computation order.
- {
- While there exists an unlisted interior node do
- {
- select an unlisted node n whose parents have been listed
- list n
- while there exists a left-most child m of n that has no
- unlisted parents and m is not a leaf do
- {
- list m
- m = n
- }
- }
- order = reverse of the order of listing of nodes
- }
- EXAMPLE 11.2
- Consider the DAG shown in Figure 11.1.
- Figure 11.1: DAG Representation.
- The order in which the nodes are listed by the heuristic ordering is shown in Figure 11.2.
- Figure 11.2: DAG Representation with heuristic ordering.
- Therefore, the computation order is:
- If the DAG representation turns out to be a tree, then for the machine model described above, we can obtain the
- optimal order using the algorithm described in Section 11.5.2, below. Here, an optimal order means the order that
- yields the shortest instruction sequence.
- 11.5.2 The Labeling Algorithm
- This algorithm works on the tree representation of a sequence of three-address statements. It could also be made to
- work if the intermediate code form was a parse tree. This algorithm has two parts: the first part labels each node of the
- tree from the bottom up, with an integer that denotes the minimum number of registers required to evaluate the tree
- and with no storing of intermediate results. The second part of the algorithm is a tree traversal that travels the tree in
- an order governed by the computed labels in the first part, and which generates the code during the tree traversal.
- {
- if n is a leaf node then
- if n is the left-most child of its parent then
- label(n) = 1
- else
- label(n) = 0
- else
- label(n) = max[label(n i ) + (i - 1)]
- for i = 1 to k
- /* where n 1 , n 2 ,..., n k are the children of n, ordered by their labels; that is,
- label(n 1 ) ≥ label(n 2 ) ≥ ... ≥ label(n k ) */
- }
- For k = 2, the formula label(n) = max[label(n i ) + (i - 1)] becomes:
- label(n) = max[11, 12 + 1]
- where 11 is label(n 1 ), and 12 is label(n 2 ). Since either 11 or 12 will be same, or since there will be a difference of at
- least the difference between 11 and 12 (i.e., 11 − 12), which is greater than or equal to one, we get:
- EXAMPLE 11.3
- Consider the following three-address code and its DAG representation, shown in Figure 11.3:
- Figure 11.3: DAG representation of three-address code for Example 11.3.
- The tree, after labeling, is shown in Figure 11.4.
- Figure 11.4: DAG representation tree after labeling.
- 11.5.3 Code Generation by Traversing the Labeled Tree
- We will now examine an algorithm that traverses the labeled tree and generates machine code to evaluate the tree in
- the register R0. The content of R0 can then be stored in the appropriate memory location. We assume that only binary
- operators are used in the tree. The algorithm uses a recursive procedure, gencode(n), to generate the code for
- evaluating into a register a subtree that has its root in node n. This procedure makes use of RSTACK to allocate
- registers.
- Initially, RSTACK contains all available registers. We assume the order of the registers to be R0, R1, … , from top to
- bottom. A call to gencode() may find a subset of registers, perhaps in a different order in RSTACK, but when
- gencode() returns, it leaves the registers in RSTACK in the same order in which they were found. The resulting code
- computes the value of the tree in the top register of RSTACK. It also makes use of TSTACK to allocate temporary
- memory locations. Depending upon the type of node n with which gencode() is called, gencode() performs the
- following:
- If n is a leaf node and is the left-most child of its parent, then gencode() generates a load
- instruction for loading the top register of RSTACK by the label of node n:
- 1.
- If n is an interior node, it will be an operator node labeled by op with the children n 1 and n 2 , and
- n 2 is a simple operand and not a root of the subtree, as shown in Figure 11.5.
- Figure 11.5: The node n is an operand and not a subtree root.
- In this case, gencode() will first generate the code to evaluate the subtree rooted at n 1 in the
- top{RSTACK]. It will then generate the instruction, OP name, RSTACK[top].
- 2.
- If n is an interior node, it will be an operator node labeled by op with the children n 1 and n 2 , and
- both n 1 and n 2 are roots of subtrees, as shown in Figure 11.6.
- 3.
- Figure 11.6: The node n is an operator, and n 1 and n 2 are subtree roots.
- In this case, gencode() examines the labels of n 1 and n 2 . If label(n 2 ) > label(n 1 ), then n 2 requires
- a greater number of registers to evaluate without storing the intermediate results than n 1 does.
- Therefore, gencode() checks whether the total number of registers available to r is greater than
- the label(n 1 ). If it is, then the subtree rooted at n 1 can be evaluated without storing the
- intermediate results. It first swaps the top two registers of RSTACK, then generates the code for
- evaluating the subtree rooted at n 2 , which is harder to evaluate in RSTACK[top]. It removes the
- top-most register from RSTACK and stores it in R, then generates code for evaluating the subtree
- rooted at n 1 in RSTACK[top]. An instruction, OP R, RSTACK[top], is generated, pushing R onto
- RSTACK. The top two registers are swapped so that the register holding the value of n will be in
- the top register of RSTACK.
- If label(n 2 ) <= label(n 1 ), then n 1 requires a greater number of register to evaluate without storing
- the intermediate results than n 2 does. Therefore, gencode() checks whether the total number of
- registers available to r is greater than label(n 2 ). If it is, then the subtree rooted at n 2 can be
- evaluated without storing the intermediate results. Hence, it first generates the code for evaluating
- subtree rooted at n 1 , which is harder to evaluate in RSTACK[top], removes the top-most register
- from RSTACK, and stores it in R. It then generates code for evaluating the subtree rooted at n 2 in
- RSTACK[top]. An instruction, OP RSTACK[top], R, is generated that pushes register R onto
- RSTACK. In this case, the top register, after pushing R onto RSTACK, holds the value of n.
- Therefore, swapping and reswapping is needed.
- 4.
- If label(n 1 ) as well as label(n 2 ) are greater than or equal to r (i.e., both subtrees require r or more
- registers to evaluate without intermediate storage), a temporary memory location is required. In
- this case, gencode() first generates the code for evaluating n 2 in a temporary memory location,
- then generates code to evaluate n 1 , followed by an instruction to evaluate root n in the top
- register of RSTACK.
- 5.
- Algorithm for Implementing Gencode()
- The procedure for gencode() is outlined as follows:
- Procedure gencode(n)
- {
- if n is a leaf node and the left-most child of its parent then
- generate MOV name, RSTACK[top]
- if n is an interior node with children n 1 and n 2 , with
- label(n 2 ) = 0 then
- {
- gencode(n 1 )
- generate op name RSTACK[top] /* name is the operand
- represented by n 2 and op is the operator
- represented by n*/
- }
- if n is an interior node with children n 1 and n 2 ,
- label(n 2 ) > label(n 1 ), and label(n 1 ) < r then
- {
- swap top two registers of RSTACK
- gencode(n 2 )
- R = pop(RSTACK)
- gencode(n 1 )
- generate op R, RSTACK[top] /* op is the operator
- represented by n */
- PUSH(R,RSTACK)
- swap top two registers of RSTACK
- }
- if n is an interior node with children n 1 and n 2 ,
- label(n 2 ) <= label(n 1 ), and label(n 2 ) < r then
- {
- gencode(n 1 )
- R = pop(RSTACK)
- gencode(n 2 )
- generate op RSTACK[top], R /* op is the operator
- represented by n */
- PUSH(R, RSTACK)
- }
- if n is an interior node with children n 1 and n 2 ,
- label(n 2 ) <= label(n 1 ), and label(n 1 ) > r as well as
- label(n 2 ) > r then
- {
- gencode(n 2 )
- T = pop(TSTACK)
- generate MOV RSTACK[top], T
- gencode(n1)
- PUSH(T, TSTACK)
- generate op T, RSTACK[top] /* op is the operator
- represented by n */
- }
- }
- The algorithm above can be used when the DAG represented is a tree; but when there are common subexpressions in
- the basic block, the DAG representation will no longer be a tree, because the common subexpressions will correspond
- to nodes with more than one parent. These are called "shared nodes". In this case, we can apply the labeling and the
- gencode() algorithm by partitioning the DAG into a set of trees. We find, for each shared node as well as root n, the
- maximal subtree with n as a root that includes no other shared nodes, except as leaves. For example, consider the
- DAG shown in Figure 11.7. It is not a tree, but it can be partitioned into the set of trees shown in Figure 11.8. The
- procedure gencode() can be used to generate code for each node of this tree.
- Figure 11.7: A nontree DAG.
- Figure 11.8: A DAG that has been partitioned into a set of trees.
- EXAMPLE 11.4
- Consider the labeled tree shown in Figure 11.9.
- Figure 11.9: Labeled tree for Example 11.4.
- The code generated by gencode() when this tree is given as input along with the recursive calls of gencode is shown
- in Table 11.7. It starts with call to gencode() of t4. Initially, the top two registers will be R0 and R1.
- Table 11.7: Recursive Gencode Calls
- Call to
- Gencode()
- Action Taken RSTACK Contents
- Top Two
- Registers
- Code
- Generated
- R0, R1
- gencode(t4) Swap top two registers Call gencode(t3) Pop R1
- Call gencode(t1) Generate an instruction SUB
- R1,R0 Push R1 Swap top two registers
- R1, R0
- R0, R1
- R1, R0
- R0, R1
- MOV E, R1
- MOV C, R0
- ADD D, R0
- SUB R0, R1
- MOV A, R0
- ADD B, R0
- SUB R1, R0
- gencode(t3) Call gencode(E) Pop R1 Call gencode(t2)
- Generate an instruction SUB R0,R1 Push R1
- R1, R0
- R0
- R1, R0
- MOV E, R1
- MOV C, R0
- ADD D, R0
- SUB R0, R1
- gencode(E) Generate an instruction MOV E, R1 R1, R0 MOV E, R1
- gencode(t2) gencode(c) Generate an instruction ADD D, R0 R0 MOV C, R0
- ADD D, R0
- gencode(c) Generate an instruction MOV C, R0 R0
- gencode(t1) gencode(A) Generate an instruction ADD B, R0 R0 MOV A, R0
- ADD B, R0
- gencode(A) Generate an instruction MOV A, R0 R0 MOV A, R0
- 11.6 USING ALGEBRAIC PROPERTIES TO REDUCE THE REGISTER
- REQUIREMENT
- It is possible to make use of algebraic properties like operator commutativity and associativity to reduce the register
- requirements of the tree. For example, consider the tree shown in Figure 11.10.
- Figure 11.10: Tree with a label of two.
- The label of the tree in Figure 11.10 is two, but since + is a commutative operator, we can interchange the left and the
- right subtrees, as shown in Figure 11.11. This brings the register requirement of the tree down to one.
- Figure 11.11: The left and right subtrees have been interchanged, reducing the register requirement to one.
- Similarly, associativity can be used to reduce the register requirement. Consider the tree shown in Figure 11.12.
- Figure 11.12: Associativity is used to reduce a tree's register requirement.
- 11.7 PEEPHOLE OPTIMIZATION
- Code generated by using the statement-by-statement code-generation strategy contains redundant instructions and
- suboptimal constructs. Therefore, to improve the quality of the target code, optimization is required. Peephole
- optimization is an effective technique for locally improving the target code. Short sequences of target code instructions
- are examined and replacement by faster sequences wherever possible. Typical optimizations that can be performed
- are:
- Elimination of redundant loads and stores
- Elimination of multiple jumps
- Elimination of unreachable code
- Algebraic simplifications
- Reducing for strength
- Use of machine idioms
- Eliminating Redundant Loads and Stores
- If the target code contains the instruction sequence:
- MOV R, a 1.
- MOV a, R 2.
- we can delete the second instruction if it an unlabeled instruction. This is because the first instruction ensures that the
- value of a is already in the register R. If it is labeled, there is no guarantee that step 1 will always be executed before
- step 2.
- Eliminating Multiple Jumps
- If we have jumps to other jumps, then the unnecessary jumps can be eliminated in either intermediate code or the
- target code. If we have a jump sequence:
- goto L1
- ...
- L1: goto L2
- then this can be replaced by:
- goto L2
- ...
- L1: goto L2
- If there are now no jumps to L1, then it may be possible to eliminate the statement, provided it is preceded by an
- unconditional jump. Similarly, the sequence:
- if a < b goto L1
- ...
- L1: goto L2
- can be replaced by:
- if a < b goto L2
- ...
- L1: goto L2
- Eliminating Unreachable Code
- An unlabeled instruction that immediately follows an unconditional jump can possibly be removed, and this operation
- can be repeated in order to eliminate a sequence of instructions. For debugging purposes, a large program may have
- within it certain segments that are executed only if a debug variable is one. For example, the source code may be:
- #define debug 0
- ...
- if (debug)
- {
- print debugging information
- }
- This if statement is translated in the intermediate code to:
- goto L2
- L1 : print debugging information
- L2 :
- One of the optimizations is to replace the pair:
- if debug = 1 goto L1
- goto L2
- within the statements with a single conditional goto statement by negating the condition and changing its target, as
- shown below:
- Print debugging information
- L2 :
- Since debug is a constant zero by constant propagation, this code will become:
- if 0 ≠ 1 goto L2
- Print debugging information
- L2 :
- Since 0 ≠ 1 is always true this will become:
- goto L2
- Print debugging information
- L2 :
- Therefore, the statements that print the debugging information are unreachable and can be eliminated, one at a time.
- Algebraic Simplifications
- If statements like:
- are generated in the code, they can be eliminated, because zero is an additive identity, and one is a multiplicative
- identity.
- Reducing Strength
- Certain machine instructions are considered to be cheaper than others. Hence, if we replace expensive operations by
- equivalent cheaper ones on the target machine, then the efficiency will be better. For example, x 2 is invariable cheaper
- to implement as x * x than as a call to an exponentiation routine. Similarly, fixed-point multiplication or division by a
- power of two is cheaper to implement as a shift.
- Using Machine Idioms
- The target machine may have hardware instructions to implement certain specific operations efficiently. Detecting
- situations that permit the use of these instructions can reduce execution time significantly. For example, some
- machines have auto-increment and auto-decrement addressing modes. Using these modes can greatly improve the
- quality of the code when pushing or popping a stack. These modes can also be used for implementing statements like
- a = a + 1.
- Chapter 12: Exercises
- The exercises that follow are designed to provide further examples of the concepts covered in this book. Their
- purpose is to put these concepts to work in practical contexts that will enable you, as a programmer, to better and
- more-efficiently use algorithms when designing your compiler.
- EXERCISE 12.1
- Construct the regular expression that corresponds to the state transition diagram shown in Figure 12.1.
- Figure 12.1: State transition diagram.
- EXERCISE 12.2
- Prove that regular sets are closed under intersection. Present a method for constructing a DFA with an intersection of
- two regular sets.
- EXERCISE 12.3
- Transform the following NFA into an optimal/minimal state DFA.
- 0 1
- ∈
- A A, C B D
- B B D C
- C C A, C D
- D D A
- −
- EXERCISE 12.4
- Obtain the canonical collection of sets of LR(1) items for the following grammar:
- EXERCISE 12.5
- Construct an LR(1) parsing table for the following grammar:
- EXERCISE 12.6
- Construct an LALR(1) parsing table for the following grammar:
- EXERCISE 12.7
- Construct an SLR(1) parsing table for the following grammar:
- EXERCISE 12.8
- Consider the following code fragment. Generate the three-address-code for it.
- if a < b then
- while c > d do
- x = x + y
- else
- do
- p = p + q
- while e <= f
- EXERCISE 12.9
- Consider the following code fragment. Generate the three-address code for it.
- for (i = 1; i <= 10; i++)
- if a < b then x = y + z
- EXERCISE 12.10
- Consider the following code fragment. Generate the three-address-code for it.
- switch a + b
- {
- case 1: x = x + 1
- case 2: y = y + 2
- case 3: z = z + 3
- default: c = c -1
- }
- EXERCISE 12.11
- Write the syntax-directed translations to go along with the LR parser for the following:
- EXERCISE 12.12
- Write the syntax-directed translations to go along with the LR parser for the following:
- EXERCISE 12.13
- There are syntactic errors in the following constructs. For each of these constructs, find out which of the input's next
- tokens will be detected as an error by the LR parser.
- while a = b do x = y + z 1.
- a + b = c 2.
- a *+ b + c 3.
- EXERCISE 12.14
- Comment on whether the following statements are true or false:
- Given a finite automata M(Q, Σ , δ , q 0 , F) that accepts L(M), the automata M 1 (Q, Σ , δ , q 0 , (Q − F ))
- accepts L(M), where L(M) is complement of L(M). If M is an optimal or minimal state automata,
- then M 1 is also a minimal state automata.
- 1.
- Every subset of a regular set is also a regular set. 2.
- In a top-down backtracking parser, the order in which various alternatives are tried may affect the
- language accepted by the parser.
- 3.
- An LR parser detects an error when the symbol coming next in the input is not a valid continuation
- of the prefix of the input seen by the parser.
- 4.
- Grammar ambiguity necessarily implies ambiguity in the language generated by that grammar. 5.
- Every name is added to the symbol table during the lexical analysis phase irrespective of the
- semantic role played by each name.
- 6.
- Given a grammar with no useless symbols, but containing unit productions, if the unit productions
- are eliminated from the grammar, then it is possible that some of the grammar symbols in the
- resulting grammar may become useless.
- 7.
- In any nonambiguous grammar without useless symbols, the handle of a given right-sentential
- form is unique.
- 8.
- Index
- A
- Action specification in LEX, 46-47
- Action tables
- Action | GOTO tables, 140
- arrays to represent, 178-179
- LALR parsing tables, 165-169
- for LR(1) parser, 163-165
- for SLR(1) parser, 152-161
- Activation records, 248-249
- Addressing modes, machine model and, 297-299
- Algebraic properties, register requirements reduced with, 317-318
- Alphabet, defined for lexical analysis, 6
- Ambiguous grammars and bottom-up parsing, 172-177
- AND operator and translation, 214-215
- Arithmetic expressions, translation of, 208-211
- Array references, 225-229
- Arrays, to represent action tables, 178-179
- Attributes
- defined, 196
- dummy synthesized attributes, 199-201
- inherited attributes, 198-199
- synthesized attributes, 197-198
- Augmented grammars, 142-146, 175-176
- Automatas, equivalence of, 51-52
- Index
- B
- Back end compilers, 4
- Back-patching, 5
- Backtracking parsers, 95
- recursive descent parsers, 94-118
- Block statements and stack allocation, 256-257
- Boolean expressions, translation of, 211-214
- Bootstrap compilers, defined, 1-2
- Bottom-up parsing
- Action | GOTO tables, 140
- ambiguous grammars, 172-177
- canonical collection of sets algorithm, 146-152
- defined and described, 135-136
- handles of right sentential form, 136-138
- implementation of, 138-140
- LALR parsing, 165-166, 190-194
- LR parsers, 140-142
- LR(1) parsing, 163-165, 179-194
- Braces {} in syntax-directed translation schemes, 202-203
- Index
- C
- Call and return sequences, stack allocation and, 250-253
- Canonical collection of sets
- algorithm for, 146-152
- exercises, 324
- of LR(1), algorithm, 161-163
- Cartesian products, set operation, 7
- CASE statements, 229-234
- Closure
- property closure of a relation, 9
- set operation, 7-8
- Closure operations, regular sets and, 47
- Code generation phase, 2, 3, 4
- DAGs and, 305-316
- difficulties encountered during, 296-297
- getreg() function and, 300-305
- labeled trees and, 307-316
- straightforward strategy for, 299-305
- Code optimization phase, 2, 3
- algebraic properties to reduce register requirements, 317-318
- algebraic simplifications, 320
- defined and described, 269-270
- global common subexpressions, eliminating, 290-292
- jumps, eliminate multiple, 319
- loads and stores, eliminating redundancy, 319
- local common subexpressions, eliminating, 288-290
- loop optimization, 270-284
- machine idioms and, 321
- partitioning three-address code into basic blocks, 271-273
- peephole optimization, 318-321
- reducible flow graphs and, 274-284
- strength reduction, 321
- unreachable code, eliminating, 319-320
- Compilation, process described, 2-5
- Compilers
- defined, 1
- front-end vs. back-end compilers, 4
- organization of, 4
- Computational order, 296
- Concatenation
- defined, 6
- set operation, 7
- Concatenation operation, regular sets and, 47
- Context-free grammars (CFGs)
- algorithm for identifying useless symbols, 64
- defined and described, 54
- derivation in, 55-56
- ∈ -productions and, 70-73
- left linear grammar, 86-90
- left-recursive grammar, 75-77
- productions (P) in, 54
- reduction of grammar, 61-70
- regular grammar as, 77-85
- right linear grammar, 85-86
- SLR(1) grammars, 152-161
- start symbol (S) in, 54
- in syntax analysis phase, 53-54
- terminals (T) in, 54
- unit productions and, 73-75
- variables (V) or nonterminals in, 54, 56
- Cross-compilers, defined, 1-2
- Index
- D
- DAGs. See Directed acyclic graphs (DAGs)
- Data storage. See Storage management
- Data structures for representing parsing tables, 178-179
- Dead states of DFAs, 27
- detection of, 31
- Decrement operators, implementation of, 224-225
- Dependency graphs, 199-201
- Derivation
- in context-free grammar, 55-56
- derivation trees in CFG, 56-61
- Detection, of DFA unreachable and dead states, 28-31
- Deterministic finite automata (DFA)
- Action | GOTO tables, 141-142
- augmented grammar and, 142-146
- equivalent to NFAs with ∈ -moves, 23-27
- exercises, 323-324
- minimization of, 27-31
- minimization/optimization of, 27-31
- transforming NFAs into, 16-18
- DFA. See Deterministic finite automata (DFA)
- Directed acyclic graphs (DAGs), 288-290
- code generation and, 305-316
- heuristic DAG ordering, 305-307
- labeling algorithm and, 307-309
- DO-WHILE statements and translation, 220-221
- Dummy synthesized attributes, 199-201
- Index
- E
- ∈ -closure(q), finding, 19-20
- ∈ -moves
- acceptance of strings by NFAs with, 19
- equivalence of NFAs with and without, 21-22
- finding ∈ -closure(q), 19-20
- NFAs with, 18-27
- ∈ -productions
- defined, 70
- eliminating, 71-73
- and nonnullable nonterminals, 70-71
- regular grammar and, 77-84
- ∈ -transitions, 18
- Equivalence of automata, 51-52
- Error handling
- detection and report of errors, 259-260
- exercises, 325
- lexical phase errors, 260
- in LR parsing, 261-264
- panic mode recovery, 261
- phase level recovery, 261-264
- predictive parsing error recovery, 264-267
- semantic errors and, 268
- YACC and, 264
- Errors. See Error handling
- Index
- F
- Finite automata
- construction of, 31-38
- defined, 11
- exercises, 326
- non-deterministic finite automata (NFA), 14-16
- specification of, 11-14
- strings and, 13, 15-16
- FOR statements and translation, 223-224
- Front-end compilers, 4
- Index
- G
- Gencode() function, 313-316
- Getreg() function, 300-305
- Global common subexpressions, eliminating, 290-292
- GOTO tables, 140
- construction of, 152-161
- for LR(1) parser, 163-165
- Grammars, exercises
- ambiguous grammars, 172-177
- augmented grammar, 142-146, 175-176
- left-recursive grammar, 75-77
- useless grammar symbols (reduction of), 61-70
- Index
- H
- Handle pruning, 137
- Hash tables for organization of symbol tables, 243-244
- Index
- I
- IF-THEN-ELSE statements and translation, 216-218
- IF-THEN statements and translation, 218-219
- Increment operators, implementation of, 224-225
- Indirect triple representation, 206-207
- Induction variables of loops
- defined, 284-285
- detecting and eliminating, 285-288
- Inherited attributes, 198-199
- Input files, LEX, 46-47
- Intermediate code generation phase, 2, 3
- Intersection, set operation, 7
- Index
- J-K
- Jumps
- and Boolean translation, 213-214
- eliminating multiple, 319
- Index
- L
- LALR parsing, 165-166, 190-194
- Language, defined for lexical analysis, 6
- Language tokens, lexical analysis and, 5
- L-attributed definitions, 201
- Left linear grammar, 86-90
- LEX compiler-writing tool, 45-46
- action specification in, 46-47
- format for input or source files, 46-47
- pattern specification in, 46-47
- Lexemes, 5
- Lexical analysis
- design of lexical analyzers, 45-47
- phase of compiling, 2-3, 5, 260
- Lexical analyzers, design of, 45-47
- Lexical phase, 2-3, 5
- error recovery, 260
- Linear lists for organization of symbol tables, 242
- Local common subexpressions, eliminating, 288-290
- Logical expressions
- AND operator, 214-215
- DO-WHILE statements, 220-221
- FOR statements, 223-224
- IF-THEN-ELSE statements, 216-218
- IF-THEN statements, 218-219
- NOT operator, 215-216
- OR operator, 215
- REPEAT statements, 222-223
- translation and, 214-224
- WHILE statements, 219-220
- Loop invariant computations, 271
- Loop jamming, 293-294
- Loop optimizations, 270-284
- back edge identification, 273-274
- induction variables, reduction of, 284-288
- loop detection, 273
- loop jamming, 293-294
- loop unrolling, 292-293
- reducible flow graphs and, 274-284
- Loop unrolling, 292-293
- LR parsers and parsing, 140-142, 179-194
- LR(1) parsers and parsing
- action tables, 163-165
- exercises, 324
- Index
- M
- Machine model described, 297-299
- Memory. See Storage management
- Memory addresses, machine model and, 297-299
- Index
- N
- Names
- access to nonlocal names, 253-255
- address descriptors and, 299
- held in symbol tables, 241
- runtime name storage, 241
- scope of name, 244-246
- Non-deterministic finite automata (NFA)
- defined and described, 14
- DFA equivalents of, 23-27
- with ∈ -moves, 18-27
- equivalence and ∈ -moves, 21-22
- strings and, 15-16
- transformation into deterministic (DFA), 16-18
- Nondistinguishable states of DFAs, 27
- Nonlocal names, 253-255
- Nonterminals in context-free grammar, 54, 56
- NOT operator and translation, 215-216
- Index
- O
- Opcodes, machine model and, 297-299
- Operators
- for regular expressions, 40
- translation and Boolean operators, 214-216
- Optimizations
- of DFAs, 27-31 see also Code optimization phase
- OR operator and translation, 215
- Index
- P
- Panic mode recovery, 261
- Parsers and parsing
- action tables, 140
- backtracking parsers, 95
- conflicts, 169-171
- data structures for representing parsing tables, 178-179
- defined and described, 91
- LALR parsing, 165-169, 190-194
- LR parsers, 140-142
- LR(1) parsers, action tables, 163-165
- predictive top-down parsers, 118-133
- table-driven predictive parsers, 123-133
- see also Bottom-up parsing; Parse trees; Syntax analysis phase; Top-down parsing
- Parse trees
- in CFG, 56-61
- derivation trees in CFG, 56-61
- labeled trees and code generation, 307-316
- node labeling algorithm, 307-309
- symbol table organization with, 242-243
- syntax trees, 203-204
- Pattern specification in LEX, 46-47
- Peephole optimization, 318-321
- Postfix notation, 203
- Power set, set operation, 7
- Predictive parsing
- error recovery and, 264-267
- predictive top-down parsers, 118-133
- Predictive top-down parsers, 118-133
- Prefixes, defined, 6
- Procedure calls, 234-235
- Productions (P) in context-free grammar, 54
- Index
- Q
- Quadruple representation, 205-206, 207
- Index
- R
- Recursion, eliminating left recursion, 75-77
- Recursive descent parsers, implementation, 94-118
- Reduce-reduce conflicts, 170-171
- Reducible flow graphs
- and code optimization, 274-284
- loop invariant statements and, 282-283
- Reduction of grammar, 61-70
- algorithm for identifying useless symbols, 64
- bottom-up parsing and, 135-136
- Registers
- algebraic properties to reduce requirements for, 317-318
- register descriptors, 299
- RSTACK to allocate, 309-313
- selecting for computation, 297
- Regular expression notation
- finite automata definitions, 6-8
- role in lexical analysis, 5
- Regular expressions
- defined and described, 39-43
- exercise, 323
- lexical analyzer design and, 45
- obtained from finite automata, 43-44
- obtained from regular grammar, 84-85
- operators for, 40 see also Regular expression notation
- Regular grammar, 77-85
- defined, 77
- ∈ -productions and, 77
- regular expressions from, 84-85
- Regular sets, 39
- exercises, 326
- lexical analyzer design and, 45
- properties of, 47-51
- Relations
- defined and described, 8
- properties of, 8-9
- property closure of, 9
- symbol for in CFG, 54
- REPEAT statements and translation, 222-223
- Return sequences, stack allocation and, 250-253
- Right linear grammar, 85-86
- RSTACKs, allocating registers with, 309-313
- Index
- S
- Scope rules and scope information, 244-246, 253
- Search trees for organization of symbol tables, 242-243
- Sentential form handles, 136-138
- Set difference, set operation, 7
- Set operations, defined, 7
- Sets
- defined, 7
- regular sets, 39, 45, 47-51
- relations between, 8-9
- Shift-reduce conflicts, 169
- SLR(1)
- exercises, 324
- grammars, 152-161
- SLR parsing, 151-162, 176-177, 180-190
- Source files, LEX, 46-47
- Stack allocation
- access link set up, 255-257
- access to nonlocal names and, 253-255
- block statements and, 256-257
- call and return sequences, 250-253
- Start symbol (S) in context-free grammar, 54
- Storage management
- heap memory storage, 247-248
- procedure activation and activation records, 248-249
- stack allocation, 250-257
- static allocation, 250
- storage allocation, 247-248
- Strings, defined, 6
- Suffixes, defined, 6
- SWITCH statements, translation of, 229-234
- Symbol tables
- defined and described, 239
- exercises, 326
- hash tables for organization of, 243-244
- implementation of, 239-240
- information entry for, 240
- linear lists for organization of, 242
- names held in, 241
- scope information, 244-246
- search trees for organization of, 242-243
- Syntactic phase error recovery, 260-261
- Syntax analysis phase, 2-3
- context-free grammar and, 53-54
- error recovery during syntactic phase, 260-261
- Syntax-directed definitions
- L-attributed definitions, 201
- translation and, 195-201
- Syntax directed translations and translation schemes, 202-203
- Syntax trees, 203-204
- Synthesized attributes, 197-198
- dummy synthesized attributes, 199-201
- Index
- T
- Table-driven predictive parsers, implementation, 123-133
- Terminals (T) in context-free grammar, 54
- Three-address code, 204-205
- exercises, 324-325
- partitioning into basic blocks, 271-273
- Three-address statements, representation of, 205-207, 296
- Tokens, lexical analysis and, 5
- Top-down parsing
- defined and described, 91-92
- exercises, 326
- implementation, 94-118
- predictive top-down parsers, 118-133
- Translations and translation schemes
- of arithmetic expressions, 208-211
- of array references, 225-229
- of Boolean expressions, 211-214
- of decrement and increment operators, 224-225
- examples of, 235-238
- exercises, 325
- intermediate code generation and, 203-205
- of logical expressions, 214-224
- procedure calls and, 234-235
- specification of, 195-196
- of SWITCH / CASE statements, 229-234
- syntax-directed definitions, 195-201
- Trees. See Parse trees
- Triple representation, 206
- Index
- U
- Union set operation, 7
- and regular sets, 47
- Unit productions defined, 73
- elimination of, 73-75
- Unreachable states of DFAs, 27
- detecting, 28-31
- Index
- V
- Variables (V) in context-free grammar, 54, 56
- Index
- W-X
- WHILE statements and translation, 219-220
- Index
- Y-Z
- YACC, error handling and, 264
- List of Figures
- Chapter 1: Introduction
- Figure 1.1: Compilation process phases.
- Figure 1.2: Syntax analysis imposes a structure hierarchy on the token string.
- Chapter 2: Finite Automata and Regular Expressions
- Figure 2.1: Transition diagram for finite automata δ (p, a) = q.
- Figure 2.2: Transition diagram for finite automata that handles several transitions.
- Figure 2.3: Transition diagram for M = ({q 0 , q 1 , q 2 , q 3 }, {0, 1} δ , q 0 , {q 3 }).
- Figure 2.4: Finite automata with ∈ -moves.
- Figure 2.5: Transitioning from an ∈ -move NFA to a non- ∈ -move NFA.
- Figure 2.6: Making the initial state of the NFA one of the final states.
- Figure 2.7: Example 2.1 NFA.
- Figure 2.8: Example 2.2 DFA equivalent to an NFA.
- Figure 2.9: Partitioning down to a single state.
- Figure 2.10: Merging nondistinguishable states B and C into a single state B 1 .
- Figure 2.11: Transition diagram for Example 2.3 finite automata.
- Figure 2.12: Finite automata containing even number of zeros and odd number of ones.
- Figure 2.13: Finite automata containing odd number of zeros and even number of ones.
- Figure 2.14: Example 2.6 finite automata considers the set prefix.
- Figure 2.15: Finite automata accepts strings containing the substring 101.
- Figure 2.16: DFA using the names A-D and q 0 − 5 .
- Figure 2.17: Complement to Figure 2.16 automata.
- Figure 2.18: DFA after minimization.
- Figure 2.19: Finite automata that accepts string decimals that are divisible by three.
- Figure 2.20: Finite automata accepts strings containing 101.
- Figure 2.21: Finite automata identified by the name states A-D and q 0 − 5 .
- Figure 2.22: Complement to Figure 2.21 automata.
- Figure 2.23: Minimization of nondistinguishable states of Figure 2.22.
- Figure 2.24: Automata that accepts binary strings that are divisible by three.
- Figure 2.25: Transition diagram for (a + b).
- Figure 2.26: Transition diagram for (a + b)*.
- Figure 2.27: Transition diagram for a. (a + b)*.
- Figure 2.28: Automata for a.(a + b)* .b.
- Figure 2.29: Automata for a.(a + b)*.b.b.
- Figure 2.30: Deriving the regular expression for a regular set.
- Figure 2.31: Transition diagram.
- Figure 2.32: Complement to transition diagram in Figure 2.31.
- Figure 2.33: Transition diagram of automata M 1 .
- Figure 2.34: Transition diagram of automata M 2 .
- Chapter 3: Context-Free Grammar and Syntax Analysis
- Figure 3.1: Derivation tree for the string id + id * id.
- Figure 3.2: Parse tree resulting from leaf-node concatenation.
- Figure 3.3: Multiple parse trees.
- Figure 3.4: Ambiguous grammar parse trees.
- Figure 3.5: Parse tree generated by using both the right- and left-most derivation orders.
- Figure 3.6: Parse tree generated from both the left- and right-most orders of derivation.
- Figure 3.7: Transition diagram for automata that accepts the regular grammar of Example 3.13.
- Figure 3.8: Deterministic equivalent of the non-deterministic automata shown in Figure 3.7.
- Figure 3.9: Non-deterministic automata.
- Figure 3.10: Transition diagram for deterministic automata equivalent shown in Figure 3.9.
- Figure 3.11: Regular-grammar automata.
- Figure 3.12: Transition diagram of automata that accepts L(G 1 ).
- Figure 3.13: Transition diagram of automata after removal of state D.
- Figure 3.14: Transition diagram for the automata that results from merged states.
- Figure 3.15: Non-deterministic automata that accepts L(G 2 ).
- Figure 3.16: Transition diagram of the equivalent deterministic automata for Figure 3.15.
- Figure 3.17: Finite automata accepting the right linear grammar for a regular expression.
- Figure 3.18: Transition diagram for a finite automata specified by a reversed regular expression.
- Chapter 4: Top-Down Parsing
- Figure 4.1: Parser uses the S-production to expand the parse tree.
- Figure 4.2: Parser uses the first alternative for A in order to expand the tree.
- Figure 4.3: If the parser fails to match a leaf, the point of failure, d, reroutes (backtracks) the pointer to
- alternative paths from A.
- Figure 4.4: The parser first expands S and fails to accept w = acdb.
- Figure 4.5: The parser advances to c and considers nonterminal A for expension.
- Figure 4.6: The parser first expands S.
- Figure 4.7: The parser advances the pointer to a second occurrence of a.
- Figure 4.8: The parser expands the next leaf labeled S.
- Figure 4.9: The parser finds no match, so it backtracks.
- Figure 4.10: The parser tries an alternate aa.
- Figure 4.11: There is no further alternate of S that can be tried, so the parser will backtrack one more step.
- Figure 4.12: The parser again finds a mismatch; hence, it backtracks.
- Figure 4.13: The parser tries an alternate aa.
- Figure 4.14: Since no alternate of S remains to be tried, the parser backtracks one more step.
- Figure 4.15: The parser tries an alternate aa.
- Figure 4.16: The parser arrives at the required parse tree.
- Figure 4.17: The parser first expands S.
- Figure 4.18: The parser advances the pointer to a second occurrence of a.
- Figure 4.19: The parser considers the next leaf labeled by S.
- Figure 4.20: The parser matches the third input symbol and moves on to the next leaf labeled by S.
- Figure 4.21: The parser considers the fourth occurrence of the input symbol a.
- Figure 4.22: The parser finds no match, so it backtracks.
- Figure 4.23: The parser tries an alternate aa.
- Figure 4.24: No alternate of S can be tried, so the parser will backtrack one more step.
- Figure 4.25: Again finding a mismatch, the parser backtracks.
- Figure 4.26: The parser then tries an alternate.
- Figure 4.27: No alternate of S remains to be tried, so the parser will backtrack one more step.
- Figure 4.28: The parser again finds a mismatch; therefore, it backtracks.
- Figure 4.29: The parser tries an alternate aa.
- Figure 4.30: The parser then tries an alternate aa.
- Figure 4.31: The parser successfully generates the parse tree for aaaa.
- Figure 4.32: The parser expands S.
- Figure 4.33: The parser matches the first symbol, advances to the second occurrence of a, and considers S for
- expansion.
- Figure 4.34: The parser finds a match for the second occurrence of a and expands S.
- Figure 4.35: The parser matches the third input symbol, considers the next leaf, and expands S.
- Figure 4.36: The parser matches the fourth input symbol, considers the next leaf, and expands S.
- Figure 4.37: A match is found for the fifth input symbol, so the parser considers the next leaf, and expands S.
- Figure 4.38: The sixth input symbol also matches. So the next leaf is considered, and S is expanded.
- Figure 4.39: No match is found, so the parser backtracks to S.
- Figure 4.40: The parser backtracks one more step.
- Figure 4.41: The parser tries the alternate aa.
- Figure 4.42: Again, a mismatch is found. So, the parser backtracks.
- Figure 4.43: No alternate of S remains, so the parser will back-track one more step.
- Figure 4.44: The parser tries an alternate aa.
- Figure 4.45: Again, a mismatch is found. The parser backtracks.
- Figure 4.46: The parser then tries an alternate aa.
- Figure 4.47: A mismatch is found, and the parser backtracks.
- Figure 4.48: The parser tries for the alternate aa, fails to find a match, and cannot generate the parse tree for six
- occurrences of a.
- Chapter 5: Bottom-up Parsing
- Figure 5.1: NFA transition diagram recognizes viable prefixes.
- Figure 5.2: Using subset construction, a DFA equivalent is derived from the transition diagram in Figure 5.1.
- Figure 5.3: DFA transition diagram showing four iterations for a canonical collection of sets.
- Figure 5.4: Transition diagram for Example 5.2 DFA.
- Figure 5.5: DFA Transition diagram.
- Figure 5.6: Transition diagram for the canonical collection of sets of LR(1) items.
- Figure 5.7: Transition diagram for a DFA using a reduced collection.
- Figure 5.8: LR(0) underlying set representations that can cause SLR parser conflicts.
- Figure 5.9: LR(1) underlying set representations that can cause CLR/LALR parser conflicts.
- Figure 5.10: LR(0) underlying set representations that can cause an SLR parser reduce-reduce conflict.
- Figure 5.11: LR(1) underlying set representations that can cause an CLR/LALR parser.
- Figure 5.12: Sets of LR(1) items represent two different CLR(1) parser states.
- Figure 5.13: States are combined to form an LALR.
- Figure 5.14: LR(1) items represent two different states of the CLR(1) parser.
- Figure 5.15: LALR state set resulting from the combination of CLR(1) state sets.
- Figure 5.16: Transition diagram for augmented grammar DFA.
- Figure 5.17: States with actions in common point to the same location via an array.
- Figure 5.18: List that incorporates the ability to append actions.
- Figure 5.19: Transition diagram for the canonical collection of sets of LR(0) items in Example 5.3.
- Figure 5.20: DFA transition diagram for Example 5.4.
- Figure 5.21: Collection of nonempty sets of LR(1) items for Example 5.7.
- Chapter 6: Syntax-Directed Definitions and Translations
- Figure 6.1: The attribute value of node X is inherently dependent on the attribute value of node Y.
- Figure 6.2: An annotated parse tree.
- Figure 6.3: Parse tree with node attributes for the string int id1,id2,id3.
- Figure 6.4: Dependency graph with four nodes.
- Figure 6.5: Parse tree for the string id+id*id.
- Figure 6.6: Syntax tree for id+id*id.
- Figure 6.7: Values of attributes at the parse tree node for the string a + b * c.
- Figure 6.8: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(symtab rec of a).
- Figure 6.9: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(sumtab rec of a).
- Figure 6.10: Translation scheme for a Boolean expression containing and, not, and or.
- Figure 6.11: The addition of the nullable nonterminal N facilitates an unconditional jump.
- Figure 6.12: A nullable nonterminal M provisions the translation of if-then.
- Figure 6.13: The translation of the Boolean while statement is facilitated by a nullable nonterminal M.
- Figure 6.14: Translation of the Boolean do-while.
- Figure 6.15: Translation of Boolean repeat-until.
- Figure 6.16: Handling the translation of the Boolean for.
- Figure 6.17: A switch/case three-address translation.
- Figure 6.18: Nullable nonterminals are introduced into a switch statement translation.
- Figure 6.19: Contents of queue during the translation.
- Chapter 7: Symbol Table Management
- Figure 7.1: A pointer steers the symbol table to remotely stored information for the array a.
- Figure 7.2: Symbol table names are held either in the symbol table record or in a separate string table.
- Figure 7.3: A new record is added to the linear list of records.
- Figure 7.4: The search tree organization approach to a symbol table.
- Figure 7.5: Hash table method of symbol table organization.
- Figure 7.6: Symbol table organization that complies with static scope information rules.
- Chapter 8: Storage Management
- Figure 8.1: Heap memory storage allows program-controlled data allocation.
- Figure 8.2: Typical format of an activation record.
- Figure 8.3: The CEP pointer is used to access the contents of the activation record.
- Figure 8.4: Typical callee code segment.
- Figure 8.5: An activation record that deals with nonlocal name references.
- Figure 8.6: A typical callee segment.
- Figure 8.7: Storage for declared names.
- Chapter 10: Code Optimization
- Figure 10.1: Program flow graph.
- Figure 10.2: The flow graph back edges are identified by computing the dominators.
- Figure 10.3: A flow graph with no back edges.
- Figure 10.4: Flow graph with GEN and KILL block sets.
- Figure 10.5: Nonunique solution to a data flow equation, where B is a predecessor of itself.
- Figure 10.6: A flow graph containing a loop invariant statement.
- Figure 10.7: Moving a loop invariant statement changes the semantics of the program.
- Figure 10.8: Moving the preheader changes the meaning of the program.
- Figure 10.9: Moving a value to the preheader changes the original meaning of the program.
- Figure 10.10: Flow graph where I is a basic induction variable.
- Figure 10.11: Modified flow graph.
- Figure 10.12: Flow graph preheader modifications.
- Figure 10.13: DAG representation of a basic block.
- Chapter 11: Code Generation
- Figure 11.1: DAG Representation.
- Figure 11.2: DAG Representation with heuristic ordering.
- Figure 11.3: DAG representation of three-address code for Example 11.3.
- Figure 11.4: DAG representation tree after labeling.
- Figure 11.5: The node n is an operand and not a subtree root.
- Figure 11.6: The node n is an operator, and n 1 and n 2 are subtree roots.
- Figure 11.7: A nontree DAG.
- Figure 11.8: A DAG that has been partitioned into a set of trees.
- Figure 11.9: Labeled tree for Example 11.4.
- Figure 11.10: Tree with a label of two.
- Figure 11.11: The left and right subtrees have been interchanged, reducing the register requirement to one.
- Figure 11.12: Associativity is used to reduce a tree's register requirement.
- Chapter 12: Exercises
- Figure 12.1: State transition diagram.
- List of Tables
- Chapter 4: Top-Down Parsing
- Table 4.1: Production Selections for Parsing Derivations
- Table 4.2: Production Selections for Parsing Derivations
- Table 4.3: Steps Involved in Parsing the String acdb
- Table 4.4: Production Selections for String ab Parsing Derivations
- Table 4.5: Production Selections for Parsing Derivations for the String adb
- Table 4.6: Production Selections for Example 4.3 Parsing Derivations
- Table 4.7: Production Selections for Example 4.5 Parsing Derivations
- Table 4.8: Production Selections for Example 4.6 Parsing Derivations
- Table 4.9: Production Selections for Example 4.7 Parsing Derivations
- Table 4.10: Production Selections for Example 4.8 Parsing Derivations
- Chapter 5: Bottom-up Parsing
- Table 5.1: Sentential Form Handles
- Table 5.2: Sentential Form Handles
- Table 5.3: Steps in Parsing the String id + id * id
- Table 5.4: Action|GOTO SLR Parsing Table
- Table 5.5: SLR Parsing Table
- Table 5.6: Action | GOTO SLR Parsing Table
- Table 5.7: CLR/LR Parsing Action | GOTO Table
- Table 5.8: LALR Parsing Table for a DFA Using a Reduced Collection
- Table 5.9: SLR Parsing Table for Augmented Grammar
- Table 5.10: SLR Parsing Table Reflects Higher Precedence and Left-Associativity
- Table 5.11: SLR(1) Parsing Table
- Table 5.12: SLR Parsing Table for Example 5.4
- Table 5.13: Parsing Table for Example 5.5
- Table 5.14: LALR(1) Parsing Table for Example 5.5
- Table 5.15: LALR(1) Parsing Table for Example 5.6
- Chapter 6: Syntax-Directed Definitions and Translations
- Table 6.1: Quadruple Representation of x = (a + b) * − c/d
- Table 6.2: Triple Representation of x = (a + b) * − c/d
- Table 6.3: Indirect Triple Representation of x = (a + b) * − c/d
- Chapter 7: Symbol Table Management
- Table 7.1: Symbol Table Contents Using a Nesting Depth Approach
- Chapter 9: Error Handling
- Table 9.1: Parsing Table for E → E + E | E * E | id
- Table 9.2: Higher Precedent * and Left-Associativity
- Table 9.3: Parsing Table with Error Routines
- Table 9.4: LR Parsing Table
- Table 9.5: Phrase Level Error-Recovery Implementation
- Chapter 10: Code Optimization
- Table 10.1: GEN and KILL sets for Figure 10.4 Flow Graph
- Table 10.2: IN and OUT Computation for Figure 10.5
- Table 10.3: First Iteration for the IN and OUT Values
- Table 10.4: Second Iteration for the IN and OUT Values
- Table 10.5: Third Iteration for the IN and OUT Values
- Table 10.6: Fourth Iteration for the IN and OUT Values
- Chapter 11: Code Generation
- Table 11.1: Six-Bit Registers for the Instruction MOV R0, R1
- Table 11.2: Six-Bit Registers for the Instruction MOV R0, R2
- Table 11.3: Six-Bit Registers for the Instruction MOV M1, M2
- Table 11.4: Computation for the Expression x = (a + b) − ((c + d) − e)
- Table 11.5: Generated Code with Only Two Available Registers, R0 and R1
- Table 11.6: Generated Code with Rearranged Computations
- Table 11.7: Recursive Gencode Calls
- List of Examples
- Chapter 2: Finite Automata and Regular Expressions
- EXAMPLE 2.1
- EXAMPLE 2.2
- EXAMPLE 2.3
- EXAMPLE 2.4
- EXAMPLE 2.5
- EXAMPLE 2.6
- EXAMPLE 2.7
- EXAMPLE 2.8
- EXAMPLE 2.9
- EXAMPLE 2.10
- Chapter 3: Context-Free Grammar and Syntax Analysis
- EXAMPLE 3.1
- EXAMPLE 3.2
- EXAMPLE 3.3
- EXAMPLE 3.4
- EXAMPLE 3.5
- EXAMPLE 3.6
- EXAMPLE 3.7
- EXAMPLE 3.8
- EXAMPLE 3.9
- EXAMPLE 3.10
- EXAMPLE 3.11
- EXAMPLE 3.12
- EXAMPLE 3.13
- EXAMPLE 3.14
- EXAMPLE 3.15
- Chapter 4: Top-Down Parsing
- EXAMPLE 4.1
- EXAMPLE 4.2
- EXAMPLE 4.3
- EXAMPLE 4.4
- EXAMPLE 4.5
- EXAMPLE 4.6
- EXAMPLE 4.7
- EXAMPLE 4.8
- Chapter 5: Bottom-up Parsing
- EXAMPLE 5.1
- EXAMPLE 5.2
- EXAMPLE 5.3
- EXAMPLE 5.4
- EXAMPLE 5.5
- EXAMPLE 5.6
- EXAMPLE 5.7
- EXAMPLE 5.8
- Chapter 6: Syntax-Directed Definitions and Translations
- EXAMPLE 6.1
- EXAMPLE 6.2
- EXAMPLE 6.3
- EXAMPLE 6.4
- EXAMPLE 6.5
- EXAMPLE 6.6
- Chapter 11: Code Generation
- EXAMPLE 11.1
- EXAMPLE 11.2
- EXAMPLE 11.3
- EXAMPLE 11.4
- Chapter 12: Exercises
- EXERCISE 12.1
- EXERCISE 12.2
- EXERCISE 12.3
- EXERCISE 12.4
- EXERCISE 12.5
- EXERCISE 12.6
- EXERCISE 12.7
- EXERCISE 12.8
- EXERCISE 12.9
- EXERCISE 12.10
- EXERCISE 12.11
- EXERCISE 12.12
- EXERCISE 12.13
- EXERCISE 12.14
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement