Advertisement
xdxdxd123

Untitled

May 22nd, 2017
925
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 371.34 KB | None | 0 0
  1. .
  2. .Algorithms for Compiler Design
  3. by O.G. Kakde
  4. ISBN:1584501006
  5. Charles River Media © 2002 (334 pages)
  6. This text teaches the fundamental algorithms that underlie modern compilers, and focuses on the
  7. "front-end" of compiler design--lexical analysis, parsing, and syntax.
  8. Table of Contents
  9. Algorithms for Compiler Design
  10. Preface
  11. Chapter 1 -Introduction
  12. Chapter 2 -Finite Automata and Regular Expressions
  13. Chapter 3 -Context-Free Grammar and Syntax Analysis
  14. Chapter 4 -Top-Down Parsing
  15. Chapter 5 -Bottom-up Parsing
  16. Chapter 6 -Syntax-Directed Definitions and Translations
  17. Chapter 7 -Symbol Table Management
  18. Chapter 8 -Storage Management
  19. Chapter 9 -Error Handling
  20. Chapter 10-Code Optimization
  21. Chapter 11-Code Generation
  22. Chapter 12-Exercises
  23. Index
  24. List of Figures
  25. List of Tables
  26. List of Examples
  27. Back Cover
  28. A compiler translates a high-level language program into a functionally equivalent low-level language program that can be
  29. understood and executed by the computer. Crucial to any computer system, effective compiler design is also one of the most
  30. complex areas of system development. Before any code for a modern compiler is even written, many programmers have
  31. difficulty with the high-level algorithms that will be necessary for the compiler to function. Written with this in mind, Algorithms
  32. for Compiler Design teaches the fundamental algorithms that underlie modern compilers. The book focuses on the “front-end”
  33. of compiler design: lexical analysis, parsing, and syntax. Blending theory with practical examples throughout, the book
  34. presents these difficult topics clearly and thoroughly. The final chapters on code generation and optimization complete a
  35. solid foundation for learning the broader requirements of an entire compiler design.
  36. FEATURES
  37. Focuses on the “front-end” of compiler design—lexical analysis, parsing, and syntax—topics basic to any
  38. introduction to compiler design
  39. Covers storage management, error handling, and recovery
  40. Introduces important “back-end” programming concepts, including code generation and optimization
  41. Algorithms for Compiler Design
  42. O.G. Kakde
  43. CHARLES RIVER MEDIA, INC.
  44. Copyright © 2002, 2003 Laxmi Publications, LTD.
  45. O.G. Kakde. Algorithms for Compiler Design
  46. 1-58450-100-6
  47. No part of this publication may be reproduced in any way, stored in a retrieval system of any type, or transmitted by
  48. any means or media, electronic or mechanical, including, but not limited to, photocopy, recording, or scanning, without
  49. prior permission in writing from the publisher.
  50. Publisher: David Pallai
  51. Production: Laxmi Publications, LTD.
  52. Cover Design: The Printed Image
  53. CHARLES RIVER MEDIA, INC.
  54. 20 Downer Avenue, Suite 3
  55. Hingham, Massachusetts 02043
  56. 781-740-0400
  57. 781-740-8816 (FAX)
  58. info@charlesriver.com
  59. http://www.charlesriver.com
  60. Original Copyright 2002, 2003 by Laxmi Publications, LTD.
  61. O.G. Kakde. Algorithms for Compiler Design.
  62. Original ISBN: 81-7008-100-6
  63. All brand names and product names mentioned in this book are trademarks or service marks of their respective
  64. companies. Any omission or misuse (of any kind) of service marks or trademarks should not be regarded as intent to
  65. infringe on the property of others. The publisher recognizes and respects all marks used by companies,
  66. manufacturers, and developers as a means to distinguish their products.
  67. 02 7 6 5 4 3 2 First Edition
  68. CHARLES RIVER MEDIA titles are available for site license or bulk purchase by institutions, user groups,
  69. corporations, etc. For additional information, please contact the Special Sales Department at 781-740-0400.
  70. Acknowledgments
  71. The author wishes to thank all of the colleagues in the Department of Electronics and Computer Science Engineering
  72. at Visvesvaraya Regional College of Engineering Nagpur, whose constant encouragement and timely help have
  73. resulted in the completion of this book. Special thanks go to Dr. C. S. Moghe, with whom the author had long technical
  74. discussions, which found their place in this book. Thanks are due to the institution for providing all of the infrastructural
  75. facilities and tools for a timely completion of this book. The author would particularly like to acknowledge Mr. P. S.
  76. Deshpande and Mr. A. S. Mokhade for their invaluable help and support from time to time. Finally, the author wishes
  77. to thank all of his students.
  78. Preface
  79. This book on algorithms for compiler design covers the various aspects of designing a language translator in depth.
  80. The book is intended to be a basic reading material in compiler design.
  81. Enough examples and algorithms have been used to effectively explain various tools of compiler design. The first
  82. chapter gives a brief introduction of the compiler and is thus important for the rest of the book.
  83. Other issues like context free grammar, parsing techniques, syntax directed definitions, symbol table, code
  84. optimization and more are explain in various chapters of the book.
  85. The final chapter has some exercises for the readers for practice.
  86. Chapter 1: Introduction
  87. 1.1 WHAT IS A COMPILER?
  88. A compiler is a program that translates a high-level language program into a functionally equivalent low-level
  89. language program. So, a compiler is basically a translator whose source language (i.e., language to be translated) is
  90. the high-level language, and the target language is a low-level language; that is, a compiler is used to implement a
  91. high-level language on a computer.
  92. 1.2 WHAT IS A CROSS-COMPILER?
  93. A cross-compiler is a compiler that runs on one machine and produces object code for another machine. The
  94. cross-compiler is used to implement the compiler, which is characterized by three languages:
  95. The source language, 1.
  96. The object language, and 2.
  97. The language in which it is written. 3.
  98. If a compiler has been implemented in its own language, then this arrangement is called a "bootstrap" arrangement.
  99. The implementation of a compiler in its own language can be done as follows.
  100. Implementing a Bootstrap Compiler
  101. Suppose we have a new language, L, that we want to make available on machines A and B. As a first step, we can
  102. write a small compiler: S C A A , which will translate an S subset of L to the object code for machine A, written in a
  103. language available on A. We then write a compiler S C S A , which is compiled in language L and generates object code
  104. written in an S subset of L for machine A. But this will not be able to execute unless and until it is translated by S C A A ;
  105. therefore, S C S A is an input into S C A A , as shown below, producing a compiler for L that will run on machine A and
  106. self-generate code for machine A: S C A A .
  107. Now, if we want to produce another compiler to run on and produce code for machine B, the compiler can be written,
  108. itself, in L and made available on machine B by using the following steps:
  109. 1.3 COMPILATION
  110. Compilation refers to the compiler's process of translating a high-level language program into a low-level language
  111. program. This process is very complex; hence, from the logical as well as an implementation point of view, it is
  112. customary to partition the compilation process into several phases, which are nothing more than logically cohesive
  113. operations that input one representation of a source program and output another representation.
  114. A typical compilation, broken down into phases, is shown in Figure 1.1.
  115. Figure 1.1: Compilation process phases.
  116. The initial process phases analyze the source program. The lexical analysis phase reads the characters in the source
  117. program and groups them into streams of tokens; each token represents a logically cohesive sequence of characters,
  118. such as identifiers, operators, and keywords. The character sequence that forms a token is called a "lexeme". Certain
  119. tokens are augmented by the lexical value; that is, when an identifier like xyz is found, the lexical analyzer not only
  120. returns id, but it also enters the lexeme xyz into the symbol table if it does not already exist there. It returns a pointer to
  121. this symbol table entry as a lexical value associated with this occurrence of the token id. Therefore, when internally
  122. representing a statement like X: = Y + Z, after the lexical analysis will be id 1 : = id 2 + id 3 .
  123. The subscripts 1, 2, and 3 are used for convenience; the actual token is id. The syntax analysis phase imposes a
  124. hierarchical structure on the token string, as shown in Figure 1.2.
  125. Figure 1.2: Syntax analysis imposes a structure hierarchy on the token string.
  126. Intermediate Code Generation
  127. Some compilers generate an explicit intermediate code representation of the source program. The intermediate code
  128. can have a variety of forms. For example, a three-address code (TAC) representation for the tree shown in Figure 1.2
  129. will be:
  130. where T 1 and T 2 are compiler-generated temporaries.
  131. Code Optimization
  132. In the optimization phase, the compiler performs various transformations in order to improve the intermediate code.
  133. These transformations will result in faster-running machine code.
  134. Code Generation
  135. The final phase in the compilation process is the generation of target code. This process involves selecting memory
  136. locations for each variable used by the program. Then, each intermediate instruction is translated into a sequence of
  137. machine instructions that performs the same task.
  138. Compiler Phase Organization
  139. This is the logical organization of compiler. It reveals that certain phases of the compiler are heavily dependent on the
  140. source language and are independent of the code requirements of the target machine. All such phases, when grouped
  141. together, constitute the front end of the compiler; whereas those phases that are dependent on the target machine
  142. constitute the back end of the compiler. Grouping the compilation phases in the front and back ends facilitates the
  143. re-targeting of the code; implementation of the same source language on different machines can be done by rewriting
  144. only the back end.
  145. Note Different languages can also be implemented on the same machine by rewriting the front end and using the
  146. same back end. But to do this, all of the front ends are required to produce the same intermediate code; and this
  147. is difficult, because the front end depends on the source language, and different languages are designed with
  148. different viewpoints. Therefore, it becomes difficult to write the front ends for different languages by using a
  149. common intermediate code.
  150. Having relatively few passes is desirable from the point of view of reducing the compilation time. To reduce the
  151. number of passes, it is required to group several phases in one pass. For some of the phases, being grouped into one
  152. pass is not a major problem. For example, the lexical analyzer and syntax analyzer can easily be grouped into one
  153. pass, because the interface between them is a single token; that is, the processing required by the token is
  154. independent of other tokens. Therefore, these phases can be easily grouped together, with the lexical analyzer
  155. working as a subroutine of the syntax analyzer, which is charge of the entire analysis activity.
  156. Conversely, grouping some of the phases into one pass is not that easy. Grouping intermediate and object
  157. code-generation phases is difficult, because it is often very hard to perform object code generation until a sufficient
  158. number of intermediate code statements have been generated. Here, the interface between the two is not based on
  159. only one intermediate instruction-certain languages permit the use of a variable before it is declared. Similarly, many
  160. languages also permit forward jumps. Therefore, it is not possible to generate object code for a construct until
  161. sufficient intermediate code statements have been generated. To overcome this problem and enable the merging of
  162. intermediate and object code generation into one pass, the technique called "back-patching" is used; the object code
  163. is generated by leaving ‘statementholes,’ which will be filled later when the information becomes available.
  164. 1.3.1 Lexical Analysis Phase
  165. In the lexical analysis phase, the compiler scans the characters of the source program, one character at a time.
  166. Whenever it gets a sufficient number of characters to constitute a token of the specified language, it outputs that
  167. token. In order to perform this task, the lexical analyzer must know the keywords, identifiers, operators, delimiters, and
  168. punctuation symbols of the language to be implemented. So, when it scans the source program, it will be able to
  169. return a suitable token whenever it encounters a token lexeme. (Lexeme refers to the sequence of characters in the
  170. source program that is matched by language's character patterns that specify identifiers, operators, keywords,
  171. delimiters, punctuation symbols, and so forth.) Therefore, the lexical analyzer design must:
  172. Specify the token of the language, and 1.
  173. Suitably recognize the tokens. 2.
  174. We cannot specify the language tokens by enumerating each and every identifier, operator, keyword, delimiter, and
  175. punctuation symbol; our specification would end up spanning several pages—and perhaps never end, especially for
  176. those languages that do not limit the number of characters that an identifier can have. Therefore, token specification
  177. should be generated by specifying the rules that govern the way that the language's alphabet symbols can be
  178. combined, so that the result of the combination will be a token of that language's identifiers, operators, and keywords.
  179. This requires the use of suitable language-specific notation.
  180. Regular Expression Notation
  181. Regular expression notation can be used for specification of tokens because tokens constitute a regular set. It is
  182. compact, precise, and contains a deterministic finite automata (DFA) that accepts the language specified by the
  183. regular expression. The DFA is used to recognize the language specified by the regular expression notation, making
  184. the automatic construction of recognizer of tokens possible. Therefore, the study of regular expression notation and
  185. finite automata becomes necessary. Some definitions of the various terms used are described below.
  186. 1.4 REGULAR EXPRESSION NOTATION/FINITE AUTOMATA DEFINITIONS
  187. String
  188. A string is a finite sequence of symbols. We use a letter, such as w, to denote a string. If w is the string, then the
  189. length of string is denoted as | w |, and it is a count of number of symbols of w. For example, if w = xyz, | w | = 3. If | w |
  190. = 0, then the string is called an "empty" string, and we use ∈ to denote the empty string.
  191. Prefix
  192. A string's prefix is the string formed by taking any number of leading symbols of string. For example, if w = abc, then ∈ ,
  193. a, ab, and abc are the prefixes of w. Any prefix of a string other than the string itself is called a "proper" prefix of the
  194. string.
  195. Suffix
  196. A string's suffix is formed by taking any number of trailing symbols of a string. For example, if w = abc, then ∈ , c, bc,
  197. and abc are the suffixes of the w. Similar to prefixes, any suffix of a string other than the string itself is called a "proper"
  198. suffix of the string.
  199. Concatenation
  200. If w 1 and w 2 are two strings, then the concatenation of w 1 and w 2 is denoted as w 1 .w 2 —simply, a string obtained by
  201. writing w 1 followed by w 2 without any space in between (i.e., a juxtaposition of w 1 and w 2 ). For example, if w 1 = xyz,
  202. and w 2 = abc, then w 1 .w 2 = xyzabc. If w is a string, then w. ∈ = w, and ∈ .w = w. Therefore, we conclude that ∈ (empty
  203. string) is a concatenation identity.
  204. Alphabet
  205. An alphabet is a finite set of symbols denoted by the symbol Σ .
  206. Language
  207. A language is a set of strings formed by using the symbols belonging to some previously chosen alphabet. For
  208. example, if Σ = { 0, 1 }, then one of the languages that can be defined over this Σ will be L = { ∈ , 0, 00, 000, 1, 11, 111,
  209. … }.
  210. Set
  211. A set is a collection of objects. It is denoted by the following methods:
  212. We can enumerate the members by placing them within curly brackets ({ }). For example, the set
  213. A is defined by: A = { 0, 1, 2 }.
  214. 1.
  215. We can use a predetermined notation in which the set is denoted as: A = { x | P (x) }. This means
  216. that A is a set of all those elements x for which the predicate P (x) is true. For example, a set of all
  217. integers divisible by three will be denoted as: A = { x | x is an integer and x mod 3 = 0}.
  218. 2.
  219. Set Operations
  220. Union: If A and B are the two sets, then the union of A and B is denoted as: A ∪ B = { x | x in A or x is
  221. in B }.
  222. Intersection: If A and B are the two sets, then the intersection of A and B is denoted as: A ∩ B = { x | x
  223. is in A and x is in B }.
  224. Set difference: If A and B are the two sets, then the difference of A and B is denoted as: A − B = { x | x
  225. is in A but not in B }.
  226. Cartesian product: If A and B are the two sets, then the Cartesian product of A and B is denoted as: A ×
  227. B = { (a, b) | a is in A and b is in B }.
  228. Power set: If A is the set, then the power set of A is denoted as : 2 A = P | P is a subset of A } (i.e., the
  229. set contains of all possible subsets of A.) For example:
  230. Concatenation: If A and B are the two sets, then the concatenation of A and B is denoted as: AB = { ab |
  231. a is in A and b is in B }. For example, if A = { 0, 1 } and B = { 1, 2 }, then AB = { 01, 02, 11, 12 }.
  232. Closure: If A is a set, then closure of A is denoted as: A* = A 0 ∪ A 1 ∪ A 2 ∪ … ∪ A ∞ , where A i is the ith
  233. power of set A, defined as A i = A.A.A … i times.
  234. (i.e., the set of all possible combination of members of A of length 0)
  235. (i.e., the set of all possible combination of members of A of length 1)
  236. (i.e., the set of all possible combinations of members of A of length 2)
  237. Therefore A* is the set of all possible combinations of the members of A. For example, if Σ = { 0,1), then Σ * will be the
  238. set of all possible combinations of zeros and ones, which is one of the languages defined over Σ .
  239. 1.5 RELATIONS
  240. Let A and B be the two sets; then the relationship R between A and B is nothing more than a set of ordered pairs (a, b)
  241. such that a is in A and b is in B, and a is related to b by relation R. That is:
  242. R = { (a, b) | a is in A and b is in B, and a is related to b by R }
  243. For example, if A = { 0, 1 } and B = { 1, 2 }, then we can define a relation of ‘less than,’ denoted by < as follows:
  244. A pair (1, 1) will not belong to the < relation, because one is not less than one. Therefore, we conclude that a relation R
  245. between sets A and B is the subset of A × B.
  246. If a pair (a, b) is in R, then aRb is true; otherwise, aRb is false.
  247. A is called a "domain" of the relation, and B is called a "range" of the relation. If the domain of a relation R is a set A,
  248. and the range is also a set A, then R is called as a relation on set A rather than calling a relation between sets A and
  249. B. For example, if A = { 0, 1, 2 }, then a < relation defined on A will result in: < = { (0, 1), (0, 2), (1, 2) }.
  250. 1.5.1 Properties of the Relation
  251. Let R be some relation defined on a set A. Therefore:
  252. R is said to be reflexive if aRa is true for every a in A; that is, if every element of A is related with
  253. itself by relation R, then R is called as a reflexive relation.
  254. 1.
  255. If every aRb implies bRa (i.e., when a is related to b by R, and if b is also related to a by the same
  256. relation R), then a relation R will be a symmetric relation.
  257. 2.
  258. If every aRb and bRc implies aRc, then the relation R is said to be transitive; that is, when a is
  259. related to b by R, and b is related to c by R, and if a is also related to c by relation R, then R is a
  260. transitive relation.
  261. If R is reflexive and transitive, as well as symmetric, then R is an equivalence relation.
  262. 3.
  263. Property Closure of a Relation
  264. Let R be a relation defined on a set A, and if P is a set of properties, then the property closure of a relation R, denoted
  265. as P-closure, is the smallest relation, R ′ , which has the properties mentioned in P. It is obtained by adding every pair
  266. (a, b) in R to R ′ , and then adding those pairs of the members of A that will make relation R have the properties in P. If
  267. P contains only transitivity properties, then the P-closure will be called as a transitive closure of the relation, and we
  268. denote the transitive closure of relation R by R + ; whereas when P contains transitive as well as reflexive properties,
  269. then the P-closure is called as a reflexive-transitive closure of relation R, and we denote it by R*. R + can be obtained
  270. from R as follows:
  271. For example, if:
  272. Chapter 2: Finite Automata and Regular Expressions
  273. 2.1 FINITE AUTOMATA
  274. A finite automata consists of a finite number of states and a finite number of transitions, and these transitions are
  275. defined on certain, specific symbols called input symbols. One of the states of the finite automata is identified as the
  276. initial state the state in which the automata always starts. Similarly, certain states are identified as final states.
  277. Therefore, a finite automata is specified as using five things:
  278. The states of the finite automata; 1.
  279. The input symbols on which transitions are made; 2.
  280. The transitions specifying from which state on which input symbol where the transition goes; 3.
  281. The initial state; and 4.
  282. The set of final states. 5.
  283. Therefore formally a finite automata is a five-tuple:
  284. where:
  285. Q is a set of states of the finite automata,
  286. Σ is a set of input symbols, and
  287. δ specifies the transitions in the automata.
  288. If from a state p there exists a transition going to state q on an input symbol a, then we write δ (p, a) = q. Hence, δ is a
  289. function whose domain is a set of ordered pairs, (p, a), where p is a state and a is an input symbol, and the range is a
  290. set of states.
  291. Therefore, we conclude that δ defines a mapping whose domain will be a set of ordered pairs of the form (p, a) and
  292. whose range will be a set of states. That is, δ defines a mapping from
  293. q 0 is the initial state, and F is a set of final sates of the automata. For example:
  294. where
  295. A directed graph exists that can be associated with finite automata. This
  296. graph is called a "transition diagram of finite automata." To associate a graph with finite automata, the vertices of the
  297. graph correspond to the states of the automata, and the edges in the transition diagram are determined as follows.
  298. If δ (p, a) = q, then put an edge from the vertex, which corresponds to state p, to the vertex that corresponds to state q,
  299. labeled by a. To indicate the initial state, we place an arrow with its head pointing to the vertex that corresponds to the
  300. initial state of the automata, and we label that arrow "start." We then encircle the vertices twice, which correspond to
  301. the final states of the automata. Therefore, the transition diagram for the described finite automata will resemble Figure
  302. 2.1.
  303. Figure 2.1: Transition diagram for finite automata δ (p, a) = q.
  304. A tabular representation can also be used to specify the finite automata. A table whose number of rows is equal to the
  305. number of states, and whose number of columns equals the number of input symbols, is used to specify the transitions
  306. in the automata. The first row specifies the transitions from the initial state; the rows specifying the transitions from the
  307. final states are marked as *. For example, the automata above can be specified as follows:
  308. A finite automata can be used to accept some particular set of strings. If x is a string made of symbols belonging to Σ
  309. of the finite automata, then x is accepted by the finite automata if a path corresponding to x in a finite automata starts
  310. in an initial state and ends in one of the final states of the automata; that is, there must exist a sequence of moves for x
  311. in the finite automata that takes the transitions from the initial state to one of the final states of the automata. Since x is
  312. a member of Σ *, we define a new transition function, δ 1 , which defines a mapping from Q × Σ * to Q. And if δ 1 (q 0 , x) =
  313. a member of F, then x is accepted by the finite automata. If x is written as wa, where a is the last symbol of x, and w is
  314. a string of the of remaining symbols of x, then:
  315. For example:
  316. where
  317. Let x be 010. To find out if x is accepted by the automata or not, we proceed as follows:
  318. δ 1 (q 0 , 0) = δ (q 0 , 0) = q 1
  319. Therefore, δ 1 (q 0 , 01 ) = δ { δ 1 (q 0 , 0), 1} = q 0
  320. Therefore, δ 1 (q 0 , 010) = δ { δ 1 (q 0 , 0 1), 0} = q 1
  321. Since q 1 is a member of F, x = 010 is accepted by the automata.
  322. If x = 0101, then δ 1 (q 0 , 0101) = δ { δ 1 (q 0 , 010), 1} = q 0
  323. Since q 0 is not a member of F, x is not accepted by the above automata.
  324. Therefore, if M is the finite automata, then the language accepted by the finite automata is denoted as L(M) = {x | δ 1
  325. (q 0 , x) = member of F }.
  326. In the finite automata discussed above, since δ defines mapping from Q × Σ to Q, there exists exactly one transition
  327. from a state on an input symbol; and therefore, this finite automata is considered a deterministic finite automata (DFA).
  328. Therefore, we define the DFA as the finite automata:
  329. M = (Q, Σ , δ , q 0 , F ), such that there exists exactly one transition from a state on a input symbol.
  330. 2.2 NON-DETERMINISTIC FINITE AUTOMATA
  331. If the basic finite automata model is modified in such a way that from a state on an input symbol zero, one or more
  332. transitions are permitted, then the corresponding finite automata is called a "non-deterministic finite automata" (NFA).
  333. Therefore, an NFA is a finite automata in which there may exist more than one paths corresponding to x in Σ * (because
  334. zero, one, or more transitions are permitted from a state on an input symbol). Whereas in a DFA, there exists exactly
  335. one path corresponding to x in Σ *. Hence, an NFA is nothing more than a finite automata:
  336. in which δ defines mapping from Q × Σ to 2 Q (to take care of zero, one, or more transitions). For example, consider the
  337. finite automata shown below:
  338. where:
  339. The transition diagram of this automata is:
  340. Figure 2.2: Transition diagram for finite automata that handles several transitions.
  341. 2.2.1 Acceptance of Strings by Non-deterministic Finite Automata
  342. Since an NFA is a finite automata in which there may exist more than one path corresponding to x in Σ *, and if this is,
  343. indeed, the case, then we are required to test the multiple paths corresponding to x in order to decide whether or not x
  344. is accepted by the NFA, because, for the NFA to accept x, at least one path corresponding to x is required in the NFA.
  345. This path should start in the initial state and end in one of the final states. Whereas in a DFA, since there exists exactly
  346. one path corresponding to x in Σ *, it is enough to test whether or not that path starts in the initial state and ends in one
  347. of the final states in order to decide whether x is accepted by the DFA or not.
  348. Therefore, if x is a string made of symbols in Σ of the NFA (i.e., x is in Σ *), then x is accepted by the NFA if at least one
  349. path exists that corresponds to x in the NFA, which starts in an initial state and ends in one of the final states of the
  350. NFA. Since x is a member of Σ * and there may exist zero, one, or more transitions from a state on an input symbol, we
  351. define a new transition function, δ 1 , which defines a mapping from 2 Q × Σ * to 2 Q ; and if δ 1 ({q 0 },x) = P, where P is a set
  352. containing at least one member of F, then x is accepted by the NFA. If x is written as wa, where a is the last symbol of
  353. x, and w is a string made of the remaining symbols of x then:
  354. For example, consider the finite automata shown below:
  355. where:
  356. If x = 0111, then to find out whether or not x is accepted by the NFA, we proceed as follows:
  357. Since δ 1 ({q 0 }, 0111) = {q 1 , q 2 , q 3 }, which contains q 3 , a member of F of the NFA—, hence, x = 0111 is accepted by
  358. the NFA.
  359. Therefore, if M is a NFA, then the language accepted by NFA is defined as:
  360. L(M) = {x | δ 1 ({q 0 } x) = P, where P contains at least one member of F }.
  361. 2.3 TRANSFORMING NFA TO DFA
  362. For every non-deterministic finite automata, there exists an equivalent deterministic finite automata. The equivalence
  363. between the two is defined in terms of language acceptance. Since an NFA is a nothing more than a finite automata in
  364. which zero, one, or more transitions on an input symbol is permitted, we can always construct a finite automata that
  365. will simulate all the moves of the NFA on a particular input symbol in parallel. We then get a finite automata in which
  366. there will be exactly one transition on an input symbol; hence, it will be a DFA equivalent to the NFA.
  367. Since the DFA equivalent of the NFA simulates (parallels) the moves of the NFA, every state of a DFA will be a
  368. combination of one or more states of the NFA. Hence, every state of a DFA will be represented by some subset of the
  369. set of states of the NFA; and therefore, the transformation from NFA to DFA is normally called the "construction"
  370. subset. Therefore, if a given NFA has n states, then the equivalent DFA will have 2 n number of states, with the initial
  371. state corresponding to the subset {q 0 }. Therefore, the transformation from NFA to DFA involves finding all possible
  372. subsets of the set states of the NFA, considering each subset to be a state of a DFA, and then finding the transition
  373. from it on every input symbol. But all the states of a DFA obtained in this way might not be reachable from the initial
  374. state; and if a state is not reachable from the initial state on any possible input sequence, then such a state does not
  375. play role in deciding what language is accepted by the DFA. (Such states are those states of the DFA that have
  376. outgoing transitions on the input symbols—but either no incoming transitions, or they only have incoming transitions
  377. from other unreachable states.) Hence, the amount of work involved in transforming an NFA to a DFA can be
  378. reduced if we attempt to generate only reachable states of a DFA. This can be done by proceeding as follows:
  379. Let M = (Q, Σ , δ , q 0 , F ) be an NFA to be transformed into a DFA.
  380. Let Q 1 be the set states of equivalent DFA.
  381. begin:
  382. Q 1old = Φ
  383. Q 1new = {q 0 }
  384. While (Q 1old ≠ Q 1new )
  385. {
  386. Temp = Q 1new - Q 1old
  387. Q 1 = Q 1new
  388. for every subset P in Temp do
  389. for every a in Σ do
  390. If transition from P on a goes to new subset S of Q
  391. then
  392. (transition from P on a is obtained by finding out
  393. the transitions from every member of P on a in a given
  394. NFA
  395. and then taking the union of all such transitions)
  396. Q 1 new = Q 1 new ∪ S
  397. }
  398. Q 1 = Q 1new
  399. end
  400. A subset P in Q l will be a final state of the DFA if P contains at least one member of F of the NFA. For example,
  401. consider the following finite automata:
  402. where:
  403. The DFA equivalent of this NFA can be obtained as follows:
  404. 0 1
  405. {q 0 ) {q 1 }
  406. Φ
  407. {q 1 } {q 1 } {q 1 , q 2 }
  408. {q 1 , q 2 } {q 1 } {q 1 , q 2 , q 3 }
  409. *{q 1 , q 2 , q 3 } {q 1 , q 3 } {q 1 , q 2 , q 3 }
  410. *{q 1 , q 3 } {q 1 , q 3 } {q 1 , q 2 , q 3 }
  411. Φ Φ Φ
  412. The transition diagram associated with this DFA is shown in Figure 2.3.
  413. Figure 2.3: Transition diagram for M = ({q 0 , q 1 , q 2 , q 3 }, {0, 1} δ , q 0 , {q 3 }).
  414. 2.4 THE NFA WITH ∈ -MOVES
  415. If a finite automata is modified to permit transitions without input symbols, along with zero, one, or more transitions on
  416. the input symbols, then we get an NFA with ‘ ∈ -moves,’ because the transitions made without symbols are called
  417. " ∈ -transitions."
  418. Consider the NFA shown in Figure 2.4.
  419. Figure 2.4: Finite automata with ∈ -moves.
  420. This is an NFA with ∈ -moves because it is possible to transition from state q 0 to q 1 without consuming any of the
  421. input symbols. Similarly, we can also transition from state q 1 to q 2 without consuming any input symbols. Since it is a
  422. finite automata, an NFA with ∈ -moves will also be denoted as a five-tuple:
  423. where Q, Σ , q 0 , and F have the usual meanings, and δ defines a mapping from
  424. (to take care of the ∈ -transitions as well as the non ∈ -transitions).
  425. Acceptance of a String by the NFA with ∈-Moves
  426. A string x in Σ * will ∈ -moves will be accepted by the NFA, if at least one path exists that corresponds to x starts in an
  427. initial state and ends in one of the final states. But since this path may be formed by ∈ -transitions as well as
  428. non- ∈ -transitions, to find out whether x is accepted or not by the NFA with ∈ -moves, we must define a function,
  429. ∈ -closure(q), where q is a state of the automata.
  430. The function ∈ -closure(q) is defined follows:
  431. ∈ -closure(q)= set of all those states of the automata that can be reached from q on a path labeled by
  432. ∈ .
  433. For example, in the NFA with ∈ -moves given above:
  434. ∈ -closure(q 0 ) = { q 0 , q 1 , q 2 }
  435. ∈ -closure(q 1 ) = { q 1 , q 2 }
  436. ∈ -closure(q 2 ) = { q 2 }
  437. The function
  438. ∈ -closure (q) will never be an empty set, because q is always reachable from itself, without dependence on any input
  439. symbol; that is, on a path labeled by ∈ , q will always exist in ∈ -closure(q) on that labeled path.
  440. If P is a set of states, then the ∈ -closure function can be extended to find ∈ -closure(P ), as follows:
  441. 2.4.1 Algorithm for Finding ∈ -Closure(q)
  442. Let T be the set that will comprise ∈ -closure(q). We begin by adding q to T, and then initialize the stack by pushing q
  443. onto stack:
  444. while (stack not empty) do
  445. {
  446. p = pop (stack)
  447. R = δ (p, ∈ )
  448. for every member of R do
  449. if it is not present in T then
  450. {
  451. add that member to T
  452. push member of R on stack
  453. }
  454. }
  455. Since x is a member of Σ *, and there may exist zero, one, or more transitions from a state on an input symbol, we
  456. define a new transition function, δ 1 , which defines a mapping from 2 Q × Σ * to 2 Q . If x is written as wa, where a is the
  457. last symbol of x and w is a string made of remaining symbols of x then:
  458. since δ 1 defines a mapping from 2 Q × Σ * to 2 Q .
  459. such that P contains at least one member of F and:
  460. For example, in the NFA with ∈ -moves, given above, if x = 01, then to find out whether x is accepted by the automata
  461. or not, we proceed as follows:
  462. Therefore:
  463. ∈ -closure( δ 1 ( ∈ -closure (q 0 ), 01) = ∈ -closure({q 1 }) = {q 1 , q 2 } Since q 2 is a final state, x = 01 is accepted by the
  464. automata.
  465. Equivalence of NFA with ∈-Moves to NFA Without ∈-Moves
  466. For every NFA with ∈ -moves, there exists an equivalent NFA without ∈ -moves that accepts the same language. To
  467. obtain an equivalent NFA without ∈ -moves, given an NFA with ∈ -moves, what is required is an elimination of
  468. ∈ -transitions from a given automata. But simply eliminating the ∈ -transitions from a given NFA with ∈ -moves will
  469. change the language accepted by the automata. Hence, for every ∈ -transition to be eliminated, we have to add some
  470. non- ∈ -transitions as substitutes in order to maintain the language's acceptance by the automata. Therefore,
  471. transforming an NFA with ∈ -moves to and NFA without ∈ -moves involves finding the non- ∈ -transitions that must be
  472. added to the automata for every ∈ -transition to be eliminated.
  473. Consider the NFA with ∈ -moves shown in Figure 2.5.
  474. Figure 2.5: Transitioning from an ∈ -move NFA to a non- ∈ -move NFA.
  475. There are ∈ -transitions from state q 0 to q 1 and from state q 1 to q 2 . To eliminate these ∈ -transitions, we must add a
  476. transition on 0 from q 0 to q 1 , as well as from state q 0 to q 2 . Similarly, a transition must be added on 1 from q 0 to q 1 , as
  477. well as from state q 0 to q 2 , because the presence of these ∈ -transitions in a given automata makes it possible to
  478. reach from q 0 to q 1 on consuming only 0, and it is possible to reach from q 0 to q 2 on consuming only 0. Similarly, it is
  479. possible to reach from q 0 to q 1 on consuming only 1, and it is possible to reach from q 0 to q 2 on consuming only 1. It is
  480. also possible to reach from q 1 to q 2 on consuming 0 as well as 1; and therefore, a transition from q 1 to q 2 on 0 and 1 is
  481. also required to be added. Since ∈ is also accepted by the given NFA ∈ -moves, to accept ∈ , and initial state of the
  482. NFA without ∈ -moves is required to be marked as one of the final states. Therefore, by adding these
  483. non- ∈ -transitions, and by making the initial state one of the final states, we get the automata shown in Figure 2.6.
  484. Figure 2.6: Making the initial state of the NFA one of the final states.
  485. Therefore, when transforming an NFA with ∈ -moves into an NFA without ∈ -moves, only the transitions are required
  486. to be changed; the states are not required to be changed. But if a given NFA with q 0 and ∈ -moves accepts ∈ (i.e., if
  487. the ∈ -closure (q 0 ) contains a member of F), then q 0 is also required to be marked as one of the final states if it is not
  488. already a member of F. Hence:
  489. If M = (Q, Σ , δ , q 0 , F) is an NFA with ∈ -moves, then its equivalent NFA without ∈ -moves will be M 1 = (Q, Σ , δ 1 , q 0 , F 1 )
  490. where δ 1 (q, a) = ∈ -closure( δ ( ∈ -closure(q), a))
  491. and
  492. F 1 = F ∪ (q 0 ) if ∈ -closure (q 0 ) contains a member of F
  493. F 1 = F otherwise
  494. For example, consider the following NFA with ∈ -moves:
  495. where
  496. δ
  497. 0 1
  498. q 0 {q 0 }
  499. φ
  500. {q 1 }
  501. q 1
  502. φ
  503. {q 1 } {q 2 }
  504. q 2
  505. φ
  506. {q 2 }
  507. φ
  508. Its equivalent NFA without ∈ -moves will be:
  509. where
  510. δ 1
  511. 0 1
  512. q 0 {q 0 , q 1 , q 2 } {q 1 , q 2 }
  513. q 1
  514. φ
  515. {q 1 , q 2 }
  516. q 2
  517. φ
  518. {q 2 }
  519. Since there exists a DFA for every NFA without ∈ -moves, and for every NFA with ∈ -moves there exists an equivalent
  520. NFA without ∈ -moves, we conclude that for every NFA with ∈ -moves there exists a DFA.
  521. 2.5 THE NFA WITH ∈ -MOVES TO THE DFA
  522. There always exists a DFA equivalent to an NFA with ∈ -moves which can be obtained as follows:
  523. A DFA equivalent to this NFA will be:
  524. If this transition generates a new subset of Q, then it will be added to Q 1 ; and next time transitions from it are found,
  525. we continue in this way until we cannot add any new states to Q 1 . After this, we identify those states of the DFA whose
  526. subset representations contain at least one member of F. If ∈ -closure(q 0 ) does not contain a member of F, and the set
  527. of such states of DFA constitute F 1 , but if ∈ -closure(q 0 ) contains a member of F, then we identify those members of
  528. Q 1 whose subset representations contain at least one member of F, or q 0 and F 1 will be set as a member of these
  529. states.
  530. Consider the following NFA with ∈ -moves:
  531. where
  532. δ
  533. 0 1
  534. q 0 {q 0 }
  535. φ
  536. {q 1 }
  537. q 1
  538. φ
  539. {q 1 } {q 2 }
  540. q 2
  541. φ
  542. {q 2 }
  543. φ
  544. A DFA equivalent to this will be:
  545. where
  546. δ 1
  547. 0 1
  548. {q 0 , q 1 , q 2 } {q 0 , q 1 , q 2 } {q 1 , q 2 }
  549. {q 1 , q 2 }
  550. φ
  551. {q 1 , q 2 }
  552. φ φ φ
  553. If we identify the subsets {q 0 , q 1 , q 2 }, {q 0 , q 1 , q 2 } and φ as A, B, and C, respectively, then the automata will be:
  554. where
  555. δ 1
  556. 0 1
  557. A A B
  558. B C B
  559. C C C
  560. EXAMPLE 2.1
  561. Obtain a DFA equivalent to the NFA shown in Figure 2.7.
  562. Figure 2.7: Example 2.1 NFA.
  563. A DFA equivalent to NFA in Figure 2.7 will be:
  564. 0 1
  565. {q 0 } {q 0 , q 1 } {q 0 }
  566. {q 0 , q 1 } {q 0 , q 1 } {q 0 , q 2 }
  567. {q 0 , q 2 } {q 0 , q 1 } {q 0 , q 3 }
  568. {q 0 , q 2 , q 3 }* {q 0 , q 1 , q 3 } {q 0 , q 3 }
  569. {q 0 , q 1 , q 3 }* {q 0 , q 3 } {q 0 , q 2 , q 3 }
  570. {q 0 , q 3 }* {q 0 , q 1 , q 3 } {q 0 , q 3 }
  571. where {q 0 } corresponds to the initial state of the automata, and the states marked as * are final states. lf we rename
  572. the states as follows:
  573. {q 0 } A
  574. {q 0 , q 1 } B
  575. {q 0 , q 2 } C
  576. {q 0 , q 2 , q 3 } D
  577. {q 0 , q 1 , q 3 } E
  578. {q 0 , q 3 } F
  579. then the transition table will be:
  580. 0 1
  581. A B A
  582. B B C
  583. C B F
  584. D* E F
  585. E* F D
  586. F* E F
  587. EXAMPLE 2.2
  588. Obtain a DFA equivalent to the NFA illustrated in Figure 2.8.
  589. Figure 2.8: Example 2.2 DFA equivalent to an NFA.
  590. A DFA equivalent to the NFA shown in Figure 2.8 will be:
  591. 0 1
  592. {q 0 } {q 0 } {q 0 , q 1 }
  593. {q 0 , q 1 } {q 0 , q 2 } {q 0 , q 1 }
  594. {q 0 , q 2 } {q 0 } {q 0 , q 1 , q 3 }
  595. {q 0 , q 1 , q 3 }* {q 0 , q 2 , q 3 } {q 0 , q 1 , q 3 }
  596. {q 0 , q 2 , q 3 }* {q 0 , q 3 } {q 0 , q 1 , q 3 }
  597. {q 0 , q 3 }* {q 0 , q 3 } {q 0 , q 1 , q 3 }
  598. where {q 0 } corresponds to the initial state of the automata, and the states marked as * are final states. If we rename
  599. the states as follows:
  600. {q 0 } A
  601. {q 0 , q 1 } B
  602. {q 0 , q 2 } C
  603. {q 0 , q 2 , q 3 } D
  604. {q 0 , q 1 , q 3 } E
  605. {q 0 , q 3 } F
  606. then the transition table will be:
  607. 0 1
  608. A A B
  609. B C B
  610. C A E
  611. D* F E
  612. E* D E
  613. F* F E
  614. 2.6 MINIMIZATION/OPTIMIZATION OF A DFA
  615. Minimization/optimization of a deterministic finite automata refers to detecting those states of a DFA whose presence
  616. or absence in a DFA does not affect the language accepted by the automata. Hence, these states can be eliminated
  617. from the automata without affecting the language accepted by the automata. Such states are:
  618. Unreachable States: Unreachable states of a DFA are not reachable from the initial state of DFA on any
  619. possible input sequence.
  620. Dead States: A dead state is a nonfinal state of a DFA whose transitions on every input symbol
  621. terminates on itself. For example, q is a dead state if q is in Q F, and δ (q, a) = q for every a in Σ .
  622. Nondistinguishable States: Nondistinguishable states are those states of a DFA for which there exist no
  623. distinguishing strings; hence, they cannot be distinguished from one another.
  624. Therefore, optimization entails:
  625. Detection of unreachable states and eliminating them from DFA; 1.
  626. Identification of nondistinguishable states, and merging them together; and 2.
  627. Detecting dead states and eliminating them from the DFA. 3.
  628. 2.6.1 Algorithm to Detect Unreachable States
  629. Input M = (Q, Σ , δ , q 0 , F )
  630. Output = Set U (which is set of unreachable states)
  631. {Let R be the set of reachable states of DFA. We take two R's, R new , and R old so that we will be able to perform
  632. iterations in the process of detecting unreachable states.}
  633. begin
  634. R old = φ
  635. R new = {q 0 }
  636. while (R old # R new ) do
  637. begin
  638. temp 1 = R new − R old
  639. R old = R new
  640. temp 2 = φ
  641. for every a in Σ do
  642. temp 2 = temp 2 ∪ δ ( temp 1 , a)
  643. R new = R new ∪ temp 2
  644. end
  645. U = Q − R new
  646. end
  647. If p and q are the two states of a DFA, then p and q are said to be ‘distinguishable’ states if a distinguishing string w
  648. exists that distinguishes p and q.
  649. A string w is a distinguishing string for states p and q if transitions from p on w go to a nonfinal state, whereas
  650. transitions from q on w go to a final state, or vice versa.
  651. Therefore, to find nondistinguishable states of a DFA, we must find out whether some distinguishing string w, which
  652. distinguishes the states, exists. If no such string exists, then the states are nondistinguishable and can be merged
  653. together.
  654. The technique that we use to find nondistinguishable states is the method of successive partitioning. We start with two
  655. groups/partitions: one contains all nonfinal states, and other contains all the final state. This is because if every final
  656. state is known to be distinguishable from a nonfinal state, then we find transitions from members of a partition on every
  657. input symbol. If on a particular input symbol a we find that transitions from some of the members of a partition goes to
  658. one place, whereas transitions from other members of a partition go to an other place, then we conclude that the
  659. members whose transitions go to one place are distinguishable from members whose transitions goes to another
  660. place. Therefore, we divide the partition in two; and we continue this partitioning until we get partitions that cannot be
  661. partitioned further. This happens when either a partition contains only one state, or when a partition contains more
  662. than one state, but they are not distinguishable from one another. If we get such a partition, we merge all of the states
  663. of this partition into a single state. For example, consider the transition diagram in Figure 2.9.
  664. Figure 2.9: Partitioning down to a single state.
  665. Initially, we have two groups, as shown below:
  666. Since
  667. Partitioning of Group I is not possible, because the transitions from all the members of Group I go only to Group I. But
  668. since
  669. state F is distinguishable from the rest of the members of Group I. Hence, we divide Group I into two groups: one
  670. containing A, B, C, E, and the other containing F, as shown below:
  671. Since
  672. partitioning of Group I is not possible, because the transitions from all the members of Group I go only to Group I. But
  673. since
  674. states A and E are distinguishable from states B and C. Hence, we further divide Group I into two groups: one
  675. containing A and E, and the other containing B and C, as shown below:
  676. Since
  677. state A is distinguishable from state E. Hence, we divide Group I into two groups: one containing A and the other
  678. containing E, as shown below:
  679. Since
  680. partitioning of Group III is not possible, because the transitions from all the members of Group III on a go to group III
  681. only. Similarly,
  682. partitioning of Group III is not possible, because the transitions from all the members of Group III on b also only go to
  683. Group III.
  684. Hence, B and C are nondistinguishable states; therefore, we merge B and C to form a single state, B 1 , as shown in
  685. Figure 2.10.
  686. Figure 2.10: Merging nondistinguishable states B and C into a single state B 1 .
  687. 2.6.2 Algorithm for Detection of Dead States
  688. Input M = (Q, Σ , δ , q 0 , F )
  689. Output = Set X (which is a set of dead states) {
  690. {
  691. X = φ
  692. for every q in (Q − F ) do
  693. {
  694. flag = true;
  695. for every a in Σ do
  696. if ( δ (q, a) # q) then
  697. {
  698. flag = false
  699. break
  700. }
  701. if flag = true then
  702. X = X ∪ {q}
  703. }
  704. }
  705. 2.7 EXAMPLES OF FINITE AUTOMATA CONSTRUCTION
  706. EXAMPLE 2.3
  707. Construct a finite automata accepting the set of all strings of zeros and ones, with at most one pair of consecutive
  708. zeros and at most one pair of consecutive ones.
  709. A transition diagram of the finite automata accepting the set of all strings of zeros and ones, with at most one pair of
  710. consecutive zeros and at most one pair of consecutive ones is shown in Figure 2.11.
  711. Figure 2.11: Transition diagram for Example 2.3 finite automata.
  712. EXAMPLE 2.4
  713. Construct a finite automata that will accept strings of zeros and ones that contain even numbers of zeros and odd
  714. numbers of ones.
  715. A transition diagram of the finite automata that accepts the set of all strings of zeros and ones that contains even
  716. numbers of zeros and odd numbers of ones is shown in Figure 2.12.
  717. Figure 2.12: Finite automata containing even number of zeros and odd number of ones.
  718. EXAMPLE 2.5
  719. Construct a finite automata that will accept a string of zeros and ones that contains an odd number of zeros and an
  720. even number of ones.
  721. A transition diagram of finite automata accepting the set of all strings of zeros and ones that contains an odd number
  722. of zeros and an even number of ones is shown in Figure 2.13.
  723. Figure 2.13: Finite automata containing odd number of zeros and even number of ones.
  724. EXAMPLE 2.6
  725. Construct the finite automata for accepting strings of zeros and ones that contain equal numbers of zeros and ones,
  726. and no prefix of the string should contain two more zeros than ones or two more ones than zeros.
  727. A transition diagram of the finite automata that will accept the set of all strings of zeros and ones, contain equal
  728. numbers of zeros and ones, and contain no string prefixes of two more zeros than ones or two more ones than zeros
  729. is shown in Figure 2.14.
  730. Figure 2.14: Example 2.6 finite automata considers the set prefix.
  731. EXAMPLE 2.7
  732. Construct a finite automata for accepting all possible strings of zeros and ones that do not contain 101 as a substring.
  733. Figure 2.15 shows a transition diagram of the finite automata that accepts the strings containing 101 as a substring.
  734. Figure 2.15: Finite automata accepts strings containing the substring 101.
  735. A DFA equivalent to this NFA will be:
  736. 0 1
  737. {A} {A} {A, B}
  738. {A, B} {A, C} {A, B}
  739. {A, C} {A} {A, B, D}
  740. {A, B, D}* {A, C, D} {A, B, D}
  741. {A, C, D}* {A, D} {A, B, D}
  742. {A, C, D}* {A, D} {A, B, D}
  743. Let us identify the states of this DFA using the names given below:
  744. {A} q 0
  745. {A, B} q 1
  746. {A, C} q 2
  747. {A, B, D} q 3
  748. {A, C, D} q 4
  749. {A, D} q 5
  750. The transition diagram of this automata is shown in Figure 2.16.
  751. Figure 2.16: DFA using the names A-D and q 0 − 5 .
  752. The complement of the automata in Figure 2.16 is shown in Figure 2.17.
  753. Figure 2.17: Complement to Figure 2.16 automata.
  754. After minimization, we get the DFA shown in Figure 2.18, because states q 3 , q 4 , and q 5 are nondistinguishable states.
  755. Hence, they get combined, and this combination becomes a dead state and, can be eliminated.
  756. Figure 2.18: DFA after minimization.
  757. EXAMPLE 2.8
  758. Construct a finite automata that will accept those strings of decimal digits that are divisible by three (see Figure 2.19).
  759. Figure 2.19: Finite automata that accepts string decimals that are divisible by three.
  760. EXAMPLE 2.9
  761. Construct a finite automata that accepts all possible strings of zeros and ones that do not contain 011 as a substring.
  762. Figure 2.20 shows a transition diagram of the automata that accepts the strings containing 101 as a substring.
  763. Figure 2.20: Finite automata accepts strings containing 101.
  764. A DFA equivalent to this NFA will be:
  765. 0 1
  766. {A} {A, B} {A}
  767. {A, B} {A, B} {A, C}
  768. {A, C} {A, B} {A, D}
  769. {A, D}* {A, B, D} {A, D}
  770. {A, B, D}* {A, B, D} {A, C, D}
  771. {A, C, D}* {A, B, D} {A, D}
  772. Let us identify the states of this DFA using the names given below:
  773. {A} q 0
  774. {A, B} q 1
  775. {A, C} q 2
  776. {A, D} q 3
  777. {A, B, D} q 4
  778. {A, C, D} q 5
  779. The transition diagram of this automata is shown in Figure 2.21.
  780. Figure 2.21: Finite automata identified by the name states A-D and q 0 − 5 .
  781. The complement of automata shown in Figure 2.21 is illustrated in Figure 2.22.
  782. Figure 2.22: Complement to Figure 2.21 automata.
  783. After minimization, we get the DFA shown in Figure 2.23, because the states q 3 , q 4 , and q 5 are nondistinguishable
  784. states. Hence, they get combined, and this combination becomes a dead state that can be eliminated.
  785. Figure 2.23: Minimization of nondistinguishable states of Figure 2.22.
  786. EXAMPLE 2.10
  787. Construct a finite automata that will accept those strings of a binary number that are divisible by three.
  788. The transition diagram of this automata is shown in Figure 2.24.
  789. Figure 2.24: Automata that accepts binary strings that are divisible by three.
  790. 2.8 REGULAR SETS AND REGULAR EXPRESSIONS
  791. 2.8.1 Regular Sets
  792. A regular set is a set of strings for which there exists some finite automata that accepts that set. That is, if R is a
  793. regular set, then R = L(M) for some finite automata M. Similarly, if M is a finite automata, then L(M) is always a regular
  794. set.
  795. 2.8.2 Regular Expression
  796. A regular expression is a notation to specify a regular set. Hence, for every regular expression, there exists a finite
  797. automata that accepts the language specified by the regular expression. Similarly, for every finite automata M, there
  798. exists a regular expression notation specifying L(M). Regular expressions and the regular sets they specify are shown
  799. in the following table.
  800. Regular
  801. expression
  802. Regular Set Finite automata
  803. φ
  804. { }
  805. ∈ { ∈ }
  806. Every a in Σ is
  807. a regular
  808. expression
  809. {a}
  810. r 1 + r 2 or r 1 | r 2
  811. is a regular
  812. expression,
  813. R 1 ∪ R 2 (Where R 1
  814. and R 2 are regular
  815. sets corresponding to
  816. r 1 and r 2 , respectively)
  817. where N 1 is a finite automata accepting R 1 , and N 2 is a finite
  818. automata accepting R 2
  819. r 1 . r 2 is a
  820. regular
  821. expression,
  822. R 1 .R 2 (Where R 1 and
  823. R 2 are regular sets
  824. corresponding to r 1
  825. and r 2 , respectively)
  826. where N 1 is a finite automata accepting R 1 , and N 2 is finite
  827. automata accepting R 2
  828. r* is a regular
  829. expression,
  830. R* (where R is a
  831. regular set
  832. corresponding to r)
  833. where N is a finite automata accepting R.
  834. Hence, we only have three regular-expression operators: | or + to denote union operations,. for concatenation
  835. operations, and * for closure operations. The precedence of the operators in the decreasing order is: *, followed by.,
  836. followed by | . For example, consider the following regular expression:
  837. To construct a finite automata for this regular expression, we proceed as follows: the basic regular expressions
  838. involved are a and b, and we start with automata for a and automata for b. Since brackets are evaluated first, we
  839. initially construct the automata for a + b using the automata for a and the automata for b, as shown in Figure 2.25.
  840. Figure 2.25: Transition diagram for (a + b).
  841. Since closure is required next, we construct the automata for (a + b)*, using the automata for a + b, as shown in
  842. Figure 2.26.
  843. Figure 2.26: Transition diagram for (a + b)*.
  844. The next step is concatenation. We construct the automata for a. (a + b)* using the automata for (a + b)* and a, as
  845. shown in Figure 2.27.
  846. Figure 2.27: Transition diagram for a. (a + b)*.
  847. Next we construct the automata for a.(a + b)*.b, as shown in Figure 2.28.
  848. Figure 2.28: Automata for a.(a + b)* .b.
  849. Finally, we construct the automata for a.(a + b)*.b.b (Figure 2.29).
  850. Figure 2.29: Automata for a.(a + b)*.b.b.
  851. This is an NFA with ∈ -moves, but an algorithm exists to transform the NFA to a DFA. So, we can obtain a DFA from
  852. this NFA.
  853. 2.9 OBTAINING THE REGULAR EXPRESSION FROM THE FINITE
  854. AUTOMATA
  855. Given a finite automata, to obtain a regular expression that specifies the regular set accepted by the given finite
  856. automata, the following steps are necessary:
  857. Associate suitable variables (e.g., A, B, C, etc.) with the states of finite automata. 1.
  858. Form a set of equations using the following rules:
  859. If there exists a transition from a state associated with variable A to a state
  860. associated with variable B on an input symbol a, then add the equation
  861. a.
  862. If the state associated with variable A is a final state, add A = ∈ to the set of
  863. equations.
  864. b.
  865. If we have the two equations A = ab and A = bc, then they can be combined
  866. as A = aB | bc.
  867. c.
  868. 2.
  869. Solve these equations to get the value of the variable associated with the starting state of the
  870. automata. In order to solve these equations, it is necessary to bring the equation in the following
  871. form:
  872. 3.
  873. where S is a variable, and a and b are expressions that do not contain S. The solution to this equation is S = a*b.
  874. (Here, the concatenation operator is between a* and b, and is not explicitly shown.) For example, consider the finite
  875. automata whose transition diagram is shown in Figure 2.30.
  876. Figure 2.30: Deriving the regular expression for a regular set.
  877. We use the names of the states of the automata as the variable names associated with the states.
  878. The set of equations obtained by the application of the rules are:
  879. To solve these equations, we do the substitution of (II) and (III) in (I), to obtain:
  880. Therefore, the value of variable S comes out be:
  881. Therefore, the regular expression specifying the regular set accepted by the given finite automata is
  882. 2.10 LEXICAL ANALYZER DESIGN
  883. Since the function of the lexical analyzer is to scan the source program and produce a stream of tokens as output, the
  884. issues involved in the design of lexical analyzer are:
  885. Identifying the tokens of the language for which the lexical analyzer is to be built, and to specify
  886. these tokens by using suitable notation, and
  887. 1.
  888. Constructing a suitable recognizer for these tokens. 2.
  889. Therefore, the first thing that is required is to identify what the keywords are, what the operators are, and what the
  890. delimiters are. These are the tokens of the language. After identifying the tokens of the language, we must use
  891. suitable notation to specify these tokens. This notation, should be compact, precise, and easy to understand. Regular
  892. expressions can be used to specify a set of strings, and a set of strings that can be specified by using
  893. regular-expression notation is called a "regular set." The tokens of a programming language constitutes a regular set.
  894. Hence, this regular set can be specified by using regular-expression notation. Therefore, we write regular expressions
  895. for things like operators, keywords, and identifiers. For example, the regular expressions specifying the subset of
  896. tokens of typical programming language are as follows:
  897. operators = +| -| * |/ | mod|div
  898. keywords = if|while|do|then
  899. letter = a|b|c|d|....|z|A|B|C|....|Z
  900. digit = 0|1|2|3|4|5|6|7|8|9
  901. identifier = letter (letter|digit)*
  902. The advantage of using regular-expression notation for specifying tokens is that when regular expressions are used,
  903. the recognizer for the tokens ends up being a DFA. Therefore, the next step is the construction of a DFA from the
  904. regular expression that specifies the tokens of the language. But the DFA is a flow-chart (graphical) representation of
  905. the lexical analyzer. Therefore, after constructing the DFA, the next step is to write a program in suitable programming
  906. language that will simulate the DFA. This program acts as a token recognizer or lexical analyzer. Therefore, we find
  907. that by using regular expressions for specifying the tokens, designing a lexical analyzer becomes a simple mechanical
  908. process that involves transforming regular expressions into finite automata and generating the program for simulating
  909. the finite automata.
  910. Therefore, it is possible to automate the procedure of obtaining the lexical analyzer from the regular expressions and
  911. specifying the tokens—and this is what precisely the tool LEX is used to do. LEX is a compiler-writing tool that
  912. facilitates writing the lexical analyzer, and hence a compiler. It inputs a regular expression that specifies the token to
  913. be recognized and generates a C program as output that acts as a lexical analyzer for the tokens specified by the
  914. inputted regular expressions.
  915. 2.10.1 Format of the Input or Source File of LEX
  916. The LEX source file contains two things:
  917. Auxiliary definitions having the format: name = regular expression.
  918. The purpose of the auxiliary definitions is to identify the larger regular expressions by using
  919. suitable names.
  920. LEX makes use of the auxiliary definitions to replace the names used for specifying the patterns
  921. of corresponding regular expressions.
  922. 1.
  923. The translation rules having the format:
  924. pattern {action}.
  925. 2.
  926. The ‘pattern’ specification is a regular expression that specifies the tokens, and ‘{action}’ is a program fragment written
  927. in C to specify the action to be taken by the lexical analyzer generated by LEX when it encounters a string matching
  928. the pattern. Normally, the action taken by the lexical analyzer is to return a pair to the parser or syntax analyzer. The
  929. first member of the pair is a token, and the second member is the value or attribute of the token. For example, if the
  930. token is an identifier, then the value of the token is a pointer to the symbol-table record that contains the
  931. corresponding name of the identifier. Hence, the action taken by the lexical analyzer is to install the name in the
  932. symbol table and return the token as an id, and to set the value of the token as a pointer to the symbol table record
  933. where the name is installed. Consider the following sample source program:
  934. letter [ a-z, A-Z ]
  935. digit [ 0-9 ]
  936. %%
  937. begin { return ("BEGIN")}
  938. end { return ("END")}
  939. if {return ("IF")}
  940. letter ( letter|digit)* { install ( );
  941. return ("identifier")
  942. }
  943. < { return ("LT")}
  944. < = { return ("LE")}
  945. %%
  946. definition of install()
  947. In the above specification, we find that the keyword ‘begin’ can be matched against two patterns one specifying the
  948. keyword and the other specifying identifiers. In this case, pattern-matching is done against whichever pattern comes
  949. first in the physical order of the specification. Hence, ‘begin’ will be recognized as a keyword and not as an identifier.
  950. Therefore, patterns that specify keywords of the language are required to be listed before a pattern-specifying
  951. identifier; otherwise, every keyword will get recognized as identifier. A lexical analyzer generated by LEX always tries
  952. to recognize the longest prefix of the input as a token. Hence, if < = is read, it will be recognized as a token "LE" not
  953. "LT."
  954. 2.11 PROPERTIES OF REGULAR SETS
  955. Since the union of two regular sets is always a regular set, regular sets are closed under the union operation. Similarly,
  956. regular sets are closed under concatenation and closure operations, because the concatenation of a regular sets is
  957. also a regular set, and the closure of a regular set is also a regular set.
  958. Regular sets are also closed under the complement operation, because if L(M) is a language accepted by a finite
  959. automata M, then the complement of L(M) is Σ * − L(M). If we make all final states of M nonfinal, and we make all
  960. nonfinal states of M final, then the automata accepts Σ * − L(M); hence, we conclude that the complement of L(M) is also
  961. a regular set. For example, consider the transition diagram in Figure 2.31.
  962. Figure 2.31: Transition diagram.
  963. The transition diagram of the complement to the automata shown in Figure 2.31 is shown in Figure 2.32.
  964. Figure 2.32: Complement to transition diagram in Figure 2.31.
  965. Since the regular sets are closed under complement as well as union operations, they are closed under intersection
  966. operations also, because intersection can be expressed in terms of both union and complement operations, as shown
  967. below:
  968. where L 1 denotes the complement of L 1 .
  969. An automata for accepting L 1 ∩ L 2 is required in order to simulate the moves of an automata that accepts L 1 as well as
  970. the moves of an automata that accepts L 2 on the input string x. Hence, every state of the automata that accepts L 1 ∩
  971. L 2 will be an ordered pair [p, q], where p is a state of the automata accepting L 1 and q is a state of the automata
  972. accepting L 2 .
  973. Therefore, if M 1 = (Q 1 , Σ , δ 1 , q 1 , F 1 ) is an automata accepting L 1 , and if M 2 = (Q 2 , Σ , δ 2 , q 2 , F 2 ) is an automata
  974. accepting L 2 , then the automata accepting L 1 ∩ L 2 will be: M = (Q 1 × Q 2 , Σ , δ , [q 1 , q 2 ], F 1 × F 2 ) where δ ([p, q], a) = [ δ 1
  975. (p, a), δ 2 (q, a)]. But all the members of Q 1 × Q 2 may not necessarily represent reachable states of M. Hence, to
  976. reduce the amount of work, we start with a pair [q 1 , q 2 ] and find transitions on every member of Σ from [q 1 , q 2 ]. If some
  977. transitions go to a new pair, then we only generate that pair, because it will then represent a reachable state of M.
  978. We next consider the newly generated pairs to find out the transitions from them. We continue this until no new pairs
  979. can be generated.
  980. Let M 1 = ( Q 1 , Σ , δ 1 , q 1 , F 1 ) be a automata accepting L 1 , and let M 2 = (Q 2 , Σ , δ 2 , q 2 , F 2 ) be a automata accepting L 2 .
  981. M = (Q, Σ , δ , q 0 , F) will be an automata accepting L 1 ∩ L 2 .
  982. begin
  983. Q old = Φ
  984. Q new = { [ q 1 , q 2 ] }
  985. While ( Q old ≠ Q new )
  986. {
  987. Temp = Q new − Q old
  988. Q old = Q new
  989. for every pair [p, q] in Temp do
  990. for every a in Σ do
  991. Q new = Q new ∪ δ ([p, q ], a)
  992. }
  993. Q = Q new
  994. end
  995. Consider the automatas and their transition diagrams shown in Figure 2.33 and Figure 2.34.
  996. Figure 2.33: Transition diagram of automata M 1 .
  997. Figure 2.34: Transition diagram of automata M 2 .
  998. The transition table for the automata accepting L(M 1 ) ∩ L(M 2 ) is:
  999. δ
  1000. A b
  1001. [1, 1] [1, 1] [2, 4]
  1002. [2, 4] [3, 3] [4, 2]
  1003. [3, 3] [2, 2] [1, 1]
  1004. [4, 2] [1, 1] [2, 4]
  1005. [2, 2] [3, 1] [4, 4]
  1006. [3, 1] [2, 1] [1, 4]
  1007. [4, 4] [1, 3] [2, 2]
  1008. [2, 1] [3, 1] [4, 4]
  1009. [1, 4]* [1, 3] [2, 2]
  1010. [1, 3] [1, 2] [2, 1]
  1011. [1, 2]* [1, 1] [2, 4]
  1012. We associate the names with states of the automata obtained, as shown below:
  1013. [1, 1] A
  1014. [2, 4] B
  1015. [3, 3] C
  1016. [4, 2] D
  1017. [2, 2] E
  1018. [3, 1] F
  1019. [4, 4] G
  1020. [2, 1] H
  1021. [1, 4] I
  1022. [1, 3] J
  1023. [1, 2] K
  1024. The transition table of the automata using the names associated above is:
  1025. δ
  1026. a B
  1027. A A B
  1028. B C D
  1029. C E A
  1030. D A B
  1031. E F G
  1032. F H I
  1033. G J E
  1034. H F G
  1035. I* J E
  1036. J K H
  1037. K* A B
  1038. 2.12 EQUIVALENCE OF TWO AUTOMATAS
  1039. Automatas M 1 and M 2 are said to be equivalent if they accept the same language; that is, L(M 1 ) = L(M 2 ). It is possible
  1040. to test whether the automatas M 1 and M 2 accept the same language—and hence, whether they are equivalent or not.
  1041. One method of doing this is to minimize both M 1 and M 2 , and if the minimal state automatas obtained from M 1 and M 2
  1042. are identical, then M 1 is equivalent to M 2 .
  1043. Another method to test whether or not M 1 is equivalent to M 2 is to find out if:
  1044. For this, complement M 2 , and construct an automata that accepts both the intersection of language accepted by M 1
  1045. and the complement of M 2 . If this automata accepts an empty set, then it means that there is no string acceptable to
  1046. M 1 that is not acceptable to M 2 . Similarly, construct an automata that accepts the intersection of language accepted by
  1047. M 2 and the complement of M 1 . If this automata accepts an empty set, then it means that there is no string acceptable
  1048. to M 2 that is not acceptable to M 1 . Hence, the language accepted by M 1 is same as the language accepted by M 2 .
  1049. Chapter 3: Context-Free Grammar and Syntax Analysis
  1050. 3.1 SYNTAX ANALYSIS
  1051. In the syntax-analysis phase, a compiler verifies whether or not the tokens generated by the lexical analyzer are
  1052. grouped according to the syntactic rules of the language. If the tokens in a string are grouped according to the
  1053. language's rules of syntax, then the string of tokens generated by the lexical analyzer is accepted as a valid construct
  1054. of the language; otherwise, an error handler is called. Hence, two issues are involved when designing the
  1055. syntax-analysis phase of a compilation process:
  1056. All valid constructs of a programming language must be specified; and by using these
  1057. specifications, a valid program is formed. That is, we form a specification of what tokens the
  1058. lexical analyzer will return, and we specify in what manner these tokens are to be grouped so that
  1059. the result of the grouping will be a valid construct of the language.
  1060. 1.
  1061. A suitable recognizer will be designed to recognize whether a string of tokens generated by the
  1062. lexical analyzer is a valid construct or not.
  1063. 2.
  1064. Therefore, suitable notation must be used to specify the constructs of a language. The notation for the construct
  1065. specifications should be compact, precise, and easy to understand. The syntax-structure specification for the
  1066. programming language (i.e., the valid constructs of the language) uses context-free grammar (CFG), because for
  1067. certain classes of grammar, we can automatically construct an efficient parser that determines if a source program is
  1068. syntactically correct. Hence, CFG notation is required topic for study.
  1069. 3.2 CONTEXT-FREE GRAMMAR
  1070. CFG notation specifies a context-free language that consists of terminals, nonterminals, a start symbol, and
  1071. productions. The terminals are nothing more than tokens of the language, used to form the language constructs.
  1072. Nonterminals are the variables that denote a set of strings. For example, S and E are nonterminals that denote
  1073. statement strings and expression strings, respectively, in a typical programming language. The nonterminals define
  1074. the sets of strings that are used to define the language generated by the grammar.
  1075. They also impose a hierarchical structure on the language, which is useful for both syntax analysis and translation.
  1076. Grammar productions specify the manner in which the terminals and string sets, defined by the nonterminals, can be
  1077. combined to form a set of strings defined by a particular nonterminal. For example, consider the production S → aSb.
  1078. This production specifies that the set of strings defined by the nonterminal S are obtained by concatenating terminal a
  1079. with any string belonging to the set of strings defined by nonterminal S, and then with terminal b. Each production
  1080. consists of a nonterminal on the left-hand side, and a string of terminals and nonterminals on the right-hand side. The
  1081. left-hand side of a production is separated from the right-hand side using the " → " symbol, which is used to identify a
  1082. relation on a set (V ∪ T)*.
  1083. Therefore context-free grammar is a four-tuple denoted as:
  1084. where:
  1085. V is a finite set of symbols called as nonterminals or variables, 1.
  1086. T is a set a symbols that are called as terminals, 2.
  1087. P is a set of productions, and 3.
  1088. S is a member of V, called as start symbol. 4.
  1089. For example:
  1090. 3.2.1 Derivation
  1091. Derivation refers to replacing an instance of a given string's nonterminal, by the right-hand side of the production rule,
  1092. whose left-hand side contains the nonterminal to be replaced. Derivation produces a new string from a given string;
  1093. therefore, derivation can be used repeatedly to obtain a new string from a given string. If the string obtained as a result
  1094. of the derivation contains only terminal symbols, then no further derivations are possible. For example, consider the
  1095. following grammar for a string S:
  1096. where P contains the following productions:
  1097. It is possible to replace the nonterminal S by a string aSa. Therefore, we obtain aSa from S by deriving S to aSa. It is
  1098. possible to replace S in aSa by ∈ , to obtain a string aa, which cannot be further derived.
  1099. If α 1 and α 2 are the two strings, and if α 2 can be obtained from α 1 , then we say α 1 is related to α 2 by "derives to
  1100. relation," which is denoted by " → ". Hence, we write α 1 → α 2 , which translates to: α 1 derives to α 2 . The symbol →
  1101. denotes a derive to relation that relates the two strings α 1 and α 2 such that α 2 is a direct derivative of α 1 (if α 2 can be
  1102. obtained from α 1 by a derivation of only one step). Therefore, will denote the transitive closure of derives to
  1103. relation, and if we have the two strings α 1 and α 2 such that α 2 can be obtained from α 1 by derivation, but α 2 may not
  1104. be a direct derivative of α 1 , then we write α 1 α 2 , which translates to: α 1 derives to α 2 through one or more
  1105. derivations.
  1106. Similarly, denotes the reflexive transitive closure of derives to relation; and if we have two strings α 1 and α 2 such
  1107. that α 1 derives to α 2 in zero, one, or more derivations, then we write α 1 α 2 . For example, in the grammar above,
  1108. we find that S → aSa → abSba → abba. Therefore, we can write S abba.
  1109. The language defined by a CFG is nothing but the set of strings of terminals that, in the case of the string S, can be
  1110. generated from S as a result of derivations using productions of the grammar. Hence, they are defined as the set of
  1111. those strings of terminals that are derivable from the grammar's start symbol. Therefore, if G = (V, T, P, S) is a
  1112. grammar, then the language by the grammar is denoted as L(G) and defined as:
  1113. The above grammar can generate the string ∈ , aa, bb, abba, … , but not aba.
  1114. 3.2.2 Standard Notation
  1115. The capital letters toward the start of the alphabet are used to denote nonterminals (e.g., A, B, C,
  1116. etc.).
  1117. 1.
  1118. Lowercase letters toward the start of the alphabet are used to denote terminals (e.g., a, b, c, etc.). 2.
  1119. S is used to denote the start symbol. 3.
  1120. Lowercase letters toward the end of the alphabet (e.g., u, v, w, etc.) are used to denote strings of
  1121. terminals.
  1122. 4.
  1123. The symbols α , β , γ , and so forth are used to denote strings of terminals as well as strings of
  1124. nonterminals.
  1125. 5.
  1126. The capital letters toward the end of alphabet (e.g., X, Y, and Z) are used to denote grammar
  1127. symbols, and they may be terminals or nonterminals.
  1128. 6.
  1129. The benefit of using these notations is that it is not required to explicitly specify all four grammar components. A
  1130. grammar can be specified by only giving the list of productions; and from this list, we can easily get information about
  1131. the terminals, nonterminals, and start symbols of the grammar.
  1132. 3.2.3 Derivation Tree or Parse Tree
  1133. When deriving a string w from S, if every derivation is considered to be a step in the tree construction, then we get the
  1134. graphical display of the derivation of string w as a tree. This is called a "derivation tree" or a "parse tree" of string w.
  1135. Therefore, a derivation tree or parse tree is the display of the derivations as a tree. Note that a tree is a derivation tree
  1136. if it satisfies the following requirements:
  1137. All the leaf nodes of the tree are labeled by terminals of the grammar. 1.
  1138. The root node of the tree is labeled by the start symbol of the grammar. 2.
  1139. The interior nodes are labeled by the nonterminals. 3.
  1140. If an interior node has a label A, and it has n descendents with labels X 1 , X 2 , … , X n from left to
  1141. right, then the production rule A → X 1 X 2 X 3 …… X n must exist in the grammar.
  1142. 4.
  1143. For example, consider a grammar whose list of productions is:
  1144. The tree shown in Figure 3.1 is a derivation tree for a string id + id * id.
  1145. Figure 3.1: Derivation tree for the string id + id * id.
  1146. Given a parse (derivation) tree, a string whose derivation is represented by the given tree is one obtained by
  1147. concatenating the labels of the leaf nodes of the parse tree in a left-to-right order.
  1148. Consider the parse tree shown in Figure 3.2. A string whose derivation is represented by this parse tree is abba.
  1149. Figure 3.2: Parse tree resulting from leaf-node concatenation.
  1150. Since a parse tree displays derivations as a tree, given a grammar G = (V, T, P, S) for every w in T *, and which is
  1151. derivable from S, there exists a parse tree displaying the derivation of w as a tree. Therefore, we can define the
  1152. language generated by the grammar as:
  1153. For some w in L(G), there may exist more than one parse tree. That means that more than one way may exist to
  1154. derive w from S, using the productions of the grammar. For example, consider a grammar having the productions
  1155. listed below:
  1156. We find that for a string id + id* id, there exists more than one parse tree, as shown in Figure 3.3.
  1157. Figure 3.3: Multiple parse trees.
  1158. If more than one parse tree exists for some w in L(G), then G is said to be an "ambiguous" grammar. Therefore, the
  1159. grammar having the productions E → E + E | E * E | id is an ambiguous grammar, because there exists more than one
  1160. parse tree for the string id + id * id in L(G) of this grammar.
  1161. Consider a grammar having the following productions:
  1162. This grammar is also an ambiguous grammar, because more than one parse tree exists for a string abab in L(G), as
  1163. shown in Figure 3.4.
  1164. Figure 3.4: Ambiguous grammar parse trees.
  1165. The parse tree construction process is such that the order in which the nonterminals are considered for replacement
  1166. does not matter. That is, given a string w, the parse tree for that string (if it exists) can be constructed by considering
  1167. the nonterminals for derivation in any order. The two specific orders of derivation, which are important from the point of
  1168. view of parsing, are:
  1169. Left-most order of derivation 1.
  1170. Right-most order of derivation 2.
  1171. The left-most order of derivation is that order of derivation in which a left-most nonterminal is considered first for
  1172. derivation at every stage in the derivation process. For example, one of the left-most orders of derivation for a string id
  1173. + id * id is:
  1174. In a right-most order of derivation, the right-most nonterminal is considered first. For example, one of the right-most
  1175. orders of derivation for id + id* id is:
  1176. The parse tree generated by using the left-most order of derivation of id + id*id and the parse tree generated by using
  1177. the right-most order of derivation of id + id*id are the same; hence, these orders are equivalent. A parse tree
  1178. generated using these orders is shown in Figure 3.5.
  1179. Figure 3.5: Parse tree generated by using both the right- and left-most derivation orders.
  1180. Another left-most order of derivation of id + id* id is given below:
  1181. And here is another right-most order of derivation of id + id*id:
  1182. The parse tree generated by using the left-most order of derivation of id + id* id and the parse tree generated using the
  1183. right-most order of derivation of id + id* id are the same. Hence, these orders are equivalent. A parse tree generated
  1184. using these orders is shown in Figure 3.6.
  1185. Figure 3.6: Parse tree generated from both the left- and right-most orders of derivation.
  1186. Therefore, we conclude that for every left-most order of derivation of a string w, there exists an equivalent right-most
  1187. order of derivation of w, generating the same parse tree.
  1188. Note If a grammar G is unambiguous, then for every w in L(G), there exists exactly one parse tree. Hence, there exists
  1189. exactly one left-most order of derivation and (equivalently) one right-most order of derivation for every w in L(G).
  1190. But if grammar G is ambiguous, then for some w in L(G), there exists more than one parse tree. Therefore, there
  1191. is more than one left-most order of derivation; and equivalently, there is more than one right-most order of
  1192. derivation.
  1193. 3.2.4 Reduction of Grammar
  1194. Reduction of a grammar refers to the identification of those grammar symbols (called "useless grammar symbols"),
  1195. and hence those productions, that do not play any role in the derivation of any w in L(G), and which we eliminate from
  1196. the grammar. This has no effect on the language generated by the grammar. For example, a grammar symbol X is
  1197. useful if and only if:
  1198. It derives to a string of terminals, and 1.
  1199. It is used in the derivation of at least one w in L(G). 2.
  1200. Thus, X is useful if and only if:
  1201. X w, where w is in T *, and 1.
  1202. S α X β w in L(G). 2.
  1203. Therefore, reduction of a given grammar G, involves:
  1204. Identification of those grammar symbols that are not capable of deriving to a w in T * and
  1205. eliminating them from the grammar; and
  1206. 1.
  1207. Identification of those grammar symbols that are not used in any derivation and eliminating them
  1208. from the grammar.
  1209. 2.
  1210. When identifying the grammar symbols that do not derive a w in T *, only nonterminals need be tested, because every
  1211. terminal member of T will also be in T *; and by default, they satisfy the first condition. A simple, iterative algorithm can
  1212. be used to identify those nonterminals that do not derive to w in T *: we start with those productions that are of the form
  1213. A → w that is, those productions whose right side is w in T *. We mark as nonterminal every A on the left side of every
  1214. production that is capable of deriving to w in T *, and then we consider every production of the form A → X 1 X 2 … X n ,
  1215. where A is not yet marked. If every X, (for 1<= i <= n) is either a terminal or a nonterminal that is already marked, then
  1216. we mark A (nonterminal on the left side of the production).
  1217. We repeat this process until no new nonterminals can be marked. The nonterminals that are not marked are those not
  1218. deriving to w in T *. After identifying the nonterminals that do not derive to w in T *, we eliminate all productions
  1219. containing these nonterminals in order to obtain a grammar that does not contain any nonterminals that do not derive
  1220. in T *. The algorithm for identifying as well as eliminating the nonterminals that do not derive to w in T * is given below:
  1221. Input: G = (V, T, P, S)
  1222. Output: G 1 = (V 1 , T, P 1 , S)
  1223. { where V 1 is the set of nonterminals deriving to w in T *, we maintain V 1 old and V 1 new to continue
  1224. iterations, and P 1 is the set of productions that do not contain nonterminals that do not derive to w in T
  1225. * }
  1226. Let U be the set of nonterminals that are not capable of deriving to w in T *.
  1227. Then,
  1228. begin
  1229. V 1 old = φ
  1230. V 1 new = φ
  1231. for every production of the form A → w do
  1232. V 1 new = V 1 new ∪ { A }
  1233. while (V 1 old ≠ V 1 new ) do
  1234. begin
  1235. temp = V − V 1 new
  1236. V 1 old = V 1 new
  1237. For every A in temp do
  1238. for every A-production of the form A → X 1 X 2 ... X n in P do
  1239. if each Xi is either in T or in V 1 old , then
  1240. begin
  1241. V 1 new = V 1 new ∪ { A }
  1242. break;
  1243. end
  1244. end
  1245. V 1 = V 1 new
  1246. U = V − V 1
  1247. for every production in P do
  1248. if it does not contain a member of U then
  1249. add the production to P 1
  1250. end
  1251. If S is itself a useless nonterminal, then the reduced grammar is a ‘null’ grammar.
  1252. When identifying the grammar symbols that are not used in the derivation of any w in L(G), terminals as well as
  1253. nonterminals must be tested. A simple, iterative algorithm can be used to identify those grammar symbols that are not
  1254. used in the derivation of any w in L(G): we start with S-productions and mark every grammar symbol X on the right
  1255. side of every S-production. We then consider every production of the form A → X 1 X 2 … X n , where A is an
  1256. already-marked nonterminal; and we mark every X on the right side of these productions. We repeat this process until
  1257. no new nonterminals can be marked. We do not mark any terminals or nonterminals not used in the derivation of any
  1258. w in L(G). After identifying the terminals and nonterminals not used in the derivation of any w in L(G), we eliminate all
  1259. productions containing them; thus, we obtain a grammar that does not contain any useless symbols-hence, a reduced
  1260. grammar.
  1261. The algorithm for identifying as well as eliminating grammar symbols that are not used in the derivation of any w in
  1262. L(G) is given below:
  1263. Input: G 1 = (V 1 , T, P 1 , S)
  1264. { The grammar obtained after elimination of the nonterminals not deriving to w in T * }
  1265. Output: G 2 = (V 2 , T 2 , P 2 , S)
  1266. { where V 2 is the set of nonterminals used in derivation of some w in L(G), and T 2 is set of terminals
  1267. used in the derivation of some w in L(G), and P 2 is set of productions containing the members of V 2
  1268. and T 2 only. We maintain V 2 old and V 2 new to continue iterations }
  1269. begin
  1270. T 2 = φ
  1271. V 2 old = φ
  1272. P 2 = φ
  1273. V 2 new = { S }
  1274. While (V 2 old # V 2 new ) do
  1275. begin
  1276. temp = V 2 new - V 2 old
  1277. V 2 old = V 2 new for every A in temp do
  1278. for every A-production of the form A → X 1 X 2 ... X n in P 1 do
  1279. for each X i (1 <= i <= n) do
  1280. begin
  1281. if (X i is in V 2 old ) then
  1282. V 2 new = V 2 new ∪ { X i }
  1283. if (X 1 is in T ) then
  1284. T 2 = T 2 ∪ { X i }
  1285. end
  1286. V 2 = V 2 new
  1287. temp 1 = V 1 − V 2
  1288. temp 2 = T 1 − T 2
  1289. for every production in P 1 do add the production to P 2 if it does
  1290. not contain a member of temp 1 as well as temp 2
  1291. G 2 = (V 2 , T 2 , P 2 , S)
  1292. end
  1293. end
  1294. EXAMPLE 3.1
  1295. Find the reduced grammar equivalent to CFG
  1296. where P contains
  1297. Since the productions A → a and C → ad exist in form A → w, nonterminals A and C are derivable to w in T *, The
  1298. production S → AC also exists, the right side of which contains the nonterminals A and C, which are derivable to w in T
  1299. *. Hence, S is also derivable to w in T *. But since the right side of both of the B-productions contain B, the nonterminal
  1300. B is not derivable to w in T *.
  1301. Hence, B can be eliminated from the grammar, and the following grammar is obtained:
  1302. where P 1 contains
  1303. Since the right side of the S-production of this grammar contains the nonterminals A and C, A and C will be used in the
  1304. derivation of some w in L(G). Similarly, the right side of the A-production contains bASC and a; hence, the terminals a
  1305. and b will be used. The right side of the C-production contains ad, so terminal d will also be useful. Therefore, every
  1306. terminal, as well as the nonterminal in G1, is useful. So the reduced grammar is:
  1307. where P 1 contains
  1308. 3.2.5 Useless Grammar Symbols
  1309. A grammar symbol is a useless grammar symbol if it does not satisfy either of the following conditions:
  1310. That is, a grammar symbol X is useless if it does not derive to terminal strings. And even if it does derive to a string of
  1311. terminals, X is a useless grammar symbol if it does not occur in a derivation sequence of any w in L(G). For example,
  1312. consider the following grammar:
  1313. First, we find those nonterminals that do not derive to the string of terminals so that they can be separated out. The
  1314. nonterminals A and X directly derive to the string of terminals because the production A → q and X → ad exist in a
  1315. grammar. There also exists a production S → bX, where b is a terminal and X is a nonterminal, which is already known
  1316. to derive to a string of terminals. Therefore, S also derives to string of terminals, and the nonterminals that are capable
  1317. of deriving to a string of terminals are: S, A, and X. B ends up being a useless nonterminal; and therefore, the
  1318. productions containing B can be eliminated from the given grammar to obtain the grammar given below:
  1319. We next find in the grammar obtained those terminals and nonterminals that occur in the derivation sequence of some
  1320. w in L(G). Since every derivation sequence starts with S, S will always occur in the derivation sequence of every w in
  1321. L(G). We then consider those productions whose left-hand side is S, such as S → bX, since the right side of this
  1322. production contains a terminal b and a nonterminal X. We conclude that the terminal b will occur in the derivation
  1323. sequence, and a nonterminal X will also occur in the derivation sequence. Therefore, we next consider those
  1324. productions whose left-hand side is a nonterminal X. The production is X → ad. Since the right side of this production
  1325. contains terminals a and d, these terminals will occur in the derivation sequence. But since no new nonterminal is
  1326. found, we conclude that the nonterminals S and X, and the terminals a, b, and d are the grammar symbols that can
  1327. occur in the derivation sequence. Therefore, we conclude that the nonterminal A will be a useless nonterminal, even
  1328. though it derives to the string of terminals. So we eliminate the productions containing A to obtain a reduced grammar,
  1329. given below:
  1330. EXAMPLE 3.2
  1331. Consider the following grammar, and obtain an equivalent grammar containing no useless grammar symbols.
  1332. Since A → xyz and Z → z are the productions of the form A → w, where w is in T *, nonterminals A and Z are capable
  1333. of deriving to w in T *. There are two X-productions: X → Xz and X → xYx. The right side of these productions contain
  1334. nonterminals X and Y, respectively. Similarly, there are two Y-productions: Y → yYy and Y → XZ. The right side of
  1335. these productions contain nonterminals Y and X, respectively. Hence, both X and Y are not capable of deriving to w in
  1336. T *. Therefore, by eliminating the productions containing X and Y, we get:
  1337. Since A is a start symbol, it will always be used in the derivation of every w in L(G). And since A → xyz is a production
  1338. in the grammar, the terminals x, y, and z will also be used in the derivation. But no nonterminal Z occurs on the right
  1339. side of the A-production, so Z will not be used in the derivation of any w in L(G). Hence, by eliminating the productions
  1340. containing nonterminal Z, we get:
  1341. which is a grammar containing no useless grammar symbols.
  1342. EXAMPLE 3.3
  1343. Find the reduced grammar that is equivalent to the CFG given below:
  1344. Since C → ad is the production of the form A → w, where w is in T *, nonterminal C is capable of deriving to w in T *.
  1345. The production S → aC contains a terminal a on the right side as well as a nonterminal C that is known to be capable
  1346. of deriving to w in T *.
  1347. Hence, nonterminal S is also capable of deriving to w in T *. The right side of the production A → bSCa contains the
  1348. nonterminals S and C, which are known to be capable of deriving to w in T *. Hence, nonterminal A is also capable of
  1349. deriving to w in T *. There are two B-productions: B → aSB and B → bBC. The right side of these productions contain
  1350. the nonterminals S, B, and C; and even though S and C are known to be capable of deriving to w in T *, nonterminal B
  1351. is not. Hence, by eliminating the productions containing B, we get:
  1352. Since S is a start symbol, it will always be used in the derivation of every w in L(G). And since S → aC is a production
  1353. in the grammar, terminal a as well as nonterminal C will also be used in the derivation. But since a nonterminal C
  1354. occurs on the right side of the S-production, and C → ad is a production, terminal d will be used along with terminal a
  1355. in the derivation. A nonterminal A, though, occurs nowhere in the right side of either the S-production or the
  1356. C-production; it will not be used in the derivation of any w in L(G). Hence, by eliminating the productions containing
  1357. nonterminal A, we get:
  1358. which is a reduced grammar equivalent to the given grammar, but it contains no useless grammar symbols.
  1359. EXAMPLE 3.4
  1360. Find the useless symbols in the following grammar, and modify the grammar so that it has no useless symbols.
  1361. Since S → 0 and B → 1 are productions of the form A → w, where w is in T *, the nonterminals S and B are capable of
  1362. deriving to w in T *. The production A → AB contains the nonterminals A and B on the right side; and even though B is
  1363. known to be capable of deriving to w in T *, nonterminal A is not capable of deriving to w in T *. Therefore, by
  1364. eliminating the productions containing A, we get:
  1365. Since S is a start symbol, it will always be used in the derivation of any w in L(G). And because S → 0 is a production
  1366. in the grammar, terminal 0 will also be used in the derivation. But nonterminal B does not occur anywhere in the right
  1367. side of the S-production, it will not be used in the derivation of any w in L(G). Hence, by eliminating the productions
  1368. containing nonterminal B, we get:
  1369. which is a grammar equivalent to the given grammar and contains no useless grammar symbols.
  1370. EXAMPLE 3.5
  1371. Find the useless symbols in the following grammar, and modify the grammar to obtain one that has no useless
  1372. symbols.
  1373. Since A → a and C → b are productions of the form A → w, where w is in T *, the nonterminals A and C are capable of
  1374. deriving to w in T *. The right side of the production S → CA contains nonterminals C and A, both of which are known
  1375. to be derivable to w in T *.
  1376. Hence, S is also capable of deriving to w in T *. There are two B-productions, B → BC and B → AB. The right side of
  1377. these productions contain the nonterminals A, B, and C. Even though A and C are known to be capable of deriving to
  1378. w in T *, nonterminal B is not capable of deriving to w in T *. Therefore, by eliminating the productions containing B, we
  1379. get:
  1380. Since S is a start symbol, it will always be used in the derivation of every w in L(G). And since S → CA is a production
  1381. in the grammar, nonterminals C and A will both be used in the derivation. For the productions A → a and C → b, the
  1382. terminals a and b will also be used in the derivation. Hence, every grammar symbol in the above grammar is useful.
  1383. Therefore, a grammar equivalent to the given grammar that contains no useless grammar symbols is:
  1384. 3.2.6 ∈ -Productions and Nullable Nonterminals
  1385. A production of the form A → ∈ is called a " ∈ -production". If A is a nonterminal, and if A ∈ (i.e., if A derives to an
  1386. empty string in zero, one, or more derivations), then A is called a "nullable nonterminal".
  1387. Algorithm for Identifying Nullable Nonterminals
  1388. Input: G = (V, T, P, S)
  1389. Output: Set N (i.e., the set of nullable nonterminals)
  1390. { we maintain N old and N new to continue iterations }
  1391. begin
  1392. N old = φ
  1393. N new = φ
  1394. for every production of the form A → ∈ do
  1395. N new = N new ∪ { A }
  1396. while (N old ≠ N new ) do
  1397. begin
  1398. temp = V - N new
  1399. N old = N new
  1400. For every A in temp do
  1401. for every A-production of the form A → X 1 X 2 ...X n in P do
  1402. if each X 1 is in N old then
  1403. N new = N new ∪ { A }
  1404. end
  1405. N = N new
  1406. end
  1407. EXAMPLE 3.6
  1408. Consider the following grammar and identify the nullable nonterminals.
  1409. By applying the above algorithm, the results after each iteration are shown below:
  1410. Initially:
  1411. After the first execution of the for loop:
  1412. After the first iteration of the while loop:
  1413. After the second iteration of the while loop:
  1414. After the third iteration of the while loop:
  1415. Therefore, N = { S, A, B, C }; and hence, all the nonterminals of the grammar are nullable.
  1416. 3.2.7 Eliminating ∈ -Productions
  1417. Given a grammar G that contains ∈ -productions, if L(G) does not contain ∈ , then it is possible to eliminate all
  1418. ∈ -productions in the given grammar G. Whereas, if L(G) contains ∈ , then elimination of all ∈ -productions from G
  1419. gives a grammar G in which L(G 1 ) = L(G) - { ∈ }. To eliminate the ∈ -productions from a grammar, we use the
  1420. following technique.
  1421. If A → ∈ is an ∈ -production to be eliminated, then we look for all those productions in the grammar whose right side
  1422. contains A, and we replace each occurrence of A in these productions. Thus, we obtain the non- ∈ -productions to be
  1423. added to the grammar so that the language's generation remains the same. For example, consider the following
  1424. grammar:
  1425. To eliminate A → ∈ form the above grammar, we replace A on the right side of the production S → aA and obtain a
  1426. non- ∈ -production, S → a, which is added to the grammar as a substitute in order to keep the language generated by
  1427. the grammar the same. Therefore, the ∈ -free grammar equivalent to the given grammar is:
  1428. EXAMPLE 3.7
  1429. Consider the following grammar, and eliminate all the ∈ -productions from the grammar without changing the language
  1430. generated by the grammar.
  1431. To eliminate A → ∈ from this grammar, the non- ∈ -productions to be added are obtained as follows: the list of the
  1432. productions containing A on the right-hand side is:
  1433. Replace each occurrence of A in each of these productions in order to obtain the non- ∈ -productions to be added to
  1434. the grammar. The list of these productions is:
  1435. Add these productions to the grammar, and eliminate A → ∈ from the grammar. This gives us the following grammar:
  1436. To eliminate B → ∈ from the grammar, the non- ∈ -productions to be added are obtained as follows. The productions
  1437. containing B on the right-hand side are:
  1438. Replace each occurrence of B in these productions in order to obtain the non- ∈ -productions to be added to the
  1439. grammar. The list of these productions is:
  1440. Add these productions to the grammar, and eliminate A → ∈ from the grammar in order to obtain the following:
  1441. EXAMPLE 3.8
  1442. Consider the following grammar and eliminate all the ∈ -productions without changing the language generated by the
  1443. grammar.
  1444. To eliminate A → ∈ from the grammar, the non- ∈ -productions to be added are obtained as follows: the list of
  1445. productions containing A on right is:
  1446. Replace each occurrence of A in this production to obtain the non- ∈ -productions to be added to the grammar. This
  1447. are:
  1448. Add these productions to the grammar, and eliminate A → ∈ from the grammar to obtain the following:
  1449. 3.2.8 Eliminating Unit Productions
  1450. A production of the form A → B, where A and B are both nonterminals, is called a "unit production". Unit productions in
  1451. the grammar increase the cost of derivations. The following algorithm can be used to eliminate unit productions from
  1452. the grammar:
  1453. While there exist a unit production A → B in the grammar do
  1454. {
  1455. select a unit production A → B such that there exists
  1456. at least one nonunit production
  1457. B → α
  1458. for every nonunit production B → α do
  1459. add production A → α to the grammar
  1460. eliminate A → B from the grammar
  1461. }
  1462. EXAMPLE 3.9
  1463. Given the grammar shown below, eliminate all the unit productions from the grammar.
  1464. The given grammar contains the productions:
  1465. which are the unit productions. To eliminate these productions from the given grammar, we first select the unit
  1466. production B → C. But since no nonunit C-productions exist in the grammar, we then select C → D. But since no
  1467. nonunit D-productions exist in the grammar, we next select D → E. There does exist a nonunit E-production: E → a.
  1468. Hence, we add D → a to the grammar and eliminate D → E. But since B → C and C → D are still there, we once again
  1469. select unit production B → C. Since no nonunit C-production exists in the grammar, we select C → D. Now there exists
  1470. a nonunit production D → a in the grammar. Hence, we add C → a to the grammar and eliminate C → D. But since B
  1471. → C is still there in the grammar, we once again select unit production B → C. Now there exists a nonunit production C
  1472. → a in the grammar, so we add B → a to the grammar and eliminate B → C. Now no unit productions exist in the
  1473. grammar. Therefore, the grammar that we get that does not contain unit productions is:
  1474. But we see that the grammar symbols C, D, and E become useless as a result of the elimination of unit productions,
  1475. because they will not be used in the derivation of any w in L(G). Hence, we can eliminate them from the grammar to
  1476. obtain:
  1477. Therefore, we conclude that to obtain the grammar in the most simplified form, we have to eliminate unit productions
  1478. first. We then eliminate the useless grammar symbols.
  1479. 3.2.9 Eliminating Left Recursion
  1480. If a grammar contains a pair of productions of the form A → A α | β , then the grammar is a "left-recursive grammar". If
  1481. left-recursive grammar is used for specification of the language, then the top-down parser specified by the grammar's
  1482. language may enter into an infinite loop during the parsing process on some erroneous input. This is because a
  1483. top-down parser attempts to obtain the left-most derivation of the input string w; hence, the parser may see the same
  1484. nonterminal A every time as the left-most nonterminal. And every time, it may do the derivation using A → A α .
  1485. Therefore, for top-down parsing, nonleft-recursive grammar should be used. Left-recursion can be eliminated from the
  1486. grammar by replacing A → A α | β with the productions A → β B and B → αβ | ∈ . In general, if a grammar contain
  1487. productions:
  1488. then the left-recursion can be eliminated by adding the following productions in place of the ones above.
  1489. EXAMPLE 3.10
  1490. Consider the following grammar:
  1491. The grammar is left-recursive because it contains a pair of productions, B → Bb | c. To eliminate the left-recursion from
  1492. the grammar, replace this pair of productions with the following productions:
  1493. Therefore, the grammar that we get after the elimination of left-recursion is:
  1494. EXAMPLE 3.11
  1495. Consider the following grammar:
  1496. The grammar is left-recursive because it contains the productions A → Ad | Ae | aB | aC. To eliminate the left-recursion
  1497. from the grammar, replace these productions by the following productions:
  1498. Therefore, the resulting grammar after the elimination of left-recursion is:
  1499. EXAMPLE 3.12
  1500. Consider the following grammar:
  1501. The grammar is left-recursive because it contains the productions L → L, S | S. To eliminate the left-recursion from the
  1502. grammar, replace these productions by the following productions:
  1503. Therefore, after the elimination of left-recursion, we get:
  1504. 3.3 REGULAR GRAMMAR
  1505. Regular grammar is a context-free grammar in which every production is restricted to one of the following forms:
  1506. A → aB, or 1.
  1507. A → w, where A and B are the nonterminals, a is a terminal symbol, and w is in T *. 2.
  1508. The ∈ -productions are permitted as a special case when L(G) contains ∈ . This grammar is called "regular grammar,"
  1509. because if the format of every production in CFG is restricted to A → aB or A → a, then the grammar can specify only
  1510. regular sets. Hence, a finite automata exists that accepts L(G), if G is regular grammar. Given a regular grammar G, a
  1511. finite automata accepting L(G) can be obtained as follows:
  1512. The number of states of the automata will be equal to the number of nonterminals of the grammar
  1513. plus one; that is, there will be a state corresponding to every nonterminal of the grammar. And one
  1514. more state will be there, which will be the final state of the automata. The state corresponding to
  1515. the start symbol of the grammar will be the initial state of the automata. If L(G) contains ∈ , then
  1516. make the start state also the final state.
  1517. 1.
  1518. The transitions in the automata can be obtained as follows:
  1519. for every production A → aB do
  1520. for every production of the form A → a do
  1521. 2.
  1522. EXAMPLE 3.13
  1523. Consider the regular grammar shown below and the transition diagram of the automata, shown in Figure 3.7, that
  1524. accepts the language generated by the grammar.
  1525. Figure 3.7: Transition diagram for automata that accepts the regular grammar of Example 3.13.
  1526. This is a non-deterministic automata. Its deterministic equivalent can be obtained as follows:
  1527. 0 1
  1528. { S } { A, C } { B, C }
  1529. *{ A, C } { S } { B, C }
  1530. *{ B, C } { A } { S }
  1531. { A } { S } { B, C }
  1532. The transition diagram of the automata is shown in Figure 3.8.
  1533. Figure 3.8: Deterministic equivalent of the non-deterministic automata shown in Figure 3.7.
  1534. Consider the following grammar:
  1535. The transition diagram of the finite automata that accepts the language generated by the above grammar is shown in
  1536. Figure 3.9.
  1537. Figure 3.9: Non-deterministic automata.
  1538. This is a non-deterministic automata. Its deterministic equivalent can be obtained as follows, and the transition
  1539. diagram is shown in Figure 3.10.
  1540. Figure 3.10: Transition diagram for deterministic automata equivalent shown in Figure 3.9.
  1541. Given a finite automata M, a regular grammar G that generates L(M) can be obtained as follows:
  1542. Associate suitable variables like A, B, C, etc, with the states of the automata. The labels of the
  1543. states can also be used as variable names.
  1544. 1.
  1545. Obtain the productions of the grammar as follows. If δ (A, a) = B, then add a production A → aB to
  1546. the list of productions of the grammar. If B is a final state, then add either A → a or B → ∈ , to the
  1547. grammar's list of productions.
  1548. 2.
  1549. The variable associated with the initial state of the automata is the start symbol of the grammar. 3.
  1550. For example consider the automata shown in Figure 3.11.
  1551. Figure 3.11: Regular-grammar automata.
  1552. The regular grammar that generates the language accepted by the automata shown in Figure 3.11 will have the
  1553. following productions:
  1554. or
  1555. where A is the start symbol. Both the grammars are same, but the first one contains ∈ -productions, whereas the
  1556. second is ∈ -free.
  1557. EXAMPLE 3.14
  1558. Find out whether the following grammar generates the same language.
  1559. G 1 :
  1560. G 2 :
  1561. Since the grammars G 1 and G 2 are the regular grammars, L(G 1 ) = L(G 2 ) if the minimal state automata accepting
  1562. L(G 1 ), and the minimal state automata accepting L(G 2 ) are identical. The transition diagram of the automata accepting
  1563. L(G 1 ) is shown in Figure 3.12.
  1564. Figure 3.12: Transition diagram of automata that accepts L(G 1 ).
  1565. The automata is deterministic. Hence, to minimize, it we proceed as follows. Since state D is an unreachable state,
  1566. eliminate it first. So, after eliminating state D, we get the transition diagram shown in Figure 3.13.
  1567. Figure 3.13: Transition diagram of automata after removal of state D.
  1568. We then identify the nondistinguishable states of the automata shown in Figure 3.13, as follows. Initially, we have two
  1569. groups:
  1570. Since
  1571. state B is distinguishable from rest of the members of Group I. Hence, we divide Group I into two groups—one
  1572. containing A, and other containing E and C, as shown below:
  1573. Since
  1574. partitioning of Group II is not possible, because the transitions from all the members of Group II only go to Group II.
  1575. Similarly:
  1576. Partitioning of Group II is not possible, because the transitions from all the members of Group II only go to Group I. And
  1577. since:
  1578. partitioning of Group III is not possible, because the transitions from all the members of Group III only go to Group I.
  1579. Similarly:
  1580. Partitioning of Group III is not possible, because the transitions from all the members of Group III only go to Group III.
  1581. Hence, states E and C are nondistinguishable states. States B and F are also nondistinguishable states. Therefore, if
  1582. we merge E and C to form a state E 1 , and we merge B and F to form B 1 , we get the automata shown in Figure 3.14.
  1583. Figure 3.14: Transition diagram for the automata that results from merged states.
  1584. Since no dead states exist in the automata shown in Figure 3.14, it is a minimal state automata that accepts L(G 1 ).
  1585. The transition diagram of the non-deterministic automata that accepts L(G 2 ) is shown in Figure 3.15.
  1586. Figure 3.15: Non-deterministic automata that accepts L(G 2 ).
  1587. Its equivalent deterministic automata is as follows, and the transition diagram is shown in Figure 3.16.
  1588. 0 1
  1589. { X } { Y, F } { Z }
  1590. *{ Y, F } { X } { Y, F }
  1591. { Z } { Z } { X }
  1592. Figure 3.16: Transition diagram of the equivalent deterministic automata for Figure 3.15.
  1593. This automata does not contain unreachable, nondistinguishable states or dead states. Hence, it is a minimal state
  1594. automata accepting L(G 2 ), and since it is identical to the minimal state automata accepting L(G 1 ), L(G 2 ) = L(G 2 ); and
  1595. therefore, G 1 and G 2 generate the same language.
  1596. Obtaining a Regular Expression from the Regular Grammar
  1597. Given a regular grammar G, a regular expression that specifies L(G) can be directly obtained as follows:
  1598. Replace the " → " symbols in the grammar's productions with "=" symbols to get a set of equations. 1.
  1599. Solve the set of equations obtained above to obtain the value of the variable S, where S is the
  1600. start symbol of the grammar. The result is the regular expression specifying L(G).
  1601. 2.
  1602. For example consider the following regular grammar:
  1603. Replacing the " → " symbol in the productions of the grammar with the "=" symbol, we get the
  1604. following set of equations:
  1605. 3.
  1606. From equation (III) we get:
  1607. because equation (III) is of the form A = aA | b, where a and b are the expressions that do not contain variable A, and
  1608. the solution of this is A = a*b. Similarly, from equation (II) we get:
  1609. Substituting the values of A in (I) gives:
  1610. Hence, the required regular expression is:
  1611. 3.4 RIGHT LINEAR AND LEFT LINEAR GRAMMAR
  1612. 3.4.1 Right Linear Grammar
  1613. Right linear grammar is a context-free grammar in which every production is restricted to one of the following forms:
  1614. A → wB 1.
  1615. A → w, where A and B are the nonterminals, and w is in T * 2.
  1616. Since w is in T *, w can also be a single terminal; hence, every regular grammar, by default, satisfies this requirement
  1617. of a right linear grammar. Therefore every regular grammar is a right linear grammar. Similarly, when | w | > 1,
  1618. productions containing w on the right side can be split into more than one production. Each contains only one terminal
  1619. and only one nonterminal on the right side by using additional nonterminals, because w can be written as ay, where a
  1620. is the first terminal symbol of w and y is string made of the remaining symbols of w. Therefore, a production A → wB
  1621. can be split into the productions A → aB 1 and B 1 → yB without affecting the language generated by the grammar. The
  1622. production B 1 → yB can be further split in a similar manner. And this can continue until | y | becomes one. A production
  1623. A → w can also be split into the productions A → aB 1 and B 1 → y without affecting the language generated by the
  1624. grammar. The production B 1 → y can be further split in a similar manner, and this can continue until | y | becomes one,
  1625. bringing the productions into the form required by the regular grammar. Therefore, we conclude that every right linear
  1626. grammar can be rewritten in such a manner; every production of the grammar will satisfy the requirement of the
  1627. regular grammar. For example, consider the following grammar:
  1628. The grammar is a right linear grammar; the production S → aaB can be split into the productions S → aC and C → aB
  1629. without affecting what is derived from S. Similarly, the production S → ab can be split into the productions S → aD and
  1630. D → a. The production B → bb can also be split into the productions B → bE and E → b. Therefore, the above
  1631. grammar can be rewritten as:
  1632. which is a regular grammar.
  1633. 3.4.2 Left Linear Grammar
  1634. Left linear grammar is a context-free grammar in which every production is restricted to one of the following forms:
  1635. A → Bw 1.
  1636. A → w, where A and B are the nonterminals, and w is in T * 2.
  1637. For every left linear grammar, there exists an equivalent right linear grammar that generates the same language, and
  1638. vice versa. Hence, we conclude that every linear grammar (left or right) is a regular grammar. Given a right linear
  1639. grammar, an equivalent left linear grammar can be obtained as follows:
  1640. Obtain a regular expression for the language generated by the given grammar. 1.
  1641. Reverse the regular expression obtained in step 1, above. 2.
  1642. Obtain the regular, right linear grammar for the regular expression obtained in step 2. 3.
  1643. Reverse the right side of every production of the grammar obtained in step 3. The resulting
  1644. grammar will be an equivalent left linear grammar.
  1645. 4.
  1646. For example consider the right linear grammar given below:
  1647. The regular expression for the above grammar is obtained as follows. Replace the → by = in the above productions
  1648. to obtain the equations:
  1649. Solving equation (II) gives:
  1650. By substituting the value of B in (I), we get:
  1651. Therefore, the required regular expression is:
  1652. And the reverse regular expression is:
  1653. The finite automata accepting the language specified by the above regular expression is shown in Figure 3.17.
  1654. Figure 3.17: Finite automata accepting the right linear grammar for a regular expression.
  1655. Therefore, the right linear grammar that generates the language accepted by the automata in Figure 3.17 is:
  1656. Since C is not useful, eliminating C gives:
  1657. which can be further simplified by replacing D in B → 1D, using D → 0 to give:
  1658. Reversing the right side of the productions yields:
  1659. which is the equivalent left linear grammar. So, given a left linear grammar, an equivalent right linear grammar can be
  1660. obtained as follows:
  1661. Reverse the right side of every production of the given grammar. 1.
  1662. Obtain a regular expression for the language generated by the grammar obtained in step 1,
  1663. above.
  1664. 2.
  1665. Reverse the regular expression obtained in the step 2. 3.
  1666. Obtain the regular, right linear grammar for the regular expression obtained in the step 3. 4.
  1667. The resulting grammar will be an equivalent left linear grammar. For example, consider the following left linear
  1668. grammar:
  1669. Reversing the right side of the productions gives us:
  1670. The regular expression that specifies the language generated by the above grammar can be obtained as follows.
  1671. Replace the → symbols with "=" symbols in the productions of the above grammar to get the following set of
  1672. equations:
  1673. From equation (II), we get:
  1674. Substituting this value in (I) gives us:
  1675. Therefore,
  1676. and the regular expression is:
  1677. The reversed regular expression is:
  1678. The finite automata that accepts the language specified by the reversed regular expression is shown in Figure 3.18.
  1679. Figure 3.18: Transition diagram for a finite automata specified by a reversed regular expression.
  1680. Therefore, the regular grammar that generates the language accepted by the automata shown in Figure 3.18 is:
  1681. which can be reduced to:
  1682. which is the required right linear grammar.
  1683. EXAMPLE 3.15
  1684. Consider the following grammar to obtain an equivalent left linear grammar.
  1685. The regular expression for the above grammar is obtained as follows. Replace the → by = in the above productions
  1686. to obtain the equations:
  1687. By substituting (III) in (Il) we get:
  1688. Therefore, A = (a | gg)A | g and A = (a | gg)*g. By substituting this value in (I) we get:
  1689. And the regular expression is:
  1690. Therefore, the reversed regular expression is:
  1691. But since (a | gg)* is the same as (gg | a)*, the reversed regular expression is same. Hence, the regular, right linear
  1692. grammar that generates the language specified by the reversed regular expression is the given grammar itself.
  1693. Therefore, an equivalent left linear grammar can be obtained by reversing the right side of the productions of the given
  1694. grammar:
  1695. Chapter 4: Top-Down Parsing
  1696. INTRODUCTION
  1697. A syntax analyzer or parser is a program that performs syntax analysis. A parser obtains a string of tokens from the
  1698. lexical analyzer and verifies whether or not the string is a valid construct of the source language-that is, whether or not
  1699. it can be generated by the grammar for the source language. And for this, the parser either attempts to derive the
  1700. string of tokens w from the start symbol S, or it attempts to reduce w to the start symbol of the grammar by tracing the
  1701. derivations of w in reverse. An attempt to derive w from the grammar's start symbol S is equivalent to an attempt to
  1702. construct the top-down parse tree; that is, it starts from the root node and proceeds toward the leaves. Similarly, an
  1703. attempt to reduce w to the grammar's start symbol S is equivalent to an attempt to construct a bottom-up parse tree;
  1704. that is, it starts with w and traces the derivations in reverse, obtaining the root S.
  1705. 4.1 TOP-DOWN PARSING
  1706. Top-down parsing attempts to find the left-most derivations for an input string w, which is equivalent to constructing a
  1707. parse tree for the input string w that starts from the root and creates the nodes of the parse tree in a predefined order.
  1708. The reason that top-down parsing seeks the left-most derivations for an input string w and not the right-most
  1709. derivations is that the input string w is scanned by the parser from left to right, one symbol/token at a time, and the
  1710. left-most derivations generate the leaves of the parse tree in left-to-right order, which matches the input scan order.
  1711. Since top-down parsing attempts to find the left-most derivations for an input string w, a top-down parser may require
  1712. backtracking (i.e., repeated scanning of the input); because in the attempt to obtain the left-most derivation of the input
  1713. string w, a parser may encounter a situation in which a nonterminal A is required to be derived next, and there are
  1714. multiple A-productions, such as A → α 1 | α 2 | … | α n . In such a situation, deciding which A-production to use for the
  1715. derivation of A is a problem. Therefore, the parser will select one of the A-productions to derive A, and if this derivation
  1716. finally leads to the derivation of w, then the parser announces the successful completion of parsing. Otherwise, the
  1717. parser resets the input pointer to where it was when the nonterminal A was derived, and it tries another A-production.
  1718. The parser will continue this until it either announces the successful completion of the parsing or reports failure after
  1719. trying all of the alternatives. For example, consider the top-down parser for the following grammar:
  1720. Let the input string be w = acb. The parser initially creates a tree consisting of a single node, labeled S, and the input
  1721. pointer points to a, the first symbol of input string w. The parser then uses the S-production S → aAb to expand the
  1722. tree as shown in Figure 4.1.
  1723. Figure 4.1: Parser uses the S-production to expand the parse tree.
  1724. The left-most leaf, labeled a, matches the first input symbol of w. Hence, the parser will now advance the input pointer
  1725. to c, the second symbol of string w, and consider the next leaf labeled A. It will then expand A, using the first
  1726. alternative for A in order to obtain the tree shown in Figure 4.2.
  1727. Figure 4.2: Parser uses the first alternative for A in order to expand the tree.
  1728. The parser now has the match for the second input symbol. So, it advances the pointer to b, the third symbol of w,
  1729. and compares it to the label of the next leaf. If the label does not match d, it reports failure and goes back (backtracks)
  1730. to A, as shown in Figure 4.3. The parser will also reset the input pointer to the second input symbol—the position it
  1731. had when the parser encountered A—and it will try a second alternative to A in order to obtain the tree. If the leaf c
  1732. matches the second symbol, and if the next leaf b matches the third symbol of w, then the parser will halt and
  1733. announce the successful completion of parsing.
  1734. Figure 4.3: If the parser fails to match a leaf, the point of failure, d, reroutes (backtracks) the pointer to
  1735. alternative paths from A.
  1736. 4.2 IMPLEMENTATION
  1737. A top-down parser can be implemented by writing a set of recursive procedures to process the input. One procedure
  1738. will take care of the left-most derivations for each nonterminal while processing the input. Each procedure should also
  1739. provide for the storing of the input pointer in some local variable so that it can be reset properly when the parser
  1740. backtracks. This implementation, called a "recursive descent parser," is a top-down parser for the above-described
  1741. grammar that can be implemented by writing the following set of procedures:
  1742. S( )
  1743. {
  1744. if (input =='a' )
  1745. {
  1746. advance( );
  1747. if (A( ) != error)
  1748. if (input =='b')
  1749. { advance( );
  1750. if (input == endmarker)
  1751. return(success);
  1752. else
  1753. return(error);
  1754. }
  1755. else
  1756. return(error);
  1757. }
  1758. else
  1759. return(error);
  1760. }
  1761. A( )
  1762. {
  1763. if (input =='c')
  1764. {
  1765. advance( );
  1766. if (input == 'd')
  1767. advance( ); }
  1768. else
  1769. return(error);
  1770. }
  1771. main( )
  1772. {
  1773. Append the endmarker to the string w to be parsed;
  1774. Set the input pointer to the left most token of w;
  1775. if ( S( ) != error)
  1776. print f ("Successful completion of the parsing");
  1777. else
  1778. printf ("Failure");
  1779. }
  1780. where advance() is a routine that, when called, advances the input's pointer to the next occurrence of the symbol w.
  1781. Caution In a backtracking parser, the order in which alternatives are tried affects the language accepted by the parser.
  1782. For example, in the above parser, if a production A → c is tried before A → cd, then the parser will fail to accept
  1783. the string w = acdb, because it first expands S, as shown in Figure 4.4.
  1784. Figure 4.4: The parser first expands S and fails to accept w = acdb.
  1785. The first input symbol matches the left-most leaf; and therefore, the parser will advance the pointer to c and consider
  1786. the nonterminal A for expansion in order to obtain the tree shown in Figure 4.5.
  1787. Figure 4.5: The parser advances to c and considers nonterminal A for expension.
  1788. The second input symbol also matches. Therefore, the parser will advance the pointer to d, the third input symbol,
  1789. and consider the next leaf, labeled b in Figure 4.5. It finds that there is no match; and therefore, it will backtrack to S
  1790. (as shown in Figure 4.5 by the thick arrow). But since there is no alternative to S that can be tried, the parser will return
  1791. failure. Because the point of mismatch is the descendent of a node labeled by S, the parser will backtrack to S. It
  1792. cannot backtrack to A. Therefore, the parser will not accept the string acdb. Whereas, if the parser tries the alternative
  1793. A → cd first and A → c second, then the parser is capable of accepting the string acdb as well as acb because, for the
  1794. string w = acb, when the parser encounters a mismatch, it is at a node labeled by d, which is a descendent of a node
  1795. labeled by A. Hence, it will backtrack to A and try A → c, and end up in the parse tree for acb. Hence, we conclude that
  1796. the order in which alternatives are tried in a backtracking parser affect the language accepted by the compiler or
  1797. parser.
  1798. EXAMPLE 4.1
  1799. Consider a grammar S → aa | aSa. If a top-down backtracking parser for this grammar tries S → aSa before S → aa,
  1800. show that the parser succeeds on two occurrences of a and four occurrences of a, but not on six occurrences of a.
  1801. In the case of two occurrences of a, the parser will first expand S, as shown in Figure 4.6.
  1802. Figure 4.6: The parser first expands S.
  1803. The first input symbol matches the left-most leaf. Therefore, the parser will advance the pointer to a second a and
  1804. consider the nonterminal S for expansion in order to obtain the tree shown in Figure 4.7.
  1805. Figure 4.7: The parser advances the pointer to a second occurrence of a.
  1806. The second input symbol also matches. Therefore, the parser will consider the next leaf labeled S and expand it, as
  1807. shown in Figure 4.8.
  1808. Figure 4.8: The parser expands the next leaf labeled S.
  1809. The parser now finds that there is no match. Therefore, it will backtrack to S, as shown by the thick arrow in Figure
  1810. 4.9. The parser then continues matching and backtracking, as shown in Figures 4.10 through 4.15, until it arrives at the
  1811. required parse tree, shown in Figure 4.16.
  1812. Figure 4.9: The parser finds no match, so it backtracks.
  1813. Figure 4.10: The parser tries an alternate aa.
  1814. Figure 4.11: There is no further alternate of S that can be tried, so the parser will backtrack one more step.
  1815. Figure 4.12: The parser again finds a mismatch; hence, it backtracks.
  1816. Figure 4.13: The parser tries an alternate aa.
  1817. Figure 4.14: Since no alternate of S remains to be tried, the parser backtracks one more step.
  1818. Figure 4.15: The parser tries an alternate aa.
  1819. Figure 4.16: The parser arrives at the required parse tree.
  1820. Now, consider a string of four occurrences of a. The parser will first expand S, as shown in Figure 4.17.
  1821. Figure 4.17: The parser first expands S.
  1822. The first input symbol matches the left-most leaf. Therefore, the parser will advance the pointer to a second a and
  1823. consider the nonterminal S for expansion, obtaining the tree shown in Figure 4.18.
  1824. Figure 4.18: The parser advances the pointer to a second occurrence of a.
  1825. The second input symbol also matches. Therefore, the parser will consider the next leaf labeled by S and expand it, as
  1826. shown in Figure 4.19.
  1827. Figure 4.19: The parser considers the next leaf labeled by S.
  1828. The third input symbol also matches. So, the parser moves on to the next leaf labeled by S and expands it, as shown
  1829. in Figure 4.20.
  1830. Figure 4.20: The parser matches the third input symbol and moves on to the next leaf labeled by S.
  1831. The fourth input symbol also matches. Therefore, the next leaf labeled by S is considered. The parser expands it, as
  1832. shown in Figure 4.21.
  1833. Figure 4.21: The parser considers the fourth occurrence of the input symbol a.
  1834. Now it finds that there is no match. Therefore, it will backtrack to S (Figure 4.22) and continue backtracking, as shown
  1835. in Figures 4.23 through 4.30, until the parser finally arrives at the successful generation of a parse tree for aaaa in
  1836. Figure 4.31.
  1837. Figure 4.22: The parser finds no match, so it backtracks.
  1838. Figure 4.23: The parser tries an alternate aa.
  1839. Figure 4.24: No alternate of S can be tried, so the parser will backtrack one more step.
  1840. Figure 4.25: Again finding a mismatch, the parser backtracks.
  1841. Figure 4.26: The parser then tries an alternate.
  1842. Figure 4.27: No alternate of S remains to be tried, so the parser will backtrack one more step.
  1843. Figure 4.28: The parser again finds a mismatch; therefore, it backtracks.
  1844. Figure 4.29: The parser tries an alternate aa.
  1845. Figure 4.30: The parser then tries an alternate aa.
  1846. Figure 4.31: The parser successfully generates the parse tree for aaaa.
  1847. Now consider a string of six occurrences of a. The parser will first expand S, as shown in Figure 4.32.
  1848. Figure 4.32: The parser expands S.
  1849. The first input symbol matches the left-most leaf. Therefore, the parser will advance the pointer to the second a and
  1850. consider the nonterminal S for expansion. The tree shown in Figure 4.33 is obtained.
  1851. Figure 4.33: The parser matches the first symbol, advances to the second occurrence of a, and considers S for
  1852. expansion.
  1853. The second input symbol also matches. Therefore, the parser will consider next leaf labeled S and expand it, as
  1854. shown in Figure 4.34.
  1855. Figure 4.34: The parser finds a match for the second occurrence of a and expands S.
  1856. The third input symbol also matches, as do the fourth through sixth symbols. In each case, the parser will consider
  1857. next leaf labeled S and expand it, as shown in Figures 4.35 through 4.38.
  1858. Figure 4.35: The parser matches the third input symbol, considers the next leaf, and expands S.
  1859. Figure 4.36: The parser matches the fourth input symbol, considers the next leaf, and expands S.
  1860. Figure 4.37: A match is found for the fifth input symbol, so the parser considers the next leaf, and expands S.
  1861. Figure 4.38: The sixth input symbol also matches. So the next leaf is considered, and S is expanded.
  1862. Now the parser finds that there is no match. Therefore, it will backtrack to S, as shown by the thick arrow in Figure
  1863. 4.39.
  1864. Figure 4.39: No match is found, so the parser backtracks to S.
  1865. Since there is no alternate of S that can be tried, the parser will backtrack one more step, as shown in Figure 4.40.
  1866. This procedure continues (Figures 4.41 through 4.47), until the parser tries the sixth alternate aa (Figure 4.48) and
  1867. fails to find a match.
  1868. Figure 4.40: The parser backtracks one more step.
  1869. Figure 4.41: The parser tries the alternate aa.
  1870. Figure 4.42: Again, a mismatch is found. So, the parser backtracks.
  1871. Figure 4.43: No alternate of S remains, so the parser will back-track one more step.
  1872. Figure 4.44: The parser tries an alternate aa.
  1873. Figure 4.45: Again, a mismatch is found. The parser backtracks.
  1874. Figure 4.46: The parser then tries an alternate aa.
  1875. Figure 4.47: A mismatch is found, and the parser backtracks.
  1876. Figure 4.48: The parser tries for the alternate aa, fails to find a match, and cannot generate the parse tree for six
  1877. occurrences of a.
  1878. 4.3 THE PREDICTIVE TOP-DOWN PARSER
  1879. A backtracking parser is a non-deterministic recognizer of the language generated by the grammar. The backtracking
  1880. problems in the top-down parser can be solved; that is, a top-down parser can function as a deterministic recognizer if
  1881. it is capable of predicting or detecting which alternatives are right choices for the expansion of nonterminals (that
  1882. derive to more than one alternative) during the parsing of input string w. By carefully writing a grammar, eliminating
  1883. left recursion, and left-factoring the result, we obtain a grammar that can be parsed by a top-down parser. This
  1884. grammar will be able to predict the right alternative for the expansion of a nonterminal during the parsing process; and
  1885. hence, it need not backtrack.
  1886. If A → α 1 | α 2 | … | α n are the A-productions in the grammar, then a top-down parser can decide if a nonterminal A is
  1887. to be expanded or not. And if it is to be expanded, the parser decides which A-production should be used. It looks at
  1888. the next input symbol and finds out which of the α i derivatives to a string that start with the terminal symbol comes next
  1889. in the input. If none of the α i derives to a string starting with a terminal symbol, the parser reports the failure; otherwise,
  1890. it carries out the derivation of A using a production A → α i , where α i derives to a string whose first terminal symbol is
  1891. the symbol coming next in the input. Therefore, we conclude that if the set of first-terminal symbols of the strings
  1892. derivable from α i is computed for each α i , and this set is made available to the parser, then the parser can predict the
  1893. right choice for the expansion of nonterminal A. This information can be easily computed using the productions of the
  1894. grammar. We define a function FIRST( α ), where α is in (V ∪ T)*, as follows:
  1895. FIRST( α ) = Set of those terminals with which the strings derivable from α start
  1896. If α = XYZ, then FIRST( α ) is computed as follows:
  1897. FIRST( α ) = FIRST(XYZ) = { X } if X is terminal.
  1898. Otherwise,
  1899. FIRST( α ) = FIRST(XYZ) = FIRST(X) if X does not derive to an empty string; that is, if
  1900. FIRST(X) does not contain ∈ .
  1901. If FIRST(X) contains ∈ , then
  1902. FIRST( α ) = FIRST(XYZ) = FIRST(X) − { ∈ } ∪ FIRST(YZ)
  1903. FIRST(YZ) is computed in an identical manner:
  1904. FIRST(YZ) = { Y } if Y is terminal.
  1905. Otherwise,
  1906. FIRST(YZ) = FIRST(Y) if Y does not derive to an empty string (i.e., if FIRST(Y) does not contain ∈ ). If FIRST(Y)
  1907. contains ∈ , then
  1908. FIRST(YZ) = FIRST(Y) − { ∈ } ∪ FIRST(Z)
  1909. For example, consider the grammar:
  1910. FIRST(S) = FIRST(ACB) ∪ FIRST(CbB) ∪
  1911. FIRST(A) = FIRST(da) ∪ FIRST(BC)
  1912. FIRST(B) = FIRST(g) ∪ FIRST( ∈ )
  1913. FIRST(C) = FIRST(h) ∪ FIRST( ∈ )
  1914. Therefore:
  1915. FIRST(BC) = FIRST(B) − { ∈ } ∪ FIRST(C)
  1916. Substituting in (II) we get:
  1917. FIRST(A)={ d } ∪ { g, h, ∈ }
  1918. FIRST(ACB) =FIRST(A) − { ∈ } ∪ FIRST(CB)
  1919. FIRST(CB) =FIRST(C) − { ∈ } ∪ FIRST(B)
  1920. Therefore, substituting in (III) we get:
  1921. FIRST(ACB)={ d, g, h, ∈ } ∪ { g, h, ∈ }
  1922. Similarly,
  1923. FIRST(CbB) =FIRST(C) − { ∈ } ∪ FIRST(bB)
  1924. Similarly,
  1925. FIRST(Ba) =FIRST(B) − { ∈ } ∪ FIRST(a)
  1926. Therefore, substituting in (I), we get:
  1927. FIRST(S)={ d, g, h, ∈ } ∪ { b, h, ∈ } ∪ { a, g, ∈ }
  1928. EXAMPLE 4.2
  1929. Consider the following grammar:
  1930. FIRST(aAb)= { a }
  1931. FIRST(cd)= { c }, and
  1932. FIRST(ef)= { e }
  1933. Hence, while deriving S, the parser looks at the next input symbol. And if it happens to be the terminal a, then the
  1934. parser derives S using S → aAb. Otherwise, the parser reports an error. Similarly, when expanding A, the parser looks
  1935. at the next input symbol; if it happens to be the terminal c, then the parser derives A using A → cd. If the next terminal
  1936. input symbol happens to be e, then the parser derives A using A → ef. Otherwise, an error is reported.
  1937. Therefore, we conclude that if the right-hand FIRST for the production S → aAb is computed, we can decide when the
  1938. parser should do the derivation using the production S → aAb. Similarly, if the right-hand FIRST for the productions A
  1939. → cd and A → ef are computed, then we can decide when derivation is to be done using A → cd and A → ef,
  1940. respectively. These decisions can be encoded in the form of table, as shown in Table 4.1, and can be made available
  1941. to the parser for the correct selection of productions for derivations during parsing.
  1942. Table 4.1: Production Selections for Parsing Derivations
  1943. a b c d e f
  1944. $
  1945. S
  1946. S → aAb
  1947. A
  1948. A → cd
  1949. A → ef
  1950. The number of rows of the table are equal to the number of nonterminals, whereas the number of columns are equal to
  1951. the number of terminals, including the end marker. The parser uses of the nonterminal to be derived as the row index
  1952. of the table, and the next input symbol is used as the column index when the parser decides which production will be
  1953. derived. Here, the production S → aAb is added in the table at [S, a] because FIRST(aAb) contains a terminal a.
  1954. Hence, S must be derived using S → aAb if and only if the terminal symbol coming next in the input is a. Similarly, the
  1955. production A → cd is added at [A, c], because FIRST(cd) contain c. Hence, A must be derived using A → cd if and only
  1956. if the terminal symbol coming next in the input is c. Finally, A must be derived using A → ef if and only if the terminal
  1957. symbol coming next in the input is e. Hence, the production A → ef is added at [A, e]. Therefore, we conclude that the
  1958. table can be constructed as follows:
  1959. for every production A → α do
  1960. for every a in FIRST( α ) do
  1961. TABLE[A, a] = A → α
  1962. Using the above method, every production of the grammar gets added into the table at the proper place when the
  1963. grammar is ∈ -free. But when the grammar is not ∈ -free, ∈ -productions will not get added to the table. If there is an
  1964. ∈ -production A → ∈ in the grammar, then deciding when A is to be derived to ∈ is not possible using the production's
  1965. right-hand FIRST. Some additional information is required to decide where the production A → ∈ is to be added to the
  1966. table.
  1967. Tip The derivation by A → ∈ is a right choice when the parser is on the verge of expanding the nonterminal A and the
  1968. next input symbol happens to be a terminal, which can occur immediately following A in any string occurring on the
  1969. right side of the production. This will lead to the expansion of A to ∈ , and the next leaf in the parse tree will be
  1970. considered, which is labeled by the symbol immediately following A and, therefore, may match the next input
  1971. symbol.
  1972. Therefore, we conclude that the production A → ∈ is to be added in the table at [A, b] for every b that immediately
  1973. follows A in any of the production grammar's right-hand strings. To compute the set of all such terminals, we make use
  1974. of the function FOLLOW(A), where A is a nonterminal, as defined below:
  1975. FOLLOW(A) = Set of terminals that immediately follow A in any string occurring on the right side of productions of the
  1976. grammar
  1977. For example, if A → α B β is a production, then FOLLOW(B) can be computed using A → α B β , as shown below:
  1978. FOLLOW(B) = FIRST( β ) if FIRST( β ) does not contain ∈ .
  1979. Therefore, we conclude that when the grammar is not ∈ -free, then the table can be constructed as follows:
  1980. Compute FIRST and FOLLOW for every nonterminal of the grammar. 1.
  1981. For every production A → α , do:
  1982. {
  1983. for every non- ∈ member a in FIRST( α ) do
  1984. TABLE[A, a] = A → α
  1985. If FIRST( α ) contain ∈ then
  1986. For every b in FOLLOW(A) do
  1987. TABLE[A, b] = A → α
  1988. }
  1989. 2.
  1990. Therefore, we conclude that if the table is constructed using the above algorithm, a top-down parser can be
  1991. constructed that will be a nonbacktracking, or ‘predictive’ parser.
  1992. 4.3.1 Implementation of a Table-Driven Predictive Parser
  1993. A table-driven parser can be implemented using an input buffer, a stack, and a parsing table. The input buffer is used
  1994. to hold the string to be parsed. The string is followed by a "$" symbol that is used as a right-end maker to indicate the
  1995. end of the input string. The stack is used to hold the sequence of grammar symbols. A "$" indicates bottom of the
  1996. stack. Initially, the stack has the start symbol of a grammar above the $. The parsing table is a table obtained by using
  1997. the above algorithm presented in the previous section. It is a two-dimensional array TABLE[A, a], where A is a
  1998. nonterminal and a is a terminal, or $ symbol. The parser is controlled by a program that behaves as follows:
  1999. The program considers X, the symbol on the top of the stack, and the next input symbol a. 1.
  2000. If X = a = $, then parser announces the successful completion of the parsing and halts. 2.
  2001. If X = a ≠ $, then the parser pops the X off the stack and advances the input pointer to the next
  2002. input symbol.
  2003. 3.
  2004. If X is a nonterminal, then the program consults the parsing table entry TABLE[x, a]. If TABLE[x, a]
  2005. = x → UVW, then the parser replaces X on the top of the stack by UVW in such a manner that U
  2006. will come on the top. If TABLE[x, a] = error, then the parser calls the error-recovery routine.
  2007. 4.
  2008. For example consider the following grammar:
  2009. FIRST(S) = FIRST(aABb) = { a }
  2010. FIRST(A) = FIRST(c) ∪ FIRST( ∈ ) = { c, ∈ }
  2011. FIRST(B) = FIRST(d) ∪ FIRST( ∈ ) = { d, ∈ }
  2012. Since the right-end marker $ is used to mark the bottom of the stack, $ will initially be immediately below S (the start
  2013. symbol) on the stack; and hence, $ will be in the FOLLOW(S). Therefore:
  2014. Using S → aABb, we get:
  2015. Therefore, the parsing table is as shown in Table 4.2.
  2016. Table 4.2: Production Selections for Parsing Derivations
  2017. a b c d
  2018. $
  2019. S
  2020. S → aABb
  2021. A
  2022. A → ∈ A → c A → ∈
  2023. B
  2024. B → ∈
  2025. B → d
  2026. Consider an input string acdb. The various steps in the parsing of this string, in terms of the contents of the stack and
  2027. unspent input, are shown in Table 4.3.
  2028. Table 4.3: Steps Involved in Parsing the String acdb
  2029. Stack Contents Unspent Input Moves
  2030. $S acdb$
  2031. Derivation using S → aABb
  2032. $bBAa acdb$ Popping a off the stack and advancing one position in the input
  2033. $bBA cdb$
  2034. Derivation using A → c
  2035. $bBc cdb$ Popping c off the stack and advancing one position in the input
  2036. $bB db$
  2037. Derivation using B → d
  2038. $bd db$ Popping d off the stack and advancing one position in the input
  2039. $b b$ Popping b off the stack and advancing one position in the input
  2040. $ $ Announce successful completion of the parsing
  2041. Similarly, for the input string ab, the various steps in the parsing of the string, in terms of the contents of the stack and
  2042. unspent input, are shown in Table 4.4.
  2043. Table 4.4: Production Selections for String ab Parsing Derivations
  2044. Stack Contents Unspent Input Moves
  2045. $S ab$
  2046. Derivation using S → aABb
  2047. $bBAa ab$ Popping a off the stack and advancing one position in the input
  2048. $bBA b$
  2049. Derivation using A → ∈
  2050. $bB b$
  2051. Derivation using B → ∈
  2052. $b b$ Popping b off the stack and advancing one position in the input
  2053. $ $ Announce successful completion of the parsing
  2054. For a string adb, the various steps in the parsing of the string, in terms of the contents of the stack and unspent input,
  2055. are shown in Table 4.5.
  2056. Table 4.5: Production Selections for Parsing Derivations for the String adb
  2057. Stack Contents Unspent Input Moves
  2058. $S adb$
  2059. Derivation using S → aABb
  2060. $bBAa adb$ Popping a off the stack and advancing one position in the input
  2061. $bBA ab$ Calling an error-handling routine
  2062. The heart of the table-driven parser is the parsing table-the parser looks at the parsing table to decide which
  2063. alternative is a right choice for the expansion of a nonterminal during the parsing of the input string. Hence,
  2064. constructing a table-driven predictive parser can be considered as equivalent to constructing the parsing table.
  2065. A parsing table for any grammar can be obtained by the application of the above algorithm; but for some grammars,
  2066. some of the entries in the parsing table may end up being multiple defined entries. Whereas for certain grammars, all
  2067. of the entries in the parsing table are singly defined entries. If the parsing table contains multiple entries, then the
  2068. parser is still non-deterministic. The parser will be a deterministic recognizer if and only if there are no multiple entries
  2069. in the parsing table. All such grammars (i.e., those grammars that, after applying the algorithm above, contain no
  2070. multiple entries in the parsing table) constitute a subset of CFGs called "LL(1)" grammars. Therefore, a given grammar
  2071. is LL(1) if its parsing table, constructed by algorithm above, contains no multiple entries. If the table contains multiple
  2072. entries, then the grammar is not LL(1).
  2073. In the acronym LL(1), the first L stands for the left-to-right scan of the input, the second L stands for the left-most
  2074. derivation, and the (1) indicates that the next input symbol is used to decide the next parsing process (i.e., length of the
  2075. lookahead is "1").
  2076. In the LL(1) parsing system, parsing is done by scanning the input from left to right, and an attempt is made to derive
  2077. the input string in a left-most order. The next input symbol is used to decide what is to be done next in the parsing
  2078. process. The predictive parser discussed above, therefore, is a LL(1) parser, because it also scans the input from left
  2079. to right and attempts to obtain the left-most derivation of it; and it also makes use of the next input symbol to decide
  2080. what is to be done next. And if the parsing table used by the predictive parser does not contain multiple entries, then
  2081. the parser acts as a recognizer of only the members of L(G); hence, the grammar is LL(1).
  2082. Therefore, LL(1) is the grammar for which an LL(1) parser can be constructed, which acts as a deterministic recognizer
  2083. of L(G). If a grammar is LL(1), then a deterministic top-down table-driven recognizer can be constructed to recognize
  2084. L(G). A parsing table constructed for a given grammar G will have multiple entries if the grammar contains multiple
  2085. productions that derive the same nonterminal-that is, the grammar contains the productions A → α | β , and both α and
  2086. β derive to a string that starts with the same terminal symbol. Therefore, one of the basic requirements for a grammar
  2087. to be considered LL(1) is when the grammar contains multiple productions that derive the same nonterminal, such as:
  2088. for every pair of productions A → α | β
  2089. FIRST( α ) ∩ FIRST( β ) = φ (i.e., FIRST( α ) and FIRST( β ) should be disjoint sets for every pair of productions A → α | β )
  2090. For a grammar to be LL(1), the satisfaction of the condition above is necessary as well sufficient if the grammar is
  2091. ∈ -free. When the grammar is not ∈ -free, then the satisfaction of the above condition is necessary but not sufficient,
  2092. because either FIRST( α ) or FIRST( β ) might contain ∈ , but not both. The above condition will still be satisfied; but if
  2093. FIRST( β ) contains ∈ , then production A → β will be added in the table on all terminals in FOLLOW(A). Hence, it also
  2094. required that FIRST( α ) and FOLLOW(A) contain no common symbols. Therefore, an additional condition must be
  2095. satisfied in order for a grammar to be LL(1). When the grammar is not ∈ -free: for every pair of productions A → α | β
  2096. if FIRST( β ) contains ∈ , and FIRST( α ) does not contain ∈ , then
  2097. FIRST( α ) ∩ FOLLOW(A) = φ
  2098. Therefore, for a grammar to be LL(1), the following conditions must be satisfied:
  2099. For every pair of productions
  2100. {
  2101. (1) FIRST( α ) ∩ FIRST( β ) = φ
  2102. and
  2103. if FIRST( β ) contains ∈ , and FIRST( α ) does not contain ∈
  2104. then
  2105. (1) FIRST( α ) ∩ FOLLOW(A) = φ
  2106. }
  2107. 4.3.2 Examples
  2108. EXAMPLE 4.3
  2109. Test whether the grammar is LL(1) or not, and construct a predictive parsing table for it.
  2110. Since the grammar contains a pair of productions S → AaAb | BbBa, for the grammar to be LL(1), it is required that:
  2111. Hence, the grammar is LL(1).
  2112. To construct a parsing table, the FIRST and FOLLOW sets are computed, as shown below:
  2113. Using S → AaAb, we get: 1.
  2114. Using S → BbBa, we get 2.
  2115. Table 4.6: Production Selections for Example 4.3 Parsing Derivations
  2116. a b
  2117. $
  2118. S
  2119. S → AaAb S → BbBa
  2120. A
  2121. A → ∈ A → ∈
  2122. B
  2123. B → ∈ B → ∈
  2124. EXAMPLE 4.4
  2125. Consider the following grammar, and test whether the grammar is LL(1) or not.
  2126. For a pair of productions S → 1AB | ∈ :
  2127. because FOLLOW(S) = { $ } (i.e., it contains only the end marker. Similarly, for a pair of productions A → 1AC | 0C:
  2128. Hence, the grammar is LL(1). Now, show that no left-recursive grammar can be LL(1).
  2129. One of the basic requirements for a grammar to be LL(1) is: for every pair of productions A → α | β in the grammar's
  2130. set of productions, FIRST( α ) and FIRST( β ) should be disjointed.
  2131. If a grammar is left-recursive, then the set of productions will contain at least one pair of the form A → A α | β ; and
  2132. hence, FIRST(A α ) and FIRST( β ) will not be disjointed sets, because everything in the FIRST( β ) will also be in the
  2133. FIRST(A α ). It thereby violates the condition for LL(1) grammar. Hence, a grammar containing a pair of productions A
  2134. → A α | β (i.e., a left-recursive grammar) cannot be LL(1).
  2135. Now, let X be a nullable nonterminal that derives to at least two terminal strings. Show that in LL(1) grammar, no
  2136. production rule can have two consecutive occurrences of X on the right side of the production.
  2137. Since X is a nullable X ∈ , X is also deriving to at least to two terminal strings-Xw 1 and Xw 2 -where w 1 and w 2 are the
  2138. strings of terminals. Therefore, for a grammar using X to be LL(1), it is required that:
  2139. FIRST(w 1 ) ∩ FIRST(w 2 ) = φ
  2140. FIRST (w 1 ) ∩ FOLLOW(X) and FIRST(w 2 ) ∩ FOLLOW(X) = φ
  2141. If this grammar contains a production rule A → α XX β -a production whose right side has two consecutive occurrences
  2142. of X-then everything in FIRST(X) will also be in the FOLLOW(X); and since FIRST(X) contains FIRST(w 1 ) as well as
  2143. FIRST(w 2 ), the second condition will therefore not be satisfied. Hence, a grammar containing a production of the form
  2144. A → α XX β will never be LL(1), thereby proving that in LL(1) grammar, no production rule can have two consecutive
  2145. occurrences of X on the right side of the production.
  2146. EXAMPLE 4.5
  2147. Construct a predictive parsing table for the following grammar where S| is a start symbol and # is the end marker.
  2148. Here, # is taken as one of the grammar symbols. And therefore, the initial configuration of the parser will be (S|, w#),
  2149. where the first member of the pair is the contents of the stack and the second member is the contents of input buffer.
  2150. Therefore, by substituting in (I), we get:
  2151. Using S| → S# we get: 1.
  2152. Using S → qABC we get:
  2153. Substituting in (II) we get:
  2154. 2.
  2155. Using A → bbD we get: 3.
  2156. Therefore, the parsing table is derived as shown in Table 4.7.
  2157. Table 4.7: Production Selections for Example 4.5 Parsing Derivations
  2158. q a b c
  2159. #
  2160. S
  2161. S → S#
  2162. S
  2163. S → qabc
  2164. A
  2165. A → a A → bbD
  2166. B
  2167. B → a B → ∈
  2168. B → ∈
  2169. C
  2170. C → b
  2171. C → ∈
  2172. D
  2173. D → ∈ D → ∈ D → c D → ∈
  2174. EXAMPLE 4.6
  2175. Construct predictive parsing table for the following grammar:
  2176. Since the grammar is ∈ -free, FOLLOW sets are not required to be computed in order to enter the productions into the
  2177. parsing table. Therefore the parsing table is as shown in Table 4.8.
  2178. Table 4.8: Production Selections for Example 4.6 Parsing Derivations
  2179. a b f g d
  2180. S
  2181. S → A
  2182. A
  2183. A → aS
  2184. A → d
  2185. B
  2186. B → bBC B → f
  2187. C
  2188. C → g
  2189. EXAMPLE 4.7
  2190. Construct a predictive parsing table for the following grammar, where S is a start symbol.
  2191. Using S → iEtSS 1 : 1.
  2192. Using S 1 → eS: 2.
  2193. Therefore, the parsing table is as shown in Table 4.9.
  2194. Table 4.9: Production Selections for Example 4.7 Parsing Derivations
  2195. i a b e T
  2196. $
  2197. S
  2198. S → iEtSS 1 S → a
  2199. S 1
  2200. S1 → eS
  2201. S 1 → ∈
  2202. S 1
  2203. S1 → ∈
  2204. E
  2205. E → b
  2206. EXAMPLE 4.8
  2207. Construct an LL(1) parsing table for the following grammar:
  2208. Computation of FIRST and FOLLOW:
  2209. Therefore by substituting in (I) we get:
  2210. Using the production S → aBDh we get: 1.
  2211. Using the production B → cC, we get: 2.
  2212. Using the production C → bC, we get: 3.
  2213. Using the production D → EF, we get: 4.
  2214. Therefore, the parsing table is as shown in Table 4.10.
  2215. Table 4.10: Production Selections for Example 4.8 Parsing Derivations
  2216. a b c g f h
  2217. $
  2218. S
  2219. S → aBDh
  2220. B
  2221. B → cC
  2222. C
  2223. C → bC
  2224. C → ∈ C → ∈ C → ∈
  2225. D
  2226. D → EF D → EF D → EF
  2227. E
  2228. E → g E → ∈ E → ∈
  2229. F
  2230. F → f F → ∈
  2231. Chapter 5: Bottom-up Parsing
  2232. 5.1 WHAT IS BOTTOM-UP PARSING?
  2233. Bottom-up parsing can be defined as an attempt to reduce the input string w to the start symbol of a grammar by
  2234. tracing out the right-most derivations of w in reverse. This is equivalent to constructing a parse tree for the input string
  2235. w by starting with leaves and proceeding toward the root—that is, attempting to construct the parse tree from the
  2236. bottom, up. This involves searching for the substring that matches the right side of any of the productions of the
  2237. grammar. This substring is replaced by the left-hand-side nonterminal of the production if this replacement leads to the
  2238. generation of the sentential form that comes one step before in the right-most derivation. This process of replacing the
  2239. right side of the production by the left side nonterminal is called "reduction". Hence, reduction is nothing more than
  2240. performing derivations in reverse. The reason why bottom-up parsing tries to trace out the right-most derivations of an
  2241. input string w in reverse and not the left-most derivations is because the parser scans the input string w from the left to
  2242. right, one symbol/token at a time. And to trace out right-most derivations of an input string w in reverse, the tokens of w
  2243. must be made available in a left-to-right order. For example, if the right-most derivation sequence of some w is:
  2244. then the bottom-up parser starts with w and searches for the occurrence of a substring of w that matches the right side
  2245. of some production A → β such that the replacement of β by A will lead to the generation of α n − 1 . The parser replaces
  2246. β by A, then it searches for the occurrence of a substring of α n − 1 that matches the right side of some production B → γ
  2247. such that replacement of γ by B will lead to the generation of α n − 2 . This process continues until the entire w substring
  2248. is reduced to S, or until the parser encounters an error.
  2249. Therefore, bottom-up parsing involves the selection of a substring that matches the right side of the production, whose
  2250. reduction to the nonterminal on the left side of the production represents one step along the reverse of a right-most
  2251. derivation. That is, it leads to the generation of the previous right-most derivation. This means that selecting a
  2252. substring that matches the right side of production is not enough; the position of this substring in the sentential form is
  2253. also important.
  2254. Tip The substring should occur in the position and sentential form that is currently under consideration and, if it is
  2255. replaced by the left-side nonterminal of the production, that it leads to the generation of the previous right-hand
  2256. sentential form of the currently considered sentential form. Therefore, finding a substring that matches the right
  2257. side of a production, as well as its position in the current sentential form, are both equally important. In order to take
  2258. both of these factors into account, we will define a "handle" of the right sentential form.
  2259. 5.2 A HANDLE OF A RIGHT SENTENTIAL FORM
  2260. A handle of a right sentential form γ is a production A → β and a position of β in γ . The string β will be found and
  2261. replaced by A to produce the previous right sentential form in the right-most derivation of γ . That is, if S → α A β → αγβ ,
  2262. then A → γ is a handle of αγβ , in the position following α . Consider the grammar:
  2263. and the right-most derivation:
  2264. The handles of the sentential forms occurring in the above derivation are shown in Table 5.1.
  2265. Table 5.1: Sentential Form Handles
  2266. Sentential Form Handle
  2267. id + id * id
  2268. E → id at the position preceding +
  2269. E + id * id
  2270. E → id at the position following +
  2271. E + E * id
  2272. E → id at the position following*
  2273. E + E * E
  2274. E → E * E at the position following +
  2275. E + E
  2276. E → E + E at the position preceding the end marker
  2277. Therefore, the bottom-up parsing is only an attempt to detect the handle of a right sentential form. And whenever a
  2278. handle is detected, the reduction is performed. This is equivalent to performing right-most derivations in reverse and is
  2279. called "handle pruning".
  2280. Therefore, if the right-most derivation sequence of some w is S → α 1 → α 2 → α 3 → … → α n − 1 → w, then handle
  2281. pruning starts with w, the nth right sentential form, the handle β n of w is located, and β n is replaced by the left side of
  2282. some production A n → β n in order to obtain α n − 1 . By continuing this process, if the parser obtains a right sentential
  2283. form that consists of only a start symbol, then it halts and announces the successful completion of parsing.
  2284. EXAMPLE 5.1
  2285. Consider the following grammar, and show the handle of each right sentential form for the string (a,(a, a)).
  2286. The right-most derivation of the string (a, (a, a)) is:
  2287. Table 5.2 presents the handles of the sentential forms occurring in the above derivation.
  2288. Table 5.2: Sentential Form Handles
  2289. Sentential Form Handle
  2290. (a, (a, a))
  2291. S → a at the position preceding the first comma
  2292. (S, (a, a))
  2293. L → S at the position preceding the first comma
  2294. (L, (a, a))
  2295. S → a at the position preceding the second comma
  2296. (L, (S, a))
  2297. L → S at the position preceding the second comma
  2298. (L, (L, a))
  2299. S → a at the position following the second comma
  2300. (L, (L, S))
  2301. L → L, S, at the position following the second left bracket
  2302. (L, (L))
  2303. S → (L) at the position following the first comma
  2304. (L, S)
  2305. L → L, S, at the position following the first left bracket
  2306. (L)
  2307. S → (L) at the position before the endmarker
  2308. 5.3 IMPLEMENTATION
  2309. A convenient way to implement a bottom-up parser is to use a shift-reduce technique: a parser goes on shifting the
  2310. input symbols onto the stack until a handle comes on the top of the stack. When a handle appears on the top of the
  2311. stack, it performs reduction. This implementation makes use of a stack to hold grammar symbols and an input buffer to
  2312. hold the string w to be parsed, which is terminated by the right endmarker $, the same symbol used to mark the bottom
  2313. of the stack. The configuration of the parser is given by a token pair-the first component of which is a stack content,
  2314. and second component is an unexpended input.
  2315. Initially, the parser will be in the configuration given by the pair ($, w$); that is, the stack is initially empty, and the
  2316. buffer contains the entire string w. The parser shifts zero or more symbols from the input on to the stack until handle α
  2317. appears on the top of the stack. The parser then reduces α to the left side of the appropriate production. This cycle is
  2318. repeated until the parser either detects an error or until the stack contains a start symbol and the input is empty, giving
  2319. the configuration ($S, $). If the parser enters ($S, $), then it announces the successful completion of parsing. Thus,
  2320. the primary operation of the parser is to shift and reduce.
  2321. For example consider the bottom-up parser for the grammar having the productions:
  2322. and the input string: id+id * id. The various steps in parsing this string are shown in Table 5.3 in terms of the contents
  2323. of the stack and unspent input.
  2324. Table 5.3: Steps in Parsing the String id + id * id
  2325. Stack Contents Input Moves
  2326. $ id + id*id$ shift id
  2327. $id + id*id$
  2328. reduce by F → id
  2329. $F + id*id$
  2330. reduce by T → F
  2331. $T + id*id$
  2332. reduce by E → T
  2333. $E + id*id$ shift +
  2334. $E + id*id$ shift id
  2335. $E + id *id$
  2336. reduce by F → id
  2337. $E + F *id$
  2338. reduce by T → F
  2339. $E + T *id$ shift *
  2340. $E + T * id$ shift id
  2341. $E + T*id $
  2342. reduce by F → id
  2343. $E + T *F $
  2344. reduce by T → T *F
  2345. $E + T $
  2346. reduce by E → E + T
  2347. $E $ accept
  2348. Shift-reduce implementation does not tell us anything about the technique used for detecting the handles; hence, it is
  2349. possible to make use of any suitable technique to detect handles. Depending upon the technique that is used to detect
  2350. handles, we get different shift-reduce parsers. For example, an operator-precedence parser is a shift-reduce parser
  2351. that uses the precedence relationship between certain pairs of terminals to guide the selection of handles. Whereas
  2352. LR parsers make use of a deterministic finite automata that recognizes the set of all viable prefixes; by reading the
  2353. stack from bottom to top, it determines what handle, if any, is on the top of the stack.
  2354. 5.4 THE LR PARSER
  2355. The LR parser is a shift-reduce parser that makes use of a deterministic finite automata, recognizing the set of all
  2356. viable prefixes by reading the stack from bottom to top. It determines what handle, if any, is available. A viable prefix of
  2357. a right sentential form is that prefix that contains a handle, but no symbol to the right of the handle. Therefore, if a
  2358. finite-state machine that recognizes viable prefixes of the right sentential forms is constructed, it can be used to guide
  2359. the handle selection in the shift-reduce parser.
  2360. Since the LR parser makes use of a DFA that recognizes viable prefixes to guide the selection of handles, it must keep
  2361. track of the states of the DFA. Hence, the LR parser stack contains two types of symbols: state symbols used to
  2362. identify the states of the DFA and grammar symbols. The parser starts with the initial state of a DFA 10 on the stack.
  2363. The parser operates by looking at the next input symbol a and the state symbol I i on the top of the stack. If there is a
  2364. transition from the state I i on a in the DFA going to state I j , then it shifts the symbol a, followed by the state symbol I j ,
  2365. onto the stack. If there is no transition from I i on a in the DFA, and if the state I i on the top of the stack recognizes,
  2366. when entered, a viable prefix that contains the handle A → α , then the parser carries out the reduction by popping α
  2367. and pushing A onto the stack. This is equivalent to making a backward transition from I i on α in the DFA and then
  2368. making a forward transition on A. Every shift action of the parser corresponds to a transition on a terminal symbol in
  2369. the DFA. Therefore, the current state of the DFA and the next input symbol determine whether the parser shifts the
  2370. next input symbol goes for reduction.
  2371. If we construct a table mapping every state and input symbol pair as either "shift," "reduce," "accept," or "error," we get
  2372. a table that can be used to guide the parsing process. Such a table is called a parsing "action" table. When carrying
  2373. out the reduction by A → α , the parser has to pop α and push A onto the stack. This requires knowledge of where the
  2374. transition goes in a DFA from the state brought onto the top of the stack after popping α on the nonterminal A; and
  2375. hence, we require another table mapping of every state and nonterminal pair into a state. The table of transitions on
  2376. the nonterminals in the DFA is called a "goto" table. Therefore, to create an LR parser we require an Action|GOTO
  2377. table.
  2378. If the current state of a DFA has a transition on the terminal symbol a to the state I j , then the next move will be to shift
  2379. the symbol a and enter the state I j . But if the current state of the DFA is one in which when entered recognizes that a
  2380. viable prefix contains the handle, then the next move of the parser will be to reduce.
  2381. Therefore, an LR parser is comprised of an input buffer (which holds the input string w to be parsed and assumed to
  2382. be terminated by the right endmarker $), a stack holding the viable prefixes of the right sentential forms, and a parsing
  2383. table that is obtain by mapping the moves of a DFA that recognizes viable prefixes and controls the parsing actions.
  2384. The configuration of a parser is given by a token pair: the first component is a stack's content, and second component
  2385. is unexpended input. If, at a particular instant (and $ is used as bottom-of-the-stack marker, also), a parser is
  2386. configured as follows:
  2387. where I i is a state symbol identifying the state of a DFA recognizing the viable prefixes, and X i is the grammar symbol.
  2388. The parser consults the parsing action table entry, [I m , a i ]. If action[I m , a i ] = S j , then the parser shifts the next input
  2389. symbol followed by the state I j on the stack and enters into the configuration:
  2390. If action[I m , a i ] = reduce by production A → α , then the parser carries out the reduction as follows. If | α | = r, then the
  2391. parser pops two r symbols from the stack (because every shift action shifts a grammar symbol as well as state
  2392. symbol), thereby bringing I m − r on the top. It then consults the goto table entry, goto[I m − r , A]. If goto[I m − r , A] = I k , then
  2393. it shifts A followed by I k onto the stack, thereby entering into the configuration:
  2394. If action[I m , a i ] = accept, then the parser halts and accepts the input string. If action[I m , a i ] = error, then the parser
  2395. invokes a suitable error-recovery routine. Initially the parser will be in the configuration given by the pair ($I 0 , w$).
  2396. Therefore, we conclude that parsing table construction involves constructing a DFA that recognizes the viable prefixes
  2397. of the right sentential forms, using the given grammar, and then maps its the moves into the form of the Action|GOTO
  2398. table. To construct such a DFA, we make use of the items that are part of a grammar's productions. Here, an item
  2399. called the "LR(0)" of a production is a production with a dot placed at some position on the right side of the production.
  2400. For example if A → XYZ is a production, then the following items can be generated from it:
  2401. If the length of the right side of the production is n, then there are (n+1) different positions on the right side of a
  2402. production where a dot can be placed. Hence, the number of items that can be generated are (n+1).
  2403. The dot's position on the right side tells us how much of the right-hand side of the production is seen in the process of
  2404. parsing. For example, the item A → X.YZ tells us that we have already seen a string derivable from X in the input and
  2405. expect to see the string derivable from YZ next in the input.
  2406. 5.4.1 Augmented Grammar
  2407. To construct a DFA that recognizes the viable prefixes, we make use of augmented grammar, which is defined as
  2408. follows: if G = (V, T, P, S) is a given grammar, then the augmented grammar will be G 1 = (V ∪ {S 1 }, T, P ∪ {S 1 → S},
  2409. S 1 ); that is, we add a unit production S 1 → S to the grammar G and make S 1 the new start symbol. The resulting
  2410. grammar will be an augmented grammar. The purpose of augmenting the grammar is to make it explicitly clear to
  2411. parser when to accept the string. Parsing will stop when the parser is on the verge of carrying out the reduction using
  2412. S 1 → S. A NFA that recognizes the viable prefixes will be a finite automata whose states correspond to the production
  2413. items of the augmented grammar. Every item represents one state in the automata, with the initial state corresponding
  2414. to an item S 1 → S. The transitions in the automata are defined as follows:
  2415. δ (A → α .B β , ∈ ) = B → . γ (This transition is required, because if the current state is A → α .B β , that means we have not
  2416. yet seen a string derivable from the nonterminal B; and since B → γ is a production of the grammar, unless we see γ ,
  2417. we will not get B. Therefore, we have to travel the path that recognizes γ , which requires entering into the state
  2418. identified by B → . γ without consuming any input symbols.)
  2419. This NFA can then be transformed into a DFA using the subset construction method. For example, consider the
  2420. following grammar:
  2421. The augmented grammar is:
  2422. The items that can be generated using these productions are:
  2423. Therefore, the transition diagram of the NFA that recognizes viable prefixes is as shown in Figure 5.1.
  2424. Figure 5.1: NFA transition diagram recognizes viable prefixes.
  2425. The DFA equivalent of the NFA shown in Figure 5.1 is, by using subset construction, illustrated in Figure 5.2.
  2426. Figure 5.2: Using subset construction, a DFA equivalent is derived from the transition diagram in Figure 5.1.
  2427. Therefore, every state of the DFA that recognizes viable prefixes is a set of items; and hence, the set of DFA states
  2428. will be a collection of sets of items—but any arbitrary collection of set of items will not correspond to the DFA set of
  2429. states. A set of items that corresponds to the states of a DFA that recognizes viable prefixes is called a "canonical
  2430. collection". Therefore, construction of a DFA involves finding canonical collection. An algorithm exists that directly
  2431. obtains the canonical collection of LR(0) sets of items, thereby allowing us to obtain the DFA. Using this algorithm, we
  2432. can directly obtain a DFA that recognizes the viable prefixes, rather than going through NFA to DFA transformation, as
  2433. explained above. The algorithm for finding out the canonical collection of LR(0) sets of items makes use of the closure
  2434. and goto functions. The set closure(I), where I is a set of items, is computed as follows:
  2435. Add every item in I to closure(I) 1.
  2436. Repeat
  2437. For every item of the form A → α .B β in closure(I) do
  2438. For every production B β → do
  2439. Add B β . → to closure(I)
  2440. Until no new item can be added to closure(I)
  2441. 2.
  2442. For example, consider the following grammar:
  2443. That is, to find out goto from I on X, first identify all the items in I in which the dot precedes X on the right side. Then,
  2444. move the dot in all the selected items one position to the right(i.e., over X), and then take a closure of the set of these
  2445. items.
  2446. 5.4.2 An Algorithm for Finding the Canonical Collection of Sets of LR(0) Items
  2447. /* Let C be the canonical collection of sets of LR(0) items. We maintain C new and C old to continue the iterations*/
  2448. Input: augmented grammar
  2449. Output: canonical collection of sets of LR(0) items (i.e., set C)
  2450. C old = φ 1.
  2451. add closure ({S 1 → .S}) to C 2.
  2452. while C old ≠ C new do 3.
  2453. C = C new 4.
  2454. For example consider the following grammar:
  2455. The augmented grammar is:
  2456. Initially, C old = φ . First we obtain:
  2457. We call it I 0 and add it to C new . Therefore:
  2458. In the first iteration, we obtain the goto from I 0 on every grammar symbol, as shown below:
  2459. Add it to C new :
  2460. Add it to C new :
  2461. Add it to C new :
  2462. Add it to C new :
  2463. Therefore, at the end of first iteration:
  2464. In the second the iteration:
  2465. So, in the second iteration, we obtain goto from {I 1 , I 2 , I 3 , I 4 }on every grammar symbol, as shown below:
  2466. Add it to C new :
  2467. Add it to C new :
  2468. Therefore, at the end of the second iteration:
  2469. In the third iteration:
  2470. In the third iteration, we obtain goto from {I 5 , I 6 } on every grammar symbol, as shown below:
  2471. Add it to C new :
  2472. Add it to C new :
  2473. Therefore, at the end of the third iteration:
  2474. In the fourth iteration:
  2475. So, in the fourth iteration, we obtain a goto from {I 7 , I 8 } on every grammar symbol, as shown below:
  2476. At the end of fourth iteration:
  2477. The transition diagram of the DFA is shown in Figure 5.3.
  2478. Figure 5.3: DFA transition diagram showing four iterations for a canonical collection of sets.
  2479. 5.4.3 Construction of a Parsing Action|GOTO Table for an SLR(1) Parser
  2480. The methods for constructing the parsing Action|GOTO table are described below.
  2481. Construction of the Action Table
  2482. For every state I 1 in C do
  2483. for every terminal symbol a do
  2484. if goto(I i , a) = I j , then
  2485. make action[I i , a] = S j /*for shift and enter into the state j*/
  2486. 1.
  2487. For every state I i in C whose underlying set of LR(0) items contains an item of the form A → α .do
  2488. for every b in FOLLOW(A) do
  2489. make action[I i , b] = R k /*where k is the number of the production A → α standing for reduce by A
  2490. → α */
  2491. 2.
  2492. Make [I i , $) = accept if I i contains an item S 1 → S. 3.
  2493. It is obvious that if a state I i has a transition on a terminal a going to I j , then the parser's next move will be to shift and
  2494. enter into state j. Therefore, the shift entries in the action table are the mappings of the transitions in the DFA on
  2495. terminals. Similarly, if state I i corresponds to the viable prefix that contains the right side of the production A → α , then
  2496. the parser will call a reduction by A → α on all those symbols that are in the FOLLOW(A). This is because if the next
  2497. input symbol happens to be a terminal symbol that can FOLLOW(A), then only the reduction by A → α may lead to a
  2498. previous right-most derivation. That is, if the next input symbol belongs to FOLLOW(A), then the position of α can be
  2499. considered to be the one where, if it is replaced by A, we might get a previous right-most derivation. Whether or not A
  2500. → α is a handle is decided in this manner.
  2501. The initial state is the one whose underlying set of items' representations contain an item S 1 → .S. This method is
  2502. called "SLR(1)"— α Simple LR; and the (1) indicates a length of one lookahead (the next symbol used by the parser to
  2503. decide its next move) used. Therefore, this parsing table is an SLR parsing table. (When the parentheses are not
  2504. specified, the length of the lookahead is assumed to be one.)
  2505. Construction of the Goto Table
  2506. A goto table is simply a mapping of transitions in the DFA on nonterminals. Therefore, it is constructed as follows:
  2507. For every I i in C do
  2508. For every nonterminal A do
  2509. if goto(I i , A) = I i then
  2510. Make GOTO[I i , A) = j
  2511. Therefore, the SLR parsing table for the grammar having the following productions is shown in Table 5.4.
  2512. Table 5.4: Action|GOTO SLR Parsing Table
  2513. Action Table GOTO Table
  2514. id + * $ E T F
  2515. I 0 S 4
  2516. 1 2 3
  2517. I 1
  2518. S 5
  2519. Accept
  2520. I 2
  2521. R 2 S 6 R 2
  2522. I 3
  2523. R 4 R 4 R 4
  2524. I 4
  2525. R 5 R 5 R 5
  2526. I 5 S 4
  2527. 7 3
  2528. I 6 S 4
  2529. 8
  2530. I 7
  2531. R 1 S 6 R 1
  2532. I 8
  2533. R 3 R 3 R 3
  2534. The productions are numbered as:
  2535. EXAMPLE 5.2
  2536. Consider the following grammar:
  2537. The augmented grammar is:
  2538. The canonical collection of sets of LR(0) items are computed as follows.
  2539. The transition diagram of the DFA is shown in Figure 5.4.
  2540. Figure 5.4: Transition diagram for Example 5.2 DFA.
  2541. Therefore, the grammar has the following productions:
  2542. which are numbered as:
  2543. has an SLR parsing table as shown in Table 5.5.
  2544. Table 5.5: SLR Parsing Table
  2545. Action Table GOTO Table
  2546. c d $ S C
  2547. I 0 S 3 S 4
  2548. 1 2
  2549. I 1
  2550. accept
  2551. I 2 S 3 S 4
  2552. 5
  2553. I 3 S 3 S 4
  2554. 6
  2555. I 4 R 3 R 3 R 3
  2556. I 5
  2557. R 1
  2558. I 6 R 2 R 2 R 2
  2559. By using the method discussed above, a parsing table can be obtained for any grammar. But the action table
  2560. obtained from the method above will not necessarily be without multiple entries for every grammar. Therefore, we
  2561. define a SLR(1) grammar as one in which we can obtain the action table without multiple entries by using the method
  2562. described. If the action table contains multiple entries, then the grammar for which the table is obtained is not SLR(1)
  2563. grammar.
  2564. For example, consider the following grammar:
  2565. The augmented grammar will be:
  2566. The canonical collection sets of LR(0) items are computed as follows.
  2567. The transition diagram for the DFA is shown in Figure 5.5.
  2568. Figure 5.5: DFA Transition diagram.
  2569. Table 5.6 shows the SLR parsing table for the grammar having the following productions:
  2570. Table 5.6: Action | GOTO SLR Parsing Table
  2571. Action Table GOTO Table
  2572. a b $ S A B
  2573. I 0 R 3 /R 4 R 3 /R 4
  2574. 1 2 3
  2575. I 1
  2576. Accept
  2577. I 2 S 4
  2578. I 3
  2579. S 5
  2580. I 4 R 3 R 3
  2581. 6
  2582. I 5 R 4 R 4
  2583. 7
  2584. I 6
  2585. S 8
  2586. I 7 S 9
  2587. I 8
  2588. R 1
  2589. I 9
  2590. R 2
  2591. The productions are numbered as follows:
  2592. Since the action table shown in Table 5.6 contains multiple entries, the above grammar is not SLR(1).
  2593. SLR(1) grammars constitute a small subset of context-free grammars, so an SLR parser can only succeed on a small
  2594. number of context-free grammars. That means an SLR parser is a less-powerful LR parser. (The power of the parser is
  2595. measured in terms of the number of grammars on which it can succeed.) This is because when an SLR parser sees a
  2596. right-hand-side production rule A → α on the top of the stack, it replaces this rule by the left-hand-side nonterminal A if
  2597. the next input symbol can FOLLOW the nonterminal A. But sometimes this reduction may not lead to the generation of
  2598. previous right-most derivations. For example, the parser constructed above can do the reduction by the production A
  2599. → ∈ in the state I 0 if the next input symbol happens to be either a or b, because both a and b are in the FOLLOW(A).
  2600. But since the reduction by A → ∈ in I 0 leads to the generation of a first instance of A in the sentential form AaAb, this
  2601. reduction proves to be a proper one if the next input symbol is a. This is because the first instance of A in the sentential
  2602. form AaAb is followed by a. But if the next input symbol is b, then this is not a proper reduction, because even though
  2603. b follows A, b follows a second instance of A in the sentential form AaAb. Similarly, if the parser carries out the
  2604. reduction by A → ∈ in the state I 4 , then it should be done if the next input symbol is b, because reduction by A → ∈ in
  2605. I 4 leads to the generation of a second instance of A in the sentential form AaAb.
  2606. Therefore, we conclude that if:
  2607. We let terminal a follow the first instance of A and let terminal b follow the second instance of A in
  2608. the sentential form AaAb;
  2609. 1.
  2610. We associate a with the item A → . in I 0 and terminal b with item A → . in I 4 ; 2.
  2611. The parser has been asked to carry out a reduction by A → ∈ in I 0 on those terminals associated
  2612. 3.
  2613. with the item A → . in I 0 , and carry out the reduction A → ∈ in I 4 on those terminals associated with
  2614. the item A → . in I 4 ;
  2615. then there would have been no conflict and the parser would have been more powerful. But this requires associating a
  2616. list of terminals (lookaheads) with the items. You may recall (see Chapter 4) that lookaheads are symbols that the
  2617. parser uses to ‘look ahead’ in the input buffer to decide whether or not reduction is to be done. That is, we have to
  2618. work with items of the form A → α .X β . The item a is called as an LR(1) item, because the length of the lookahead is
  2619. one; therefore, an item without a lookahead is one with lookahead length of zero 0, an LR(0) item. In the SLR method,
  2620. we were working with LR(0) items. Therefore, we define an LR(k) item to be an item using lookaheads of length k. So,
  2621. an LR(1) item is comprised of two parts: the LR(0) item and the lookahead associated with the item.
  2622. Note We conclude that if we work with LR(1) items instead of using LR(0) items, then every state of the parser will
  2623. correspond to a set of LR(1) items. When the parser looks ahead in the input buffer to decide whether reduction
  2624. is to be done or not, the information about the terminals will be available in the state of the parser itself, which is
  2625. not the case with the SLR parser state. Hence, with LR(1), we get a more powerful parser.
  2626. Therefore, if we modify the closure and the goto functions to work suitably with the LR(1) items, by allowing them to
  2627. compute the lookaheads, we can obtain the canonical collection of sets of LR(1). And from this we can obtain the
  2628. parsing Action|GOTO table. For example, closure(I), where I is a set of LR(1) items, is computed as follows:
  2629. Add every item in I to closure(I). 1.
  2630. Repeat
  2631. For every item of the form A → α .B β , a in closure(I) do
  2632. For every production B → γ do
  2633. Add B → . γ , FIRST( β a) to closure(I)
  2634. 2.
  2635. /* because the reduction by B → γ generates B preceding β in the right side of A → α B β ; and hence, the reduction by B
  2636. → γ is proper only on those symbols that are in the FIRST( β ). But if β derives to an empty string, then a will also follow
  2637. B, and the lookaheads of B → γ will be FIRST( β a)
  2638. until no new item can be added to closure(I)
  2639. For example, consider the following grammar:
  2640. goto(I, X) = closure({A → α X. β , a | A → α .X β ,a is in I})
  2641. That is, to find out goto from I on X, first identify all the items in I with a dot preceding X in the LR(0) section of the item.
  2642. Then, move the dot in all the selected items one position to the right (i.e., over X), and then take this new set's closure.
  2643. For example:
  2644. 5.4.4 An Algorithm for Finding the Canonical Collection of Sets of LR(1) Items
  2645. /* Let C be the canonical collection of sets of LR(1) items. We maintain C new and C old to continue the iterations */
  2646. Input : augmented grammar
  2647. Output: canonical collection of sets of LR(1) items (i.e., set C)
  2648. C old = φ 1.
  2649. add closure({S 1 → .S, $}) to C 2.
  2650. while C old ≠ C new do
  2651. temp = C new − C old
  2652. C old = C new
  2653. for every I in temp do
  2654. for every X in V ∪ T (i.e., for every grammar symbol X) do
  2655. if goto(I, X) is not empty and not in C new , then
  2656. add goto(I, X) to C new
  2657. }
  2658. 3.
  2659. C = C new 4.
  2660. For example, consider the following grammar:
  2661. The augmented grammar will be:
  2662. The canonical collection of sets of LR(1) items are computed as follows:
  2663. The transition diagram of the DFA is shown in Figure 5.6.
  2664. Figure 5.6: Transition diagram for the canonical collection of sets of LR(1) items.
  2665. 5.4.5 Construction of the Action|GOTO Table for the LR(1) Parser
  2666. The following steps will construct the parsing action table for the LR(1) parser:
  2667. for every state I i in C do
  2668. for every terminal symbol a do
  2669. if goto(I i , a) = I j then
  2670. make action[I i , a] = S j /*for shift and enter
  2671. into the state j*/
  2672. 1.
  2673. for every state I i in C whose underlying set of LR(1) items contains an item of the form A → α ., a
  2674. do
  2675. make action[I i , a] = R k /*where k is the number of
  2676. 2.
  2677. the production A → α , standing for reduce by A → α */
  2678. make [I i , $] = accept if I i contains an item S 1 → S., $ 3.
  2679. The goto table is simply a mapping of transitions in the DFA on nonterminals. Therefore, it is constructed as follows:
  2680. for every I i in C do
  2681. for every nonterminal A do
  2682. if goto (I i , A) = I j then
  2683. make goto[I i , A] = j
  2684. This method is called as CLR(1) or LR(1), and is more powerful than SLR(1). Therefore, the CLR or LR parsing table
  2685. for the grammar having the following productions is:
  2686. Table 5.7: CLR/LR Parsing Action | GOTO Table
  2687. Action Table GOTO Table
  2688. a b $ S A B
  2689. I 0 R 3 R 4
  2690. 1 2 3
  2691. I 1
  2692. Accept
  2693. I 2 S 4
  2694. I 3
  2695. S 5
  2696. I 4 R 3 R 3
  2697. 6
  2698. I 5 R 4 R 4
  2699. 7
  2700. I 6
  2701. S 8
  2702. I 7 S 9
  2703. I 8
  2704. R 1
  2705. I 9
  2706. R 2
  2707. The productions are numbered as follows:
  2708. By comparing the SLR(1) parser with the CLR(1) parser, we find that the CLR(1) parser is more powerful. But the
  2709. CLR(1) has a greater number of states than the SLR(1) parser; hence, its storage requirement is also greater than the
  2710. SLR(1) parser. Therefore, we can devise a parser that is an intermediate between the two; that is, the parser's power
  2711. will be in between that of SLR(1) and CLR(1), and its storage requirement will be the same as SLR(1)'s. Such a parser,
  2712. LALR(1), will be much more useful: since each of its states corresponds to the set of LR(1) items, the information
  2713. about the lookaheads is available in the state itself, making it more powerful than the SLR parser. But a state of the
  2714. LALR(1) parser is obtained by combining those states of the CLR parser that have identical LR(0) (core) items, but
  2715. which differ in the lookaheads in their item set representations. Therefore, even if there is no reduce-reduce conflict in
  2716. the states of the CLR parser that has been combined to form an LALR parser, a conflict may get generated in the state
  2717. of LALR parser. We may be able to obtain a CLR parsing table without multiple entries for a grammar, but when we
  2718. construct the LALR parsing table for the same grammar, it might have multiple entries.
  2719. 5.4.6 Construction of the LALR Parsing Table
  2720. The steps in constructing an LALR parsing table are as follows:
  2721. Obtain the canonical collection of sets of LR(1) items. 1.
  2722. If more than one set of LR(1) items exists in the canonical collection obtained that have identical
  2723. cores or LR(0)s, but which have different in lookaheads, then combine these sets of LR(1) items to
  2724. obtain a reduced collection, C 1 , of sets of LR(1) items.
  2725. 2.
  2726. Construct the parsing table by using this reduced collection, as follows. 3.
  2727. Construction of the Action Table
  2728. for every state I i in C 1 do
  2729. for every terminal symbol a do
  2730. if goto(I i , a) = I j then
  2731. make action[I i , a] = S j /*for shift and enter
  2732. into the state j*/
  2733. 1.
  2734. for every state I i in C 1 whose underlying set of LR(1) items contains an item of the form A → α ., a,
  2735. do
  2736. make action[I i , a] = R k /*where k is the number of the production
  2737. A → α standing for reduce by A → α */
  2738. 2.
  2739. make [I i , $] = accept if I i contains an item S 1 → S., $ 3.
  2740. Construction of the Goto Table
  2741. The goto table simply maps transitions on nonterminals in the DFA. Therefore, the table is constructed as follows:
  2742. for every I i in C 1 do
  2743. for every nonterminal A do
  2744. if goto(I i , A) = I j then
  2745. make goto(I i , A) = j
  2746. For example, consider the following grammar:
  2747. The augmented grammar is:
  2748. The canonical collection of sets of LR(1) items are computed as follows:
  2749. We see that the states I 3 and I 6 have identical LR(0) set items that differ only in their lookaheads. The same goes for
  2750. the pair of states I 4 , I 7 and the pair of states I 8 , I 9 . Hence, we can combine I 3 with I 6 , I 4 with I 7 , and I 8 with I 9 to obtain
  2751. the reduced collection shown below:
  2752. where I 36 stands for combination of I 3 and I 6 , I 47 stands for the combination of I 4 and I 7 , and I 89 stands for the
  2753. combination of I 8 and I 9 . The transition diagram of the DFA using the reduced collection is shown in Figure 5.7.
  2754. Figure 5.7: Transition diagram for a DFA using a reduced collection.
  2755. Therefore, Table 5.8 shows the LALR parsing table for the grammar having the following productions:
  2756. which are numbered as:
  2757. Table 5.8: LALR Parsing Table for a DFA Using a Reduced Collection
  2758. Action Table GOTO Table
  2759. a b $ S A
  2760. I 0 S 36 S 47
  2761. 1 2
  2762. I 1
  2763. Accept
  2764. I 2 S 36 S 47
  2765. 5
  2766. I 36 S 36 S 47
  2767. 89
  2768. I 47 R 3 R3 R 3
  2769. I 5
  2770. R 1
  2771. I 89 R 2 R 2 R 2
  2772. 5.4.7 Parser Conflicts
  2773. An LR parser may encounter two types of conflicts: shift-reduce conflicts and reduce-reduce conflicts.
  2774. Shift-Reduce Conflict
  2775. A shift-reduce (S-R) conflict occurs in an SLR parser state if the underlying set of LR(0) item representations contains
  2776. items of the form depicted in Figure 5.8, and FOLLOW(B) contains terminal a.
  2777. Figure 5.8: LR(0) underlying set representations that can cause SLR parser conflicts.
  2778. Similarly, an S-R conflict occurs in a state of the CLR or LALR parser if the underlying set of LR(1) items
  2779. representation contains items of the form shown in Figure 5.9.
  2780. Figure 5.9: LR(1) underlying set representations that can cause CLR/LALR parser conflicts.
  2781. Reduce-Reduce Conflict
  2782. A reduce-reduce (R-R) conflict occurs if the underlying set of LR(0) items representation in a state of an SLR parser
  2783. contains items of the form shown in Figure 5.10, and FOLLOW(A) and FOLLOW(B) are not disjoint sets.
  2784. Figure 5.10: LR(0) underlying set representations that can cause an SLR parser reduce-reduce conflict.
  2785. Similarly an R-R conflict occurs if the underlying set of LR(1) items representation in a state of a CLR or LALR parser
  2786. contains items of the form shown in Figure 5.11.
  2787. Figure 5.11: LR(1) underlying set representations that can cause an CLR/LALR parser.
  2788. If a set of items' representation contains only nonfinal items, then there is no conflict in the corresponding state. (An
  2789. item in which the dot is in the right-most position, like A → XYZ., is called as a final item; and an item in which the dot
  2790. is not in the right-most position, like A → X.YZ, is a nonfinal item).
  2791. Even if there is no R-R conflict in the states of a CLR parser, conflicts may be generated in the state of a LALR parser
  2792. that is obtained by combining the states of the CLR parser. We combine the states of the CLR parser in order to form
  2793. an LALR state. The items' lookaheads in the LALR parser state are obtained by combining the lookaheads of the
  2794. corresponding items in the states of the CLR parser. And since reduction depends on the lookaheads, even if there is
  2795. no R-R conflict in the states of the CLR parser, a conflict may become generated in the state of the LALR parser as a
  2796. result of this state combination. For example, consider the sets of LR(1) items that represent the two different states of
  2797. the CLR(1) parser, as shown in Figure 5.12.
  2798. Figure 5.12: Sets of LR(1) items represent two different CLR(1) parser states.
  2799. There is no R-R conflict in these states. But when we combine these states to form an LALR, the state's set of items
  2800. representation will be as shown in Figure 5.13.
  2801. Figure 5.13: States are combined to form an LALR.
  2802. We see that there is an R-R conflict in this state, because the parser will call a reduction by A → α as well as by B → γ
  2803. on both a and b. If there is a S-R conflict in the CLR(1) states, it will never be reflected in the LALR(1) state obtained by
  2804. combining the CLR(1) states. For example consider the sets of LR(1) items representing the two different states of the
  2805. CLR(1) parser as shown in Figure 5.14.
  2806. Figure 5.14: LR(1) items represent two different states of the CLR(1) parser.
  2807. There is no S-R conflict in these states. But when we combine these states, the resulting LALR state set will be as
  2808. shown in Figure 5.15. There is no S-R conflict in this state, as well.
  2809. Figure 5.15: LALR state set resulting from the combination of CLR(1) state sets.
  2810. 5.4.8 Handling Ambiguous Grammars
  2811. Since every ambiguous grammar fails to be LR, they will not belong to either the SLR, CLR, or LALR grammar
  2812. classes. But some ambiguous grammars are quite useful for specifying languages. Hence, the question is how to deal
  2813. with these grammars in the framework of LR parsing. For example, the natural grammar that specifies
  2814. nonparenthesized expressions with + and * operators is:
  2815. But this is ambiguous grammar, because the precedence and associations of the operators has not been specified.
  2816. Even so, we prefer this grammar, because we can easily change the precedence and associations as required,
  2817. thereby allowing us more flexibility. Similarly, if we use unambiguous grammar instead of the above grammar to
  2818. specify the same language, it will have the following productions:
  2819. This parser will spend a substantial portion its time in carrying out reductions by the unit productions E → T and T → F.
  2820. These production are in the grammar to enforce associations and precedence, thereby making the parsing time
  2821. excessive. With an ambiguous grammar, every LR parser construction method will have conflicts. But these conflicts
  2822. can be resolved by using the precedence and association information of + and * as per the language's usage. For
  2823. example, consider the SLR parser construction for the above grammar. The augmented grammar is:
  2824. The canonical collection of sets of LR(0) items is shown below:
  2825. The transition diagram of the DFA for the augmented grammar is shown in Figure 5.16. The SLR parsing table is
  2826. shown in Table 5.9.
  2827. Figure 5.16: Transition diagram for augmented grammar DFA.
  2828. Table 5.9: SLR Parsing Table for Augmented Grammar
  2829. Action Table GOTO Table
  2830. + * id $ E
  2831. I 0
  2832. S 2
  2833. 1
  2834. I 1 S 3 S 4
  2835. accept
  2836. I 2 R 3 R 3
  2837. R 3
  2838. I 3
  2839. S 2
  2840. 5
  2841. I 4
  2842. S 2
  2843. 6
  2844. I 5 S 3 /R 1 S 4 /R 1
  2845. R 1
  2846. I 6 S 3 /R 2 S 4 /R 2
  2847. R 2
  2848. We see that there are shift-reduce conflicts in I 5 and I 6 on + as well as *. Therefore, for an input string id + id + id$,
  2849. when the parser enters into the state I 5 , the parser will be in the configuration:
  2850. Hence, the parser can either reduce by E → E + E or shift the + onto the stack and enter into the state I 3 . To resolve
  2851. this conflict, we make use of associations. If we want left-associativity, then a reduction by E → E + E is the right
  2852. choice. Whereas if we want right-associativity, then shift is a right choice.
  2853. Similarly, if the input string is id + id * id$ when the parser enters into the state I 5 , it will be in the configuration:
  2854. Hence, the parser can either reduce by E → E + E or shift the * onto the stack and enter into the state I 4 in order to
  2855. resolve this conflict. We must make use of precedence if we want a higher precedence for + then the reduction by E →
  2856. E + E. If we want a higher precedence for *, then shift is a right choice.
  2857. Similarly if the input string is id * id + id$ when the parser enters into the state I 6 , it will be in the configuration:
  2858. Hence, the parser can either reduce by E → E * E or shift the + onto the stack and enter into the state I 3 in order to
  2859. resolve this conflict. We have to make use of precedence if we want a higher precedence for *; then reduction by E →
  2860. E * E is a right choice. Whereas if we want a higher precedence for +, then shift is right choice.
  2861. Similarly, if the input string is id * id * id$ when the parser enters into the state I6, the parser will be in the configuration:
  2862. The parser can either reduce by E → E * E or shift the * onto the stack and enter into the state I 4 . To resolve this
  2863. conflict, we have to make use of associations. If we want left-associativity, then a reduction by E → E * E is a right
  2864. choice. If we want right-associativity, then shift is a right choice.
  2865. Therefore, for a higher precedence to *, and for left-associativity for both + and *, we get the SLR parsing table shown
  2866. in Table 5.10.
  2867. Table 5.10: SLR Parsing Table Reflects Higher Precedence and Left-Associativity
  2868. Action Table GOTO Table
  2869. + * id $ E
  2870. I 0
  2871. S 2
  2872. 1
  2873. I 1 S 3 S 4
  2874. Accept
  2875. I 2 R 3 R 3
  2876. R 3
  2877. I 3
  2878. S 2
  2879. 5
  2880. I 4
  2881. S 2
  2882. 6
  2883. I 5 R 1 S 4
  2884. R 1
  2885. I 6 R 2 R 2
  2886. R 2
  2887. Therefore, we have a way to deal with ambiguous grammars. We can make use of nonambiguous rules to resolve
  2888. parsing action conflicts.
  2889. 5.5 DATA STRUCTURES FOR REPRESENTING PARSING TABLES
  2890. Since there are only a few entries in the goto table, separate data structures must be used for the action table and the
  2891. goto table. These data structures are described below.
  2892. Representing the Action Table
  2893. One of the simplest ways to represent the action table is to use a two-dimensional array. But since many rows of the
  2894. action table are identical, we can save considerable space (and expend a negligible cost in processing time) by
  2895. creating an array of pointers for each state. Then, pointers for states with the same actions will point to the same
  2896. location, as shown in Figure 5.17.
  2897. Figure 5.17: States with actions in common point to the same location via an array.
  2898. To access information, we assign each terminal a number from zero to one less than the number of terminals. We use
  2899. this integer as an offset from the pointer value for each state. Further reduction in the space is possible at the expense
  2900. of speed by creating a list of actions for each state. Each node on a list will be comprised of a terminal symbol and the
  2901. action for that terminal symbol. It is here that the most frequent actions, like error actions, can be appended at the end
  2902. of the list. For example, for the state I 0 in Table 5.10, the list will be as shown in Figure 5.18.
  2903. Figure 5.18: List that incorporates the ability to append actions.
  2904. Representing the GOTO Table
  2905. An efficient way to represent the goto table is to make a list of pairs for each nonterminal A. Each pair is of the form:
  2906. goto(current-state, A) = next-state
  2907. Since the error entries in the goto table are never consulted, we can replace each error entry by the most common
  2908. nonerror entry in its column is represented by any in place of current-state .
  2909. 5.6 WHY LR PARSING IS ATTRACTIVE
  2910. There are several reasons why LR parsers are attractive:
  2911. An LR parser can be constructed to recognize virtually all programming language constructs for
  2912. which a CFG can be written.
  2913. 1.
  2914. The LR parsing method is the most general, nonbacktracking shift-reduce method known. Yet it
  2915. can be implemented as efficiently as any other method.
  2916. 2.
  2917. The class of grammars that can be parsed by using the LR method is a proper superset of the
  2918. class of grammars that can be parsed with a predictive parser.
  2919. 3.
  2920. The LR parser can quickly detect a syntactic error via the left-to-right scanning of input. 4.
  2921. The main drawback of the LR method is that it is too much work to construct an LR parser by hand for a typical
  2922. programming language grammar. But fortunately, many LR parser generators are available that automatically
  2923. generate the required LR parser.
  2924. 5.7 EXAMPLES
  2925. The examples that follow further illustrate the concepts covered within this chapter.
  2926. EXAMPLE 5.3
  2927. Construct an SLR(1) parsing table for the following grammar:
  2928. First, augment the given grammar by adding a production S1 → S to the grammar. Therefore, the augmented
  2929. grammar is:
  2930. Next, we obtain the canonical collection of sets of LR(0) items, as follows:
  2931. The transition diagram of this DFA is shown in Figure 5.19.
  2932. Figure 5.19: Transition diagram for the canonical collection of sets of LR(0) items in Example 5.3.
  2933. The FOLLOW sets of the various nonterminals are FOLLOW(S 1 ) = {$}. Therefore:
  2934. Using S 1 → S, we get FOLLOW(S) = FOLLOW(S 1 ) = {$} 1.
  2935. Using S → xAy, we get FOLLOW(A) = {y} 2.
  2936. Using S → xBy, we get FOLLOW(B) = {y} 3.
  2937. Using S → xAz, we get FOLLOW(A) = {z} 4.
  2938. Therefore, FOLLOW(A) = {y, z}. Using A → qS, we get FOLLOW(S) = FOLLOW(A) = {y, z}. Therefore, FOLLOW(S) =
  2939. {y, z, $}. Let the productions of the grammar be numbered as follows:
  2940. The SLR parsing table for the productions above is shown in Table 5.11.
  2941. Table 5.11: SLR(1) Parsing Table
  2942. Action Table GOTO Table
  2943. x Y Z q $ S A B
  2944. I 0 S 2 R 3 /R 4
  2945. 1
  2946. I 1
  2947. Accept
  2948. I 2
  2949. S 5
  2950. 3 4
  2951. I 3
  2952. S 6 S 7
  2953. I 4
  2954. S 8
  2955. I 5 S 2 R 5 /R 6 R 5
  2956. 9
  2957. I 6
  2958. R 1 R 1
  2959. R 1
  2960. I 7
  2961. R 3 R 3
  2962. R 3
  2963. I 8
  2964. R 2 R 2
  2965. R 2
  2966. I 9
  2967. R 4 R 4
  2968. EXAMPLE 5.4
  2969. Construct an SLR(1) parsing table for the following grammar:
  2970. First, augment the given grammar by adding the production S 1 → S to the grammar. The augmented grammar is:
  2971. Next, we obtain the canonical collection of sets of LR(0) items, as follows:
  2972. The transition diagram of the DFA is shown in Figure 5.20.
  2973. Figure 5.20: DFA transition diagram for Example 5.4.
  2974. The FOLLOW sets of the various nonterminals are FOLLOW(S 1 ) = {$}. Therefore:
  2975. Using S 1 → S, we get FOLLOW(S) = FOLLOW(S 1 ) = {$} 1.
  2976. Using S → 0S0, we get FOLLOW(S) = { 0 } 2.
  2977. Using S → 1S1, we get FOLLOW(S) = {1} 3.
  2978. So, FOLLOW(S) = {0, 1, $}. Let the productions be numbered as follows:
  2979. The SLR parsing table for the production set above is shown in Table 5.12.
  2980. Table 5.12: SLR Parsing Table for Example 5.4
  2981. Action Table GOTO Table
  2982. 0 1 $ S
  2983. I 0 S 2 S 3
  2984. 1
  2985. I 1
  2986. accept
  2987. I 2 S 2 S 3
  2988. 4
  2989. I 3 S 6 S 3
  2990. 5
  2991. I 4 S 7
  2992. I 5
  2993. S 8
  2994. I 6 S2/R 3 S 3 /R 3 R 3 4
  2995. I 7 R 1 R 1
  2996. R 1
  2997. I 8 R 2 R 2
  2998. R 2
  2999. EXAMPLE 5.5
  3000. Consider the following grammar, and construct the LR(1) parsing table.
  3001. The augmented grammar is:
  3002. The canonical collection of sets of LR(1) items is:
  3003. The parsing table for the production above is shown in Table 5.13.
  3004. Table 5.13: Parsing Table for Example 5.5
  3005. Action Table GOTO Table
  3006. A B $ S
  3007. I 0 S 2 S 3 R 3 1
  3008. I 1
  3009. Accept
  3010. I 2 S 5 S 6 /R 3
  3011. 4
  3012. I 3 S 8 /R 3 S 9
  3013. 7
  3014. I 4
  3015. S 10
  3016. I 5 S 5 S 6 /R 3
  3017. 11
  3018. I 6 S 8 /R 3 S 9
  3019. 12
  3020. I 7 S 13
  3021. I 8 S 5 S 6 /R 3
  3022. 14
  3023. I 9 S 8 /R 3 S 9
  3024. 15
  3025. I 10 S 2 S 3 R 3 16
  3026. I 11
  3027. S 17
  3028. I 12 S 18
  3029. I 13 S 2 S 3 R 3 19
  3030. I 14
  3031. S 20
  3032. I 15
  3033. S 21
  3034. I 16
  3035. R 1
  3036. I 17 S 5 S 6 /R 3
  3037. 22
  3038. I 18 S 5 S 6 /R 3
  3039. 23
  3040. I 19
  3041. R 2
  3042. I 20 S 8 /R 3 S 9
  3043. 24
  3044. I 21 S 8 /R 3 S 9
  3045. 25
  3046. I 22
  3047. R 1
  3048. I 23
  3049. R 2
  3050. I 24 R 1
  3051. I 25 R 2
  3052. The productions for the grammar are numbered as shown below:
  3053. EXAMPLE 5.6
  3054. Construct an LALR(1) parsing table for the following grammar:
  3055. The augmented grammar is:
  3056. The canonical collection of sets of LR(1) items is:
  3057. There no sets of LR(1) items in the canonical collection that have identical LR(0)-part items and that differ only in their
  3058. lookaheads. So, the LALR(1) parsing table for the above grammar is as shown in Table 5.14.
  3059. Table 5.14: LALR(1) Parsing Table for Example 5.5
  3060. Action Table GOTO Table
  3061. a b c d $ S A
  3062. I 0
  3063. S 3
  3064. S 4
  3065. 1 2
  3066. I 1
  3067. Accept
  3068. I 2 S 5
  3069. I 3
  3070. S 7
  3071. 1
  3072. I 4 R 5
  3073. S 8
  3074. I 5
  3075. R 1
  3076. I 6 S 10
  3077. S 9
  3078. I 7
  3079. R 5
  3080. I 8
  3081. R 3
  3082. I 9
  3083. R 2
  3084. I 10
  3085. R 4
  3086. The productions of the grammar are numbered as shown below:
  3087. S → Aa 1.
  3088. S → bAc 2.
  3089. S → dc 3.
  3090. S → bda 4.
  3091. A → d 5.
  3092. EXAMPLE 5.7
  3093. Construct an LALR(1) parsing table for the following grammar:
  3094. The augmented grammar is:
  3095. The canonical collection of sets of LR(1) items is:
  3096. Since no sets of LR(1) items in the canonical collection have identical LR(0)-part items and differ only in their
  3097. lookaheads, the LALR(1) parsing table for the above grammar is as shown in Table 5.15.
  3098. Table 5.15: LALR(1) Parsing Table for Example 5.6
  3099. Action Table GOTO Table
  3100. a b c d $ S A B
  3101. I 0 S 4 S 5
  3102. S 6
  3103. 1 2 3
  3104. I 1
  3105. Accept
  3106. I 2 S 7
  3107. I 3
  3108. S 8
  3109. I 4
  3110. S 10
  3111. 9
  3112. I 5
  3113. S 12
  3114. 11
  3115. I 6 R 5
  3116. R 6
  3117. I 7
  3118. R 1
  3119. I 8
  3120. R 3
  3121. I 9
  3122. S 13
  3123. I 10
  3124. R 5
  3125. I 11 S 14
  3126. I 12 R 6
  3127. I 13
  3128. R 2
  3129. I 14
  3130. R 4
  3131. The productions of the grammar are numbered as shown below:
  3132. S → Aa 1.
  3133. S → aAc 2.
  3134. S → Bc 3.
  3135. S → bBa 4.
  3136. A → d 5.
  3137. B → d 6.
  3138. EXAMPLE 5.8
  3139. Construct the nonempty sets of LR(1) items for the following grammar:
  3140. The collection of nonempty sets of LR(1) items is shown in Figure 5.21.
  3141. Figure 5.21: Collection of nonempty sets of LR(1) items for Example 5.7.
  3142. Chapter 6: Syntax-Directed Definitions and Translations
  3143. 6.1 SPECIFICATION OF TRANSLATIONS
  3144. The specification of a construct's translation in a programming language involves specifying what the construct is, as
  3145. well as specifying the translating rules for the construct. Whenever a compiler encounters that construct in a program,
  3146. it will translate the construct according to the rules of translation. Here, the term "translation" is used in a much broader
  3147. sense. Translation does not necessarily mean generating either intermediate code or object code. Translation also
  3148. involves adding information into the symbol table as well as performing construct-specific computations. For example,
  3149. if a construct is a declarative statement, then its translation adds information about the construct's type attribute into
  3150. the symbol table. Whereas, if the construct is an expression, then its translation generates the code for evaluating the
  3151. expression.
  3152. When we specify what the construct is, we specify the syntactic structure of the construct; hence, syntactic
  3153. specification is the part of the specification of the construct's translation. Therefore, if we suitably extend the notation
  3154. that we use for syntactic specification so that it will allow for both the syntactic structure and the rules of translation that
  3155. go along with it, then we can use this notation as a framework for the specification of the construct translation.
  3156. Translation of a construct involves manipulating the values of various quantities. For example, when translating the
  3157. declarative statement int a, b, c, the compiler needs to extract the type int and add it to the symbol records of a, b,
  3158. and c. This requires that the compiler keep track of the type int, as well as the pointers to the symbol records
  3159. containing a, b, and c.
  3160. Since we use a context-free grammar to specify the syntactic structure of a programming language, we extend that
  3161. context-free grammar by associating sets of attributes with the grammar symbols. These sets hold the values of the
  3162. quantities, which a compiler is required to track, as well as the associated set of production rules of the grammar that
  3163. specify the how the attributed values of the grammar symbols of the production are manipulated. These extensions
  3164. allow us to specify the translations. Syntax-directed definitions and translation schemes are examples of these
  3165. extensions of context-free grammars, allowing us to specify the translations.
  3166. Syntax-directed definitions use CFG to specify the syntactic structure of the construct. It associates a set of attributes
  3167. with each grammar symbol; and with each production, it associates a set of semantic rules for computing the values of
  3168. the attributes of the grammar symbols appearing in that production. Therefore, the grammar and the set of semantic
  3169. rules constitute syntax-directed definitions.
  3170. 6.2 IMPLEMENTATION OF THE TRANSLATIONS SPECIFIED BY
  3171. SYNTAX-DIRECTED DEFINITIONS
  3172. Attributes are associated with the grammar symbols that are the labels of the parse tree nodes. They are thus
  3173. associated with the construct's parse tree translation specification. Therefore, when a semantic rule is evaluated, the
  3174. parser computes the value of an attribute at a parse tree node. For example, a semantic rule could specify the
  3175. computation of the value of an attribute val that is associated with the grammar symbol X (a labeled parse tree node).
  3176. To refer to the attribute val associated with the grammar symbol X, we use the notation X.val. Therefore, to evaluate
  3177. the semantic rules and carry out translations, we must traverse the parse tree and get the values of the attributes at
  3178. the nodes computed. The order in which we traverse the parse tree nodes depends on the dependencies of the
  3179. attributes at the parse tree nodes. That is, if an attribute val at a parse tree node X depends on the attribute val at the
  3180. parse tree node Y, as shown in Figure 6.1, then the val attribute at node X cannot be computed unless the val attribute
  3181. at Y is also computed.
  3182. Figure 6.1: The attribute value of node X is inherently dependent on the attribute value of node Y.
  3183. Hence, carrying out the translation specified by the syntax-directed definitions involves:
  3184. Generating the parse tree for the input string W, 1.
  3185. Finding out the traversal order of the parse tree nodes by generating a dependency graph and
  3186. doing a topological sort of that graph, and
  3187. 2.
  3188. Traversing the parse tree in the proper order and getting the semantic rules evaluated. 3.
  3189. If the parse tree attribute's dependencies are such that an attribute of node X depends on the attributes of nodes
  3190. generated before it in the parse tree-construction process, then it is possible to get X's attribute value during the
  3191. parsing itself; the parser is not required to generate an explicit parse tree, and the translations can be carried out along
  3192. with the parsing. The attributes associated with a grammar symbol are classified into two categories: the synthesized
  3193. and the inherited attributes of the grammar symbol.
  3194. Synthesized Attributes
  3195. An attribute is said to be synthesized if its value at a parse tree node is determined by the attribute values at the child
  3196. nodes. A synthesized attribute has a desirable property; it can be evaluated during a single bottom-up traversal of the
  3197. parse tree. Synthesized attributes are, in practice, extensively used. Syntax-directed definitions that only use
  3198. synthesized attributes are shown below:
  3199. These definitions specify the translations that must be carried by the expression evaluator. A parse tree, along with the
  3200. values of the attributes at the nodes (called an "annotated parse tree"), for an expression 2+3*5 is shown in Figure 6.2.
  3201. Figure 6.2: An annotated parse tree.
  3202. Syntax-directed definitions that only use synthesized attributes are known as "S-attributed" definitions. If translations
  3203. are specified using S-attributed definitions, then the semantic rules can be conveniently evaluated by the LR parser
  3204. itself during the parsing, thereby making translation more efficient. Therefore, S-attributed definitions constitute a
  3205. subclass of the syntax-directed definitions that can be implemented using an LR parser.
  3206. Inherited Attributes
  3207. Inherited attributes are those whose initial value at a node in the parse tree is defined in terms of the attributes of the
  3208. parent and/or siblings of that node. For example, syntax-directed definitions that use inherited attributes are given
  3209. below:
  3210. A parse tree, along with the attributes' values at the parse tree nodes, for an input string int id1,id2,id3 is shown in
  3211. Figure 6.3.
  3212. Figure 6.3: Parse tree with node attributes for the string int id1,id2,id3.
  3213. Inherited attributes are convenient for expressing the dependency of a programming language construct on the
  3214. context in which it appears. When inherited attributes are used, then the interdependencies among the attributes at
  3215. the nodes of the parse tree must be taken into account when evaluating their semantic rules, and these
  3216. interdependencies among attributes are depicted by a directed graph called a "dependency graph". For example, if a
  3217. semantic rule is of the form A.val = f(X.val,Y.val,Z.val)—that is, if A.val is function of X.val, Y.val, and Z.val)-and is
  3218. associated with a production A → XYZ, then we conclude that A.val depends on X.val, Y.val, and Z.val. Therefore,
  3219. every semantic rule must adopt the above form (if it hasn't already) by introducing a dummy, synthesized attribute.
  3220. Dummy Synthesized Attributes
  3221. If the semantic rule is in the form of a procedure call fun(al,a2,a3, … , ak), then we can transform it into the form b =
  3222. fun(a1,a2,a3, … , ak), where b is a dummy synthesized attribute. The dependency graph has a node for each attribute
  3223. and an edge from node b to node a if attribute a depends on attribute b. For example, if a production A → XYZ is
  3224. used in the parse tree, then there will be four nodes in the dependency graph—A.val, X.val, Y.val, and Z.val—with
  3225. edges from X.val, Y.val, and Z.val to A.val.
  3226. The dependency graph for such a parse tree is shown in Figure 6.4. The ellipses denote the nodes of the dependency
  3227. graph, and the circles denote the nodes of the parse tree.
  3228. Figure 6.4: Dependency graph with four nodes.
  3229. This topological sort of a dependency graph results in an order in which the semantic rules can be evaluated. But for
  3230. reasons of efficiency, it is better to get the semantic rules evaluated (i.e., carry out the translation) during the parsing
  3231. itself. If the translations are to be carried out during the parsing, then the evaluation order of the semantic rules gets
  3232. linked to the order in which the parse tree nodes are created, even though the actual parse tree is not required to be
  3233. generated by the parser. Many top-down as well as bottom-up parsers generate nodes in a depth-first left-to-right
  3234. order; so the semantic rules must be evaluated in this same order if the translations are to be carried out during the
  3235. parsing itself. A class of syntax-directed definitions, called "L-attributed" definitions, has attributes that can always be
  3236. evaluated in depth-first, left-to-right order. Hence, if the translations are specified using L-attributed definitions, then it
  3237. is possible to carry out translations during the parsing.
  3238. 6.3 L-ATTRIBUTED DEFINITIONS
  3239. A syntax-directed definition is L-attributed if each inherited attribute of X j for i between 1 and n, and on the right side of
  3240. production A → X 1 X 2 … ,X n , depends only on:
  3241. The attributes (both inherited as well as synthesized) of the symbols X 1 ,X 2 , … , X j − 1 (i.e., the
  3242. symbols to the left of X j in the production, and
  3243. 1.
  3244. The inherited attributes of A. 2.
  3245. The syntax-directed definition above is an example of the L-attributed definition, because the inherited attribute L.type
  3246. depends on T.type, and T is to the left of L in the production D → TL. Similarly, the inherited attribute L 1 .type depends
  3247. on the inherited attribute L.type, and L is parent of L 1 in the production L → L 1 ,id.
  3248. When translations carried out during parsing, the order in which the semantic rules are evaluated by the parser must
  3249. be explicitly specified. Hence, instead of using the syntax-directed definitions, we use syntax-directed translation
  3250. schemes to specify the translations. Syntax-directed definitions are more abstract specifications for translations;
  3251. therefore, they hide many implementation details, freeing the user from having to explicitly specify the order in which
  3252. translation takes place. Whereas the syntax-directed translation schemes indicate the order in which semantic rules
  3253. are evaluated, allowing some implementation details to be specified.
  3254. 6.4 SYNTAX-DIRECTED TRANSLATION SCHEMES
  3255. A syntax-directed translation scheme is a context-free grammar in which attributes are associated with the grammar
  3256. symbols, and semantic actions, enclosed within braces ({ }), are inserted in the right sides of the productions. These
  3257. semantic actions are basically the subroutines that are called at the appropriate times by the parser, enabling the
  3258. translation. The position of the semantic action on the right side of the production indicates the time when it will be
  3259. called for execution by the parser. When we design a translation scheme, we must ensure that an attribute value is
  3260. available when the action refers to it. This requires that:
  3261. An inherited attribute of a symbol on the right side of a production must be computed in an action
  3262. immediately preceding (to the left of) that symbol, because it may be referred to by an action
  3263. computing the inherited attribute of the symbol to the right of (following) it.
  3264. 1.
  3265. An action that computes the synthesized attribute of a nonterminal on the left side of the
  3266. production should be placed at the end of the right side of the production, because it might refer to
  3267. the attributes of any of the right-side grammar symbols. Therefore, unless they are computed, the
  3268. synthesized attribute of a nonterminal on the left cannot be computed.
  3269. 2.
  3270. These restrictions are motivated by the L-attributed definitions. Below is an example of a syntax-directed translation
  3271. scheme that satisfies these requirements, which are implemented during predictive parsing:
  3272. The advantage of a top-down parser is that semantic actions can be called in the middle of the productions. Thus, in
  3273. the above translation scheme, while using the production D → TL to expand D, we call a routine after recognizing T
  3274. (i.e., after T has been fully expanded), thereby making it easier to handle the inherited attributes. Whereas a bottom-up
  3275. parser reduces the right side of the production D → TL by popping T and L from the top of the parser stack and
  3276. replacing them by D, the value of the synthesized attribute T.type is already on the parser stack at a known position. It
  3277. can be inherited by L. Since L.type is defined by a copy rule, L.type = T.type, the value of T.type can be used in place
  3278. of L.type. Thus, if the parser stack is implemented as two parallel arrays—state and value—and state [I] holds a
  3279. grammar symbol X, then value [I] holds a synthesized attribute of X. Therefore, the translation scheme implemented
  3280. during bottom-up parsing is as follows, where [top] is value of stack top before the reduction and [newtop] is the value
  3281. of the stack top after the reduction:
  3282. 6.5 INTERMEDIATE CODE GENERATION
  3283. While translating a source program into a functionally equivalent object code representation, a parser may first
  3284. generate an intermediate representation. This makes retargeting of the code possible and allows some optimizations
  3285. to be carried out that would otherwise not be possible. The following are commonly used intermediate representations:
  3286. Postfix notation 1.
  3287. Syntax tree 2.
  3288. Three-address code 3.
  3289. Postfix Notation
  3290. In postfix notation, the operator follows the operand. For example, in the expression (a − b) * (c + d) + (a − b), the
  3291. postfix representation is:
  3292. Syntax Tree
  3293. The syntax tree is nothing more than a condensed form of the parse tree. The operator and keyword nodes of the
  3294. parse tree (Figure 6.5) are moved to their parent, and a chain of single productions is replaced by single link (Figure
  3295. 6.6).
  3296. Figure 6.5: Parse tree for the string id+id*id.
  3297. Figure 6.6: Syntax tree for id+id*id.
  3298. Three-Address Code
  3299. Three address code is a sequence of statements of the form x = y op z. Since a statement involves no more than
  3300. three references, it is called a "three-address statement," and a sequence of such statements is referred to as
  3301. three-address code. For example, the three-address code for the expression a + b * c + d is:
  3302. Sometimes a statement might contain less than three references; but it is still called a three-address statement. The
  3303. following are the three-address statements used to represent various programming language constructs:
  3304. Used for representing arithmetic expressions:
  3305. Used for representing Boolean expressions:
  3306. Used for representing array references and dereferencing operations:
  3307. Used for representing a procedure call:
  3308. 6.6 REPRESENTING THREE-ADDRESS STATEMENTS
  3309. Records with fields for the operators and operands can be used to represent three-address statements. It is possible
  3310. to use a record structure with four fields: the first holds the operator, the next two hold the operand1 and operand2,
  3311. respectively, and the last one holds the result. This representation of a three-address statement is called a "quadruple
  3312. representation".
  3313. Quadruple Representation
  3314. Using quadruple representation, the three-address statement x = y op z is represented by placing op in the operator
  3315. field, y in the operand1 field, z in the operand2 field, and x in the result field. The statement x = op y, where op is a
  3316. unary operator, is represented by placing op in the operator field, y in the operand1 field, and x in the result field; the
  3317. operand2 field is not used. A statement like param t1 is represented by placing param in the operator field and t1 in the
  3318. operand1 field; neither operand2 nor the result field are used. Unconditional and conditional jump statements are
  3319. represented by placing the target labels in the result field. For example, a quadruple representation of the
  3320. three-address code for the statement x = (a + b) * - c/d is shown in Table 6.1. The numbers in parentheses represent
  3321. the pointers to the triple structure.
  3322. Table 6.1: Quadruple Representation of x = ( a + b ) * − c/d
  3323. Operator Operand1 Operand2 Result
  3324. (1) + a b t1
  3325. (2)
  3326. c
  3327. t2
  3328. (3) * t1 t2 t3
  3329. (4) / t3 d t4
  3330. (5) = t4
  3331. x
  3332. Triple Representation
  3333. The contents of the operand1, operand2, and result fields are therefore normally the pointers to the symbol records
  3334. for the names represented by these fields. Hence, it becomes necessary to enter temporary names into the symbol
  3335. table as they are created. This can be avoided by using the position of the statement to refer to a temporary value. If
  3336. this is done, then a record structure with three fields is enough to represent the three-address statements: the first
  3337. holds the operator value, and the next two holding values for the operand1 and operand2, respectively. Such a
  3338. representation is called a "triple representation". The contents of the operand1 and operand2 fields are either pointers
  3339. to the symbol table records, or they are pointers to records (for temporary names) within the triple representation itself.
  3340. For example, a triple representation of the three-address code for the statement x = (a+b)* − c/d is shown in Table 6.2.
  3341. Table 6.2: Triple Representation of x = ( a + b ) * − c/d
  3342. Operator Operand1 Operand2
  3343. (1) + a b
  3344. (2)
  3345. c
  3346. (3) * (1) (2)
  3347. (4) / (3) d
  3348. (5) = x (4)
  3349. Indirect Triple Representation
  3350. Another representation uses an additional array to list the pointers to the triples in the desired order. This is called an
  3351. indirect triple representation. For example, a triple representation of the three-address code for the statement x =
  3352. (a+b)* − c/d is shown in Table 6.3.
  3353. Table 6.3: Indirect Triple Representation of x = ( a + b ) * − c/d
  3354. Operator Operand1 Operand2
  3355. (1) (1) + a b
  3356. (2) (2)
  3357. c
  3358. (3) (3) * (1) (2)
  3359. (4) (4) / (3) d
  3360. (5) (5) = x (4)
  3361. Comparison
  3362. By using quadruples, we can move a statement that computes A without requiring any changes in the statements
  3363. using A, because the result field is explicit. However, in a triple representation, if we want to move a statement that
  3364. defines a temporary value, then we must change all of the pointers in the operand1 and operand2 fields of the records
  3365. in which this temporary value is used. Thus, quadruple representation is easier to work with when using an optimizing
  3366. compiler, which entails a lot of code movement. Indirect triple representation presents no such problems, because a
  3367. separate list of pointers to the triple structure is maintained. When statements are moved, this list is reordered, and no
  3368. change in the triple structure is necessary; hence, the utility of indirect triples is almost the same as that of quadruples.
  3369. 6.7 SYNTAX-DIRECTED TRANSLATION SCHEMES TO SPECIFY THE
  3370. TRANSLATION OF VARIOUS PROGRAMMING LANGUAGE CONSTRUCTS
  3371. Specifying the translation of the construct involves specifying the construct's syntactic structure, using CFG, and
  3372. associating suitable semantic actions with the productions of the CFG. For example, if we want to specify the
  3373. translation of the arithmetic expressions into postfix notation so they can be carried along with the parsing, and if the
  3374. parsing method is LR, then first we write a grammar that specifies the syntactic structure of the arithmetic expressions.
  3375. We then associate suitable semantic actions with the productions of the grammar. The expressions used for these
  3376. associations are covered below.
  3377. 6.7.1 Arithmetic Expressions
  3378. The grammar that specifies the syntactic structure of the expressions in a typical programming language will have the
  3379. following productions:
  3380. Translating arithmetic expressions involves generating code to evaluate the given expression. Hence, for an
  3381. expression a + b * c, the three-address code that is required to be generated is:
  3382. where t1 and t2 are pointers to the symbol table records that contain compiler-generated temporaries, and a, b, and c
  3383. are pointers to the symbol table records that contain the programmer-defined names a, b, and c, respectively.
  3384. Syntax-directed translation schemes to specify the translation of an expression into postfix notation are as follows:
  3385. where code is a string value attribute used to hold the postfix expression, and place is pointer value attribute used to
  3386. link the pointer to the symbol record that contains the name of the identifier. The label getname returns the name of
  3387. the identifier from the symbol table record that is pointed to by ptr, and concate(s 1 , s 2 , s 3 ) returns the concatenation of
  3388. the strings s 1 , s 2 , and s 3 , respectively. For the string a+b*c, the values of the attributes at the parse tree node are
  3389. shown in Figure 6.7.
  3390. Figure 6.7: Values of attributes at the parse tree node for the string a + b * c.
  3391. id.place = addr(symtab rec of a)
  3392. Syntax-directed translation schemes to specify the translation of an expression into the syntax tree are as follows:
  3393. where ptr is pointer value attribute used to link the pointer to a node in the syntax tree, and place is pointer value
  3394. attribute used to link the pointer to the symbol record that contains the name of the identifier. The mkleaf generates
  3395. leaf nodes, and mknode generates intermediate nodes.
  3396. For the string a+b*c, the values of the attributes at the parse tree nodes are shown in Figure 6.8.
  3397. Figure 6.8: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(symtab rec of a).
  3398. id.place = addr(sumtab rec of a)
  3399. Syntax-directed translation schemes specify the translation of an expression into three-address code, as follows:
  3400. where ptr is a pointer value attribute used to link the pointer to the symbol record that contains the name of the
  3401. identifier, mkleaf generates leafnodes, and mknode generates intermediate nodes. For the string a+b*c, the values of
  3402. the attributes at the parse tree nodes are shown in Figure 6.9.
  3403. Figure 6.9: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(sumtab rec of a).
  3404. 6.7.2 Boolean Expressions
  3405. One way of translating a Boolean expression is to encode the expression's true and false values as the integers one
  3406. and zero, respectively. The code to evaluate the value of the expression in some temporary is generated as shown
  3407. below:
  3408. E → E 1 relop E 2
  3409. {
  3410. t1 = gentemp();
  3411. gencode(if E 1 .place relop.val E 2 .place
  3412. goto(nextquad + 3));
  3413. gencode(t1 = 0);
  3414. gencode(goto(nextquad+2))
  3415. gencode(t1 = 1)}
  3416. E.place = t1;
  3417. }
  3418. where nextquad keeps track of the index in the code array. The next statement will be inserted by the gencode
  3419. procedure, and will update the value of nextquad. The following translation scheme:
  3420. translates the expression a < b to the following three-address code:
  3421. Similarly, a Boolean expression formed by using logical operators involves generating code to evaluate those
  3422. operators in some temporary form, as shown below:
  3423. E → E1 lop E2
  3424. {
  3425. t1 = gentemp();
  3426. gencode (t1 = E1.place lop.val E2.place);
  3427. E.place = t1;
  3428. }
  3429. E → not E1
  3430. {
  3431. t1 = gentemp();
  3432. gencode (t1 = not E1.place)
  3433. E.place = t1
  3434. }
  3435. lop → and { lop.val = and}
  3436. lop → or { lop.val = or}
  3437. The translation scheme above translates the expressions a < b and c > d to the following three-address code:
  3438. Another way to translate a Boolean expression is to represent its value by a position in the three-address code
  3439. sequence. For example, if we point to the statement labeled L1, then the value of the expression is true (1); whereas if
  3440. we point to the statement labeled L2, then the value of the expression is false (0). In this case, the use of a temporary
  3441. to hold either a one or zero, depending upon the true of false value of the expression, becomes redundant. This also
  3442. makes it possible to decide the value of the expression without evaluating it completely. This is called a "short circuit"
  3443. or "jumping the code". To discover the true/false value of the expression a<b or c>d, it is not necessary to completely
  3444. evaluate the expression; if a<b is true, then the entire expression will be true. Similarly to discover the true/false value
  3445. of the expression a<b and c>d, it is not necessary to completely evaluate the expression, because if a<b is false, then
  3446. the entire expression will be false.
  3447. Tip Therefore a Boolean expression can be translated into two to three address statements, a conditional jump, and an
  3448. unconditional jump. But the targets of these jumps are known at the time of translating a Boolean expression;
  3449. hence, these jumps are generated without their targets, which are filled in later on.
  3450. Therefore, we must remember the indices of these jumps in the code array by using suitable attributes of E. For this,
  3451. we use two pointer value attributes: E.true and E.false. The attribute E.true will hold the pointer to the list that contains
  3452. the index of the conditional jump in the code array, whereas the attribute E.false will hold the pointer to the list that
  3453. contains the index of the unconditional jump. The translation scheme for the Boolean expression that uses relational
  3454. operators is as follows:
  3455. E → E 1 relop E 2
  3456. {
  3457. E.true = mklist(nextquad);
  3458. E.false = mklist(nextquad + l);
  3459. gencode (if E 1 .place relop.val E 2 .place goto);
  3460. gencode (goto_);
  3461. }
  3462. where mklist(ind) is a procedure that creates a list containing ind and returns a pointer to the created list.
  3463. The above translation scheme translates the expression a < b to the following three address code:
  3464. 6.7.3 Short-Circuit Code for Logical Expressions
  3465. There are several methods to adequately handle the various elements of Boolean operators. These are covered by
  3466. type below.
  3467. AND
  3468. Logical expressions that use the ‘and’ operator are expressions defined by the production E → E1 and E2. Generating
  3469. the short-circuit code for these logical expressions involves setting the true value of the first expression, E1, to the
  3470. start of the second expression, E2, in the code array. We make the true value of E the same as the true value of
  3471. expression E2; and we make the false value of E the same as the false values of both E1 and E2. This requires
  3472. remembering where E2 starts in the code array index, which means we must provision the memory of the nextquad
  3473. value just before E2 is processed. This can accomplished by introducing a nullable nonterminal M before E2 in the
  3474. above production, providing for a reduction by M → ∈ just before the processing of E2. Hence, we can get a semantic
  3475. action associated with this production and executed at this point. We therefore have a method for remembering the
  3476. value of nextquad just before the E2 code is generated.
  3477. E → E 1 and M E 2 { backpatch(E1.true, M.quad);
  3478. E.true = E2.true;
  3479. E.false = merge(E1.false, E2.false);
  3480. }
  3481. M → ∈ {M.quad = nextquad; }
  3482. where backpatch(ptr,L) is a procedure that takes a pointer ptr to a list containing indices of the code array and fills the
  3483. target of the statements at these indices in the code array by L.
  3484. OR
  3485. For an expression using the ‘or’ operator-that is, an expression defined by the production E → E1 or E2—generating
  3486. the short-circuit code involves setting the false value of the first expression, E1, to the start of E2 in the code array,
  3487. and making the false value of E the same as the false value of E2. The true value of E is assigned the same true value
  3488. as both E1 and E2. This requires remembering where E2 starts in the code array index, which requires making a
  3489. provision for remembering the value of nextquad just before the expression E2 is processed. This can achieved by
  3490. introducing a nullable nonterminal M before E2 in the above production, providing for a reduction by M → ∈ just before
  3491. the processing of E2. Hence, we obtain a semantic action that is associated with this production and executed at this
  3492. point; therefore, we have provisioned the recall of the value of nextquad just before the E2 code is generated.
  3493. E → E1 or M E2 { backpatch(E1.false, M.quad);
  3494. E.false = E2.false;
  3495. E.true = merge(E1.true, E2.true);
  3496. }
  3497. M → ∈ {M.quad = nextquad; }
  3498. NOT
  3499. For the logical expression using the ‘not’ operator, that is, one defined by the production E → not E1, generating the
  3500. short-circuit code involves making the false value of the expression E the same as the true value of E1. And the true
  3501. value of E is assigned the false value of E1.
  3502. E → not E1 {
  3503. E.true = E1.false
  3504. E.false = E1.true
  3505. }
  3506. The above translation scheme translates the expression a < b and c > d to the following three-address code:
  3507. For example, consider the following Boolean expression:
  3508. When the above translation scheme is used to translate this construct, the three-address code generated for it is as
  3509. shown below, and the translation scheme is shown in Figure 6.10.
  3510. Figure 6.10: Translation scheme for a Boolean expression containing and, not, and or.
  3511. IF-THEN-ELSE
  3512. Since an if-then-else statement is composed of three components—a boolean expression E, a statement S1 that is to
  3513. be executed when E is true, and a statement S2 that is to be executed when E is false—the translation of if-then-else
  3514. involves making a provision for transferring control to the start of S1 if E is true, for transferring control to the start of
  3515. S2 if E is false, and for transferring control to the next statement after the execution of S1 and S2 is over. This
  3516. requires remembering where S1 starts in the index of the code array as well as remembering where S2 starts in the
  3517. index of the code array.
  3518. This is achieved by introducing a nullable nonterminal M1 before the S1 and a nullable nonterminal M2 before the S2
  3519. in the above production, providing for the reduction by M1 → ∈ just before processing S1. Hence, we get a semantic
  3520. action associated with this production and executed at this point, which enables the recall of the value of nextquad just
  3521. before generating S1 code. Similarly, it provides for the reduction by M2 → ∈ just before processing S2, and we get a
  3522. semantic action associated with production executed at this point, enabling the recall of the value of nextquad just
  3523. before generating S2 code.
  3524. In addition, an unconditional jump is required at the end of S1 in order to transfer control to the statement that follows
  3525. the if-then-else statement. To generate this unconditional jump, we add a nullable nonterminal N after S1 to the
  3526. production and associate a semantic action with the production N → ∈ , which takes care of generating this
  3527. unconditional jump, as shown in Figure 6.11.
  3528. S → if E then M1 S1 N
  3529. else M2 S2 {
  3530. backpatch (E.true, M1.quad)
  3531. backpatch (E.false, M2.quad)
  3532. S.next:
  3533. = merge (S1.next, S2.next, N.next)
  3534. }
  3535. M1 → ∈ { M1.quad = nextquad;}
  3536. M2 → ∈ { M2.quad = nextquad}
  3537. N → ∈ {
  3538. N.next = mklist (nextquad);
  3539. gencode (goto...);
  3540. }
  3541. Figure 6.11: The addition of the nullable nonterminal N facilitates an unconditional jump.
  3542. Hence, for the statement if a<b then x = y + z else p = q + r, the three-address code that is required to be generated is:
  3543. IF-THEN
  3544. Since an if-then statement is comprised of two components, a Boolean expression E and an S1 statement that will be
  3545. executed when E is true, the translation of if-then involves making a provision for transferring control to the start of S1
  3546. code if E is either true or false, and a provision is made for transferring control to the next statement after the execution
  3547. of S1 is over. This requires recalling the index of the start of S1 in the code array, and can be achieved by introducing
  3548. a nullable nonterminal M before S1 in the production. This will provide for a reduction by M → ∈ , just before the
  3549. processing of S1. Hence, we can get a semantic action associated with this production and executed at this point,
  3550. which makes a provisioning the recall of for remembering the value of nextquad just before generating code of S1
  3551. code is generated, as shown in Figure 6.12 below:
  3552. S → if E then M S1 {
  3553. backpatch (E.true, M.quad);
  3554. S.next = merge(E.false, S1.next)
  3555. }
  3556. M → ∈ { M.quad = nextquad; }
  3557. Figure 6.12: A nullable nonterminal M provisions the translation of if-then.
  3558. Hence, for the statement if a<b then x = y + z, the three-address code that is required to be generated is:
  3559. WHILE
  3560. Since a while statement has two components, a Boolean expression E and a statement S1, which is the statement to
  3561. be executed repeatedly as long as E is true, the translation of while involves provisioning the transfer of control to the
  3562. start of S1 code if E is true. The expression must be tested again after S1 execution is over, control must be
  3563. transferred to the next statement if E is false. This requires remembering the index in the code array where S1 code
  3564. starts as well as where the E code starts. This can be achieved by introducing a nullable nonterminal M2 before S1 in
  3565. the production. This will provide for the reduction by M2 → ∈ just before the processing of S1. Hence, a semantic
  3566. action is associated with this production and is executed at this point, enabling the recall of the value of nextquad just
  3567. before generating S code. Similarly, introducing a nullable nonterminal M1 before E will provide for the reduction by M1
  3568. → ∈ just before the processing of E. Hence, a semantic action is now associated with this production and is executed
  3569. at this point, provisioning the recall of the value of nextquad just before E code is generated, as shown in Figure 6.13.
  3570. S → while M1 E
  3571. do M2 S1 {
  3572. backpatch (E.true, M2.quad)
  3573. backpatch (S1.next, M1.quad)
  3574. S.next = E.false
  3575. gencode (goto(M1.quad))
  3576. }
  3577. M1 →∈ { M1.quad = nextquad; }
  3578. M2 →∈ { M2.quad = nextquad; }
  3579. Figure 6.13: The translation of the Boolean while statement is facilitated by a nullable nonterminal M.
  3580. Hence, for the statement while a<b do x = y + z, the three-address code that is required to be generated is:
  3581. DO-WHILE
  3582. Since a do-while statement is comprised of two components, a Boolean expression E and an S1 statement that is
  3583. executed repeatedly as long as E is true (as well as the test for whether E is true or false at the end of S1 execution),
  3584. translation involves provisioning the transfer of control to test the expression after the execution of S1 is over. Control
  3585. must also be transferred to the start of S1 code if E is true, and conversely to the next statement if E is false.
  3586. This requires recalling the S1 start index in the code array as well as the E start index. We introduce a nullable
  3587. nonterminal M1 before S1 in the production, providing for the reduction by M1 → ∈ just before the processing of S1.
  3588. Hence, a semantic action is now associated with this production and is executed at this point, provisioning the recall of
  3589. the value of nextquad just before S1 code generates. Similarly, introducing a nullable nonterminal M2 before E will
  3590. provide for the reduction by M2 → ∈ just before the processing of E. We then have a semantic action associated with
  3591. this production and executed at this point, and which provisions the recall of the value of nextquad just before E code
  3592. generates, as shown in Figure 6.14.
  3593. S → do M1 S1 while M2 E {
  3594. backpatch (E.true, M1.quad)
  3595. backpatch (S1.next, M2.quad)
  3596. S.next = F.false
  3597. }
  3598. M1 → ∈ { M1.quad = nextquad; }
  3599. M2 → ∈ { M2.quad = nextquad; }
  3600. Figure 6.14: Translation of the Boolean do-while.
  3601. Hence, for a statement do x = y + z while a<b, the three-address code that is required to be generated is:
  3602. REPEAT-UNTIL
  3603. Since a repeat-until statement has two components, a Boolean expression E and an S1 statement that is executed
  3604. repeatedly until E becomes true (as well as the test of whether E is true or false at the end of S1), the translation of
  3605. repeat-until involves provisioning transfer of control to a test of the expression after the execution of S1 is over. We
  3606. must also engineer a transfer a control to the start code of S1 if E is false and to the next statement if E is true.
  3607. This requires recalling the index in the code array where S1 code starts as well as the index in the code array where E
  3608. code starts. We achieve this by introducing a nullable nonterminal M1 before S1 in the production. This will provide for
  3609. the reduction by M 1 → ∈ , just before the processing of S1. Hence, we can get a semantic action that is associated
  3610. with this production and is executed at this point. This makes a provision for remembering the value of nextquad just
  3611. before S code generates, and introduces a nullable non-terminal M2 before E. This will provide for the reduction by M 2
  3612. → ∈ , just before the processing of E. Now we can get a semantic action associated with this production and executed
  3613. at this point, and which provisions the recall of the value of nextquad just before E code generates, as shown in Figure
  3614. 6.15.
  3615. S → repeat M1 S1
  3616. until M2 E {
  3617. backpatch (E.false, M1.quad)
  3618. backpatch (S1.next, M2.quad)
  3619. S.next = E.true
  3620. }
  3621. M1 → ∈ { M1.quad = nextquad; }
  3622. M2 → ∈ { M2.quad = nextquad; }
  3623. Figure 6.15: Translation of Boolean repeat-until.
  3624. Hence, for the Boolean statement repeat x = y + z until a<b, the three-address code that is required to be generated
  3625. is:
  3626. FOR
  3627. A for statement is composed of four components: an expression E1, which is used to initialize the iteration variable; an
  3628. expression E2, which is a Boolean expression used to test whether or not the value of the iteration variable exceeds
  3629. the final value; an expression E3, which is used to specify the step by which the value of the iteration variable is to be
  3630. incremented or decremented; and an S1 statement, which is the statement to be executed as long as the value of the
  3631. iteration variable is less than or equal to the final value. Hence, the translation of a for statement involves provisioning
  3632. the transfer a control to the start of S1 code if E2 is true, transferring control to the start of E3 code after the execution
  3633. of S1 is over, transferring control to the start of E2 code after E3 code is executed, and transferring control to the next
  3634. statement if E2 is false, as shown in Figure 6.16.
  3635. S → for (E1; M1 E2; M2 E3) M3 S1
  3636. {
  3637. backpatch (E2.true, M3.quad)
  3638. backpatch (M3.next, M1.quad)
  3639. backpatch (S1.next, M2.quad)
  3640. gencode (goto(M2.quad))
  3641. S.next = E2.false
  3642. }
  3643. M1 → ∈ { M1.quad = nextquad; }
  3644. M2 → ∈ { M2.quad = nextquad; }
  3645. M3 → ∈ {
  3646. M3.next: = mklist (nextquad)
  3647. gencode (goto...)
  3648. M3.quad = nextquad;
  3649. }
  3650. Figure 6.16: Handling the translation of the Boolean for.
  3651. Hence, for a statement for(i = 1; i< = 20; i+ +) x = y + z, the three-address code that is required to be generated is:
  3652. 6.8 IMPLEMENTATION OF INCREMENT AND DECREMENT OPERATORS
  3653. L → id++ {
  3654. t1 = gentemp();
  3655. t2 = gentemp();
  3656. gencode(t1 = id.place);
  3657. gencode(t2 = id.place +1);
  3658. gencode (id.place = t2);
  3659. L.place = t1;
  3660. }
  3661. L → ++id {
  3662. t1 = gentemp();
  3663. gencode(t1 = id.place +1);
  3664. gencode(id.place = t1);
  3665. L.place = t1;
  3666. }
  3667. L → id- - {
  3668. t1 = gentemp();
  3669. t2 = gentemp();
  3670. gencode(t1 = id.place);
  3671. gencode(t2 = id.place -1);
  3672. gencode(id.place = t2);
  3673. L.place = t1;
  3674. }
  3675. L → - -id {
  3676. t1 = gentemp();
  3677. gencode (t1 = id.place -1);
  3678. gencode (id.place = t1);
  3679. L.place = t1;
  3680. }
  3681. 6.9 THE ARRAY REFERENCE
  3682. An array reference is an expression with an l-value. Therefore, to capture its syntactic structure, we add the following
  3683. productions to the grammar:
  3684. An array reference in a source program is replaced by the l-value of an expression that specifies the arrayreference to
  3685. an element of the array. Computing the l-value involves finding the offset of the referred element of the array and then
  3686. adding it to the base. But since deriving an offset depends on the subscripts used in an array reference, and the
  3687. values of these subscripts are not known during the compilation, unless the subscripts are constant expressions, a
  3688. compiler has to generate the code for evaluating the l-value of an expression that specifies the reference to an
  3689. element of an array. This l-value computation is achieved as follows:
  3690. where lbi and ubi are the lower and upper bounds of the ith dimension.
  3691. If the lower bound of each dimension is one, and the upper bound of the ith dimension is di, then the offset computing
  3692. formula becomes:
  3693. The [i1*d2*d3* … *dk + i2*d3*d4* … *dk + … + ik]*bpw is a variable part of the offset computation, whereas [d2* d3* … *dk
  3694. + d3*d4* … *dk + … +dk]*bpw is a constant part of the offset computation and is not required to be computed for every
  3695. reference to an array a. It can be computed once while processing the declaration of the array a. We call this value
  3696. "constant C". Therefore:
  3697. where V is the variable part, and
  3698. Since addr(a) is fixed, we can combine C with addr(a) and store this value in an attribute, L.place, and we can store V
  3699. in another attribute, L.off, so that:
  3700. Hence, the translation of an array reference involves generating code for computing V, and V is made a value of
  3701. attribute L.off. We compute addr(a) − C and make it the value of the attribute L.place. Computing V involves evaluating
  3702. the expression:
  3703. This expression can be rewritten as:
  3704. Therefore, the three-address code that is required to be generated for computing V is:
  3705. Therefore, the translation scheme is:
  3706. elist → E (Initialize queue by adding E.place)
  3707. elist → elist1, E (Append E.place to queue)
  3708. L → id[elist] { T1: = gentemp ( )
  3709. elist.Ndim = 1
  3710. gencode(T1 = retrieve();
  3711. while (queue not empty ) do
  3712. {
  3713. gencode (T1= T1 * limit (id.place, elist.Ndim))
  3714. gencode (T1 : = T1 + retrieve())
  3715. elist.Ndim = elist.Ndim + 1
  3716. }
  3717. V = gentemp();
  3718. U = gentemp();
  3719. gencode (V : = T1 * bpw)
  3720. gencode (U : = id.place - C)
  3721. L.off = V
  3722. L.place: = U
  3723. }
  3724. where retrieve() is a function that retrieves a value from the queue, and limit() returns the upper bound of the
  3725. dimension of the array.
  3726. In this translation scheme, the attribute id.place cannot be accessed in the semantic action associated with the
  3727. production elist → E or in the semantic action associated with the production elist → elist l, E. So it is not possible to
  3728. make use of the value of the subscript that is available in E.place to get the required three-address statements
  3729. generated. Hence, a queue is necessary in order to maintain the subscripts' storage. These subscripts are used later
  3730. on for generating the code for computing the offset.
  3731. Another way to approach this is to modify the grammar to make it suitable for translation. This requires rewriting the
  3732. productions in such a manner that both id and E exist in the same production so that the pointer to the symbol table
  3733. record of the array name is available in id.place. This can be used to retrieve the upper-bound dimension information
  3734. of the array. And the value of the subscript is available E.place; so by using both of these, the required three-address
  3735. statements can be generated, and the value of the subscript does not need to be stored. Therefore, the modified
  3736. grammar, along with the semantic actions, is:
  3737. L → elist { U = newtemp(); V = newtemp()
  3738. V = elist.place * bpw
  3739. U = gencode (elist.array - C)
  3740. L.place = U
  3741. L.off = V
  3742. }
  3743. elist → id E {elist.place = E.place
  3744. elist.array = id.place
  3745. elist.Ndim = 1; }
  3746. elist → elist, E { T1 = newtemp ();
  3747. gencode (T1 = elist.place *
  3748. limit (elist.array, elist.Ndim +1))
  3749. gencode (T1 = T1 + E.place)
  3750. elist.array = elist1.array
  3751. elist.place = T1,
  3752. elist.Ndim = elist.Ndim +1
  3753. }
  3754. For example, consider the following assignment statement:
  3755. where a and b are arrays of size 30 × 40, and c and d are arrays of size 20.
  3756. There are four bytes per word, and the arrays are allocated statically. When the above translation scheme is used to
  3757. translate this construct, the three-address code generated is:
  3758. 6.10 SWITCH/CASE
  3759. To capture the syntactic structure of the switch statement, we add the following productions to the grammar. Here,
  3760. break is assumed to be a part of statement that is derivable from a nonterminal S.
  3761. S → switch E { caselist}
  3762. caselist → caselist case V : S
  3763. caselist → case V: S
  3764. caselist → default: S
  3765. caselist → caselist default: S
  3766. A switch statement is comprised of two components: an expression E, which is used to select a particular case from
  3767. the list of cases; and a caselist, which is a list of n number of cases, each of which corresponds to one of the possible
  3768. values of the expression E, perhaps including a default case.
  3769. Note A case statement can be implemented in a variety of different ways. If the number of cases is not too great, then
  3770. a case statement can be implemented by generating a sequence of conditional jumps, each of which tests for an
  3771. individual value and transfers to the code for the corresponding statement. If the number of cases is large, then it
  3772. is more efficient to construct a hash table for the case values with the labels of the various statements as entries.
  3773. A syntax-directed translation scheme that translates a case statement into a sequence of conditional jumps, each of
  3774. which tests for an individual value and transfers to the code for the corresponding statement, is considered below. We
  3775. begin with a typical switch statement:
  3776. switch (E)
  3777. {
  3778. case V1: S1
  3779. case V2: S2
  3780. .
  3781. .
  3782. .
  3783. case Vn:Sn
  3784. }
  3785. The generated three-address that is required for the statement is shown in Figure 6.17. Here, next is the label of the
  3786. code for the statement that comes next in the switch statement execution order.
  3787. Figure 6.17: A switch/case three-address translation.
  3788. Therefore, switch statement translation involves generating an unconditional jump after the code of every S1, S2, … ,
  3789. Sn statement in order to transfer control to the next element of the switch statement, as well as to remember the code
  3790. start of S1, S2, … , Sn, and to generate the conditional jumps. Each of these jumps tests for an individual value and
  3791. transfers to the code for the corresponding statement. This requires introducing nullable nonterminals before S1, as
  3792. shown in Figure 6.18.
  3793. Figure 6.18: Nullable nonterminals are introduced into a switch statement translation.
  3794. EXAMPLE 6.1
  3795. Consider the following switch statement:
  3796. switch (i + j )
  3797. {
  3798. case 1: x = y + z
  3799. default: p = q + r
  3800. case 2: u = v + w
  3801. }
  3802. The above translation scheme translates into the following three-address code, which is also shown in Figure 6.19:
  3803. Figure 6.19: Contents of queue during the translation.
  3804. EXAMPLE 6.2
  3805. Using the above translation scheme translates the following switch statement:
  3806. switch (a+b)
  3807. {
  3808. case 2: { x = y; break; }
  3809. case 5: {switch x
  3810. {
  3811. case 0: { a = b + 1; break; }
  3812. case 1: { a = b + 3; break; }
  3813. default: { a = 2; break; }
  3814. }
  3815. break;
  3816. case 9: { x = y - 1; break; }
  3817. default: { a = 2; break; }
  3818. }
  3819. The three address code is:
  3820. t1 = a + b 1.
  3821. goto(23) 2.
  3822. x = y 3.
  3823. goto NEXT 4.
  3824. goto(14) 5.
  3825. t3 = b + 1 6.
  3826. a = t3 7.
  3827. goto NEXT 8.
  3828. t4 = b + 3 9.
  3829. a = t4 10.
  3830. goto NEXT 11.
  3831. a = 2 12.
  3832. goto NEXT 13.
  3833. if x = 0 goto(6) 14.
  3834. if x = 1 goto(9) 15.
  3835. goto(12) 16.
  3836. goto NEXT 17.
  3837. t5 = y − 1 18.
  3838. x = t5 19.
  3839. goto NEXT 20.
  3840. a = 2 21.
  3841. goto NEXT 22.
  3842. if t1 = 2 goto(3) 23.
  3843. if t1 = 5 goto(5) 24.
  3844. if t1 = 9 goto(18) 25.
  3845. goto(21) 26.
  3846. 6.11 THE PROCEDURE CALL
  3847. S → call id (arglist)
  3848. { for every value T in queue generate
  3849. Param T gencode
  3850. (call id.place, arglist.count)
  3851. }
  3852. arglist → arglist, E{ append (queue, E.place)
  3853. arglist.count:= arglist. count + 1}
  3854. arglist → E { initialize queue by E.place
  3855. arglist.count: = 1}
  3856. 6.12 EXAMPLES
  3857. Following are additional examples of syntax-directed definitions and translations.
  3858. EXAMPLE 6.3
  3859. Generate the three-address code for the following C program:
  3860. main()
  3861. { int i = 1;
  3862. int a[10];
  3863. while(i <= 10)
  3864. a[i] = ;
  3865. }
  3866. The three-address code for the above C program is:
  3867. i = 1 1.
  3868. if i <= 10 goto(4) 2.
  3869. goto(8) 3.
  3870. t1 = i * width 4.
  3871. t2 = addr(a) − width 5.
  3872. t2[t1] = 0 6.
  3873. goto(2) 7.
  3874. where width is the number of bytes required for each element.
  3875. EXAMPLE 6.4
  3876. Generate the three-address code for the following program fragment:
  3877. while (A < C and B > D) do
  3878. if A = 1 then C = C+1
  3879. else
  3880. while A <= D do
  3881. A = A + 3
  3882. The three-address code is:
  3883. if a < c goto(3) 1.
  3884. goto(16) 2.
  3885. if b >d goto(5) 3.
  3886. goto(16) 4.
  3887. if a = 1 goto(7) 5.
  3888. goto(10) 6.
  3889. t1 = c+1 7.
  3890. c = t1 8.
  3891. goto(1) 9.
  3892. if a <= d goto 10.
  3893. goto(1) 11.
  3894. t2 = a+3 12.
  3895. a = t2 13.
  3896. goto(10) 14.
  3897. goto(1) 15.
  3898. EXAMPLE 6.5
  3899. Generate the three-address code for the following program fragment, where a and b are arrays of size 20 × 20, and
  3900. there are four bytes per word.
  3901. begin
  3902. add = 0;
  3903. i = 1;
  3904. j = 1;
  3905. do
  3906. begin
  3907. add = add + a[i,j] * b[j,i]
  3908. i = i + 1;
  3909. j = j + 1;
  3910. end
  3911. while i <= 20 and j <= 20;
  3912. end
  3913. The three-address code is:
  3914. add = 0 1.
  3915. i = 1 2.
  3916. j = 1 3.
  3917. t1 = i * 20 4.
  3918. t1 = t1 + j 5.
  3919. t1 = t1 * 4 6.
  3920. t2 = addr(a) − 84 7.
  3921. t3 = t2[t1] 8.
  3922. t4 = j * 20 9.
  3923. t4 = t4 + i 10.
  3924. t4 = t4 * 4 11.
  3925. t5 = addr(b) − 84 12.
  3926. t6 = t5[t4] 13.
  3927. t7 = t3 * t6 14.
  3928. t7 = add + t7 15.
  3929. t8 = i + 1 16.
  3930. i = t8 17.
  3931. t9 = j + 1 18.
  3932. j = t9 19.
  3933. if i <= 20 goto(22) 20.
  3934. goto NEXT 21.
  3935. if j <= 20 goto(4) 22.
  3936. goto NEXT 23.
  3937. EXAMPLE 6.6
  3938. Consider the program fragment:
  3939. sum = 0
  3940. for(i = 1; i<= 20; i++)
  3941. sum = sum + a[i] + b[i];
  3942. and generate the three-address code for it. There are four bytes per word.
  3943. The three address code is:
  3944. sum = 0 1.
  3945. i = 1 2.
  3946. if i <= 20 goto(8) 3.
  3947. goto NEXT 4.
  3948. t1 = i+1 5.
  3949. i = t1 6.
  3950. goto(3) 7.
  3951. t2 = i * 4 8.
  3952. t3 = addr(a) − 4 9.
  3953. t4 = t3[t2] 10.
  3954. t5 = i * 4 11.
  3955. t6 = addr(b) − 4 12.
  3956. t7 = t6[t5] 13.
  3957. t8 = sum + t4 14.
  3958. t8 = t8 + t7 15.
  3959. sum = t8 16.
  3960. goto(5) 17.
  3961. Chapter 7: Symbol Table Management
  3962. 7.1 THE SYMBOL TABLE
  3963. A symbol table is a data structure used by a compiler to keep track of scope/ binding information about names. This
  3964. information is used in the source program to identify the various program elements, like variables, constants,
  3965. procedures, and the labels of statements. The symbol table is searched every time a name is encountered in the
  3966. source text. When a new name or new information about an existing name is discovered, the content of the symbol
  3967. table changes. Therefore, a symbol table must have an efficient mechanism for accessing the information held in the
  3968. table as well as for adding new entries to the symbol table.
  3969. For efficiency, our choice of the implementation data structure for the symbol table and the organization its contents
  3970. should be stress a minimal cost when adding new entries or accessing the information on existing entries. Also, if the
  3971. symbol table can grow dynamically as necessary, then it is more useful for a compiler.
  3972. 7.2 IMPLEMENTATION
  3973. Each entry in a symbol table can be implemented as a record that consists of several fields. These fields are
  3974. dependent on the information to be saved about the name. But since the information about a name depends on the
  3975. usage of the name (i.e., on the program element identified by the name), the entries in the symbol table records will
  3976. not be uniform. Hence, to keep the symbol table records uniform, some of the information about the name is kept
  3977. outside of the symbol table record, and a pointer to this information is stored in the symbol table record, as shown in
  3978. Figure 7.1. Here, the information about the lower and upper bounds of the dimension of the array named a is kept
  3979. outside of the symbol table record, and the pointer to this information is stored within the symbol table record.
  3980. Figure 7.1: A pointer steers the symbol table to remotely stored information for the array a.
  3981. 7.3 ENTERING INFORMATION INTO THE SYMBOL TABLE
  3982. Information is entered into the symbol table in various ways. In some cases, the symbol table record is created by the
  3983. lexical analyzer as soon as the name is encountered in the input, and the attributes of the name are entered when the
  3984. declarations are processed. But very often, the same name is used to denote different objects, perhaps even in the
  3985. same block. For example, in C programming, the same name can be used as a variable name and as a member
  3986. name of a structure, both in the same block. In such cases, the lexical analyzer only returns the name to the parser,
  3987. rather than a pointer to the symbol table record. That is, a symbol table record is not created by the lexical analyzer;
  3988. the string itself is returned to the parser, and the symbol table record is created when the name's syntactic role is
  3989. discovered.
  3990. 7.4 WHERE SHOULD NAMES BE HELD?
  3991. If there is a modest upper bound on the length of the name, then the name can be stored in the symbol table record
  3992. itself. But if there is no such limit, or if the limit is rarely reached, then an indirect scheme of storing name is used. A
  3993. separate array of characters, called a "string table," is used to store the name, and a pointer to the name is kept in the
  3994. symbol table record, as shown in Figure 7.2.
  3995. Figure 7.2: Symbol table names are held either in the symbol table record or in a separate string table.
  3996. 7.5 INFORMATION ABOUT THE RUNTIME STORAGE LOCATION
  3997. The information about the runtime, name storage location is kept in the symbol table. If the compiler is going to be
  3998. generating assembly code, then the assembler takes care of the storage locations of the various names. After
  3999. generating the assembly code, the compiler scans the symbol table and generates the assembly language data
  4000. definitions. These are appended to the assembly language code for each name. But if machine code is being
  4001. generated, then the compiler must ascertain the position of each data object relative to a fixed origin.
  4002. 7.6 VARIOUS APPROACHES TO SYMBOL TABLE ORGANIZATION
  4003. There are several methods of organizing the symbol table. These methods are discussed below.
  4004. 7.6.1 The Linear List
  4005. A linear list of records is the easiest way to implement a symbol table. The new names are added to the table in the
  4006. order that they arrive. Whenever a new name is to be added to the table, the table is first searched linearly or
  4007. sequentially to check whether or not the name is already present in the table. If the name is not present, then the
  4008. record for new name is created and added to the list at a position specified by the available pointer, as shown in the
  4009. Figure 7.3.
  4010. Figure 7.3: A new record is added to the linear list of records.
  4011. To retrieve the information about the name, the table is searched sequentially, starting from the first record in the
  4012. table. The average number of comparisons, p, required for search are p = (n + 1)/2 for successful search and p = n for
  4013. an unsuccessful search, where n is the number of records in symbol table. The advantage of this organization is that it
  4014. takes less space, and additions to the table are simple. This method's disadvantage is that it has a higher accessing
  4015. time.
  4016. 7.6.2 Search Trees
  4017. A search tree is a more efficient approach to symbol table organization. We add two links, left and right, in each
  4018. record, and these links point to the record in the search tree. Whenever a name is to be added, first the name is
  4019. searched in the tree. If it does not exist, then a record for the new name is created and added at the proper position in
  4020. the search tree. This organization has the property of alphabetical accessibility; that is, all the names accessible from
  4021. name i will, by following a left link, precede name 1 in alphabetical order. Similarly, all the name accessible from name i
  4022. will follow name i in alphabetical order by following the right link (see Figure 7.4). The expected time needed to enter n
  4023. names and to make m queries is proportional to (m + n) log 2 n; so for greater numbers of records (higher n) this method
  4024. has advantages over linear list organization.
  4025. Figure 7.4: The search tree organization approach to a symbol table.
  4026. 7.6.3 Hash Tables
  4027. A hash table is a table of k pointers numbered from zero to k − 1 that point to the symbol table and a record within the
  4028. symbol table. To enter a name into symbol table, we find out the hash value of the name by applying a suitable hash
  4029. function. The hash function maps the name into an integer between zero and k − 1, and using this value as an index in
  4030. the hash table, we search the list of the symbol table records that is built on that hash index. If the name is not present
  4031. in that list, we create a record for name and insert it at the head of the list. When retrieving the information associated
  4032. with the name, the hash value of the name is first obtained, and then the list that was built on this hash value is
  4033. searched for information about the name (Figure 7.5).
  4034. Figure 7.5: Hash table method of symbol table organization.
  4035. 7.7 REPRESENTING THE SCOPE INFORMATION IN THE SYMBOL TABLE
  4036. Every name possesses a region of validity within the source program, called the "scope" of that name. The rules
  4037. governing the scope of names in a block-structured language are as follows:
  4038. A name declared within a block B is valid only within B. 1.
  4039. If block B1 is nested within B2, then any name that is valid for B2 is also valid for B1, unless the
  4040. identifier for that name is re-declared in B1.
  4041. 2.
  4042. These scope rules require a more complicated symbol table organization than simply a list of associations between
  4043. names and attributes. One technique that can be used is to keep multiple symbol tables, one for each active block,
  4044. such as the block that the compiler is currently in. Each table is list of names and their associated attributes, and the
  4045. tables are organized into a stack. Whenever a new block is entered, a new empty table is pushed onto the stack for
  4046. holding the names that are declared as local to this block. And when a declaration is compiled, the table on the stack is
  4047. searched for a name. If the name is not found, then the new name is inserted. When a reference to a name is
  4048. translated, each table is searched, starting from the top table on the stack, ensuring compliance with static scope
  4049. rules. For example, consider following program structure. The symbol table organization will be as shown in Figure
  4050. 7.6.
  4051. Program main
  4052. Var x,y : integer :
  4053. Procedure P :
  4054. Var x,a : boolean;
  4055. Procedure q
  4056. Var x,y,z : real;
  4057. Begin
  4058. .
  4059. .
  4060. end
  4061. begin :
  4062. end
  4063. begin :
  4064. end
  4065. Figure 7.6: Symbol table organization that complies with static scope information rules.
  4066. Another technique can be used to represent scope information in the symbol table. We store the nesting depth of
  4067. each procedure block in the symbol table and use the [procedure name, nesting depth] pair as the key to accessing
  4068. the information from the table. A nesting depth of a procedure is a number that is obtained by starting with a value of
  4069. one for the main and adding one to it every time we go from an enclosing to an enclosed procedure. This number is
  4070. basically a count of how many procedures are there in the referencing environment of the procedure.
  4071. For example, refer to the program code structure above. The symbol table's contents are shown in Table 7.1.
  4072. Table 7.1: Symbol Table Contents Using a Nesting Depth Approach
  4073. X 1 real
  4074. Y 1 real
  4075. Z 1 real
  4076. q 3 proc
  4077. a 3 Boolean
  4078. X 3 Boolean
  4079. P 2 proc
  4080. Y 2 integer
  4081. X 2 integer
  4082. Chapter 8: Storage Management
  4083. 8.1 STORAGE ALLOCATION
  4084. One of the important tasks that a compiler must perform is to allocate the resources of the target machine to
  4085. represent the data objects that are being manipulated by the source program. That is, a compiler must decide the
  4086. run-time representation of the data objects in the source program. Source program run-time representations of the
  4087. data objects, such as integers and real variables, usually take the form of equivalent data objects at the machine level;
  4088. whereas data structures, such as arrays and strings, are represented by several words of machine memory.
  4089. The strategies that can be used to allocate storage to the data objects are determined by the rules defining the scope
  4090. and duration of the names in the programming language. The simplest strategy is static allocation, which is used in
  4091. languages like FORTRAN. With static allocation, it is possible to determine the run-time size and relative position of
  4092. each data object during compilation. A more-complex strategy for dynamic memory allocation that involves stacks is
  4093. required for languages that support recursion: an entry to a new block or procedure causes the allocation of space on
  4094. a stack, which is freed on exit from the block or procedure. An even more-complex strategy is required for languages,
  4095. which allows the allocation and freeing of memory for some data in a non-nested fashion. This storage space can be
  4096. allocated and freed arbitrarily from an area called a "heap". Therefore, implementation of languages like PASCAL and
  4097. C allow data to be allocated under program control. The run-time organization of the memory will be as shown in
  4098. Figure 8.1.
  4099. Figure 8.1: Heap memory storage allows program-controlled data allocation.
  4100. The run-time storage has been subdivided to hold the generated target code and the data objects, which are allocated
  4101. statically for the stack and heap. The sizes of the stack and heap can change as the program executes.
  4102. 8.2 ACTIVATION OF THE PROCEDURE AND THE ACTIVATION RECORD
  4103. Each execution of a procedure is referred to as an activation of the procedure. This is different from the procedure
  4104. definition, which in its simplest form is the association of an identifier with a statement; the identifier is the name of the
  4105. procedure, and the statement is the body of the procedure.
  4106. If a procedure is non-recursive, then there exists only one activation of procedure at any one time. Whereas if a
  4107. procedure is recursive, several activations of that procedure may be active at the same time. The information needed
  4108. by a single execution or a single activation of a procedure is managed using a contiguous block of storage called an
  4109. "activation record" or fiactivation framefl consisting of the collection of fields. (Very often, registers take the place of
  4110. one or more of the fields in the activation record.) The activation record contains the following information:
  4111. Temporary values, such as those arising during the evaluation of the expression. 1.
  4112. Local data of a procedure. 2.
  4113. The information about the machine state (i.e., the machine status) just before a procedure is
  4114. called, including PC values and the values of these registers that must be restored when control is
  4115. relinquished after the procedure.
  4116. 3.
  4117. Access links (optional) referring to non-local data that is held in other activation records. This is
  4118. not required for a language like FORTRAN, because non-local data is kept in fixed place. But it is
  4119. required for Pascal.
  4120. 4.
  4121. Actual parameters (i.e., the parameters supplied to the called procedure). These parameters may
  4122. also be passed in machine registers for greater efficiency.
  4123. 5.
  4124. The return value used by called procedure to return a value to calling procedure. Again, for
  4125. greater efficiency, a machine register may be used for returning values.
  4126. 6.
  4127. The size of almost all of the fields of the activation record can be determined at compile time. An exception is if a
  4128. called procedure has a local array whose size is determined by the values of the actual parameters.
  4129. The information in the activation record is organized in a manner that enables easy access at execution time. A pointer
  4130. to the activation record is required. This pointer is called the current environment pointer (CEP), and it points to one of
  4131. the fixed fields in the activation record. Using the proper offset from this pointer, and depending upon the format of the
  4132. activation record, the contents of the activation record can be accessed. Figure 8.2 shows the organization of
  4133. information in a typical activation record.
  4134. Figure 8.2: Typical format of an activation record.
  4135. 8.3 STATIC ALLOCATION
  4136. In static allocation, the names are bound to specific storage locations as the program is compiled. These storage
  4137. locations cannot be changed during the program's execution. Since the binding does not change at run time, every
  4138. time a procedure is called, its names are bound to the same storage locations. Hence, if the local names are allocated
  4139. statically, then their values will be retained throughout the activation of a procedure. The compiler uses the name type
  4140. to determine the amount of storage to set aside for that name. The address of this storage consists of an offset from
  4141. an end-of-activation record for the procedure. The compiler must decide where the activation records go relative to the
  4142. target code and relative to other activation records. Once this decision is made, the storage position for each name in
  4143. the record is fixed. Therefore, at compile time, it is possible to fill in both the address at which the target code can find
  4144. the data and the address at which information is saved. However, there are some limitations to using static allocation:
  4145. The size of the data object and any constraints on its position in memory must be known at
  4146. compile time.
  4147. 1.
  4148. Recursive procedures cannot be permitted, because all activations of a procedure use the same
  4149. binding for local names.
  4150. 2.
  4151. Data structures cannot be created dynamically, since there is no mechanism for storage
  4152. allocation at run time.
  4153. 3.
  4154. 8.4 STACK ALLOCATION
  4155. In stack allocation, storage is organized as a stack, and activation records are pushed and popped as the activation of
  4156. procedures begin and end, respectively, thereby permitting recursive procedures. The storage for the locals in each
  4157. procedure call is contained in the activation record for that call. Hence, the locals are bound to fresh storage in each
  4158. activation, because a new activation record is pushed onto stack when a call is made. The storage values of locals are
  4159. deleted when the activation ends.
  4160. 8.4.1 The Call and Return Sequence
  4161. Procedure calls are implemented by generating what is called a "call sequence and return sequence" in the target
  4162. code. The job of a call sequence is to set up an activation record. Setting up an activation record means entering the
  4163. information into the fields of the activation record if the storage for the activation record is allocated statically. When
  4164. the storage for the activation record is allocated dynamically, storage is allocated for it on the stack, and the
  4165. information is entered in its fields.
  4166. On the other hand, the job of a return sequence is to restore the state of machine so that the machine's calling
  4167. procedure can continue executing. This also involves destroying the activation record if it was allocated dynamically on
  4168. the stack.
  4169. The code in a call sequence is often divided between the caller and the callee. But there is no exact division of
  4170. run-time tasks between the caller and callee. It depends on the source language, the target machine, and the
  4171. operating system. Hence, even when using a common language, the call sequence may differ from implementation to
  4172. implementation. But it is desirable to put as much of the calling sequence into the callee as possible, because there
  4173. may be several calls for a procedure. And even though that portion of the calling sequence is generated for each call
  4174. by the various callers, this portion of the calling sequence is shared within the callee, so it is generated only once.
  4175. Figure 8.3 shows the format of a typical activation record. Here, the contents of the activation record are accessed
  4176. using the CEP pointer.
  4177. Figure 8.3: The CEP pointer is used to access the contents of the activation record.
  4178. The stack is assumed to be growing from higher to lower addresses. A positive offset will be used to access the
  4179. contents of the activation record when we want to go in a direction opposite to that of the growth of the stack (in Figure
  4180. 8.3, the field pointed to by the CEP). A negative offset will be used to access the contents of the activation record
  4181. when we want to go in the same direction as the growth of stack. A typical call sequence for caller code to evaluate
  4182. parameters is as follows:
  4183. push ( ) /* for return value
  4184. push (T 1 ) /* T 1 is holding the first argument
  4185. push (T 2 ) /* T 2 is holding the second argument
  4186. .
  4187. .
  4188. .
  4189. push (T n ) /* T n is holding the nth argument
  4190. push (n) /* n is the count of arguments
  4191. push (return address)
  4192. push (CEP)
  4193. goto start of code segment of callee
  4194. A typical callee code segment is shown in Figure 8.4.
  4195. Call sequence
  4196. Object code of the callee
  4197. Return sequence
  4198. Figure 8.4: Typical callee code segment.
  4199. A typical call sequence in the callee will be:
  4200. CEP = top /*
  4201. Code for pushing the local data of
  4202. the callee
  4203. And a typical return sequence is:
  4204. top = CEP + 1
  4205. 1 = *top /* for retrieving return address
  4206. top = top + 1
  4207. CEP =*CEP / for resetting the CEP to point to the activation record of the caller
  4208. top = top+ *top +2 /*for resetting top to point to the top of the activation record of caller goto1
  4209. 8.4.2 Access to Nonlocal Names
  4210. The way that the nonlocals are accessed depends on the scope rules of the language (see Chapter 7). There are two
  4211. different types of scope rules: static scope rules and dynamic scope rules.
  4212. Static scope rules determine which declaration a name's reference will be associated with, depending upon the
  4213. program's language, thereby determining from where the name's value will be obtained at run time. When static scope
  4214. rules are used during compilation, the compiler knows how the declarations are bound to the name references, and
  4215. hence, from where their values will be obtained at run time. What the compiler has to do is to provision the retrieval of
  4216. the nonlocal name value when it is accessed at run time.
  4217. Whereas when dynamic scope rules are used, the values of nonlocal names are retrieved at run time by scanning
  4218. down the stack, starting at the top-most activation record. The rule for associating a nonlocal reference to a declaration
  4219. is simple when procedure nesting is not permitted. In the absence of nested procedures, the storage for all names
  4220. declared outside any procedure can be allocated statically. The position of this storage is known at compile time, so if
  4221. a name is nonlocal in some procedure's body, its statically determined address is used; whereas if a name is local, it is
  4222. assessed via a CEP pointer using the suitable offset.
  4223. An important benefit of static allocation for nonlocals is that declared procedures can be freely passed as parameters
  4224. and returned as results. For example, a function inCis passed by address; that is, a pointer is passed to it. When the
  4225. procedures are nested, declarations are bound to name references according to the following rule: if a name x is not
  4226. declared in a procedure P, then an occurrence of x in P is in the scope of a declaration of x in an enclosing procedure
  4227. P 1 such that:
  4228. The enclosing procedure P 1 has a declaration of x, and 1.
  4229. P 1 is more closely nested around P than any other procedure with a declaration of x. 2.
  4230. Therefore, a reference to a nonlocal name x is resolved by associating it with the declaration of x in P 1 , and the
  4231. compiler is required to provision getting the value of x at run time from the most-recent activation record of P 1 by
  4232. generating a suitable call sequence.
  4233. One of the ways to implement this is to add a pointer, called an "access link," to each activation record. And if a
  4234. procedure P is nested immediately within Q in the source text, then make the access link in the activation record P,
  4235. pointing to the most-recent activation record of Q. This requires an activation record with a format like that shown in
  4236. Figure 8.5.
  4237. Figure 8.5: An activation record that deals with nonlocal name references.
  4238. The modified call and return sequence, required for setting up the activation record shown in Figure 8.5, is:
  4239. push ( ) /* for return value
  4240. push (T 1 ) /* T 1 is holding the first argument
  4241. push (T 2 ) /* T 2 is holding the second argument
  4242. .
  4243. .
  4244. .
  4245. push (T n ) /* T n is holding the nth argument
  4246. push(n) /* n is the count of arguments
  4247. push (return address)
  4248. push (CEP)
  4249. code to set up access link
  4250. goto start of code segment of callee
  4251. A typical callee segment is shown in Figure 8.6.
  4252. Call sequence
  4253. Object code of the callee
  4254. Return sequence
  4255. Figure 8.6: A typical callee segment.
  4256. A typical call sequence in the callee is:
  4257. CEP = top+1/* code for pushing the local data of the callee
  4258. A typical return sequence is:
  4259. top = CEP+1
  4260. 1 = *top /* for retrieving return address
  4261. top = top+1
  4262. CEP = *CEP / for resetting the CEP to point to the activation record of caller
  4263. top = top + *top +2/* for resetting top to point to the top of the activation record of caller goto 1
  4264. 8.4.3 Setting Up the Access Link
  4265. To generate the code for setting up the access link, a compiler makes use of the following information: the nesting
  4266. depth of the caller procedure and the nesting depth of the callee procedure. A procedure's nesting depth is number
  4267. that is obtained by starting with value of one for the main and adding one to it every time we go from an enclosing to
  4268. an enclosed procedure. This number is basically a count of how many procedures are there in the referencing
  4269. environment of the procedure.
  4270. Suppose that procedure p at a nesting depth Np calls a procedure at nesting depth Nq. Then the access link in the
  4271. activation record of procedure q is set up as follows:
  4272. if (Nq > Np) then
  4273. The access link in the activation record of procedure q is set to point to the activation record of procedure p.
  4274. else
  4275. if (Nq =Np) then
  4276. Copy the access link in the activation record of procedure p into the activation record of procedure q.
  4277. else
  4278. if (Nq < Np) then
  4279. Follow (Np − Nq) links to reach to the activation record, and copy the access link of this activation record into the
  4280. activation record of procedure q.
  4281. The Block Statement
  4282. A block is a statement that contains its own local data declarations. Blocks can either be independent—like B1 begin
  4283. and B1 end, then B2 begin and B2 end—or they can be nested—like B1 begin and B2 begin, then B2 end and B1
  4284. end. This nesting property is sometimes called a "block structure". The scope of a declaration in a block-structured
  4285. language is given by the most closely nested rule:
  4286. The scope of a declaration in a block B includes B. 1.
  4287. If a name X is not declared in a block B, then an occurrence of X in B is in the scope of a
  4288. declaration of X in an enclosing block B ′ , such that:
  4289. B ′ has a declaration of X, and a.
  4290. B ′ is more closely nested around B than any other block with a declaration of
  4291. X.
  4292. b.
  4293. 2.
  4294. Block structure can be implemented using stack allocation. Space is allocated for declared names. The block is
  4295. entered by pushing an activation record, and it is de-allocated when control leaves the block and the activation record
  4296. is destroyed. That is, a block is treated like a parameter-less procedure, called only at the entry to the block and
  4297. returned upon exit from the block. An alternative is to allocate storage for a complete procedure body at one time. If
  4298. there are blocks within the procedure, then an allowance is made for the storage needed by the declarations within the
  4299. block, as shown in Figure 8.7. For example, consider the following program structure:
  4300. main ()
  4301. {
  4302. int a;
  4303. {
  4304. int b;
  4305. {
  4306. int c;
  4307. printf ("% d% d\n", b,c);
  4308. }
  4309. {
  4310. intd;
  4311. printf("% d% d\n", b, d);
  4312. }
  4313. }
  4314. printf("% d\n",a);
  4315. }
  4316. a
  4317. b
  4318. c,d
  4319. Figure 8.7: Storage for declared names.
  4320. Chapter 9: Error Handling
  4321. 9.1 ERROR RECOVERY
  4322. One of the important tasks that a compiler must perform is the detection of and recovery from errors. Recovery from
  4323. errors is important, because the compiler will be scanning and compiling the entire program, perhaps in the presence
  4324. of errors; so as many errors as possible need to be detected.
  4325. Every phase of a compilation expects the input to be in a particular format, and whenever that input is not in the
  4326. required format, an error is returned. When detecting an error, a compiler scans some of the tokens that are ahead of
  4327. the error's point of occurrence. The fewer the number of tokens that must be scanned ahead of the point of error
  4328. occurrence, the better the compiler's error-detection capability. For example, consider the following statement:
  4329. if a = b then x: = y + z;
  4330. The error in the above statement will be detected in the syntactic analysis phase, but not before the syntax analyzer
  4331. sees the token "then"; but the first token, itself, is in error.
  4332. After detecting an error, the first thing that a compiler is supposed to do is to report the error by producing a suitable
  4333. diagnostic. A good error diagnostic should possess the following properties.
  4334. The message should be produced in terms of the original source program rather than in terms of
  4335. some internal representation of the source program. For example, the message should be
  4336. produced along with the line numbers of the source program.
  4337. 1.
  4338. The error message should be easy to understand by the user. 2.
  4339. The error message should be specific and should localize the problem. For example, an error
  4340. message should read, "x is not declared in function fun," and not just, "missing declaration".
  4341. 3.
  4342. The message should not be redundant; that is, the same message should not be produced again
  4343. and again.
  4344. 4.
  4345. Therefore, a compiler should report errors by generating messages with the above properties. The errors captured by
  4346. the compiler can be classified as either syntactic errors or semantic errors. Syntactic errors are those errors that are
  4347. detected in the lexical or syntactic analysis phase by the compiler. Semantic errors are those errors detected by the
  4348. compiler.
  4349. 9.2 RECOVERY FROM LEXICAL PHASE ERRORS
  4350. The lexical analyzer detects an error when it discovers that an input's prefix does not fit the specification of any token
  4351. class. After detecting an error, the lexical analyzer can invoke an error recovery routine. This can entail a variety of
  4352. remedial actions.
  4353. The simplest possible error recovery is to skip the erroneous characters until the lexical analyzer finds another token.
  4354. But this is likely to cause the parser to read a deletion error, which can cause severe difficulties in the syntaxanalysis
  4355. and remaining phases. One way the parser can help the lexical analyzer can improve its ability to recover from errors
  4356. is to make its list of legitimate tokens (in the current context) available to the error recovery routine. The error-recovery
  4357. routine can then decide whether a remaining input's prefix matches one of these tokens closely enough to be treated
  4358. as that token.
  4359. 9.3 RECOVERY FROM SYNTACTIC PHASE ERRORS
  4360. A parser detects an error when it has no legal move from its current configuration. The LL(1) and LR(1) parsers use
  4361. the valid prefix property; therefore, they are capable of announcing an error as soon as they read an input that is not a
  4362. valid continuation of the previous input's prefix. This is earliest time that a left-to-right parser can announce an error.
  4363. But there are a variety of other types of parsers that do not necessarily have this property.
  4364. The advantages of using a parser with a valid-prefix-property capability is that it reports an error as soon as possible,
  4365. and it minimizes the amount of erroneous output passed to subsequent phases of the compiler.
  4366. Panic Mode Recovery
  4367. Panic mode recovery is an error recovery method that can be used in any kind of parsing, because error recovery
  4368. depends somewhat on the type of parsing technique used. In panic mode recovery, a parser discards input symbols
  4369. until a statement delimiter, such as a semicolon or an end, is encountered. The parser then deletes stack entries until it
  4370. finds an entry that will allow it to continue parsing, given the synchronizing token on the input. This method is simple to
  4371. implement, and it never gets into an infinite loop.
  4372. 9.4 ERROR RECOVERY IN LR PARSING
  4373. A systematic method for error recovery in LR parsing is to scan down the stack until a state S with a goto on a
  4374. particular nonterminal A is found, and then discard zero or more input symbols until a symbol a is found that can
  4375. legitimately follow A. The parser then shifts the state goto [S, A] on the stack and resumes normal parsing.
  4376. There might be more than one choice for the nonterminal A. Normally, these would be nonterminals representing
  4377. major program pieces, such as statements.
  4378. Another method of error recovery that can be implemented is called "phrase level recovery". Each error entry in the LR
  4379. parsing table is examined, and, based on language usage, an appropriate error-recovery procedure is constructed.
  4380. For example, to recover from an construct error that starts with an operator, the error-recovery routine will push an
  4381. imaginary id onto the stack and cover it with the appropriate state. While doing this, the error entries in a particular
  4382. state that call for a particular reduction on some input symbols are replaced by that reduction. This has the effect of
  4383. postponing the error detection until one or more reductions are made; but the error will still be caught before a shift.
  4384. A phrase level error-recovery implementation for an LR parser is shown below. The parsing table's grammar is:
  4385. The SLR parsing table for the above grammar is shown in Table 9.1.
  4386. Table 9.1: Parsing Table for E → E + E | E * E | id
  4387. id + * $ E
  4388. I 0 S 2
  4389. 1
  4390. I 1
  4391. S 3 S 4 Accept
  4392. I 2
  4393. R 3 R 3 R 3
  4394. I 3 S 2
  4395. 5
  4396. I 4 S 2
  4397. 6
  4398. I 5
  4399. S 3 /R 1 S 4 /R 1 R 1
  4400. I 6
  4401. S 3 /R 2 S 4 /R 2 R 2
  4402. The conflict is resolved by giving higher precedence to * and using leftassociativity, as shown in Table 9.2.
  4403. Table 9.2: Higher Precedent * and Left-Associativity
  4404. id + * $ E
  4405. I 0 S 2
  4406. 1
  4407. I 1
  4408. S 3 S 4 Accept
  4409. I 2
  4410. R 3 R 3 R 3
  4411. I 3 S 2
  4412. 5
  4413. I 4 S 2
  4414. 6
  4415. I 5
  4416. R 1 S 4 R 1
  4417. I 6
  4418. R 2 R 2 R 2
  4419. The parsing table with error routines is shown in Table 9.3, where routine e 1 is called from states I 0 , I 3 , and I 4 , which
  4420. pushes an imaginary id onto the stack and covers it with state I 2 . The routine e 2 is called from state I 1 , which pushes +
  4421. onto stack and covers it with state I 3 .
  4422. Table 9.3: Parsing Table with Error Routines
  4423. id + * $ E
  4424. I 0 S 2 e 1 e 1 e 1 1
  4425. I 1 E 2 S 3 S 4 Accept
  4426. I 2 R 3 R 3 R 3 R 3
  4427. I 3 S 2 e 1 E 1 E 1 5
  4428. I 4 S 2 E 1 E 1 E 1 6
  4429. I 5 R 1 R 1 S 4 R 1
  4430. I 6 R 2 R 2 R 2 R 2
  4431. For example, if we trace the behavior of the parser described above for the input id + *id $:
  4432. Stack Contents Unspent Input Moves
  4433. $I 0 id+*id$ shift and enter into state 2
  4434. $I 0 idI 2 +*id$ reduce by production number 3
  4435. $I 0 EI 1 +*id$ shift and enter into state 3
  4436. $I 0 EI 1 +I 3 *id$ call error routine e1
  4437. $I 0 EI 1 +I 3 id I 2 *id$ reduce by production number 3
  4438. (id I 2 pushed by e 1)
  4439. $I 0 EI 1 +I 3 EI 5 *id$ shift and enter into state 4
  4440. $I 0 EI 1 +I 3 E I 5 *I 4 id$ shift and enter into state 2
  4441. $I 0 EI 1 +I 3 E I 5 *I 4 idI 2 $ reduce by production number 3
  4442. Stack Contents Unspent Input Moves
  4443. $I 0 EI 1 +I 3 E I 5 *I 4 EI 6 $ reduce by production number 2
  4444. $I 0 EI 1 +I 3 EI 5 $ reduce by production number 1
  4445. $I 0 EI 1 $ accept
  4446. Similarly, if we trace the behavior of the parser for the input id id*id $:
  4447. Stack Contents Unspent Input Moves
  4448. $I 0 id id*id$ shift and enter into state 2
  4449. $I 0 idI 2 id*id$ reduce by production number 3
  4450. $I 0 EI 1 id*id$ call error routine e 2
  4451. $I 0 EI 1 + I 3 id*id$ shift and enter into state 2
  4452. ( I 3 pushed by e 2)
  4453. $I 0 EI 1 +I 3 id I 2 *id$ reduce by production number 3
  4454. $I 0 EI 1 +I 3 EI 5 *id$ shift and enter into state 4
  4455. $I 0 EI 1 +I 3 EI 5 *I 4 id$ shift and enter into state 2
  4456. $I 0 EI 1 +I 3 EI 5 *I 4 idI 2 $ reduce by production number 3
  4457. $I 0 EI 1 +I 3 EI 5 *I 4 EI 6 $ reduce by production number 2
  4458. $I 0 EI 1 +I 3 EI 5 $ reduce by production number 1
  4459. $I 0 EI 1 $ accept
  4460. 9.5 AUTOMATIC ERROR RECOVERY IN YACC
  4461. The tool YACC can generate a parser with the ability to automatically recover from the errors. Major nonterminals,
  4462. such as those for program blocks or statements, are identified; and then error productions of the form A → error α are
  4463. added to the grammar, where α is usually ∈ .
  4464. When YACC-generated parser encounters an error, it finds the top-most state on its stack, whose underlying set of
  4465. items includes an item of the form A → .error. Therefore, the parser shifts the token error, and a reduction to A is
  4466. immediately possible. The parser then invokes a semantic action associated with production A → error, and this
  4467. semantic action takes care of recovering from the error.
  4468. 9.6 PREDICTIVE PARSING ERROR RECOVERY
  4469. An error is detected during predictive parsing when the terminal on the top of the stack does not match the next input
  4470. symbol, or when nonterminal A is on top of the stack and a is the next input symbol. M[A,a] is the error entry used to
  4471. for recovery. Panic mode recovery can be used to recover from an error detected by the LL parser. The effectiveness
  4472. of panic mode recovery depends on the choice of the synchronizing token. Several heuristics can be used when
  4473. selecting the synchronizing token in order to ensure quick recovery from common errors:
  4474. All the symbols in the FOLLOW(A) must be kept in the set of synchronizing tokens, because if we
  4475. skip until an a symbol in FOLLOW(A) is read, and we pop A from the stack, it is likely that the
  4476. parsing can continue.
  4477. 1.
  4478. Since the syntactic structure of a language is very often hierarchical, we add the symbols that
  4479. begin higher constructs to the synchronizing set of lower constructs. For example, we add
  4480. keywords to the synchronizing sets of nonterminals that generate expressions.
  4481. 2.
  4482. We also add the symbols in FIRST(A) to the synchronizing set of nonterminal A. This provides for
  4483. a resumption of parsing according to A if a symbol in FIRST(A) appears in the input.
  4484. 3.
  4485. A derivation by an ∈ -production can be used as a default. Error detection will be postponed, but
  4486. the error will still be captured. This method reduces the number of nonterminals that must be
  4487. considered during error recovery.
  4488. 4.
  4489. Note Another method of error recovery that can be implemented is called "phrase level recovery". In phrase level
  4490. recovery, each error entry in the LL parsing table is examined, and based on language usage, an appropriate
  4491. error-recovery procedure is constructed. For example, to recover from a construct error that starts with an
  4492. operator, the error-recovery routine will insert an imaginary id into the input. Then, if some state terminal symbols
  4493. are derived using an ∈ -production, the error entries in that state are replaced by the derivation using the
  4494. imaginary-id ∈ -production. This has the effect of postponing error detection.
  4495. A phrase level error-recovery implementation for an LR parser is shown in Tables 9.4 and 9.5. The parsing table is
  4496. constructed for the following grammar:
  4497. Table 9.4: LR Parsing Table
  4498. id + * $
  4499. E
  4500. E → TE 1
  4501. T
  4502. T → FT 1
  4503. F
  4504. F → id
  4505. E 1
  4506. E 1 → +TE 1
  4507. E 1 → ∈
  4508. T 1
  4509. T 1 → ∈ T 1 → * FT 1 T 1 → ∈
  4510. id pop
  4511. +
  4512. pop
  4513. *
  4514. pop
  4515. $
  4516. accept
  4517. The modified table is shown in Table 9.5. Routine e 1 , when called, pushes an imaginary id into the input; and routine
  4518. e 2 , when called, removes all the remaining symbols from the input.
  4519. Table 9.5: Phrase Level Error-Recovery Implementation
  4520. id + * $
  4521. E
  4522. E → TE 1
  4523. e 1 e 1 e 1
  4524. T
  4525. T → FT 1
  4526. e 1 e 1 e 1
  4527. F → id
  4528. e 1 e 1 e 1
  4529. E 1
  4530. E 1 → ∈ E 1 → +TE 1 E 1 → ∈ E 1 → ∈
  4531. T 1
  4532. T 1 → ∈ T 1 → ∈ T 1 → *FT 1 T 1 → ∈
  4533. id pop
  4534. +
  4535. pop
  4536. *
  4537. pop
  4538. $ e 2 e 2 e 2 accept
  4539. For example, if we trace the behavior of the parser shown in Table 9.5 for the input id + *id $:
  4540. Stack Contents Unspent Input Moves
  4541. $E id+*id$
  4542. derive using E → TE 1
  4543. $E 1 T id+*id$
  4544. derive using T → FT 1
  4545. $E 1 T 1 F id+*id$
  4546. derive using F → id
  4547. $E 1 T 1 id id+*id$ pop
  4548. $E 1 T 1 +*id$
  4549. derive using T 1 → ∈
  4550. Stack Contents Unspent Input Moves
  4551. $E 1 +*id$
  4552. derive using E 1 → +TE 1
  4553. $E 1 T+ +*id$ pop
  4554. $E 1 T *id$ call error routine e1
  4555. $E 1 T id*id$
  4556. derive using T → FT 1
  4557. (imaginary id is pushed by e 1 )
  4558. $E 1 T 1 F id*id$
  4559. derive using F → id
  4560. $E 1 T 1 id id*id$ pop
  4561. $E 1 T 1 *id$
  4562. derive using T 1 → *FT 1
  4563. $E 1 T 1 F id$
  4564. derive using F → id
  4565. $E 1 T 1 id id$ pop
  4566. $E 1 T 1 $
  4567. derive using T 1 → ∈
  4568. $E 1 $
  4569. derive using E 1 → ∈
  4570. $ $ accept
  4571. Similarly, if we trace the behavior for the input id id*id $:
  4572. Stack Contents Unspent Input Moves
  4573. $E id id*id$
  4574. derive using E → TE 1
  4575. $E 1 T id+*id$
  4576. derive using T → FT 1
  4577. $E 1 T 1 F id+*id$
  4578. derive using F → id
  4579. $E 1 T 1 id id+*id$ pop
  4580. $E 1 T 1 id*id$
  4581. derive using T 1 → ∈
  4582. $E 1 id*id$
  4583. derive using E 1 → ∈
  4584. $ id*id$ call error routine e 2
  4585. (id*id$ is removed by e 2 )
  4586. $ $ accept
  4587. 9.7 RECOVERY FROM SEMANTIC ERRORS
  4588. The primary sources of semantic errors are undeclared names and type incompatibilities. Recovery from an
  4589. undeclared name is rather straightforward. The first time the undeclared name is encountered, an entry can be made
  4590. in the symbol table for that name with an attribute that is appropriate to the current context. For example, if missing
  4591. declaration error of x is encountered, then the error-recovery routine enters the appropriate attribute for x in x's symbol
  4592. table, depending on the current context of x. A flag is then set in the x symbol table record to indicate that an attribute
  4593. has been added, and to recover from an error or not in response to the declaration of x.
  4594. Chapter 10: Code Optimization
  4595. 10.1 INTRODUCTION TO CODE OPTIMIZATION
  4596. The translation of a source program to an object program is basically one of many mappings; that is, there are many
  4597. object programs for the same source program, which implement the same computations. Some of these
  4598. object-translated source programs may be better than other object programs when it comes to storage requirements
  4599. and execution speeds. Code optimization refers to techniques a compiler can employ in order to produce an improved
  4600. object code for a given source program.
  4601. How beneficial the optimization is depends upon the situation. For a program that is only expected to be run a few
  4602. times, and which will then be discarded, no optimization is necessary. Whereas if a program is expected to run
  4603. indefinitely, or if it is expected to run many times, then optimization is useful, because the effort spent on improving the
  4604. program's execution time will be paid back, even if execution time is only reduced by a small percentage.
  4605. What follows are some optimization techniques that are useful when designing optimizing compilers.
  4606. 10.2 WHAT IS CODE OPTIMIZATION?
  4607. Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated
  4608. object code. It involves a complex analysis of the intermediate code and the performance of various transformations;
  4609. but every optimizing transformation must also preserve the semantics of the program. That is, a compiler should not
  4610. attempt any optimization that would lead to a change in the program's semantics.
  4611. Optimization can be machine-independent or machine-dependent. Machine-independent optimizations can be
  4612. performed independently of the target machine for which the compiler is generating code; that is, the optimizations are
  4613. not tied to the target machine's specific platform or language. Examples of machine-independent optimizations are:
  4614. elimination of loop invariant computation, induction variable elimination, and elimination of common subexpressions.
  4615. On the other hand, machine-dependent optimization requires knowledge of the target machine. An attempt to generate
  4616. object code that will utilize the target machine's registers more efficiently is an example of machine-dependent code
  4617. optimization. Actually, code optimization is a misnomer; even after performing various optimizing transformations,
  4618. there is no guarantee that the generated object code will be optimal. Hence, we are actually performing code
  4619. improvement. When attempting any optimizing transformation, the following criteria should be applied:
  4620. The optimization should capture most of the potential improvements without an unreasonable
  4621. amount of effort.
  4622. 1.
  4623. The optimization should be such that the meaning of the source program is preserved. 2.
  4624. The optimization should, on average, reduce the time and space expended by the object code. 3.
  4625. 10.3 LOOP OPTIMIZATION
  4626. Loop optimization is the most valuable machine-independent optimization because a program's inner loops are good
  4627. candidates for improvement. The important loop optimizations are elimination of loop invariant computations and
  4628. elimination of induction variables. A loop invariant computation is one that computes the same value every time a loop
  4629. is executed. Therefore, moving such a computation outside the loop leads to a reduction in the execution time.
  4630. Induction variables are those variables used in a loop; their values are in lock-step, and hence, it may be possible to
  4631. eliminate all except one.
  4632. 10.3.1 Eliminating Loop Invariant Computations
  4633. To eliminate loop invariant computations, we first identify the invariant computations and then move them outside
  4634. loop if the move does not lead to a change in the program's meaning. Identification of loop invariant computation
  4635. requires the detection of loops in the program. Whether a loop exists in the program or not depends on the program's
  4636. control flow, therefore, requiring a control flow analysis. For loop detection, a graphical representation, called a
  4637. "program flow graph," shows how the control is flowing in the program and how the control is being used. To obtain
  4638. such a graph, we must partition the intermediate code into basic blocks. This requires identifying leader statements,
  4639. which are defined as follows:
  4640. The first statement is a leader statement. 1.
  4641. The target of a conditional or unconditional goto is a leader. 2.
  4642. A statement that immediately follows a conditional goto is a leader. 3.
  4643. A basic block is a sequence of three-address statements that can be entered only at the beginning, and control ends
  4644. after the execution of the last statement, without a halt or any possibility of branching, except at the end.
  4645. 10.3.2 Algorithm to Partition Three-Address Code into Basic Blocks
  4646. To partition three-address code into basic blocks, we must identify the leader statements in the three-address code
  4647. and then include all the statements, starting from a leader, and up to, but not including, the next leader. The basic
  4648. blocks into which the three-address code is partitioned constitute the nodes or vertices of the program flow graph. The
  4649. edges in the flow graph are decided as follows. If B1 and B2 are the two blocks, then add an edge from B1 to B2 in the
  4650. program flow graph, if the block B2 follows B1 in an execution sequence. The block B2 follows B1 in an execution
  4651. sequence if and only if:
  4652. The first statement of block B2 immediately follows the last statement of block B1 in the
  4653. three-address code, and the last statement of block B1 is not an unconditional goto statement.
  4654. 1.
  4655. The last statement of block B1 is either a conditional or unconditional goto statement, and the first
  4656. statement of block B2 is the target of the last statement of block B1.
  4657. 2.
  4658. For example, consider the following program fragment:
  4659. Fact(x)
  4660. {
  4661. int f = 1;
  4662. for(i = 2; i<=x; i++)
  4663. f = f*i;
  4664. return(f);
  4665. }
  4666. The three-address-code representation for the program fragment above is:
  4667. f = 1; 1.
  4668. i = 2 2.
  4669. if i <= x goto(8) 3.
  4670. f = f *i 4.
  4671. t1 = i + 1 5.
  4672. i = t1 6.
  4673. goto(3) 7.
  4674. goto calling program 8.
  4675. The leader statements are:
  4676. Statement number 1, because it is the first statement.
  4677. Statement number 3, because it is the target of a goto.
  4678. Statement number 4, because it immediately follows a conditional goto statement.
  4679. Statement number 8, because it is a target of a conditional goto statement.
  4680. Therefore, the basic blocks into which the above code can be partitioned are as follows, and the program flow graph is
  4681. shown in Figure 10.1.
  4682. Block B 1 :
  4683. Block B 2 :
  4684. Block B 3 :
  4685. Block B 4 :
  4686. Figure 10.1: Program flow graph.
  4687. 10.3.3 Loop Detection
  4688. A loop is a cycle in the flow graph that satisfies two properties:
  4689. It should have a single entry node or header, so that it will be possible to move all of the loop
  4690. invariant computations to a unique place, called a "preheader," which is a block/node placed
  4691. outside the loop, just in front of the header.
  4692. 1.
  4693. It should be strongly connected; that is, it should be possible to go from any node of the loop to
  4694. any other node while staying within the loop. This is required until at least some of the loops get
  4695. executed repeatedly.
  4696. 2.
  4697. If the flow graph contains one or more back edges, then only one or more loops/ xcycles exist in the program.
  4698. Therefore, we must identify any back edges in the flow graph.
  4699. 10.3.4 Identification of the Back Edges
  4700. To identify the back edges in the flow graph, we compute the dominators of every node of the program flow graph. A
  4701. node a is a dominator of node b if all the paths starting at the initial node of the graph that reach to node b go through
  4702. a. For example, consider the flow graph in Figure 10.2. In this flow graph, the dominator of node 3 is only node 1,
  4703. because all the paths reaching up to node 3 from node 1 do not go through node 2.
  4704. Figure 10.2: The flow graph back edges are identified by computing the dominators.
  4705. Dominator (dom) relationships have the following properties:
  4706. They are reflexive; that is, every node dominates itself. 1.
  4707. That are transitive; that is, if a dom b and b dom c, this implies a dom c. 2.
  4708. 10.3.5 Reducible Flow Graphs
  4709. Several code-optimization transformations are easy to perform on reducible flow graphs. A flow graph G is reducible if
  4710. and only if we can partition the edges into two disjointed groups, forward edges and back edges, with the following two
  4711. properties:
  4712. The forward edges form an acyclic graph in which every node can be reached from the initial
  4713. node G.
  4714. 1.
  4715. The back edges consist only of edges whose heads dominate their tails. 2.
  4716. For example, consider the flow graph shown in Figure 10.3. This flow graph has no back edges, because no edge's
  4717. head dominates the tail of that edge. Hence, it could have been a reducible graph if the entire graph had been acyclic.
  4718. But that is not the case. Therefore, it is not a reducible flow graph.
  4719. Figure 10.3: A flow graph with no back edges.
  4720. After identifying the back edges, if any, the natural loop of every back edge must be identified. The natural loop of a
  4721. back edge a → b is the set of all those nodes that can reach a without going through b, including node b itself.
  4722. Therefore, to find a natural loop of the back edge n → d, we start with node n and add all the predecessors of node n
  4723. to the loop. Then we add the predecessors of the nodes that were just added to the loop; and we continue this process
  4724. until we reach node d. These nodes plus node d constitute the set of all those nodes that can reach node n without
  4725. going through node d. This is the natural loop of the edge n → d. Therefore, the algorithm for detecting the natural loop
  4726. of a back edge is:
  4727. Input : back edge n → d.
  4728. Output: set loop, which is a set of nodes forming the natural
  4729. loop of the back edge n → d.
  4730. main()
  4731. {
  4732. loop = { d } / * Initialize by adding node d to the set loop*/
  4733. insert(n); /* call a procedure insert with the node n */
  4734. }
  4735. procedure insert(m)
  4736. {
  4737. if m is not in the loop then
  4738. {
  4739. loop = loop ∪ { m }
  4740. for every predecessor p of m do
  4741. insert(p);
  4742. }
  4743. }
  4744. For example in the flow graph shown in Figure 10.1, the back edges are edge B3 → B2, and the loop is comprised of
  4745. the blocks B2 and B3.
  4746. After the natural loops of the back edges are identified, the next task is to identify the loop invariant computations. The
  4747. three-address statement x = y op z, which exists in the basic block B (a part of the loop), is a loop invariant statement if
  4748. all possible definitions of b and c that reach upto this statement are outside the loop, or if b and c are constants,
  4749. because then the calculation b op c will be the same each time the statement is encountered in the loop. Hence, to
  4750. decide whether the statement x = b op c is loop invariant or not, we must compute the u − d chaining information. The
  4751. u − d chaining information is computed by doing a global data flow analysis of the flow graph. All of the definitions that
  4752. are capable of reaching to a point immediately before the start of the basic block are computed, and we call the set of
  4753. all such definitions for a block B the IN(B). The set of all the definitions capable of reaching to a point immediately after
  4754. the last statement of block B will be called OUT(B). We compute both IN(B) and OUT(B) for every block B, GEN(B)
  4755. and KILL(B), which are defined as:
  4756. GEN(B): The set of all the definitions generated in block B.
  4757. KILL(B): The set of all the definitions outside block B that define the same variables as are defined in
  4758. block B.
  4759. Consider the flow graph in Figure 10.4.
  4760. The GEN and KILL sets for the basic blocks are as shown in Table 10.1.
  4761. Table 10.1: GEN and KILL sets for Figure 10.4 Flow Graph
  4762. Block GEN KILL
  4763. B1 {1,2} {6,10,11}
  4764. B2 {3,4} {5,8}
  4765. B3 {5} {4,8}
  4766. B4 {6,7} {2,9,11}
  4767. B5 {8,9} {4,5,7}
  4768. B6 {10,11} {1,2,6}
  4769. Figure 10.4: Flow graph with GEN and KILL block sets.
  4770. IN(B) and OUT(B) are defined by the following set of equations, which are called "data flow equations":
  4771. The next step, therefore, is to solve these equations. If there are n nodes, there will be 2n equations in 2n unknowns.
  4772. The solution to these equations is not generally unique. This is because we may have a situation like that shown in
  4773. Figure 10.5, where a block B is a predecessor of itself.
  4774. Figure 10.5: Nonunique solution to a data flow equation, where B is a predecessor of itself.
  4775. If there is a solution to the data flow equations for block B, and if the solution is IN(B) = IN 0 and OUT(B) = OUT 0 , then
  4776. IN 0 ∪ {d} and OUT 0 ∪ {d}, where d is any definition not in IN 0 . OUT 0 and KILL(B) also satisfy the equations, because if
  4777. we take OUT 0 ∪ {d} as the value of OUT(B), since B is one of the predecessors of itself according to IN(B) = ∪
  4778. OUT(P), d gets added to IN(B), because d is not in the KILL(B). Hence, we get IN(B) = IN 0 ∪ {d}. And according to
  4779. OUT(B) = IN(B) − KILL(B) GEN(B), OUT(B) = OUT 0 ∪ {d} gets satisfied. Therefore, IN 0 , OUT 0 is one of the solutions,
  4780. whereas IN 0 ∪ {d}, OUT 0 ∪ {d} is another solution to the equations—no unique solution. What we are interested in is
  4781. finding smallest solution, that is, the smallest IN(B) and OUT(B) for every block B, which consists of values that are in
  4782. all solutions. For example, since IN 0 is in IN 0 ∪ {d}, and OUT 0 is in OUT 0 ∪ {d}, IN 0 , OUT 0 is the smallest solution. And
  4783. this is what we want, because the smallest IN(B) turns out to be the set of all definitions reaching the point just before
  4784. the beginning of B. The algorithm for computing the smallest IN(B) and OUT(B) is as follows:
  4785. For each block B do
  4786. {
  4787. IN(B)= φ
  4788. OUT(B)= GEN(B)
  4789. }
  4790. 1.
  4791. flag = true 2.
  4792. while (flag) do
  4793. {
  4794. flag = false
  4795. for each block B do
  4796. {
  4797. IN new (B) = Φ
  4798. for each predecessor P of B
  4799. IN new (B) = IN new (B) ∪ OUT(P)
  4800. if IN new (B) ≠ IN(B) then
  4801. {
  4802. flag = true
  4803. IN(B) = IN new (B)
  4804. OUT(B) = IN(B) - KILL(B) ∪ GEN(B)
  4805. }
  4806. }
  4807. }
  4808. 3.
  4809. Initially, we take IN(B) for every block that is to be an empty set, and we take OUT(B) for GEN(B), and we compute
  4810. IN new (B). If it is different from IN(B), we compute a new OUT(B) and go for the next iteration. This is continued until
  4811. IN(B) comes out to be the same for every B in a previous or current iteration.
  4812. For example, for the flow graph shown in Figure 10.5, the IN and OUT iterations for the blocks are computed using
  4813. above algorithm, as shown in Tables 10.2–10.6.
  4814. Table 10.2: IN and OUT Computation for Figure 10.5
  4815. Block IN OUT
  4816. B1
  4817. Φ
  4818. {1,2}
  4819. B2
  4820. Φ
  4821. {3,4}
  4822. B3
  4823. Φ
  4824. {5}
  4825. B4
  4826. Φ
  4827. {6,7}
  4828. B5
  4829. Φ
  4830. {8,9}
  4831. B6
  4832. Φ
  4833. {10,11}
  4834. Table 10.3: First Iteration for the IN and OUT Values
  4835. Block IN OUT
  4836. B1
  4837. Φ
  4838. {1,2}
  4839. B2 {1,2,6,7} {1,2,3,4,6,7}
  4840. B3 {3,4,8,9} {3,5,9}
  4841. B4 {3,4,5} {3,4,5,6,7}
  4842. B5 {5} {8,9}
  4843. B6 {6,7} {7,10,11}
  4844. Table 10.4: Second Iteration for the IN and OUT Values
  4845. Block IN OUT
  4846. B1
  4847. Φ
  4848. {1,2}
  4849. B2 {1,2,3,4,5,6,7} {1,2,3,4,6,7}
  4850. B3 {1,2,3,4,6,7,8,9} {1,2,3,5,6,7,9}
  4851. B4 {1,2,3,4,5,6,7,9} {1,3,4,5,6,7}
  4852. B5 {3,5,9} {3,8,9}
  4853. B6 {3,4,5,6,7} {3,4,5,7,10,11}
  4854. Table 10.5: Third Iteration for the IN and OUT Values
  4855. Block IN OUT
  4856. B1
  4857. Φ
  4858. {1,2}
  4859. B2 {1,2,3,4,5,6,7} {1,2,3,4,6,7}
  4860. B3 {1,2,3,4,6,7,8,9} {1,2,3,5,6,7,9}
  4861. B4 {1,2,3,4,5,6,7,9} {1,3,4,5,6,7}
  4862. B5 {1,2,3,5,6,7,9} {1,2,3,6,8,9}
  4863. B6 {1,3,4,5,6,7} {1,3,4,5,7,10,11}
  4864. Table 10.6: Fourth Iteration for the IN and OUT Values
  4865. Block IN OUT
  4866. B1
  4867. Φ
  4868. {1,2}
  4869. B2 {1,2,3,4,5,6,7} {1,2,3,4,6,7}
  4870. B3 {1,2,3,4,6,7,8,9} {1,2,3,5,6,7,9}
  4871. B4 {1,2,3,4,5,6,7,9} {1,3,4,5,6,7}
  4872. B5 {1,2,3,5,6,7,9} {1,2,3,6,8,9}
  4873. B6 {1,3,4,5,6,7} {1,3,4,5,7,10,11}
  4874. The next step is to compute the u − d chains from the reaching definitions information, as follows.
  4875. If the use of A in block B is preceded by its definition, then the u − d chain of A contains only the last definition prior to
  4876. this use of A. If the use of A in block B is not preceded by any definition of A, then the u − d chain for this use consists of
  4877. all definitions of A in IN(B).
  4878. For example, in the flow graph for which IN and OUT were computed in Tables 10.2–10.6, the use of a in definition 4,
  4879. block B2 is preceded by definition 3, which is the definition of a. Hence, the u − d chain for this use of a only contains
  4880. definition 3. But the use of b in B2 is not preceded by any definition of b in B2. Therefore, the u − d chain for this use of
  4881. B will be {1}, because this is the only definition of b in IN(B2).
  4882. The u − d chain information is used to identify the loop invariant computations. The next step is to perform the code
  4883. motion, which moves a loop invariant statement to a newly created node, called "preheader," whose only successor is
  4884. a header of the loop. All the predecessors of the header that lie outside the loop will become predecessors of the
  4885. preheader.
  4886. But sometimes the movement of a loop invariant statement to the preheader is not possible because such a move
  4887. would alter the semantics of the program. For example, if a loop invariant statement exists in a basic block that is not a
  4888. dominator of all the exits of the loop (where an exit of the loop is the node whose successor is outside the loop), then
  4889. moving the loop invariant statement in the preheader may change the semantics of the program. Therefore, before
  4890. moving a loop invariant statement to the preheader, we must check whether the code motion is legal or not. Consider
  4891. the flow graph shown in Figure 10.6.
  4892. Figure 10.6: A flow graph containing a loop invariant statement.
  4893. In the flow graph shown in Figure 10.6, x = 2 is the loop invariant. But since it occurs in B3, which is not the dominator
  4894. of the exit of loop, if we move it to the preheader, as shown in Figure 10.7, a value of two will always get assigned to y
  4895. in B5; whereas in the original program, y in B5 may get value one as well as two.
  4896. Figure 10.7: Moving a loop invariant statement changes the semantics of the program.
  4897. After Moving x = 2 to the Preheader
  4898. In the flow graph shown in Figure 10.7, if x is not used outside the loop, then the statement x = 2 can be moved to the
  4899. preheader. Therefore, for a code motion to be legal, the following conditions must be met, even if no errors are
  4900. encountered:
  4901. The block in which a loop invariant statement occurs should be a dominator of all exits of the loop,
  4902. or the name assigned to the block should not be used outside the loop.
  4903. 1.
  4904. We cannot move a loop invariant statement assigned to A into preheader if there is another
  4905. statement in the loop that assigns to A. For example, consider the flow graph shown in Figure
  4906. 10.8.
  4907. Figure 10.8: Moving the preheader changes the meaning of the program.
  4908. Even though the statement x = 3 in B2 satisfies condition (1), moving it to the preheader will
  4909. change the meaning of the program. Because if x = 3 is moved to the preheader, then the value
  4910. that will be assigned to y in B5 will be two if the execution path is B1–B2–B3–B4–B2–B4–B5.
  4911. Whereas for the same execution path, the original program assigns a three to y in B5.
  4912. 2.
  4913. The move is illegal if A is used in the loop, and A is reached by any definition of A other than the
  4914. statement to be moved. For example, consider the flow graph shown in Figure 10.9.
  4915. 3.
  4916. Figure 10.9: Moving a value to the preheader changes the original meaning of the program.
  4917. Even though x is not used outside the loop, the statement x = 2 in the block B2 cannot be moved to the preheader,
  4918. because the use of x in B4 is also reached by the definition x = 1 in B1. Therefore, if we move x = 2 to the preheader,
  4919. then the value that will get assigned to a in B4 will always be a one, which is not the case in the original program.
  4920. 10.4 ELIMINATING INDUCTION VARIABLES
  4921. We define basic induction variables of a loop as those names whose only assignments within the loop are of the form I
  4922. = I ± C, where C is a constant or a name whose value does not change within the loop. A basic induction variable may
  4923. or may not form an arithmetic progression at the loop header.
  4924. For example, consider the flow graph shown in Figure 10.10. In the loop formed by B2, I is a basic induction variable.
  4925. Figure 10.10: Flow graph where I is a basic induction variable.
  4926. We then define an induction variable of loop L as either a basic induction variable or a name J for which there is a
  4927. basic induction variable I, such that each time J is assigned in L, J's value is some linear function or value of I. That is,
  4928. the value of J in L should be C 1 I + C 2 , where C 1 and C 2 could be functions of both constants and loop invariant
  4929. names. For example, in loop L, I is a basic induction variable; and T1 is also an induction variable, because the only
  4930. assignment of T1 in the loop assigns a value to T1 that is a linear function of I, computed as 4 * I.
  4931. Algorithm for Detecting and Eliminating Induction Variables
  4932. An algorithm exists that will detect and eliminate induction variables. Its method is as follows:
  4933. Find all of the basic induction variables by scanning the statements of loop L. 1.
  4934. Find any additional induction variables, and for each such additional induction variable A, find the
  4935. family of some basic induction B to which A belongs. (If the value of A at the point of assignment is
  4936. expressed as C 1 B + C 2 , then A is said to belong to the family of basic induction variable B).
  4937. Specifically, we search for names A with single assignments to A within loop L, and which have
  4938. 2.
  4939. one of the following forms:
  4940. where C is a loop constant, and B is an induction variable, basic or otherwise. If B is basic, then A
  4941. is in the family of B. If B is not basic, let B be in the family of D, then the additional requirements to
  4942. be satisfied are:
  4943. There must be no assignment to D between the lone point of assignment to B
  4944. in L and the assignment to A.
  4945. a.
  4946. There must be no definition of B outside of L reaches A. b.
  4947. Consider each basic induction variable B in turn. For every induction variable A in the family of B:
  4948. Create a new name, temp. a.
  4949. Replace the assignment to A in the loop with A = temp. b.
  4950. Set temp to C 1 B + C 2 at the end of the preheader by adding the statements: c.
  4951. Immediately after each assignment B = B + D, where D is a loop invariant,
  4952. append:
  4953. If D is a loop invariant name, and if C 1 ≠ 1, create a new loop invariant name
  4954. for C 1 * D, and add the statements:
  4955. d.
  4956. For each basic induction variable B whose only uses are to compute other
  4957. induction variables in its family and in conditional branches, take some A in B's
  4958. family, preferably one whose function expresses its value simply, and replace
  4959. each test of the form B reloop X goto Y by:
  4960. Delete all assignments to B from the loop, as they will now be useless.
  4961. e.
  4962. If there is no assignment to temp between the introduced statement A = temp
  4963. (step 1) and the only use of A, then replace all uses of A by temp and delete
  4964. the statement A = temp.
  4965. In the flow graph shown in Figure 10.10, we see that I is a basic induction
  4966. variable, and T1 is the additional induction variable in the family of I, because
  4967. f.
  4968. 3.
  4969. the value of T1 at the point of assignment in the loop is expressed as T1 = 4 *
  4970. I. Therefore, according to step 3b, we replace T1 = 4 * I by T1 = temp. And
  4971. according to step 3c, we add temp = 4 * I to the preheader. We then append
  4972. the statement temp = temp + 4 after Figure 10.10 statement (10), as per step
  4973. 3d. And according to step 3e, we replace the statement if I ≤ 20 goto B2 by:
  4974. The results of these modifications are shown in Figure 10.11.
  4975. Figure 10.11: Modified flow graph.
  4976. By step 3f, replace T1 by temp. And by copy propagation, temp = 4 * I, in the preheader, can be replaced by temp =
  4977. 4, and the statement I = 1 can be eliminated. In B1, the statement if temp ≤ temp2 goto B2 can be replaced by if temp
  4978. ≤ 80 goto B2, and we can eliminate temp2 = 80, as shown in Figure 10.12.
  4979. Figure 10.12: Flow graph preheader modifications.
  4980. 10.5 ELIMINATING LOCAL COMMON SUBEXPRESSIONS
  4981. The first step in eliminating local common subexpressions is to detect the common subexpression in a basic block.
  4982. The common subexpressions in a basic block can be automatically detected if we construct a directed acyclic graph
  4983. (DAG).
  4984. DAG Construction
  4985. For constructing a basic block DAG, we make use of the function node(id), which returns the most recently created
  4986. node associated with id. For every three-address statement x = y op z, x = op y, or x = y in the block we:
  4987. do
  4988. {
  4989. If node(y) is undefined, create a leaf labeled y, and let node(y) be this node. If node(z) is
  4990. undefined, create a leaf labeled z, and let that leaf be node(z). If the statement is of the form x =
  4991. op y or x = y, then if node(y) is undefined, create a leaf labeled y, and let node(y) be this node.
  4992. 1.
  4993. If a node exists that is labeled op whose left child is node(y) and whose right child is node(z) (to
  4994. catch the common subexpressions), then return this node. Otherwise, create such a node, and
  4995. return it. If the statement is of the form x = op y, then check if a node exists that is labeled op
  4996. whose only child is node(y). Return this node. Otherwise, create such a node and return. Let the
  4997. returned node be n.
  4998. 2.
  4999. Append x to the list of identifiers for the node n returned in step 2. Delete x from the list of
  5000. attached identifiers for node(x), and set node(x) to be node n.
  5001. 3.
  5002. }
  5003. Therefore, we first go for a DAG representation of the basic block. And if the interior nodes in the DAG have more than
  5004. one label, then those nodes of the DAG represent the common subexpressions in the basic block. After detecting
  5005. these common subexpressions, we eliminate them from the basic block. The following example shows the elimination
  5006. of local common subexpressions, and the DAG is shown in Figure 10.13.
  5007. S1 : = 4 * I 1.
  5008. S2 : addr(A) − 4 2.
  5009. S3 : S2 [S1] 3.
  5010. S4 : 4 * I 4.
  5011. S5 : = addr(B) − 4 5.
  5012. S6 : = S5 [S4] 6.
  5013. S7 : = S3 * S6 7.
  5014. S8 : PROD + S7 8.
  5015. PROD : = S8 9.
  5016. S9 : = I + 1 10.
  5017. I = S9 11.
  5018. if I ≤ 20 goto (1). 12.
  5019. Figure 10.13: DAG representation of a basic block.
  5020. In Figure 10.13, PROD 0 indicates the initial value of PROD, and I0 indicates the initial value of I. We see that the
  5021. same value is assigned to S8 and PROD. Similarly, the value assigned to S9 is the same as I. And the value
  5022. computed for S1 and S4 are the same; hence, we can eliminate these common subexpressions by selecting one of
  5023. the attached identifiers (one that is needed outside the block). We assume that none of the temporaries is needed
  5024. outside the block. The rewritten block will be:
  5025. S1 : = 4 * I 1.
  5026. S2 : = addr(A) − 4 2.
  5027. S3 : = S2 [S1] 3.
  5028. S5 : = addr(B) − 4 4.
  5029. S6 : = S5 [S1] 5.
  5030. S7 : = S3 * S6 6.
  5031. PROD : = PROD + S7 7.
  5032. I : = I + 1 8.
  5033. if I ≤ 20 goto (1) 9.
  5034. 10.6 ELIMINATING GLOBAL COMMON SUBEXPRESSIONS
  5035. Global common subexpressions are expressions that compute the same value but in different basic blocks. To detect
  5036. such expressions, we need to compute available expressions.
  5037. 10.6.1 Available Expressions
  5038. An expression x op y is available at a point p if every path from the initial node of the flow graph reaching to p
  5039. evaluates x op y, and if after the last such evaluation and prior to reaching p there are no subsequent assignments to x
  5040. or y. To eliminate global common subexpressions, we need to compute the set of all the expressions available at the
  5041. point just before the start of every block. This requires computing the set all the expressions available at a point just
  5042. after the end of every block. We call these sets IN(b) and OUT(b), respectively. The computation of IN(b) and OUT(b)
  5043. requires computing the set of all expressions generated by the basic block and the set of all expressions killed by the
  5044. basic block, respectively:
  5045. A block kills an expression x op y if it assigns to x or y and if does not subsequently recompute as op
  5046. y.
  5047. A block generates an expression x op y if it evaluates x op y and subsequently does not redefine x or
  5048. y.
  5049. To compute the available expressions, we solve the following equations:
  5050. Here, also, we obtain the smallest solution.
  5051. The algorithm for computing the smallest IN(b) and OUT(b) is given below, where b1 is the initial block, and U is a
  5052. "universal" set of all expressions appearing on the right of one or more statements of the program.
  5053. IN(b1) = φ
  5054. OUT(b1) = GEN(b1);
  5055. 1.
  5056. for (i=2; i <= n; i++)
  5057. {
  5058. IN(b) = U
  5059. OUT(b) = U - GEN(b)
  5060. }
  5061. 2.
  5062. flag = true 3.
  5063. while (flag) do
  5064. {
  5065. flag = false
  5066. for (i=2; i <= n; i++)
  5067. {
  5068. IN new (bi) = Φ
  5069. for each predecessor p of bi
  5070. IN new (bi) = IN new (bi) ∩ OUT(p)
  5071. if IN new (bi) ≠ IN(bi) then
  5072. {
  5073. flag = true
  5074. IN(bi) = IN new (bi)
  5075. 4.
  5076. OUT(bi) = IN(bi) - KILL(bi) ∪ GEN(bi)
  5077. }
  5078. }
  5079. }
  5080. After computing IN(b) and OUT(b), eliminating the global common subexpressions is done as follows. For every
  5081. statement s of the form x = y op z such that y op z is available at the beginning of the block containing s, and neither y
  5082. nor z is defined prior to the statement x = y op z in that block, do:
  5083. Find all definitions reaching up to the s statement block that have y op z on the right. 1.
  5084. Create a new temp. 2.
  5085. Replace each statement U = y op z found in step 1 by: 3.
  5086. Replace the statement x = y op z in block by x = temp. 4.
  5087. 10.7 LOOP UNROLLING
  5088. Loop unrolling involves replicating the body of the loop to reduce the required number of tests if the number of
  5089. iterations are constant. For example consider the following loop:
  5090. I = 1
  5091. while (I <= 100)
  5092. {
  5093. x[I] = 0;
  5094. I++;
  5095. }
  5096. In this case, the test I <= 100 will be performed 100 times. But if the body of the loop is replicated, then the number of
  5097. times this test will need to be performed will be 50. After replication of the body, the loop will be:
  5098. I = 1
  5099. while(I<= 100)
  5100. {
  5101. x[I] = 0;
  5102. I++;
  5103. X[I] = 0;
  5104. I++;
  5105. }
  5106. It is possible to choose any divisor for the number of times the loop is executed, and the body will be replicated that
  5107. many times. Unrolling once—that is, replicating the body to form two copies of the body—saves 50% of the maximum
  5108. possible executions.
  5109. 10.8 LOOP JAMMING
  5110. Loop jamming is a technique that merges the bodies of two loops if the two loops have the same number of iterations
  5111. and they use the same indices. This eliminates the test of one loop. For example, consider the following loop:
  5112. {
  5113. for (I = 0; I < 10; I++)
  5114. for (J = 0; J < 10; J++)
  5115. X[I,J] = 0;
  5116. for (I = 0; I < 10; I++)
  5117. X[I,I] = 1;
  5118. }
  5119. Here, the bodies of the loops on I can be concatenated. The result of loop jamming will be:
  5120. {
  5121. for (I = 0; I < 10; I++)
  5122. {
  5123. for (J = 0; J < 10; J++)
  5124. X[I,J] = 0;
  5125. X[I,I] = 1;
  5126. }
  5127. }
  5128. The following conditions are sufficient for making loop jamming legal:
  5129. No quantity is computed by the second loop at the iteration I if it is computed by the first loop at
  5130. iteration J ≥ I.
  5131. 1.
  5132. If a value is computed by the first loop at iteration J ≥ I, then this value should not be used by
  5133. second loop at iteration I.
  5134. 2.
  5135. Chapter 11: Code Generation
  5136. 11.1 AN INTRODUCTION TO CODE GENERATION
  5137. Code generation is the last phase in the compilation process. Being a machine-dependent phase, it is not possible to
  5138. generate good code without considering the details of the particular machine for which the compiler is expected to
  5139. generate code. Even so, a carefully selected code-generation algorithm can produce code that is twice as fast as code
  5140. generated by an ill-considered code-generation algorithm.
  5141. In this chapter, we first discuss straightforward code generation from a sequence of three-address statements. This is
  5142. followed by a discussion of the code-generation algorithm that takes into account the flow of control structures in the
  5143. program when assigning registers to names. Then we will look at a code-generation algorithm that is capable of
  5144. generating reasonably good code from a basic block. Finally, various machine-dependent optimizations that are
  5145. capable of improving the efficiency of object code are discussed. Throughout our discussion, we assume that the input
  5146. to the code-generation algorithm is a sequence of three-address statements partitioned into basic blocks.
  5147. 11.2 PROBLEMS THAT HINDER GOOD CODE GENERATION
  5148. There are three main difficulties that we face when attempting to generate efficient object code, namely:
  5149. Selection of the most-efficient instructions to represent the computation specified by the
  5150. three-address statement;
  5151. 1.
  5152. Deciding on a computation order that leads to the generation of the more-efficient object code;
  5153. and
  5154. 2.
  5155. Deciding which registers to use. 3.
  5156. Selecting the Most-Efficient Instructions to Represent the Computation Specified by the
  5157. Three-Address Statement
  5158. Many machines allow certain computations to be done in more than one way. For example, if a machine permits an
  5159. instruction AOS for incrementing the contents of a storage location directly, then for a three-address statement a = a +
  5160. 1, it is possible to generate the instruction AOS a, rather than a sequence of instructions like the following:
  5161. MOVE a, R
  5162. ADD #1, R
  5163. MOVE R, a
  5164. Now, deciding which instruction sequence is better is the problem. This decision requires an extensive knowledge
  5165. about the context in which these three-address statements will appear.
  5166. Deciding on the Computation Order that Will Lead to the Generation of More-Efficient
  5167. Object Code
  5168. Some computation orders require fewer registers to hold intermediate results than others. Now, deciding the best
  5169. order is very difficult. For example, consider the basic block:
  5170. If the order of computation used is the one given in the basic block t1-t2-t3-t4, then the number of registers required for
  5171. holding the intermediate result is more than when the order t2-t3-t1-t4 is used.
  5172. Deciding on Registers
  5173. Deciding which register should handle the computation is another problem that stands in the way of good code
  5174. generation. The problem is further complicated when a machine requires register-pairs for some operands and results.
  5175. 11.3 THE MACHINE MODEL
  5176. Being a machine-dependent phase, we will need to describe some of the features of a typical computer in order to
  5177. discuss the various issues involved in code generation. For this purpose, we describe a hypothetical machine model,
  5178. as follows.
  5179. We assume that the machine is byte-addressable with two bytes per word, having 2 16 bytes, and eight
  5180. general-purpose registers, R0 to R7, that are capable of holding a 16-bit quantity. The format of the instruction is an op
  5181. source destination with four-bit opcode, and the source and destination are each six-bit fields. Since a six-bit field is
  5182. not capable of holding a memory address (a memory address is a 16-bit), when sources and destinations are memory
  5183. addresses, then these six-bit fields hold certain bit patterns that specify that the words following an instruction contain
  5184. memory addresses used as source and destination operands, respectively. The following addressing modes are
  5185. assumed to be supported by the machine model:
  5186. r (register addressing) 1.
  5187. *r (indirect register) 2.
  5188. X (absolute address) 3.
  5189. #data (immediate) 4.
  5190. X(r) (indexed address) 5.
  5191. *X(r) (indirect indexed address) 6.
  5192. We assume that opcodes like the one listed below are available:
  5193. MOV (for moving source to destination),
  5194. ADD (for adding source to destination), and
  5195. SUB (for subtracting source from destination), and so on.
  5196. The cost of the instruction is considered to be its length, because generating a shorter instruction not only reduces the
  5197. storage requirement of the object code, but it also reduces the time taken to perform the operation. This is because
  5198. most machines spend more time fetching words from memory than they spend in executing the instruction. Hence, by
  5199. minimizing the instruction length, we minimize the time taken to perform the instruction, as well.
  5200. For example, length of the instruction MOV R0, R1 is one memory word, because, three-bit code is enough for
  5201. uniquely identifying each of the registers. Therefore, the six-bit fields, each for source and destination operand, can
  5202. easily hold the three-bit codes for the registers shown in Table 11.1.
  5203. Table 11.1: Six-Bit Registers for the Instruction MOV R0, R1
  5204. MOV R0 R1
  5205. Similarly, the length of the instruction MOV R0, M is two memory words, because since the destination operand is a
  5206. memory address, it will occupy the word following an instruction, as shown in Table 11.2.
  5207. Table 11.2: Six-Bit Registers for the Instruction MOV R0, R2
  5208. MOV R 0 bit pattern
  5209. M
  5210. Similarly, the length of the instruction MOV M1, M2 is three memory words, because the source and the destination
  5211. operands, being memory addresses, will occupy the words following the instruction, as shown in Table 11.3.
  5212. Table 11.3: Six-Bit Registers for the Instruction MOV M1, M2
  5213. MOV bit pattern bit pattern
  5214. M1
  5215. M2
  5216. For example, consider a three-address statement, a = b + c. We can generate the following different instruction
  5217. sequences for this statement, depending upon where the values of operand b and c can be found.
  5218. If the values of b and c can be found in the memory locations of the same name, then the following instruction
  5219. sequences can be generated:
  5220. MOV b, R0
  5221. ADD c, R0
  5222. MOV R0, a length = six words
  5223. 1.
  5224. MOV b, a
  5225. ADD c, a length = six words
  5226. If the addresses of a, b, and c are assumed to be in registers R0, R1, and R2, respectively then
  5227. the following instruction sequence can be generated:
  5228. 2.
  5229. MOV *R1, *R0
  5230. ADD *R2, *R0 length = two words
  5231. If the values of b and c are assumed to be in registers R0 and R1, respectively, then the following
  5232. instruction sequence can be generated:
  5233. 3.
  5234. ADD R2, R1
  5235. MOV R1, a length = three words
  5236. 4.
  5237. Therefore, we conclude that for generating good code, we must utilize the addressing capabilities of the machine
  5238. efficiently. And this will be possible if we keep the one-value or the r-value of the name in the register if it is going to be
  5239. used in the future.
  5240. 11.4 STRAIGHTFORWARD CODE GENERATION
  5241. Given a sequence of three-address statements partitioned into basic blocks, straightforward code generation involves
  5242. generating code for each three-address statement in turn by taking the advantage of any of the operands of the
  5243. three-address statements that are in the register, and leaving the computed result in the register as long as possible.
  5244. We store it only if the register is needed for another computation or just before a procedure call, jump, or labeled
  5245. statement, such as at the end of a basic block. The reason for this is that after leaving a basic block, we may go to
  5246. several different blocks, or we may go to one particular block that can be reached from several others. In either case,
  5247. we cannot assume that a datum used by a block appears in the same register, no matter how the program's control
  5248. reached that block. Hence, to avoid possible error, our code-generation strategy stores everything across the basic
  5249. block boundaries.
  5250. When generating code by using the above strategy, we need to keep track of what is currently in
  5251. each register. For this, we maintain what is called a "register descriptor," which is simply a pointer
  5252. to a list that contains information about what is currently in each of the registers. Initially, all of the
  5253. registers are empty.
  5254. We also need to keep track of the locations for each name—where the current value of the name can be found at run
  5255. time. For this, we maintain what is called an "address descriptor" for each name in the block. This information can be
  5256. stored in the symbol table.
  5257. We also need a location to perform the computation specified by each of the three-address statements. For this, we
  5258. make use of the function getreg(). When called, getreg() returns a location specifying the computation performed by a
  5259. three-address statement. For example, if x = y op z is performed, getreg() returns a location L where the computation y
  5260. op z should be performed; and if possible, it returns a register.
  5261. Algorithm for the Function Getreg()
  5262. What follows is an algorithm for storing and returning the register locations for three-address statements by using the
  5263. function getreg().
  5264. {
  5265. For every three-address statement of the form x = y op z
  5266. in the basic block do
  5267. {
  5268. Call getreg() to obtain the location L in which the computation y op z should be performed. /* This
  5269. requires passing the three-address statement x = y op z as a parameter to getreg(), which can be
  5270. done by passing the index of this statement in the quadruple array.
  5271. 1.
  5272. Obtain the current location of the operand y by consulting its address descriptor, and if the value
  5273. of y is currently both in the memory location as well as in the register, then prefer the register. If
  5274. the value of y is currently not available in L, then generate an instruction MOV y, L (where y as
  5275. assumed to represent the current location of y).
  5276. 2.
  5277. Generate the instruction OP z, L, and update the address descriptor of x to indicate that x is now
  5278. available in L, and if L is in a register, then update its descriptor to indicate that it will contain the
  5279. run-time value of x.
  5280. 3.
  5281. If the current values of y and /or z are in the register, we have no further uses for them, and they
  5282. are not live at the end of the block, then alter the register descriptor to indicate that after the
  5283. execution of the statement x = y op z, those registers will no longer contain y and /or z.
  5284. 4.
  5285. }
  5286. Store all the results.
  5287. }
  5288. The function getreg(), when called upon to return a location where the computation specified by the three-address
  5289. statement x = y op z should be performed, returns a location L as follows:
  5290. First, it searches for a register already containing the name y. If such a register exists, and if y
  5291. has no further use after the execution of x = y op z, and if it is not live at the end of the block and
  5292. holds the value of no other name, then return the register for L.
  5293. 1.
  5294. Otherwise, getreg() searches for an empty register; and if an empty register is available, then it
  5295. returns it for L.
  5296. 2.
  5297. If no empty register exists, and if x has further use in the block, or op is an operator such as
  5298. indexing that requires a register, then getreg() it finds a suitable, occupied register. The register is
  5299. emptied by storing its value in the proper memory location M, the address descriptor is updated,
  5300. the register is returned for L. (The least-recently used strategy can be used to find a suitable,
  5301. occupied register to be emptied.)
  5302. 3.
  5303. If x is not used in the block or no suitable, occupied register can be found, getreg() selects a
  5304. memory location of x and returns it for L.
  5305. 4.
  5306. EXAMPLE 11.1
  5307. Consider the expression:
  5308. The three-address code for this is:
  5309. Applying the algorithm above results in Table 11.4.
  5310. Table 11.4: Computation for the Expression x = ( a + b ) − (( c + d ) − e )
  5311. Statement L Instructions Generated Register Descriptor Address Descriptor
  5312. All registers empty
  5313. t1 = a + b R0 MOV a, R0 ADD b, R0 R0 will hold t1 t1 is in R0
  5314. t2 = c + d R1 MOV c, R1 ADD d, R1 R1 will hold t2 t2 is in R1
  5315. t3 = t2 − e
  5316. R1 SUB e, R1 R1 will hold t3 t3 is in R1
  5317. x = t1 − t3
  5318. R0 SUB R1, R0 R0 will hold x x is in R0
  5319. MOV R0, x
  5320. x is in R0 and memory
  5321. The algorithm makes use of the next-use information of each name in order to make more-informed decisions
  5322. regarding register allocation. Therefore, it is required to compute the next-use information. If:
  5323. A statement at the index i in a block assigns a value to name x,
  5324. And if a statement at the index j in the same block uses x as an operand,
  5325. And if the path from the statement at index i to the statement at index j is a path without any
  5326. intervening assignment to name x, then
  5327. we say that the value of x computed by the statement at index i is used in the statement at index j. Hence, the next use
  5328. of the name x in the statement i is statement j. For each three-address statement i, we need to compute information
  5329. about those three-address statements in the block that are the next uses of the names coming in statement i. This
  5330. requires the backward scanning of the basic block, which will allow us to attach to every statement i under
  5331. consideration the information about those statements that are the next uses of each name in the statement i. The
  5332. algorithm is as follows:
  5333. For each statement i of the form x = y op z do
  5334. {
  5335. attach information about the next uses of x, y, and z
  5336. to statement i
  5337. set the information for x to no next-use /* This information
  5338. can be kept into the symbol table */
  5339. set the information for y and z to be the next use
  5340. in statement i
  5341. }
  5342. Consider the basic block:
  5343. When straightforward code generation is done using the above algorithm, and if only two registers, R0 and R1, are
  5344. available, then the generated code is as shown in Table 11.5.
  5345. Table 11.5: Generated Code with Only Two Available Registers, R0 and R1
  5346. Statement L Instructions Generated Cost Register Descriptor Address
  5347. Descriptor
  5348. R0 and R1 empty
  5349. t1 = a + b R0 MOV a, R0
  5350. ADD b, R0
  5351. 2
  5352. words
  5353. 2
  5354. words
  5355. R0 will hold t1 is in t1 R0
  5356. t2 = c + d R1 MOV c, R1
  5357. ADD d, R1
  5358. 2
  5359. words
  5360. 2
  5361. words
  5362. R1 will hold t2 t2 is in R1
  5363. t3 = e − t2
  5364. MOV R0, t1 (generated
  5365. memory by getreg())
  5366. 2
  5367. words
  5368. t1 is in
  5369. t3 is in R0
  5370. R0 MOV e, R0SUB R1, R0 2
  5371. words
  5372. 1 word
  5373. R0 will hold t3
  5374. R1 will be empty
  5375. because t2 has no next
  5376. use
  5377. x = t1 − t3
  5378. R1 MOV t1, R1 SUB R0, R1 2
  5379. words
  5380. 1 word
  5381. R1 will hold x
  5382. R0 will be empty
  5383. because t3 has no next
  5384. use
  5385. x is in R1
  5386. MOV R1, x 2
  5387. words
  5388. x is in R1 and
  5389. memory
  5390. We see that the total length of the instruction sequence generated is 18 memory words. If we rearrange the final
  5391. computations as:
  5392. and then generate the code, we get Table 11.6.
  5393. Table 11.6: Generated Code with Rearranged Computations
  5394. Statement L Instructions
  5395. Generated
  5396. Cost Register Descriptor Address
  5397. Descriptor
  5398. R0 and R1 empty
  5399. t2 = c + d R0 MOV c, R0 ADD
  5400. d, R0
  5401. 2
  5402. words
  5403. 2
  5404. words
  5405. R0 will hold t2 t2 is in R0
  5406. t3 = e − t2
  5407. R1 MOV e, R1SUB
  5408. R0, R1
  5409. 2
  5410. words
  5411. 1 word
  5412. R1 will hold t3 R0 will be empty
  5413. because t2 has no next use
  5414. t3 is in R1
  5415. t1 = a + b R0 MOV a, R0 ADD
  5416. b, R0
  5417. 2
  5418. words
  5419. 2
  5420. words
  5421. R0 will hold t1 t1 is in R0
  5422. x = t1 − t3
  5423. R1 SUB R1, R0 1 word R0 will hold x R1 will be empty
  5424. because t3 has no next use
  5425. x is in R0
  5426. MOV R0, x 2
  5427. words
  5428. x is in R0 and
  5429. memory
  5430. Here, the length of the instruction sequence generated is 14 memory words. This indicates that the order of the
  5431. computation is a deciding factor in the cost of the code generated. In the above example, the cost is reduced when the
  5432. order t2-t3-t1-t4 is used, because t1 gets computed immediately before the statement that computes t4, which uses t1
  5433. as its left operand. Hence, no intermediate store-and-load is required, as is the case when the order t1-t2-t3-t4 is used.
  5434. Good code generation requires rearranging the final computation order, and this can be done conveniently with a DAG
  5435. representation of a basic block rather than with a linear sequence of three-address statements.
  5436. 11.5 USING DAG FOR CODE GENERATION
  5437. To rearrange the final computation order for more-efficient code-generation, we first obtain a DAG representation of
  5438. the basic block, and then we order the nodes of the DAG using heuristics. Heuristics attempts to order the nodes of a
  5439. DAG so that, if possible, a node immediately follows the evaluation of its left-most operand.
  5440. 11.5.1 Heuristic DAG Ordering
  5441. The algorithm for heuristic ordering is given below. It lists the nodes of a DAG such that the node's reverse listing
  5442. results in the computation order.
  5443. {
  5444. While there exists an unlisted interior node do
  5445. {
  5446. select an unlisted node n whose parents have been listed
  5447. list n
  5448. while there exists a left-most child m of n that has no
  5449. unlisted parents and m is not a leaf do
  5450. {
  5451. list m
  5452. m = n
  5453. }
  5454. }
  5455. order = reverse of the order of listing of nodes
  5456. }
  5457. EXAMPLE 11.2
  5458. Consider the DAG shown in Figure 11.1.
  5459. Figure 11.1: DAG Representation.
  5460. The order in which the nodes are listed by the heuristic ordering is shown in Figure 11.2.
  5461. Figure 11.2: DAG Representation with heuristic ordering.
  5462. Therefore, the computation order is:
  5463. If the DAG representation turns out to be a tree, then for the machine model described above, we can obtain the
  5464. optimal order using the algorithm described in Section 11.5.2, below. Here, an optimal order means the order that
  5465. yields the shortest instruction sequence.
  5466. 11.5.2 The Labeling Algorithm
  5467. This algorithm works on the tree representation of a sequence of three-address statements. It could also be made to
  5468. work if the intermediate code form was a parse tree. This algorithm has two parts: the first part labels each node of the
  5469. tree from the bottom up, with an integer that denotes the minimum number of registers required to evaluate the tree
  5470. and with no storing of intermediate results. The second part of the algorithm is a tree traversal that travels the tree in
  5471. an order governed by the computed labels in the first part, and which generates the code during the tree traversal.
  5472. {
  5473. if n is a leaf node then
  5474. if n is the left-most child of its parent then
  5475. label(n) = 1
  5476. else
  5477. label(n) = 0
  5478. else
  5479. label(n) = max[label(n i ) + (i - 1)]
  5480. for i = 1 to k
  5481. /* where n 1 , n 2 ,..., n k are the children of n, ordered by their labels; that is,
  5482. label(n 1 ) ≥ label(n 2 ) ≥ ... ≥ label(n k ) */
  5483. }
  5484. For k = 2, the formula label(n) = max[label(n i ) + (i - 1)] becomes:
  5485. label(n) = max[11, 12 + 1]
  5486. where 11 is label(n 1 ), and 12 is label(n 2 ). Since either 11 or 12 will be same, or since there will be a difference of at
  5487. least the difference between 11 and 12 (i.e., 11 − 12), which is greater than or equal to one, we get:
  5488. EXAMPLE 11.3
  5489. Consider the following three-address code and its DAG representation, shown in Figure 11.3:
  5490. Figure 11.3: DAG representation of three-address code for Example 11.3.
  5491. The tree, after labeling, is shown in Figure 11.4.
  5492. Figure 11.4: DAG representation tree after labeling.
  5493. 11.5.3 Code Generation by Traversing the Labeled Tree
  5494. We will now examine an algorithm that traverses the labeled tree and generates machine code to evaluate the tree in
  5495. the register R0. The content of R0 can then be stored in the appropriate memory location. We assume that only binary
  5496. operators are used in the tree. The algorithm uses a recursive procedure, gencode(n), to generate the code for
  5497. evaluating into a register a subtree that has its root in node n. This procedure makes use of RSTACK to allocate
  5498. registers.
  5499. Initially, RSTACK contains all available registers. We assume the order of the registers to be R0, R1, … , from top to
  5500. bottom. A call to gencode() may find a subset of registers, perhaps in a different order in RSTACK, but when
  5501. gencode() returns, it leaves the registers in RSTACK in the same order in which they were found. The resulting code
  5502. computes the value of the tree in the top register of RSTACK. It also makes use of TSTACK to allocate temporary
  5503. memory locations. Depending upon the type of node n with which gencode() is called, gencode() performs the
  5504. following:
  5505. If n is a leaf node and is the left-most child of its parent, then gencode() generates a load
  5506. instruction for loading the top register of RSTACK by the label of node n:
  5507. 1.
  5508. If n is an interior node, it will be an operator node labeled by op with the children n 1 and n 2 , and
  5509. n 2 is a simple operand and not a root of the subtree, as shown in Figure 11.5.
  5510. Figure 11.5: The node n is an operand and not a subtree root.
  5511. In this case, gencode() will first generate the code to evaluate the subtree rooted at n 1 in the
  5512. top{RSTACK]. It will then generate the instruction, OP name, RSTACK[top].
  5513. 2.
  5514. If n is an interior node, it will be an operator node labeled by op with the children n 1 and n 2 , and
  5515. both n 1 and n 2 are roots of subtrees, as shown in Figure 11.6.
  5516. 3.
  5517. Figure 11.6: The node n is an operator, and n 1 and n 2 are subtree roots.
  5518. In this case, gencode() examines the labels of n 1 and n 2 . If label(n 2 ) > label(n 1 ), then n 2 requires
  5519. a greater number of registers to evaluate without storing the intermediate results than n 1 does.
  5520. Therefore, gencode() checks whether the total number of registers available to r is greater than
  5521. the label(n 1 ). If it is, then the subtree rooted at n 1 can be evaluated without storing the
  5522. intermediate results. It first swaps the top two registers of RSTACK, then generates the code for
  5523. evaluating the subtree rooted at n 2 , which is harder to evaluate in RSTACK[top]. It removes the
  5524. top-most register from RSTACK and stores it in R, then generates code for evaluating the subtree
  5525. rooted at n 1 in RSTACK[top]. An instruction, OP R, RSTACK[top], is generated, pushing R onto
  5526. RSTACK. The top two registers are swapped so that the register holding the value of n will be in
  5527. the top register of RSTACK.
  5528. If label(n 2 ) <= label(n 1 ), then n 1 requires a greater number of register to evaluate without storing
  5529. the intermediate results than n 2 does. Therefore, gencode() checks whether the total number of
  5530. registers available to r is greater than label(n 2 ). If it is, then the subtree rooted at n 2 can be
  5531. evaluated without storing the intermediate results. Hence, it first generates the code for evaluating
  5532. subtree rooted at n 1 , which is harder to evaluate in RSTACK[top], removes the top-most register
  5533. from RSTACK, and stores it in R. It then generates code for evaluating the subtree rooted at n 2 in
  5534. RSTACK[top]. An instruction, OP RSTACK[top], R, is generated that pushes register R onto
  5535. RSTACK. In this case, the top register, after pushing R onto RSTACK, holds the value of n.
  5536. Therefore, swapping and reswapping is needed.
  5537. 4.
  5538. If label(n 1 ) as well as label(n 2 ) are greater than or equal to r (i.e., both subtrees require r or more
  5539. registers to evaluate without intermediate storage), a temporary memory location is required. In
  5540. this case, gencode() first generates the code for evaluating n 2 in a temporary memory location,
  5541. then generates code to evaluate n 1 , followed by an instruction to evaluate root n in the top
  5542. register of RSTACK.
  5543. 5.
  5544. Algorithm for Implementing Gencode()
  5545. The procedure for gencode() is outlined as follows:
  5546. Procedure gencode(n)
  5547. {
  5548. if n is a leaf node and the left-most child of its parent then
  5549. generate MOV name, RSTACK[top]
  5550. if n is an interior node with children n 1 and n 2 , with
  5551. label(n 2 ) = 0 then
  5552. {
  5553. gencode(n 1 )
  5554. generate op name RSTACK[top] /* name is the operand
  5555. represented by n 2 and op is the operator
  5556. represented by n*/
  5557. }
  5558. if n is an interior node with children n 1 and n 2 ,
  5559. label(n 2 ) > label(n 1 ), and label(n 1 ) < r then
  5560. {
  5561. swap top two registers of RSTACK
  5562. gencode(n 2 )
  5563. R = pop(RSTACK)
  5564. gencode(n 1 )
  5565. generate op R, RSTACK[top] /* op is the operator
  5566. represented by n */
  5567. PUSH(R,RSTACK)
  5568. swap top two registers of RSTACK
  5569. }
  5570. if n is an interior node with children n 1 and n 2 ,
  5571. label(n 2 ) <= label(n 1 ), and label(n 2 ) < r then
  5572. {
  5573. gencode(n 1 )
  5574. R = pop(RSTACK)
  5575. gencode(n 2 )
  5576. generate op RSTACK[top], R /* op is the operator
  5577. represented by n */
  5578. PUSH(R, RSTACK)
  5579. }
  5580. if n is an interior node with children n 1 and n 2 ,
  5581. label(n 2 ) <= label(n 1 ), and label(n 1 ) > r as well as
  5582. label(n 2 ) > r then
  5583. {
  5584. gencode(n 2 )
  5585. T = pop(TSTACK)
  5586. generate MOV RSTACK[top], T
  5587. gencode(n1)
  5588. PUSH(T, TSTACK)
  5589. generate op T, RSTACK[top] /* op is the operator
  5590. represented by n */
  5591. }
  5592. }
  5593. The algorithm above can be used when the DAG represented is a tree; but when there are common subexpressions in
  5594. the basic block, the DAG representation will no longer be a tree, because the common subexpressions will correspond
  5595. to nodes with more than one parent. These are called "shared nodes". In this case, we can apply the labeling and the
  5596. gencode() algorithm by partitioning the DAG into a set of trees. We find, for each shared node as well as root n, the
  5597. maximal subtree with n as a root that includes no other shared nodes, except as leaves. For example, consider the
  5598. DAG shown in Figure 11.7. It is not a tree, but it can be partitioned into the set of trees shown in Figure 11.8. The
  5599. procedure gencode() can be used to generate code for each node of this tree.
  5600. Figure 11.7: A nontree DAG.
  5601. Figure 11.8: A DAG that has been partitioned into a set of trees.
  5602. EXAMPLE 11.4
  5603. Consider the labeled tree shown in Figure 11.9.
  5604. Figure 11.9: Labeled tree for Example 11.4.
  5605. The code generated by gencode() when this tree is given as input along with the recursive calls of gencode is shown
  5606. in Table 11.7. It starts with call to gencode() of t4. Initially, the top two registers will be R0 and R1.
  5607. Table 11.7: Recursive Gencode Calls
  5608. Call to
  5609. Gencode()
  5610. Action Taken RSTACK Contents
  5611. Top Two
  5612. Registers
  5613. Code
  5614. Generated
  5615. R0, R1
  5616. gencode(t4) Swap top two registers Call gencode(t3) Pop R1
  5617. Call gencode(t1) Generate an instruction SUB
  5618. R1,R0 Push R1 Swap top two registers
  5619. R1, R0
  5620. R0, R1
  5621. R1, R0
  5622. R0, R1
  5623. MOV E, R1
  5624. MOV C, R0
  5625. ADD D, R0
  5626. SUB R0, R1
  5627. MOV A, R0
  5628. ADD B, R0
  5629. SUB R1, R0
  5630. gencode(t3) Call gencode(E) Pop R1 Call gencode(t2)
  5631. Generate an instruction SUB R0,R1 Push R1
  5632. R1, R0
  5633. R0
  5634. R1, R0
  5635. MOV E, R1
  5636. MOV C, R0
  5637. ADD D, R0
  5638. SUB R0, R1
  5639. gencode(E) Generate an instruction MOV E, R1 R1, R0 MOV E, R1
  5640. gencode(t2) gencode(c) Generate an instruction ADD D, R0 R0 MOV C, R0
  5641. ADD D, R0
  5642. gencode(c) Generate an instruction MOV C, R0 R0
  5643. gencode(t1) gencode(A) Generate an instruction ADD B, R0 R0 MOV A, R0
  5644. ADD B, R0
  5645. gencode(A) Generate an instruction MOV A, R0 R0 MOV A, R0
  5646. 11.6 USING ALGEBRAIC PROPERTIES TO REDUCE THE REGISTER
  5647. REQUIREMENT
  5648. It is possible to make use of algebraic properties like operator commutativity and associativity to reduce the register
  5649. requirements of the tree. For example, consider the tree shown in Figure 11.10.
  5650. Figure 11.10: Tree with a label of two.
  5651. The label of the tree in Figure 11.10 is two, but since + is a commutative operator, we can interchange the left and the
  5652. right subtrees, as shown in Figure 11.11. This brings the register requirement of the tree down to one.
  5653. Figure 11.11: The left and right subtrees have been interchanged, reducing the register requirement to one.
  5654. Similarly, associativity can be used to reduce the register requirement. Consider the tree shown in Figure 11.12.
  5655. Figure 11.12: Associativity is used to reduce a tree's register requirement.
  5656. 11.7 PEEPHOLE OPTIMIZATION
  5657. Code generated by using the statement-by-statement code-generation strategy contains redundant instructions and
  5658. suboptimal constructs. Therefore, to improve the quality of the target code, optimization is required. Peephole
  5659. optimization is an effective technique for locally improving the target code. Short sequences of target code instructions
  5660. are examined and replacement by faster sequences wherever possible. Typical optimizations that can be performed
  5661. are:
  5662. Elimination of redundant loads and stores
  5663. Elimination of multiple jumps
  5664. Elimination of unreachable code
  5665. Algebraic simplifications
  5666. Reducing for strength
  5667. Use of machine idioms
  5668. Eliminating Redundant Loads and Stores
  5669. If the target code contains the instruction sequence:
  5670. MOV R, a 1.
  5671. MOV a, R 2.
  5672. we can delete the second instruction if it an unlabeled instruction. This is because the first instruction ensures that the
  5673. value of a is already in the register R. If it is labeled, there is no guarantee that step 1 will always be executed before
  5674. step 2.
  5675. Eliminating Multiple Jumps
  5676. If we have jumps to other jumps, then the unnecessary jumps can be eliminated in either intermediate code or the
  5677. target code. If we have a jump sequence:
  5678. goto L1
  5679. ...
  5680. L1: goto L2
  5681. then this can be replaced by:
  5682. goto L2
  5683. ...
  5684. L1: goto L2
  5685. If there are now no jumps to L1, then it may be possible to eliminate the statement, provided it is preceded by an
  5686. unconditional jump. Similarly, the sequence:
  5687. if a < b goto L1
  5688. ...
  5689. L1: goto L2
  5690. can be replaced by:
  5691. if a < b goto L2
  5692. ...
  5693. L1: goto L2
  5694. Eliminating Unreachable Code
  5695. An unlabeled instruction that immediately follows an unconditional jump can possibly be removed, and this operation
  5696. can be repeated in order to eliminate a sequence of instructions. For debugging purposes, a large program may have
  5697. within it certain segments that are executed only if a debug variable is one. For example, the source code may be:
  5698. #define debug 0
  5699. ...
  5700. if (debug)
  5701. {
  5702. print debugging information
  5703. }
  5704. This if statement is translated in the intermediate code to:
  5705. goto L2
  5706. L1 : print debugging information
  5707. L2 :
  5708. One of the optimizations is to replace the pair:
  5709. if debug = 1 goto L1
  5710. goto L2
  5711. within the statements with a single conditional goto statement by negating the condition and changing its target, as
  5712. shown below:
  5713. Print debugging information
  5714. L2 :
  5715. Since debug is a constant zero by constant propagation, this code will become:
  5716. if 0 ≠ 1 goto L2
  5717. Print debugging information
  5718. L2 :
  5719. Since 0 ≠ 1 is always true this will become:
  5720. goto L2
  5721. Print debugging information
  5722. L2 :
  5723. Therefore, the statements that print the debugging information are unreachable and can be eliminated, one at a time.
  5724. Algebraic Simplifications
  5725. If statements like:
  5726. are generated in the code, they can be eliminated, because zero is an additive identity, and one is a multiplicative
  5727. identity.
  5728. Reducing Strength
  5729. Certain machine instructions are considered to be cheaper than others. Hence, if we replace expensive operations by
  5730. equivalent cheaper ones on the target machine, then the efficiency will be better. For example, x 2 is invariable cheaper
  5731. to implement as x * x than as a call to an exponentiation routine. Similarly, fixed-point multiplication or division by a
  5732. power of two is cheaper to implement as a shift.
  5733. Using Machine Idioms
  5734. The target machine may have hardware instructions to implement certain specific operations efficiently. Detecting
  5735. situations that permit the use of these instructions can reduce execution time significantly. For example, some
  5736. machines have auto-increment and auto-decrement addressing modes. Using these modes can greatly improve the
  5737. quality of the code when pushing or popping a stack. These modes can also be used for implementing statements like
  5738. a = a + 1.
  5739. Chapter 12: Exercises
  5740. The exercises that follow are designed to provide further examples of the concepts covered in this book. Their
  5741. purpose is to put these concepts to work in practical contexts that will enable you, as a programmer, to better and
  5742. more-efficiently use algorithms when designing your compiler.
  5743. EXERCISE 12.1
  5744. Construct the regular expression that corresponds to the state transition diagram shown in Figure 12.1.
  5745. Figure 12.1: State transition diagram.
  5746. EXERCISE 12.2
  5747. Prove that regular sets are closed under intersection. Present a method for constructing a DFA with an intersection of
  5748. two regular sets.
  5749. EXERCISE 12.3
  5750. Transform the following NFA into an optimal/minimal state DFA.
  5751. 0 1
  5752. A A, C B D
  5753. B B D C
  5754. C C A, C D
  5755. D D A
  5756. EXERCISE 12.4
  5757. Obtain the canonical collection of sets of LR(1) items for the following grammar:
  5758. EXERCISE 12.5
  5759. Construct an LR(1) parsing table for the following grammar:
  5760. EXERCISE 12.6
  5761. Construct an LALR(1) parsing table for the following grammar:
  5762. EXERCISE 12.7
  5763. Construct an SLR(1) parsing table for the following grammar:
  5764. EXERCISE 12.8
  5765. Consider the following code fragment. Generate the three-address-code for it.
  5766. if a < b then
  5767. while c > d do
  5768. x = x + y
  5769. else
  5770. do
  5771. p = p + q
  5772. while e <= f
  5773. EXERCISE 12.9
  5774. Consider the following code fragment. Generate the three-address code for it.
  5775. for (i = 1; i <= 10; i++)
  5776. if a < b then x = y + z
  5777. EXERCISE 12.10
  5778. Consider the following code fragment. Generate the three-address-code for it.
  5779. switch a + b
  5780. {
  5781. case 1: x = x + 1
  5782. case 2: y = y + 2
  5783. case 3: z = z + 3
  5784. default: c = c -1
  5785. }
  5786. EXERCISE 12.11
  5787. Write the syntax-directed translations to go along with the LR parser for the following:
  5788. EXERCISE 12.12
  5789. Write the syntax-directed translations to go along with the LR parser for the following:
  5790. EXERCISE 12.13
  5791. There are syntactic errors in the following constructs. For each of these constructs, find out which of the input's next
  5792. tokens will be detected as an error by the LR parser.
  5793. while a = b do x = y + z 1.
  5794. a + b = c 2.
  5795. a *+ b + c 3.
  5796. EXERCISE 12.14
  5797. Comment on whether the following statements are true or false:
  5798. Given a finite automata M(Q, Σ , δ , q 0 , F) that accepts L(M), the automata M 1 (Q, Σ , δ , q 0 , (Q − F ))
  5799. accepts L(M), where L(M) is complement of L(M). If M is an optimal or minimal state automata,
  5800. then M 1 is also a minimal state automata.
  5801. 1.
  5802. Every subset of a regular set is also a regular set. 2.
  5803. In a top-down backtracking parser, the order in which various alternatives are tried may affect the
  5804. language accepted by the parser.
  5805. 3.
  5806. An LR parser detects an error when the symbol coming next in the input is not a valid continuation
  5807. of the prefix of the input seen by the parser.
  5808. 4.
  5809. Grammar ambiguity necessarily implies ambiguity in the language generated by that grammar. 5.
  5810. Every name is added to the symbol table during the lexical analysis phase irrespective of the
  5811. semantic role played by each name.
  5812. 6.
  5813. Given a grammar with no useless symbols, but containing unit productions, if the unit productions
  5814. are eliminated from the grammar, then it is possible that some of the grammar symbols in the
  5815. resulting grammar may become useless.
  5816. 7.
  5817. In any nonambiguous grammar without useless symbols, the handle of a given right-sentential
  5818. form is unique.
  5819. 8.
  5820. Index
  5821. A
  5822. Action specification in LEX, 46-47
  5823. Action tables
  5824. Action | GOTO tables, 140
  5825. arrays to represent, 178-179
  5826. LALR parsing tables, 165-169
  5827. for LR(1) parser, 163-165
  5828. for SLR(1) parser, 152-161
  5829. Activation records, 248-249
  5830. Addressing modes, machine model and, 297-299
  5831. Algebraic properties, register requirements reduced with, 317-318
  5832. Alphabet, defined for lexical analysis, 6
  5833. Ambiguous grammars and bottom-up parsing, 172-177
  5834. AND operator and translation, 214-215
  5835. Arithmetic expressions, translation of, 208-211
  5836. Array references, 225-229
  5837. Arrays, to represent action tables, 178-179
  5838. Attributes
  5839. defined, 196
  5840. dummy synthesized attributes, 199-201
  5841. inherited attributes, 198-199
  5842. synthesized attributes, 197-198
  5843. Augmented grammars, 142-146, 175-176
  5844. Automatas, equivalence of, 51-52
  5845. Index
  5846. B
  5847. Back end compilers, 4
  5848. Back-patching, 5
  5849. Backtracking parsers, 95
  5850. recursive descent parsers, 94-118
  5851. Block statements and stack allocation, 256-257
  5852. Boolean expressions, translation of, 211-214
  5853. Bootstrap compilers, defined, 1-2
  5854. Bottom-up parsing
  5855. Action | GOTO tables, 140
  5856. ambiguous grammars, 172-177
  5857. canonical collection of sets algorithm, 146-152
  5858. defined and described, 135-136
  5859. handles of right sentential form, 136-138
  5860. implementation of, 138-140
  5861. LALR parsing, 165-166, 190-194
  5862. LR parsers, 140-142
  5863. LR(1) parsing, 163-165, 179-194
  5864. Braces {} in syntax-directed translation schemes, 202-203
  5865. Index
  5866. C
  5867. Call and return sequences, stack allocation and, 250-253
  5868. Canonical collection of sets
  5869. algorithm for, 146-152
  5870. exercises, 324
  5871. of LR(1), algorithm, 161-163
  5872. Cartesian products, set operation, 7
  5873. CASE statements, 229-234
  5874. Closure
  5875. property closure of a relation, 9
  5876. set operation, 7-8
  5877. Closure operations, regular sets and, 47
  5878. Code generation phase, 2, 3, 4
  5879. DAGs and, 305-316
  5880. difficulties encountered during, 296-297
  5881. getreg() function and, 300-305
  5882. labeled trees and, 307-316
  5883. straightforward strategy for, 299-305
  5884. Code optimization phase, 2, 3
  5885. algebraic properties to reduce register requirements, 317-318
  5886. algebraic simplifications, 320
  5887. defined and described, 269-270
  5888. global common subexpressions, eliminating, 290-292
  5889. jumps, eliminate multiple, 319
  5890. loads and stores, eliminating redundancy, 319
  5891. local common subexpressions, eliminating, 288-290
  5892. loop optimization, 270-284
  5893. machine idioms and, 321
  5894. partitioning three-address code into basic blocks, 271-273
  5895. peephole optimization, 318-321
  5896. reducible flow graphs and, 274-284
  5897. strength reduction, 321
  5898. unreachable code, eliminating, 319-320
  5899. Compilation, process described, 2-5
  5900. Compilers
  5901. defined, 1
  5902. front-end vs. back-end compilers, 4
  5903. organization of, 4
  5904. Computational order, 296
  5905. Concatenation
  5906. defined, 6
  5907. set operation, 7
  5908. Concatenation operation, regular sets and, 47
  5909. Context-free grammars (CFGs)
  5910. algorithm for identifying useless symbols, 64
  5911. defined and described, 54
  5912. derivation in, 55-56
  5913. ∈ -productions and, 70-73
  5914. left linear grammar, 86-90
  5915. left-recursive grammar, 75-77
  5916. productions (P) in, 54
  5917. reduction of grammar, 61-70
  5918. regular grammar as, 77-85
  5919. right linear grammar, 85-86
  5920. SLR(1) grammars, 152-161
  5921. start symbol (S) in, 54
  5922. in syntax analysis phase, 53-54
  5923. terminals (T) in, 54
  5924. unit productions and, 73-75
  5925. variables (V) or nonterminals in, 54, 56
  5926. Cross-compilers, defined, 1-2
  5927. Index
  5928. D
  5929. DAGs. See Directed acyclic graphs (DAGs)
  5930. Data storage. See Storage management
  5931. Data structures for representing parsing tables, 178-179
  5932. Dead states of DFAs, 27
  5933. detection of, 31
  5934. Decrement operators, implementation of, 224-225
  5935. Dependency graphs, 199-201
  5936. Derivation
  5937. in context-free grammar, 55-56
  5938. derivation trees in CFG, 56-61
  5939. Detection, of DFA unreachable and dead states, 28-31
  5940. Deterministic finite automata (DFA)
  5941. Action | GOTO tables, 141-142
  5942. augmented grammar and, 142-146
  5943. equivalent to NFAs with ∈ -moves, 23-27
  5944. exercises, 323-324
  5945. minimization of, 27-31
  5946. minimization/optimization of, 27-31
  5947. transforming NFAs into, 16-18
  5948. DFA. See Deterministic finite automata (DFA)
  5949. Directed acyclic graphs (DAGs), 288-290
  5950. code generation and, 305-316
  5951. heuristic DAG ordering, 305-307
  5952. labeling algorithm and, 307-309
  5953. DO-WHILE statements and translation, 220-221
  5954. Dummy synthesized attributes, 199-201
  5955. Index
  5956. E
  5957. ∈ -closure(q), finding, 19-20
  5958. ∈ -moves
  5959. acceptance of strings by NFAs with, 19
  5960. equivalence of NFAs with and without, 21-22
  5961. finding ∈ -closure(q), 19-20
  5962. NFAs with, 18-27
  5963. ∈ -productions
  5964. defined, 70
  5965. eliminating, 71-73
  5966. and nonnullable nonterminals, 70-71
  5967. regular grammar and, 77-84
  5968. ∈ -transitions, 18
  5969. Equivalence of automata, 51-52
  5970. Error handling
  5971. detection and report of errors, 259-260
  5972. exercises, 325
  5973. lexical phase errors, 260
  5974. in LR parsing, 261-264
  5975. panic mode recovery, 261
  5976. phase level recovery, 261-264
  5977. predictive parsing error recovery, 264-267
  5978. semantic errors and, 268
  5979. YACC and, 264
  5980. Errors. See Error handling
  5981. Index
  5982. F
  5983. Finite automata
  5984. construction of, 31-38
  5985. defined, 11
  5986. exercises, 326
  5987. non-deterministic finite automata (NFA), 14-16
  5988. specification of, 11-14
  5989. strings and, 13, 15-16
  5990. FOR statements and translation, 223-224
  5991. Front-end compilers, 4
  5992. Index
  5993. G
  5994. Gencode() function, 313-316
  5995. Getreg() function, 300-305
  5996. Global common subexpressions, eliminating, 290-292
  5997. GOTO tables, 140
  5998. construction of, 152-161
  5999. for LR(1) parser, 163-165
  6000. Grammars, exercises
  6001. ambiguous grammars, 172-177
  6002. augmented grammar, 142-146, 175-176
  6003. left-recursive grammar, 75-77
  6004. useless grammar symbols (reduction of), 61-70
  6005. Index
  6006. H
  6007. Handle pruning, 137
  6008. Hash tables for organization of symbol tables, 243-244
  6009. Index
  6010. I
  6011. IF-THEN-ELSE statements and translation, 216-218
  6012. IF-THEN statements and translation, 218-219
  6013. Increment operators, implementation of, 224-225
  6014. Indirect triple representation, 206-207
  6015. Induction variables of loops
  6016. defined, 284-285
  6017. detecting and eliminating, 285-288
  6018. Inherited attributes, 198-199
  6019. Input files, LEX, 46-47
  6020. Intermediate code generation phase, 2, 3
  6021. Intersection, set operation, 7
  6022. Index
  6023. J-K
  6024. Jumps
  6025. and Boolean translation, 213-214
  6026. eliminating multiple, 319
  6027. Index
  6028. L
  6029. LALR parsing, 165-166, 190-194
  6030. Language, defined for lexical analysis, 6
  6031. Language tokens, lexical analysis and, 5
  6032. L-attributed definitions, 201
  6033. Left linear grammar, 86-90
  6034. LEX compiler-writing tool, 45-46
  6035. action specification in, 46-47
  6036. format for input or source files, 46-47
  6037. pattern specification in, 46-47
  6038. Lexemes, 5
  6039. Lexical analysis
  6040. design of lexical analyzers, 45-47
  6041. phase of compiling, 2-3, 5, 260
  6042. Lexical analyzers, design of, 45-47
  6043. Lexical phase, 2-3, 5
  6044. error recovery, 260
  6045. Linear lists for organization of symbol tables, 242
  6046. Local common subexpressions, eliminating, 288-290
  6047. Logical expressions
  6048. AND operator, 214-215
  6049. DO-WHILE statements, 220-221
  6050. FOR statements, 223-224
  6051. IF-THEN-ELSE statements, 216-218
  6052. IF-THEN statements, 218-219
  6053. NOT operator, 215-216
  6054. OR operator, 215
  6055. REPEAT statements, 222-223
  6056. translation and, 214-224
  6057. WHILE statements, 219-220
  6058. Loop invariant computations, 271
  6059. Loop jamming, 293-294
  6060. Loop optimizations, 270-284
  6061. back edge identification, 273-274
  6062. induction variables, reduction of, 284-288
  6063. loop detection, 273
  6064. loop jamming, 293-294
  6065. loop unrolling, 292-293
  6066. reducible flow graphs and, 274-284
  6067. Loop unrolling, 292-293
  6068. LR parsers and parsing, 140-142, 179-194
  6069. LR(1) parsers and parsing
  6070. action tables, 163-165
  6071. exercises, 324
  6072. Index
  6073. M
  6074. Machine model described, 297-299
  6075. Memory. See Storage management
  6076. Memory addresses, machine model and, 297-299
  6077. Index
  6078. N
  6079. Names
  6080. access to nonlocal names, 253-255
  6081. address descriptors and, 299
  6082. held in symbol tables, 241
  6083. runtime name storage, 241
  6084. scope of name, 244-246
  6085. Non-deterministic finite automata (NFA)
  6086. defined and described, 14
  6087. DFA equivalents of, 23-27
  6088. with ∈ -moves, 18-27
  6089. equivalence and ∈ -moves, 21-22
  6090. strings and, 15-16
  6091. transformation into deterministic (DFA), 16-18
  6092. Nondistinguishable states of DFAs, 27
  6093. Nonlocal names, 253-255
  6094. Nonterminals in context-free grammar, 54, 56
  6095. NOT operator and translation, 215-216
  6096. Index
  6097. O
  6098. Opcodes, machine model and, 297-299
  6099. Operators
  6100. for regular expressions, 40
  6101. translation and Boolean operators, 214-216
  6102. Optimizations
  6103. of DFAs, 27-31 see also Code optimization phase
  6104. OR operator and translation, 215
  6105. Index
  6106. P
  6107. Panic mode recovery, 261
  6108. Parsers and parsing
  6109. action tables, 140
  6110. backtracking parsers, 95
  6111. conflicts, 169-171
  6112. data structures for representing parsing tables, 178-179
  6113. defined and described, 91
  6114. LALR parsing, 165-169, 190-194
  6115. LR parsers, 140-142
  6116. LR(1) parsers, action tables, 163-165
  6117. predictive top-down parsers, 118-133
  6118. table-driven predictive parsers, 123-133
  6119. see also Bottom-up parsing; Parse trees; Syntax analysis phase; Top-down parsing
  6120. Parse trees
  6121. in CFG, 56-61
  6122. derivation trees in CFG, 56-61
  6123. labeled trees and code generation, 307-316
  6124. node labeling algorithm, 307-309
  6125. symbol table organization with, 242-243
  6126. syntax trees, 203-204
  6127. Pattern specification in LEX, 46-47
  6128. Peephole optimization, 318-321
  6129. Postfix notation, 203
  6130. Power set, set operation, 7
  6131. Predictive parsing
  6132. error recovery and, 264-267
  6133. predictive top-down parsers, 118-133
  6134. Predictive top-down parsers, 118-133
  6135. Prefixes, defined, 6
  6136. Procedure calls, 234-235
  6137. Productions (P) in context-free grammar, 54
  6138. Index
  6139. Q
  6140. Quadruple representation, 205-206, 207
  6141. Index
  6142. R
  6143. Recursion, eliminating left recursion, 75-77
  6144. Recursive descent parsers, implementation, 94-118
  6145. Reduce-reduce conflicts, 170-171
  6146. Reducible flow graphs
  6147. and code optimization, 274-284
  6148. loop invariant statements and, 282-283
  6149. Reduction of grammar, 61-70
  6150. algorithm for identifying useless symbols, 64
  6151. bottom-up parsing and, 135-136
  6152. Registers
  6153. algebraic properties to reduce requirements for, 317-318
  6154. register descriptors, 299
  6155. RSTACK to allocate, 309-313
  6156. selecting for computation, 297
  6157. Regular expression notation
  6158. finite automata definitions, 6-8
  6159. role in lexical analysis, 5
  6160. Regular expressions
  6161. defined and described, 39-43
  6162. exercise, 323
  6163. lexical analyzer design and, 45
  6164. obtained from finite automata, 43-44
  6165. obtained from regular grammar, 84-85
  6166. operators for, 40 see also Regular expression notation
  6167. Regular grammar, 77-85
  6168. defined, 77
  6169. ∈ -productions and, 77
  6170. regular expressions from, 84-85
  6171. Regular sets, 39
  6172. exercises, 326
  6173. lexical analyzer design and, 45
  6174. properties of, 47-51
  6175. Relations
  6176. defined and described, 8
  6177. properties of, 8-9
  6178. property closure of, 9
  6179. symbol for in CFG, 54
  6180. REPEAT statements and translation, 222-223
  6181. Return sequences, stack allocation and, 250-253
  6182. Right linear grammar, 85-86
  6183. RSTACKs, allocating registers with, 309-313
  6184. Index
  6185. S
  6186. Scope rules and scope information, 244-246, 253
  6187. Search trees for organization of symbol tables, 242-243
  6188. Sentential form handles, 136-138
  6189. Set difference, set operation, 7
  6190. Set operations, defined, 7
  6191. Sets
  6192. defined, 7
  6193. regular sets, 39, 45, 47-51
  6194. relations between, 8-9
  6195. Shift-reduce conflicts, 169
  6196. SLR(1)
  6197. exercises, 324
  6198. grammars, 152-161
  6199. SLR parsing, 151-162, 176-177, 180-190
  6200. Source files, LEX, 46-47
  6201. Stack allocation
  6202. access link set up, 255-257
  6203. access to nonlocal names and, 253-255
  6204. block statements and, 256-257
  6205. call and return sequences, 250-253
  6206. Start symbol (S) in context-free grammar, 54
  6207. Storage management
  6208. heap memory storage, 247-248
  6209. procedure activation and activation records, 248-249
  6210. stack allocation, 250-257
  6211. static allocation, 250
  6212. storage allocation, 247-248
  6213. Strings, defined, 6
  6214. Suffixes, defined, 6
  6215. SWITCH statements, translation of, 229-234
  6216. Symbol tables
  6217. defined and described, 239
  6218. exercises, 326
  6219. hash tables for organization of, 243-244
  6220. implementation of, 239-240
  6221. information entry for, 240
  6222. linear lists for organization of, 242
  6223. names held in, 241
  6224. scope information, 244-246
  6225. search trees for organization of, 242-243
  6226. Syntactic phase error recovery, 260-261
  6227. Syntax analysis phase, 2-3
  6228. context-free grammar and, 53-54
  6229. error recovery during syntactic phase, 260-261
  6230. Syntax-directed definitions
  6231. L-attributed definitions, 201
  6232. translation and, 195-201
  6233. Syntax directed translations and translation schemes, 202-203
  6234. Syntax trees, 203-204
  6235. Synthesized attributes, 197-198
  6236. dummy synthesized attributes, 199-201
  6237. Index
  6238. T
  6239. Table-driven predictive parsers, implementation, 123-133
  6240. Terminals (T) in context-free grammar, 54
  6241. Three-address code, 204-205
  6242. exercises, 324-325
  6243. partitioning into basic blocks, 271-273
  6244. Three-address statements, representation of, 205-207, 296
  6245. Tokens, lexical analysis and, 5
  6246. Top-down parsing
  6247. defined and described, 91-92
  6248. exercises, 326
  6249. implementation, 94-118
  6250. predictive top-down parsers, 118-133
  6251. Translations and translation schemes
  6252. of arithmetic expressions, 208-211
  6253. of array references, 225-229
  6254. of Boolean expressions, 211-214
  6255. of decrement and increment operators, 224-225
  6256. examples of, 235-238
  6257. exercises, 325
  6258. intermediate code generation and, 203-205
  6259. of logical expressions, 214-224
  6260. procedure calls and, 234-235
  6261. specification of, 195-196
  6262. of SWITCH / CASE statements, 229-234
  6263. syntax-directed definitions, 195-201
  6264. Trees. See Parse trees
  6265. Triple representation, 206
  6266. Index
  6267. U
  6268. Union set operation, 7
  6269. and regular sets, 47
  6270. Unit productions defined, 73
  6271. elimination of, 73-75
  6272. Unreachable states of DFAs, 27
  6273. detecting, 28-31
  6274. Index
  6275. V
  6276. Variables (V) in context-free grammar, 54, 56
  6277. Index
  6278. W-X
  6279. WHILE statements and translation, 219-220
  6280. Index
  6281. Y-Z
  6282. YACC, error handling and, 264
  6283. List of Figures
  6284. Chapter 1: Introduction
  6285. Figure 1.1: Compilation process phases.
  6286. Figure 1.2: Syntax analysis imposes a structure hierarchy on the token string.
  6287. Chapter 2: Finite Automata and Regular Expressions
  6288. Figure 2.1: Transition diagram for finite automata δ (p, a) = q.
  6289. Figure 2.2: Transition diagram for finite automata that handles several transitions.
  6290. Figure 2.3: Transition diagram for M = ({q 0 , q 1 , q 2 , q 3 }, {0, 1} δ , q 0 , {q 3 }).
  6291. Figure 2.4: Finite automata with ∈ -moves.
  6292. Figure 2.5: Transitioning from an ∈ -move NFA to a non- ∈ -move NFA.
  6293. Figure 2.6: Making the initial state of the NFA one of the final states.
  6294. Figure 2.7: Example 2.1 NFA.
  6295. Figure 2.8: Example 2.2 DFA equivalent to an NFA.
  6296. Figure 2.9: Partitioning down to a single state.
  6297. Figure 2.10: Merging nondistinguishable states B and C into a single state B 1 .
  6298. Figure 2.11: Transition diagram for Example 2.3 finite automata.
  6299. Figure 2.12: Finite automata containing even number of zeros and odd number of ones.
  6300. Figure 2.13: Finite automata containing odd number of zeros and even number of ones.
  6301. Figure 2.14: Example 2.6 finite automata considers the set prefix.
  6302. Figure 2.15: Finite automata accepts strings containing the substring 101.
  6303. Figure 2.16: DFA using the names A-D and q 0 − 5 .
  6304. Figure 2.17: Complement to Figure 2.16 automata.
  6305. Figure 2.18: DFA after minimization.
  6306. Figure 2.19: Finite automata that accepts string decimals that are divisible by three.
  6307. Figure 2.20: Finite automata accepts strings containing 101.
  6308. Figure 2.21: Finite automata identified by the name states A-D and q 0 − 5 .
  6309. Figure 2.22: Complement to Figure 2.21 automata.
  6310. Figure 2.23: Minimization of nondistinguishable states of Figure 2.22.
  6311. Figure 2.24: Automata that accepts binary strings that are divisible by three.
  6312. Figure 2.25: Transition diagram for (a + b).
  6313. Figure 2.26: Transition diagram for (a + b)*.
  6314. Figure 2.27: Transition diagram for a. (a + b)*.
  6315. Figure 2.28: Automata for a.(a + b)* .b.
  6316. Figure 2.29: Automata for a.(a + b)*.b.b.
  6317. Figure 2.30: Deriving the regular expression for a regular set.
  6318. Figure 2.31: Transition diagram.
  6319. Figure 2.32: Complement to transition diagram in Figure 2.31.
  6320. Figure 2.33: Transition diagram of automata M 1 .
  6321. Figure 2.34: Transition diagram of automata M 2 .
  6322. Chapter 3: Context-Free Grammar and Syntax Analysis
  6323. Figure 3.1: Derivation tree for the string id + id * id.
  6324. Figure 3.2: Parse tree resulting from leaf-node concatenation.
  6325. Figure 3.3: Multiple parse trees.
  6326. Figure 3.4: Ambiguous grammar parse trees.
  6327. Figure 3.5: Parse tree generated by using both the right- and left-most derivation orders.
  6328. Figure 3.6: Parse tree generated from both the left- and right-most orders of derivation.
  6329. Figure 3.7: Transition diagram for automata that accepts the regular grammar of Example 3.13.
  6330. Figure 3.8: Deterministic equivalent of the non-deterministic automata shown in Figure 3.7.
  6331. Figure 3.9: Non-deterministic automata.
  6332. Figure 3.10: Transition diagram for deterministic automata equivalent shown in Figure 3.9.
  6333. Figure 3.11: Regular-grammar automata.
  6334. Figure 3.12: Transition diagram of automata that accepts L(G 1 ).
  6335. Figure 3.13: Transition diagram of automata after removal of state D.
  6336. Figure 3.14: Transition diagram for the automata that results from merged states.
  6337. Figure 3.15: Non-deterministic automata that accepts L(G 2 ).
  6338. Figure 3.16: Transition diagram of the equivalent deterministic automata for Figure 3.15.
  6339. Figure 3.17: Finite automata accepting the right linear grammar for a regular expression.
  6340. Figure 3.18: Transition diagram for a finite automata specified by a reversed regular expression.
  6341. Chapter 4: Top-Down Parsing
  6342. Figure 4.1: Parser uses the S-production to expand the parse tree.
  6343. Figure 4.2: Parser uses the first alternative for A in order to expand the tree.
  6344. Figure 4.3: If the parser fails to match a leaf, the point of failure, d, reroutes (backtracks) the pointer to
  6345. alternative paths from A.
  6346. Figure 4.4: The parser first expands S and fails to accept w = acdb.
  6347. Figure 4.5: The parser advances to c and considers nonterminal A for expension.
  6348. Figure 4.6: The parser first expands S.
  6349. Figure 4.7: The parser advances the pointer to a second occurrence of a.
  6350. Figure 4.8: The parser expands the next leaf labeled S.
  6351. Figure 4.9: The parser finds no match, so it backtracks.
  6352. Figure 4.10: The parser tries an alternate aa.
  6353. Figure 4.11: There is no further alternate of S that can be tried, so the parser will backtrack one more step.
  6354. Figure 4.12: The parser again finds a mismatch; hence, it backtracks.
  6355. Figure 4.13: The parser tries an alternate aa.
  6356. Figure 4.14: Since no alternate of S remains to be tried, the parser backtracks one more step.
  6357. Figure 4.15: The parser tries an alternate aa.
  6358. Figure 4.16: The parser arrives at the required parse tree.
  6359. Figure 4.17: The parser first expands S.
  6360. Figure 4.18: The parser advances the pointer to a second occurrence of a.
  6361. Figure 4.19: The parser considers the next leaf labeled by S.
  6362. Figure 4.20: The parser matches the third input symbol and moves on to the next leaf labeled by S.
  6363. Figure 4.21: The parser considers the fourth occurrence of the input symbol a.
  6364. Figure 4.22: The parser finds no match, so it backtracks.
  6365. Figure 4.23: The parser tries an alternate aa.
  6366. Figure 4.24: No alternate of S can be tried, so the parser will backtrack one more step.
  6367. Figure 4.25: Again finding a mismatch, the parser backtracks.
  6368. Figure 4.26: The parser then tries an alternate.
  6369. Figure 4.27: No alternate of S remains to be tried, so the parser will backtrack one more step.
  6370. Figure 4.28: The parser again finds a mismatch; therefore, it backtracks.
  6371. Figure 4.29: The parser tries an alternate aa.
  6372. Figure 4.30: The parser then tries an alternate aa.
  6373. Figure 4.31: The parser successfully generates the parse tree for aaaa.
  6374. Figure 4.32: The parser expands S.
  6375. Figure 4.33: The parser matches the first symbol, advances to the second occurrence of a, and considers S for
  6376. expansion.
  6377. Figure 4.34: The parser finds a match for the second occurrence of a and expands S.
  6378. Figure 4.35: The parser matches the third input symbol, considers the next leaf, and expands S.
  6379. Figure 4.36: The parser matches the fourth input symbol, considers the next leaf, and expands S.
  6380. Figure 4.37: A match is found for the fifth input symbol, so the parser considers the next leaf, and expands S.
  6381. Figure 4.38: The sixth input symbol also matches. So the next leaf is considered, and S is expanded.
  6382. Figure 4.39: No match is found, so the parser backtracks to S.
  6383. Figure 4.40: The parser backtracks one more step.
  6384. Figure 4.41: The parser tries the alternate aa.
  6385. Figure 4.42: Again, a mismatch is found. So, the parser backtracks.
  6386. Figure 4.43: No alternate of S remains, so the parser will back-track one more step.
  6387. Figure 4.44: The parser tries an alternate aa.
  6388. Figure 4.45: Again, a mismatch is found. The parser backtracks.
  6389. Figure 4.46: The parser then tries an alternate aa.
  6390. Figure 4.47: A mismatch is found, and the parser backtracks.
  6391. Figure 4.48: The parser tries for the alternate aa, fails to find a match, and cannot generate the parse tree for six
  6392. occurrences of a.
  6393. Chapter 5: Bottom-up Parsing
  6394. Figure 5.1: NFA transition diagram recognizes viable prefixes.
  6395. Figure 5.2: Using subset construction, a DFA equivalent is derived from the transition diagram in Figure 5.1.
  6396. Figure 5.3: DFA transition diagram showing four iterations for a canonical collection of sets.
  6397. Figure 5.4: Transition diagram for Example 5.2 DFA.
  6398. Figure 5.5: DFA Transition diagram.
  6399. Figure 5.6: Transition diagram for the canonical collection of sets of LR(1) items.
  6400. Figure 5.7: Transition diagram for a DFA using a reduced collection.
  6401. Figure 5.8: LR(0) underlying set representations that can cause SLR parser conflicts.
  6402. Figure 5.9: LR(1) underlying set representations that can cause CLR/LALR parser conflicts.
  6403. Figure 5.10: LR(0) underlying set representations that can cause an SLR parser reduce-reduce conflict.
  6404. Figure 5.11: LR(1) underlying set representations that can cause an CLR/LALR parser.
  6405. Figure 5.12: Sets of LR(1) items represent two different CLR(1) parser states.
  6406. Figure 5.13: States are combined to form an LALR.
  6407. Figure 5.14: LR(1) items represent two different states of the CLR(1) parser.
  6408. Figure 5.15: LALR state set resulting from the combination of CLR(1) state sets.
  6409. Figure 5.16: Transition diagram for augmented grammar DFA.
  6410. Figure 5.17: States with actions in common point to the same location via an array.
  6411. Figure 5.18: List that incorporates the ability to append actions.
  6412. Figure 5.19: Transition diagram for the canonical collection of sets of LR(0) items in Example 5.3.
  6413. Figure 5.20: DFA transition diagram for Example 5.4.
  6414. Figure 5.21: Collection of nonempty sets of LR(1) items for Example 5.7.
  6415. Chapter 6: Syntax-Directed Definitions and Translations
  6416. Figure 6.1: The attribute value of node X is inherently dependent on the attribute value of node Y.
  6417. Figure 6.2: An annotated parse tree.
  6418. Figure 6.3: Parse tree with node attributes for the string int id1,id2,id3.
  6419. Figure 6.4: Dependency graph with four nodes.
  6420. Figure 6.5: Parse tree for the string id+id*id.
  6421. Figure 6.6: Syntax tree for id+id*id.
  6422. Figure 6.7: Values of attributes at the parse tree node for the string a + b * c.
  6423. Figure 6.8: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(symtab rec of a).
  6424. Figure 6.9: Values of the attributes at the parse tree nodes for a + b * c, id.place = addr(sumtab rec of a).
  6425. Figure 6.10: Translation scheme for a Boolean expression containing and, not, and or.
  6426. Figure 6.11: The addition of the nullable nonterminal N facilitates an unconditional jump.
  6427. Figure 6.12: A nullable nonterminal M provisions the translation of if-then.
  6428. Figure 6.13: The translation of the Boolean while statement is facilitated by a nullable nonterminal M.
  6429. Figure 6.14: Translation of the Boolean do-while.
  6430. Figure 6.15: Translation of Boolean repeat-until.
  6431. Figure 6.16: Handling the translation of the Boolean for.
  6432. Figure 6.17: A switch/case three-address translation.
  6433. Figure 6.18: Nullable nonterminals are introduced into a switch statement translation.
  6434. Figure 6.19: Contents of queue during the translation.
  6435. Chapter 7: Symbol Table Management
  6436. Figure 7.1: A pointer steers the symbol table to remotely stored information for the array a.
  6437. Figure 7.2: Symbol table names are held either in the symbol table record or in a separate string table.
  6438. Figure 7.3: A new record is added to the linear list of records.
  6439. Figure 7.4: The search tree organization approach to a symbol table.
  6440. Figure 7.5: Hash table method of symbol table organization.
  6441. Figure 7.6: Symbol table organization that complies with static scope information rules.
  6442. Chapter 8: Storage Management
  6443. Figure 8.1: Heap memory storage allows program-controlled data allocation.
  6444. Figure 8.2: Typical format of an activation record.
  6445. Figure 8.3: The CEP pointer is used to access the contents of the activation record.
  6446. Figure 8.4: Typical callee code segment.
  6447. Figure 8.5: An activation record that deals with nonlocal name references.
  6448. Figure 8.6: A typical callee segment.
  6449. Figure 8.7: Storage for declared names.
  6450. Chapter 10: Code Optimization
  6451. Figure 10.1: Program flow graph.
  6452. Figure 10.2: The flow graph back edges are identified by computing the dominators.
  6453. Figure 10.3: A flow graph with no back edges.
  6454. Figure 10.4: Flow graph with GEN and KILL block sets.
  6455. Figure 10.5: Nonunique solution to a data flow equation, where B is a predecessor of itself.
  6456. Figure 10.6: A flow graph containing a loop invariant statement.
  6457. Figure 10.7: Moving a loop invariant statement changes the semantics of the program.
  6458. Figure 10.8: Moving the preheader changes the meaning of the program.
  6459. Figure 10.9: Moving a value to the preheader changes the original meaning of the program.
  6460. Figure 10.10: Flow graph where I is a basic induction variable.
  6461. Figure 10.11: Modified flow graph.
  6462. Figure 10.12: Flow graph preheader modifications.
  6463. Figure 10.13: DAG representation of a basic block.
  6464. Chapter 11: Code Generation
  6465. Figure 11.1: DAG Representation.
  6466. Figure 11.2: DAG Representation with heuristic ordering.
  6467. Figure 11.3: DAG representation of three-address code for Example 11.3.
  6468. Figure 11.4: DAG representation tree after labeling.
  6469. Figure 11.5: The node n is an operand and not a subtree root.
  6470. Figure 11.6: The node n is an operator, and n 1 and n 2 are subtree roots.
  6471. Figure 11.7: A nontree DAG.
  6472. Figure 11.8: A DAG that has been partitioned into a set of trees.
  6473. Figure 11.9: Labeled tree for Example 11.4.
  6474. Figure 11.10: Tree with a label of two.
  6475. Figure 11.11: The left and right subtrees have been interchanged, reducing the register requirement to one.
  6476. Figure 11.12: Associativity is used to reduce a tree's register requirement.
  6477. Chapter 12: Exercises
  6478. Figure 12.1: State transition diagram.
  6479. List of Tables
  6480. Chapter 4: Top-Down Parsing
  6481. Table 4.1: Production Selections for Parsing Derivations
  6482. Table 4.2: Production Selections for Parsing Derivations
  6483. Table 4.3: Steps Involved in Parsing the String acdb
  6484. Table 4.4: Production Selections for String ab Parsing Derivations
  6485. Table 4.5: Production Selections for Parsing Derivations for the String adb
  6486. Table 4.6: Production Selections for Example 4.3 Parsing Derivations
  6487. Table 4.7: Production Selections for Example 4.5 Parsing Derivations
  6488. Table 4.8: Production Selections for Example 4.6 Parsing Derivations
  6489. Table 4.9: Production Selections for Example 4.7 Parsing Derivations
  6490. Table 4.10: Production Selections for Example 4.8 Parsing Derivations
  6491. Chapter 5: Bottom-up Parsing
  6492. Table 5.1: Sentential Form Handles
  6493. Table 5.2: Sentential Form Handles
  6494. Table 5.3: Steps in Parsing the String id + id * id
  6495. Table 5.4: Action|GOTO SLR Parsing Table
  6496. Table 5.5: SLR Parsing Table
  6497. Table 5.6: Action | GOTO SLR Parsing Table
  6498. Table 5.7: CLR/LR Parsing Action | GOTO Table
  6499. Table 5.8: LALR Parsing Table for a DFA Using a Reduced Collection
  6500. Table 5.9: SLR Parsing Table for Augmented Grammar
  6501. Table 5.10: SLR Parsing Table Reflects Higher Precedence and Left-Associativity
  6502. Table 5.11: SLR(1) Parsing Table
  6503. Table 5.12: SLR Parsing Table for Example 5.4
  6504. Table 5.13: Parsing Table for Example 5.5
  6505. Table 5.14: LALR(1) Parsing Table for Example 5.5
  6506. Table 5.15: LALR(1) Parsing Table for Example 5.6
  6507. Chapter 6: Syntax-Directed Definitions and Translations
  6508. Table 6.1: Quadruple Representation of x = (a + b) * − c/d
  6509. Table 6.2: Triple Representation of x = (a + b) * − c/d
  6510. Table 6.3: Indirect Triple Representation of x = (a + b) * − c/d
  6511. Chapter 7: Symbol Table Management
  6512. Table 7.1: Symbol Table Contents Using a Nesting Depth Approach
  6513. Chapter 9: Error Handling
  6514. Table 9.1: Parsing Table for E → E + E | E * E | id
  6515. Table 9.2: Higher Precedent * and Left-Associativity
  6516. Table 9.3: Parsing Table with Error Routines
  6517. Table 9.4: LR Parsing Table
  6518. Table 9.5: Phrase Level Error-Recovery Implementation
  6519. Chapter 10: Code Optimization
  6520. Table 10.1: GEN and KILL sets for Figure 10.4 Flow Graph
  6521. Table 10.2: IN and OUT Computation for Figure 10.5
  6522. Table 10.3: First Iteration for the IN and OUT Values
  6523. Table 10.4: Second Iteration for the IN and OUT Values
  6524. Table 10.5: Third Iteration for the IN and OUT Values
  6525. Table 10.6: Fourth Iteration for the IN and OUT Values
  6526. Chapter 11: Code Generation
  6527. Table 11.1: Six-Bit Registers for the Instruction MOV R0, R1
  6528. Table 11.2: Six-Bit Registers for the Instruction MOV R0, R2
  6529. Table 11.3: Six-Bit Registers for the Instruction MOV M1, M2
  6530. Table 11.4: Computation for the Expression x = (a + b) − ((c + d) − e)
  6531. Table 11.5: Generated Code with Only Two Available Registers, R0 and R1
  6532. Table 11.6: Generated Code with Rearranged Computations
  6533. Table 11.7: Recursive Gencode Calls
  6534. List of Examples
  6535. Chapter 2: Finite Automata and Regular Expressions
  6536. EXAMPLE 2.1
  6537. EXAMPLE 2.2
  6538. EXAMPLE 2.3
  6539. EXAMPLE 2.4
  6540. EXAMPLE 2.5
  6541. EXAMPLE 2.6
  6542. EXAMPLE 2.7
  6543. EXAMPLE 2.8
  6544. EXAMPLE 2.9
  6545. EXAMPLE 2.10
  6546. Chapter 3: Context-Free Grammar and Syntax Analysis
  6547. EXAMPLE 3.1
  6548. EXAMPLE 3.2
  6549. EXAMPLE 3.3
  6550. EXAMPLE 3.4
  6551. EXAMPLE 3.5
  6552. EXAMPLE 3.6
  6553. EXAMPLE 3.7
  6554. EXAMPLE 3.8
  6555. EXAMPLE 3.9
  6556. EXAMPLE 3.10
  6557. EXAMPLE 3.11
  6558. EXAMPLE 3.12
  6559. EXAMPLE 3.13
  6560. EXAMPLE 3.14
  6561. EXAMPLE 3.15
  6562. Chapter 4: Top-Down Parsing
  6563. EXAMPLE 4.1
  6564. EXAMPLE 4.2
  6565. EXAMPLE 4.3
  6566. EXAMPLE 4.4
  6567. EXAMPLE 4.5
  6568. EXAMPLE 4.6
  6569. EXAMPLE 4.7
  6570. EXAMPLE 4.8
  6571. Chapter 5: Bottom-up Parsing
  6572. EXAMPLE 5.1
  6573. EXAMPLE 5.2
  6574. EXAMPLE 5.3
  6575. EXAMPLE 5.4
  6576. EXAMPLE 5.5
  6577. EXAMPLE 5.6
  6578. EXAMPLE 5.7
  6579. EXAMPLE 5.8
  6580. Chapter 6: Syntax-Directed Definitions and Translations
  6581. EXAMPLE 6.1
  6582. EXAMPLE 6.2
  6583. EXAMPLE 6.3
  6584. EXAMPLE 6.4
  6585. EXAMPLE 6.5
  6586. EXAMPLE 6.6
  6587. Chapter 11: Code Generation
  6588. EXAMPLE 11.1
  6589. EXAMPLE 11.2
  6590. EXAMPLE 11.3
  6591. EXAMPLE 11.4
  6592. Chapter 12: Exercises
  6593. EXERCISE 12.1
  6594. EXERCISE 12.2
  6595. EXERCISE 12.3
  6596. EXERCISE 12.4
  6597. EXERCISE 12.5
  6598. EXERCISE 12.6
  6599. EXERCISE 12.7
  6600. EXERCISE 12.8
  6601. EXERCISE 12.9
  6602. EXERCISE 12.10
  6603. EXERCISE 12.11
  6604. EXERCISE 12.12
  6605. EXERCISE 12.13
  6606. EXERCISE 12.14
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement