Advertisement
xdxdxd123

Untitled

May 22nd, 2017
617
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 329.88 KB | None | 0 0
  1. Compiler Construction
  2. A Practical Approach
  3. F.J.F. Benders
  4. J.W. Haaring
  5. T.H. Janssen
  6. D. Meffert
  7. A.C. van Oostenrijk
  8. January 11, 2003
  9. 2
  10. Abstract
  11. Plenty of literature is available to learn about compiler construction,
  12. but most of it is either too easy, covering only the very basics, or
  13. too difficult and accessible only to academics. We find that, most
  14. notably, literature about code generation is lacking and it is in this
  15. area that this book attempts to fill in the gaps.
  16. In this book, we design a new language (Inger), and explain how
  17. to write a compiler for it that compiles to Intel assembly language.
  18. We discuss lexical analysis (scanning), LL(1) grammars, recursive
  19. descent parsing, syntax error recovery, identification, type checking
  20. and code generation using templates and give practical advice on
  21. tackling each part of the compiler.
  22. 3
  23. Acknowledgements
  24. The authors would like to extend their gratitude to a number of
  25. people who were invaluable in the conception and writing of this
  26. book. The compiler construction project of which this book is the
  27. result was started with the help of Frits Feldbrugge and Robert
  28. Holwerda. The Inger language was named after Inger Vermeir (in
  29. the good tradition of naming languages after people, like Ada). The
  30. project team was coordinated by Marco Devillers, who proved to be
  31. a valuable source of advice.
  32. We would also like to thank Carola Doumen for her help in struc-
  33. turing the project and coordinating presentations given about the
  34. project. Cees Haaring helped us getting a number of copies of the
  35. book printed.
  36. Furthermore, we thank Mike Wertman for letting us study his source
  37. of a Pascal compiler written in Java, and Hans Meijer of the Uni-
  38. versity of Nijmegen for his invaluable compiler construction course.
  39. Finally, we would like to thank the University of Arnhem and Ni-
  40. jmegen for letting us use a project room and computer equipment
  41. for as long as we wanted.
  42. 4
  43. Contents
  44. 1 Introduction 9
  45. 1.1 Translation and Interpretation . . . . . . . . . . . . . . . . . . . 9
  46. 1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
  47. 1.3 A Sample Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . 11
  48. 2 Compiler History 16
  49. 2.1 Procedural Programming . . . . . . . . . . . . . . . . . . . . . . 16
  50. 2.2 Functional Programming . . . . . . . . . . . . . . . . . . . . . . . 17
  51. 2.3 Object Oriented Programming . . . . . . . . . . . . . . . . . . . 17
  52. 2.4 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
  53. I Inger 23
  54. 3 Language Specification 24
  55. 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
  56. 3.2 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 24
  57. 3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
  58. 3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
  59. 3.4.1 bool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
  60. 3.4.2 int . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
  61. 3.4.3 float . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
  62. 3.4.4 char . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
  63. 3.4.5 untyped . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
  64. 3.5 Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
  65. 3.6 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
  66. 3.6.1 Simple Statements . . . . . . . . . . . . . . . . . . . . . . 40
  67. 3.6.2 Compound Statements . . . . . . . . . . . . . . . . . . . . 42
  68. 3.6.3 Repetitive Statements . . . . . . . . . . . . . . . . . . . . 42
  69. 3.6.4 Conditional Statements . . . . . . . . . . . . . . . . . . . 43
  70. 3.6.5 Flow Control Statements . . . . . . . . . . . . . . . . . . 46
  71. 3.7 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
  72. 3.8 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
  73. 3.9 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
  74. 3.10 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
  75. 3.11 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
  76. 3.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
  77. 5
  78. II Syntax 59
  79. 4 Lexical Analyzer 61
  80. 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
  81. 4.2 Regular Language Theory . . . . . . . . . . . . . . . . . . . . . . 63
  82. 4.3 Sample Regular Expressions . . . . . . . . . . . . . . . . . . . . . 66
  83. 4.4 UNIX Regular Expressions . . . . . . . . . . . . . . . . . . . . . 66
  84. 4.5 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
  85. 4.6 Common Regular Expressions . . . . . . . . . . . . . . . . . . . . 68
  86. 4.7 Lexical Analyzer Generators . . . . . . . . . . . . . . . . . . . . . 71
  87. 4.8 Inger Lexical Analyzer Specification . . . . . . . . . . . . . . . . 72
  88. 5 Grammar 78
  89. 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
  90. 5.2 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
  91. 5.3 Syntax and Semantics . . . . . . . . . . . . . . . . . . . . . . . . 80
  92. 5.4 Production Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
  93. 5.5 Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . 84
  94. 5.6 The Chomsky Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 88
  95. 5.7 Additional Notation . . . . . . . . . . . . . . . . . . . . . . . . . 90
  96. 5.8 Syntax Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
  97. 5.9 Precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
  98. 5.10 Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
  99. 5.11 A Logic Language . . . . . . . . . . . . . . . . . . . . . . . . . . 100
  100. 5.12 Common Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
  101. 6 Parsing 106
  102. 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
  103. 6.2 Prefix code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
  104. 6.3 Parsing Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
  105. 6.4 Top-down Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 109
  106. 6.5 Bottom-up Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 113
  107. 6.6 Direction Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
  108. 6.7 Parser Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
  109. 6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
  110. 7 Preprocessor 121
  111. 7.1 What is a preprocessor? . . . . . . . . . . . . . . . . . . . . . . . 121
  112. 7.2 Features of the Inger preprocessor . . . . . . . . . . . . . . . . . 121
  113. 7.2.1 Multiple file inclusion . . . . . . . . . . . . . . . . . . . . 122
  114. 7.2.2 Circular References . . . . . . . . . . . . . . . . . . . . . . 122
  115. 8 Error Recovery 124
  116. 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
  117. 8.2 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
  118. 8.3 Error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
  119. 8.4 Error reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
  120. 8.5 Error recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
  121. 8.6 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
  122. 6
  123. III Semantics 132
  124. 9 Symbol table 134
  125. 9.1 Introduction to symbol identification . . . . . . . . . . . . . . . . 134
  126. 9.2 Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
  127. 9.3 The Symbol Table . . . . . . . . . . . . . . . . . . . . . . . . . . 136
  128. 9.3.1 Dynamic vs. Static . . . . . . . . . . . . . . . . . . . . . . 136
  129. 9.4 Data structure selection . . . . . . . . . . . . . . . . . . . . . . . 138
  130. 9.4.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
  131. 9.4.2 Data structures compared . . . . . . . . . . . . . . . . . . 138
  132. 9.4.3 Data structure selection . . . . . . . . . . . . . . . . . . . 140
  133. 9.5 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
  134. 9.6 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
  135. 10 Type Checking 143
  136. 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
  137. 10.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
  138. 10.2.1 Decorate the AST with types . . . . . . . . . . . . . . . . 144
  139. 10.2.2 Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
  140. 10.3 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
  141. 10.3.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 148
  142. 11 Miscellaneous Semantic Checks 153
  143. 11.1 Left Hand Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
  144. 11.1.1 Check Algorithm . . . . . . . . . . . . . . . . . . . . . . . 154
  145. 11.2 Function Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 155
  146. 11.3 Return Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
  147. 11.3.1 Unreachable Code . . . . . . . . . . . . . . . . . . . . . . 156
  148. 11.3.2 Non-void Function Returns . . . . . . . . . . . . . . . . . 157
  149. 11.4 Duplicate Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
  150. 11.5 Goto Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
  151. IV Code Generation 161
  152. 12 Code Generation 163
  153. 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
  154. 12.2 Boilerplate Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
  155. 12.3 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
  156. 12.4 Resource Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 165
  157. 12.5 Intermediate Results of Expressions . . . . . . . . . . . . . . . . 166
  158. 12.6 Function calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
  159. 12.7 Control Flow Structures . . . . . . . . . . . . . . . . . . . . . . . 170
  160. 12.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
  161. 13 Code Templates 173
  162. 14 Bootstrapping 199
  163. 15 Conclusion 202
  164. 7
  165. A Requirements 203
  166. A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
  167. A.2 Running Inger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
  168. A.3 Inger Development . . . . . . . . . . . . . . . . . . . . . . . . . . 204
  169. A.4 Required Development Skills . . . . . . . . . . . . . . . . . . . . 205
  170. B Software Packages 207
  171. C Summary of Operations 209
  172. C.1 Operator Precedence Table . . . . . . . . . . . . . . . . . . . . . 209
  173. C.2 Operand and Result Types . . . . . . . . . . . . . . . . . . . . . 210
  174. D Backus-Naur Form 211
  175. E Syntax Diagrams 216
  176. F Inger Lexical Analyzer Source 223
  177. F.1 tokens.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
  178. F.2 lexer.l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
  179. G Logic Language Parser Source 236
  180. G.1 Lexical Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
  181. G.2 Parser Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
  182. G.3 Parser Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
  183. 8
  184. Chapter 1
  185. Introduction
  186. This book is about constructing a compiler. But what, precisely, is a compiler?
  187. We must give a clear and complete answer to this question before we can begin
  188. building our own compiler.
  189. In this chapter, we will introduce the concept of a translator, and more
  190. specifically, a compiler. It serves as an introduction to the rest of the book and
  191. presents some basic definitions that we will assume to be clear throughout the
  192. remainder of the book.
  193. 1.1 Translation and Interpretation
  194. A compiler is a special form of a translator:
  195. Definition 1.1 (Translator)
  196. A translator is a program, or a system, that converts an input text some lan-
  197. guage to a text in another language, with the same meaning.
  198. ?
  199. One translator could translate English text to Dutch text, and another could
  200. translate a Pascal program to assembly or machine code. Yet another translator
  201. might translate chess notation to an actual representation of a chess board, or
  202. translate a web page description in HTML to the actual web page. The latter
  203. two examples are in fact not so much translators as they are interpreters:
  204. Definition 1.2 (Interpreter)
  205. An interpreter is a translator that converts an input text to its meaning, as
  206. defined by its semantics.
  207. ?
  208. A BASIC-interpreter like GW-BASIC is a classic and familiar example of
  209. an interpreter. Conversely, a translator translates the expression 2+3 to, for
  210. example, machine code that evaluates to 5 . It does not translate directly to 5 .
  211. The processor (CPU) that executes the machine code is the actual interpreter,
  212. delivering the final result. These observations lead to the following relationship:
  213. 9
  214. translators ⊂ interpreters
  215. Sometimes the difference between the translation of an input text and its
  216. meaning is not immediately clear, and it can be difficult to decide whether a
  217. certain translator is an interpreter or not.
  218. A compiler is a translator that converts program source code to some target
  219. code, such as Pascal to assembly code, C to machine code and so on. Such
  220. translators differ from translators for, for example, natural languages because
  221. their input is expected to follow very strict rules for form (syntax) and the
  222. meaning of an input text must always be clear, i.e. follow a set of semantic
  223. rules.
  224. Many programs can be considered translators, not just the ones that deal
  225. with text. Other types of input and output can also be viewed as structured text
  226. (SQL queries, vector graphics, XML) which adheres to a certain syntax, and
  227. therefore be treated the same way. Many conversion tools (conversion between
  228. graphics formats, or HTML to L
  229. A T E X) are in fact translators. In order to think
  230. of some process as a translator, one must find out which alphabet is used (the
  231. set of allowed words) and which sentences are spoken. An interesting exercise
  232. is writing a program that converts chess notation to a chess board diagram.
  233. Meijer ([1] presents a set of definitions that clarify the distinction between
  234. translation and interpretation. If the input text to a translator is a program,
  235. then that program can have its own input stream. Such a program can be
  236. translated without knowledge of the contents of the input stream, but it cannot
  237. be interpreted.
  238. Let p be the program that must be translated, in programming language P,
  239. and let i be the input. Then the interpreter is a function v P , and the result of
  240. the translation of p with input i is denoted as:
  241. v P (p,i)
  242. If c is a translator, the same result is obtained by applying the translation
  243. c(p) in a programming language M to the input stream i:
  244. v M (c(p),i)
  245. Interpreters are quite common. Many popular programming languages can-
  246. not be compiled but must be interpreted, such as (early forms of) BASIC,
  247. Smalltalk, Ruby, Perl and PHP. Other programming languages provide both
  248. the option of compilation and the option of interpretation.
  249. The rest of this book will focus on compilers, which translate program input
  250. texts to some target language. We are specifically interested in translating
  251. program input text in a programming language (in particular, Inger) to Intel
  252. assemby language.
  253. 1.2 Roadmap
  254. Constructing a compiler involves specifying the programming language for which
  255. you wish to build a compiler, and then write a grammar for it. The com-
  256. piler then reads source programs written in the new programming language and
  257. 10
  258. checks that they are syntactically valid (well-formed). After that, the compiler
  259. verifies that the meaning of the program is correct, i.e. it checks the program’s
  260. semantics. The final step in the compilation is generating code in the target
  261. language.
  262. To help you vizualize where you are in the compiler construction process,
  263. every chapter begins with a copy of the roadmap:
  264. One of the squares on the map will be highlighted.
  265. 1.3 A Sample Interpreter
  266. In this section, we will discuss a very simple sample interpreter that calculates
  267. the result of simple mathematical expressions, using the operators + (addition),
  268. - (subtraction), * (multiplication) and / (division). We will work only with
  269. numbers consisting of one digits (0 through 9).
  270. We will now devise a systematical approach to calculate the result of the
  271. expression
  272. 1 + 2 * 3 - 4
  273. This is traditionally done by reading the input string on a character by
  274. character basis. Initially, the read pointer is set at the beginning of the string,
  275. just before the number 1 :
  276. 1 + 2 * 3 - 4
  277. ^
  278. We now proceed by reading the first character (or code), which happens to
  279. be 1 . This is not an operator so we cannot calculate anything yet. We must
  280. store the 1 we just read away for later use, and we do so by creating a stack
  281. (a last-in-first-out queue abstraction) and placing 1 on it. We illustrate this
  282. by drawing a vertical line between the items on the stack (on the left) and the
  283. items on the input stream (on the right):
  284. 1 | + 2 * 3 - 4
  285. ^
  286. The read pointer is now at the + operator. This operator needs two operands,
  287. only one of which is known at this time. So all we can do is store the + on the
  288. stack and move the read pointer forwards one position.
  289. 1 + | 2 * 3 - 4
  290. ^
  291. 11
  292. The next character read is 2 . We must now resist the temptation to combine
  293. this new operand with the operator and operand already on the stack and
  294. evaluate 1 + 2 , since the rules of precedence dictate that we must evaluate 2 *
  295. 3 , and then add this to 1 . Therefore, we place (shift) the value 2 on the stack:
  296. 1 + 2 | * 3 - 4
  297. ^
  298. We now read another operator ( * ) which needs two operands. We shift it
  299. on the stack because the second operand is not yet known. The read pointer is
  300. once again moved to the right and we read the number 3 . This number is also
  301. placed on the stack and the read pointer now points to the operator - :
  302. 1 + 2 * 3 | - 4
  303. ^
  304. We are now in a position to fold up (reduce) some of the contents of the
  305. stack. The operator - is of lower priority than the operator * . According to the
  306. rules of precedence, we may now calculate 2 * 3 , which happen to be the topmost
  307. three items on the stack (which, as you will remember, is a last-in-first-out data
  308. structure). We pop the last three items off the stack and calculate the result,
  309. which is shifted back onto the stack. This is the process of reduction.
  310. 1 + 6 | - 4
  311. ^
  312. We now compare the priority of the operator - with the priority of the op-
  313. erator + and find that, according to the rules of precedence, they have equal
  314. priority. This means we can either evaluate the current stack contents or con-
  315. tinue shifting items onto the stack. In order to keep the contents of the stack
  316. to a minimum (consider what would happen if an endless number of + and -
  317. operators were encountered in succession) we reduce the contents of the stack
  318. first, by calculating 1 + 6 :
  319. 7 | - 4
  320. ^
  321. The stack can be simplied no further, so we direct our attention to the next
  322. operator in the input stream ( - ). This operator needs two operands, so we must
  323. shift the read pointer still further to the right:
  324. 7 - 4 |^
  325. We have now reached the end of the stream but are able to reduce the
  326. contents of the stack to a final result. The expression 7 - 4 is evaluated, yielding
  327. 3 . Evaluation of the entire expression 1 + 2 * 3 - 4 is now complete and the
  328. algorithm used in the process is simple. There are a couple of interesting points:
  329. 12
  330. 1. Since the list of tokens already read from the input stream are placed on
  331. a stack in order to wait for evaluation, the operations shift and reduce are
  332. in fact equivalent to the operators push and pop.
  333. 2. The relative precedence of the operators encountered in the input stream
  334. determine the order in which the contents of the stack are evaluated.
  335. Operators not only have priority, but also associativity. Consider the ex-
  336. pression
  337. 1 − 2 − 3
  338. The order in which the two operators are evaluated is significant, as the
  339. following two possible orders show:
  340. (1 − 2) − 3 = −4
  341. 1 − (2 − 3) = 2
  342. Of course, the correct answer is −4 and we may conclude that the - operator
  343. associates to the left. There are also (but fewer) operators that associate to the
  344. right, like the “to the power of” operator ( ˆ ):
  345. (2 3 ) 2 = 8 2 = 64 (incorrect)
  346. 2 ( 3 2 ) = 2 9 = 512 (correct)
  347. A final class of operators is nonassociative, like + :
  348. (1 + 4) + 3 = 5 + 3 = 8
  349. 1 + (4 + 3) = 1 + 7 = 8
  350. Such operators may be evaluated either to the left or to the right; it does
  351. not really matter. In compiler construction, non-associative operators are often
  352. treated as left-associative operators for simplicity.
  353. The importance of priority and associativty in the evaluation of mathemati-
  354. cal expressions, leads to the observation, that an operator priority list is required
  355. by the interpreter. The following table could be used:
  356. operator priority associativity
  357. ˆ 1 right
  358. * 2 left
  359. / 2 left
  360. + 3 left
  361. - 3 left
  362. The parentheses, ( and ) can also be considered an operator, with the highest
  363. priority (and could therefore be added to the priority list). At this point, the
  364. priority relation is still incomplete. We also need invisible markers to indicate
  365. the beginning and end of an expression. The begin-marker [ should be of the
  366. lowest priority (in order to cause every other operator that gets shifted onto
  367. an otherwise empty stack not to evaluate. The end-marker ] should be of the
  368. lowest priority (just lower than [ ) for the same reasons. The new, full priority
  369. relation is then:
  370. 13
  371. { [ , ] } < { + , - } < { * , / } < { ˆ }
  372. The language discussed in our example supports only one-digit numbers.
  373. In order to support numbers of arbitrary length while still reading one digit
  374. at a time and working with the stack-based shift-reduce approach, we could
  375. introduce a new implicit concatenation operator:
  376. 1.2 = 1 ∗ 10 + 2 = 12
  377. Numerous other problems accompany the introduction of numbers of arbi-
  378. trary length, which will not be discussed here (but most certainly in the rest
  379. of this book). This concludes the simple interpreter which we have crafted by
  380. hand. In the remaining chapter, you will learn how an actual compiler may be
  381. built using standard methods and techniques, which you can apply to your own
  382. programming language.
  383. 14
  384. Bibliography
  385. [1] H. Meijer: Inleiding Vertalerbouw, University of Nijmegen, Subfaculty of
  386. Computer Science, 2002.
  387. [2] M.J. Scott: Programming Language Pragmatics, Morgan Kaufmann Pub-
  388. lishers, 2000.
  389. 15
  390. Chapter 2
  391. Compiler History
  392. This chapter gives an overview of compiler history. Programming languages
  393. can roughly be divided into three classes: procedural or imperative programming
  394. languages, functional programming languages and object-oriented programming
  395. languages. Compilers exist for all three classes and each type have their own
  396. quirks and specialties.
  397. 2.1 Procedural Programming
  398. The first programming languages evolved from machine code. Instead of writ-
  399. ing numbers to specify addresses and instructions you could write symbolic
  400. names. The computer executes this sequence of instructions when the user runs
  401. the program. This style of programming is known as procedural or imperative
  402. programming.
  403. Some of the first procedural languages include:
  404. • FORTRAN, created in 1957 by IBM to produce programs to solve math-
  405. ematical problems. FORTRAN is short for formula translation.
  406. • Algol 60, created in the late fifties with to goal to provide a “universal
  407. programming language”. Even though the language was not widely used,
  408. its syntax became the standard language for describing algorithms.
  409. • COBOL, a “data processing” language developed by Sammett introduced
  410. many new data types and implicit type conversion.
  411. • PL/I, a programming language that would combine the best features of
  412. FORTRAN, Algol and COBOL.
  413. • Algol 68, the successor to Algol 60. Also not widely used, though the
  414. ideas it introduced have been widely imitated.
  415. • Pascal, created by Wirth to demonstrate a powerful programming lan-
  416. gauge can be simple too, as opposed to the complex Algol 68, PL/I and
  417. others.
  418. 16
  419. • Modula2, also created by Wirth as an improvement to Pascal with modules
  420. as most important new feature.
  421. • C, designed by Ritchie as a low level language mainly for the task of
  422. system programming. C became very popular because UNIX was very
  423. popular and heavily depended on it.
  424. • Ada, a large and complex language created by Whitaker and one of the
  425. latest attempts at designing a procedural language.
  426. 2.2 Functional Programming
  427. Functional programming is based on the abstract model of programming Tur-
  428. ing introduced, known as the Turing Machine. Also the theory of recursive
  429. functions Kleene and Church introduced play an important role in functional
  430. programming. The big difference with procedural programming is the insight
  431. that everything can be done with expression as opposed to commands.
  432. Some important functional programming languages:
  433. • LISP, the language that introduced functional programming. Developed
  434. by John McCarthy in 1958.
  435. • Scheme, a language with a syntax and semantics very similar to LISP, but
  436. simpler and more consistent. Designed by Guy L. Steele Jr. and Gerald
  437. Lay Sussmann.
  438. • SASL, short for St. Andrew’s Symbolic Language. It was created by
  439. David Turner and has an Algol like syntax.
  440. • SML, designed by Milner, Tofte and Harper as a “metalanguage”.
  441. 2.3 Object Oriented Programming
  442. Object oriented programming is entirely focused on objects, not functions. It
  443. has some major advantages over procedural programming. Well written code
  444. consists of objects that keep their data private and only accesible through cer-
  445. tain methods (the interface), this concept is known as encapsulation. Another
  446. important object oriented programming concept is inheritance—a mechanism
  447. to have objects inherit state and behaviour from their superclass.
  448. Important object oriented programming languages:
  449. • Simula, developed in Norway by Kristiaan Nyaard and Ole-Johan Dahl.
  450. A language to model system, wich are collections of interacting processes
  451. (objects), wich in turn are represented by multiple procedures.
  452. • SmallTalk, originated with Alan Kay’s ideas on computers and program-
  453. ming. It is influenced by LISP and Simula.
  454. • CLU, a language introduced by Liskov and Zilles to support the idea of
  455. information hiding (the interface of a module should be public, while the
  456. implementation remains private).
  457. 17
  458. • C++, created at Bell Labs by Bjarne Soustroup as a programming lan-
  459. guage to replace C. It is a hybrid language — it supports both imperative
  460. and object oriented programming.
  461. • Eiffel, a stricly object oriented language wich is strongly focused on soft-
  462. ware engineering.
  463. • Java, a object oriented programming language with a syntax wich looks
  464. like C++, but much simpler. Java is compiled to byte code wich makes it
  465. portable. This is also the reason it became very popular for web develop-
  466. ment.
  467. • Kevo, a language based on prototypes instead of classes.
  468. 2.4 Timeline
  469. In this section, we give a compact overview of the timeline of compiler con-
  470. struction. As described in the overview article [4], the conception of the first
  471. computer language goes back as far as 1946. In this year (or thereabouts), Kon-
  472. rad Zuse, a german engineer working alone Konrad Zuse, a German engineer
  473. working alone while hiding out in the Bavarian Alps, develops Plankalkul. He
  474. applies the language to, among other things, chess. Not long after that, the first
  475. compiled language appears: Short Code, which the first computer language ac-
  476. tually used on an electronic computing device. It is, however, a “hand-compiled”
  477. language.
  478. In 1951, Grace Hopper, working for Remington Rand, begins design work on
  479. the first widely known compiler, named A-0. When the language is released by
  480. Rand in 1957, it is called MATH-MATIC. Less well-known is the fact that almost
  481. simulaneously, a rudimentary compiler was developed at a much less professional
  482. level. Alick E. Glennie, in his spare time at the University of Manchester, devises
  483. a compiler called AUTOCODE.
  484. A few years after that, in 1957, the world famous programming language
  485. FORTRAN (FORmula TRANslation) is conceived. John Backus (responsible
  486. for his Backus-Naur Form for syntax specification) leads the development of
  487. FORTRAN and later on works on the ALGOL programming language. The
  488. publication of FORTRAN was quickly followed by FORTRAN II (1958), which
  489. supported subroutines (a major innovation at the time, giving birth to the
  490. concept of modular programming).
  491. Also in 1958, John McCarthy at M.I.T. begins work on LISP–LISt Process-
  492. ing, the precursor of (almost) all functional programming languages we know
  493. today. Also, this is the year in which the ALGOL programming language ap-
  494. pears (at least, the specification). The specification of ALGOL does not describe
  495. how data will be input or output; that is left to the individual implementations.
  496. 1959 was another year of much innovation. LISP 1.5 appears and the func-
  497. tional programming paradigm is settled. Also, COBOL is created by the Con-
  498. ference on Data Systems and Languages (CODASYL). In the next year, the first
  499. 18
  500. actual implementation of ALGOL appears (ALGOL60). It is the root of the
  501. family tree that will ultimately produce the likes of Pascal by Niklaus Wirth.
  502. ALGOL goes on to become the most popular language in Europe in the mid-
  503. to late-1960s.
  504. Sometime in the early 1960s, Kenneth Iverson begins work on the language
  505. that will become APL – A Programming Language. It uses a specialized char-
  506. acter set that, for proper use, requires APL-compatible I/O devices. In 1962,
  507. Iverson publishes a book on his new language (titled, aptly, A Programming
  508. Language). 1962 is also the year in which FORTRAN IV appears, as well as
  509. SNOBOL (StriNg-Oriented symBOlic Language) and associated compilers.
  510. In 1963, the new language PL/1 is conceived. This language will later form
  511. the basis for many other languages. In the year after, APL/360 is implemented
  512. and at Dartmouth University, professors John G. Kemeny and Thomas E. Kurtz
  513. invent BASIC. The first implementation is a compiler. The first BASIC program
  514. runs at about 4:00 a.m. on May 1, 1964.
  515. Languages start appearing rapidly now: 1965 - SNOBOL3. 1966 - FOR-
  516. TRAN 66 and LISP 2. Work begins on LOGO at Bolt, Beranek, & Newman.
  517. The team is headed by Wally Fuerzeig and includes Seymour Papert. LOGO is
  518. best known for its “turtle graphics.” Lest we forget: 1967 - SNOBOL4.
  519. In 1968, the aptly named ALGOL68 appears. This new language is not alto-
  520. gether a success, and some members of the specifications committee–including
  521. C.A.R. Hoare and Niklaus Wirth–protest its approval. ALGOL 68 proves diffi-
  522. cult to implement. Wirth begins work on his new language Pascal in this year,
  523. which also sees the birth of ALTRAN, a FORTRAN variant, and the official
  524. definition of COBOL by the American National Standards Institute (ANSI).
  525. Compiler construction attracts a lot of interest – in 1969, 500 people attend an
  526. APL conference at IBM’s headquarters in Armonk, New York. The demands
  527. for APL’s distribution are so great that the event is later referred to as “The
  528. March on Armonk.”
  529. Sometime in the early 1970s , Charles Moore writes the first significant pro-
  530. grams in his new language, Forth. Work on Prolog begins about this time. Also
  531. sometime in the early 1970s, work on Smalltalk begins at Xerox PARC, led by
  532. Alan Kay. Early versions will include Smalltalk-72, Smalltalk-74, and Smalltalk-
  533. 76. An implementation of Pascal appears on a CDC 6000-series computer. Icon,
  534. a descendant of SNOBOL4, appears.
  535. Remember 1946? In 1972, the manuscript for Konrad Zuse’s Plankalkul (see
  536. 1946) is finally published. In the same year, Dennis Ritchie and Brian Kernighan
  537. produces C. The definitive reference manual for it will not appear until 1974.
  538. The first implementation of Prolog – by Alain Colmerauer and Phillip Rous-
  539. sel – appears. Three years later, in 1975, Tiny BASIC by Bob Albrecht and
  540. Dennis Allison (implementation by Dick Whipple and John Arnold) runs on a
  541. microcomputer in 2 KB of RAM. A 4-KB machine is sizable, which left 2 KB
  542. available for the program. Bill Gates and Paul Allen write a version of BA-
  543. SIC that they sell to MITS (Micro Instrumentation and Telemetry Systems)
  544. 19
  545. on a per-copy royalty basis. MITS is producing the Altair, an 8080-based mi-
  546. crocomputer. Also in 1975, Scheme, a LISP dialect by G.L. Steele and G.J.
  547. Sussman, appears. Pascal User Manual and Report, by Jensen and Wirth, (also
  548. extensively used in the conception of Inger) is published.
  549. B.W. Kerninghan describes RATFOR – RATional FORTRAN. It is a pre-
  550. processor that allows C-like control structures in FORTRAN. RATFOR is used
  551. in Kernighan and Plauger’s “Software Tools,” which appears in 1976. In that
  552. same year, the Design System Language, a precursor to PostScript (which was
  553. not developed until much later), appears.
  554. In 1977, ANSI defines a standard for MUMPS: the Massachusetts General
  555. Hospital Utility Multi-Programming System. Used originally to handle medical
  556. records, MUMPS recognizes only a string data-type. Later renamed M. The
  557. design competition (ordered by the Department of Defense) that will produce
  558. Ada begins. A team led by Jean Ichbiah, will win the competition. Also,
  559. sometime in the late 1970s, Kenneth Bowles produces UCSD Pascal, which
  560. makes Pascal available on PDP-11 and Z80-based (remember the ZX-spectrum)
  561. computers and thus for “home use”. Niklaus Wirth begins work on Modula,
  562. forerunner of Modula-2 and successor to Pascal.
  563. The text-processing language AWK (after the designers: Aho, Weinberger
  564. and Kernighan) becomes available in 1978. So does the ANSI standard for
  565. FORTRAN 77. Two years later, the first “real” implementation of Smalltalk
  566. (Smalltalk-80) appears. So does Modula-2. Bjarne Stroustrup develops “C With
  567. Classes”, which will eventually become C++.
  568. In 1981, design begins on Common LISP, a version of LISP that must unify
  569. the many different dialects in use at the time. Japan begins the “Fifth Gener-
  570. ation Computer System” project. The primary language is Prolog. In the next
  571. year, the International Standards Organisation (ISO) publishes Pascal appears.
  572. PostScript is published (after DSL).
  573. The famous book on Smalltalk: Smalltalk-80: The Language and Its Imple-
  574. mentation by Adele Goldberg is published. Ada appears, the language named
  575. after Lady Augusta Ada Byron, Countess of Lovelace and daughter of the En-
  576. glish poet Byron. She has been called the first computer programmer because
  577. of her work on Charles Babbage’s analytical engine. In 1983, the Department
  578. of Defense ’(DoD) directs that all new “mission-critical” applications be written
  579. in Ada.
  580. In late 1983 and early 1984, Microsoft and Digital Research both release the
  581. first C compilers for microcomputers. The use of compilers by back-bedroom
  582. programmers becomes almost feasible. In July , the first implementation of
  583. C++ appears. It is in 1984 that Borland produces its famous Turbo Pascal. A
  584. reference manual for APL2 appears, an extension of APL that permits nested
  585. arrays.
  586. An important year for computer languages is 1985. It is the year in which
  587. Forth controls the submersible sled that locates the wreck of the Titanic. Meth-
  588. 20
  589. ods, a line-oriented Smalltalk for personal computers, is introduced. Also, in
  590. 1986, jargonSmalltalk/V appears–the first widely available version of Smalltalk
  591. for microcomputers. Apple releases Object Pascal for the Mac, greatly popular-
  592. izing the Pascal language. Borland extends its “Turbo” product line with Turbo
  593. Prolog.
  594. Charles Duff releases Actor, an object-oriented language for developing Mi-
  595. crosoft Windows applications. Eiffel, an object-oriented language, appears. So
  596. does C++. Borland produces the fourth incarnation of Turbo Pascal (1987). In
  597. 1988, the spefication of CLOS (Common LISP Object System) is fianlly pub-
  598. lished. Wirth finishes Oberon, his follow-up to Modula-2, his third language so
  599. far.
  600. In 1989, the ANSI specification for C is published, leveraging the already
  601. popular language even further. C++ 2.0 arrives in the form of a draft refer-
  602. ence manual. The 2.0 version adds features such as multiple inheritance (not
  603. approved by everyone) and pointers to members. A year later, the Annotated
  604. C++ Reference Manual by Bjarne Stroustrup is published, adding templates
  605. and exception-handling features. FORTRAN 90 includes such new elements as
  606. case statements and derived types. Kenneth Iverson and Roger Hui present J
  607. at the APL90 conference.
  608. Dylan – named for Dylan Thomas – an object-oriented language resembling
  609. Scheme, is released by Apple in 1992. A year later, ANSI releases the X3J4.1
  610. technical report – the first-draft proposal for object-oriented COBOL. The stan-
  611. dard is expected to be finalized in 1997.
  612. In 1994, Microsoft incorporates Visual Basic for Applications into Excel and
  613. in 1995, ISO accepts the 1995 revision of the Ada language. Called Ada 95, it
  614. includes OOP features and support for real-time systems.
  615. This concludes the compact timeline of the evolution of programming lan-
  616. guages. Of course, in the present day, another revolution is taking place, in the
  617. form of the Microsoft .NET platform. This platform is worth an entire book
  618. unto itself, and much literature is in fact already available. We will not discuss
  619. the .NET platform and the common language specification any further in this
  620. book. It is now time to move on to the first part of building our own compiler.
  621. 21
  622. Bibliography
  623. [1] T. Dodd: An Advanced Logic Programming Language - Prolog-2 Encyclo-
  624. pedia, Blackwell Scientific Publications Ltd., 1990.
  625. [2] M. Looijen: Grepen uit de Geschiedenis van de Automatisering, Kluwer
  626. Bedrijfswetenschappen, Deventer, 1992.
  627. [3] G. Moody: Rebel Code - Inside Linux and the Open Source Revolution,
  628. Perseus Publishing, 2001.
  629. [4] N.N.A Brief History of Programming Language, BYTE Magazine, 20th
  630. anniversary, 1995.
  631. [5] P.H. Salus: Handbook of Programming Languages, Volume I: Object-
  632. oriented Programming Languages, Macmillan Technical Publishing, 1998.
  633. [6] P.H. Salus: Handbook of Programming Languages, Volume II: Imperative
  634. Programming Languages, Macmillan Technical Publishing, 1998.
  635. [7] P.H. Salus: Handbook of Programming Languages, Volume III: Little Lan-
  636. guages and Tools, Macmillan Technical Publishing, 1998.
  637. [8] P.H. Salus: Handbook of Programming Languages, Volume IV: Functional
  638. and Logic Programming Languages, Macmillan Technical Publishing, 1998.
  639. 22
  640. Part I
  641. Inger
  642. 23
  643. Chapter 3
  644. Language Specification
  645. 3.1 Introduction
  646. This chapter gives a detailed introduction to the Inger language. The reader is
  647. assumed to have some familiarity with the concept of a programming language,
  648. and some experience with mathematics.
  649. To give the reader an introduction to programming in general, we cite a
  650. short fragment of the introduction to the PASCAL User Manual and Report by
  651. Niklaus Wirth [7]:
  652. An algorithm or computer program consists of two essential parts, a
  653. description of the actions which are to be performed, and a descrip-
  654. tion of the data, which are manipulated by so-called statements, and
  655. data are described by so-called declarations and definitions.
  656. Inger provides language constructs (declarations) to define the data a pro-
  657. gram requires, and numerous ways to manipulate that data. In the next sections,
  658. we will explore Inger in some detail.
  659. 3.2 Program Structure
  660. A program consists of one or more named modules, all of which contribute data
  661. or actions to the final program. Every module resides in its own source file. The
  662. best way to get to know Inger is to examine a small module source file. The
  663. program “factorial” (listing 3.1) calculates the factorial of the number 6. The
  664. output is 6! = 720.
  665. All modules begin with the module name, which can be any name the pro-
  666. grammer desires. A module then contains zero or more functions, which en-
  667. capsulate (parts of) algorithms, and zero or more global variables (global data).
  668. The functions and data declarations can occur in any order, which is made
  669. clear in a syntax diagram (figure 3.1). By starting from the left, one can trace
  670. 24
  671. /* factor.i - test program.
  672. Contains a function that calculates
  673. the factorial of the number 6.
  674. This program tests the while loop. */
  675. 5
  676. module test module;
  677. factor : int n → int
  678. {
  679. 10 int factor = 1;
  680. int i = 1;
  681. while( i <= n ) do
  682. {
  683. 15 factor = factor ∗ n;
  684. n = n + 1;
  685. }
  686. return( factor );
  687. }
  688. 20
  689. start main: void → void
  690. {
  691. int f;
  692. f = factor ( 6 );
  693. 25 }
  694. Listing 3.1: Inger Factorial Program
  695. 25
  696. the lines leading through boxes and rounded enclosures. Boxes represent ad-
  697. ditional syntax diagrams, while rounded enclosures contain terminal symbols
  698. (those actually written in an Inger program). A syntactically valid program is
  699. constructed by following the lines and always taking smooth turns, never sharp
  700. turns. Note that dotted lines are used to break a syntax diagram in half that is
  701. too wide to fit on the page.
  702. Figure 3.1: Syntax diagram for module
  703. Example 3.1 (Tracing a Syntax Diagram)
  704. As an example, we will show two valid programs that are generated by tracing
  705. the syntax diagram for module. These are not complete programs; they still
  706. contain the names of the additional syntax diagrams function and declaration
  707. that must be traced.
  708. module Program One;
  709. Program One is a correct program that contains no functions or declarations.
  710. The syntax diagram for module allows this, because the loop leading through
  711. either function or declaration is taken zero times.
  712. module Program Two;
  713. extern function;
  714. declaration ;
  715. function;
  716. Program Two is also correct. It contains two functions and one declaration.
  717. One of the functions is marked extern ; the keyword extern is optional, as the the
  718. syntax diagram for module shows.
  719. ?
  720. Syntax diagrams are a very descriptive way of writing down language syntax,
  721. but not very compact. We may also use Backus-Naur Form (BNF) to denote
  722. the syntax for the program structure, as shown in listing 3.2.
  723. In BNF, each syntax diagram is denoted using one or more lines. The
  724. line begins with the name of the syntax diagram (a nonterminal), followed
  725. by a colon. The contents of the syntax diagram are written after the colon:
  726. nonterminals, (which have their own syntax diagrams), and terminals, which
  727. 26
  728. module: module identifier ; globals.
  729. globals : ?.
  730. globals : global globals.
  731. globals : extern global globals.
  732. global: function.
  733. global: declaration.
  734. Listing 3.2: Backus-Naur Form for module
  735. are printed in bold. Since nonterminals may have syntax diagrams of their
  736. own, a single syntax diagram may be expressed using multiple lines of BNF.
  737. A line of BNF is also called a production rule. It provides information on how
  738. to “produce” actual code from a nonterminal. In the following example, we
  739. produce the programs “one” and “two” from the previous example using the
  740. BNF productions.
  741. Example 3.2 (BNF Derivations)
  742. Here is the listing from program “one” again:
  743. module Program One;
  744. To derive this program, we start with the topmost BNF nonterminal,
  745. module . This is called the start symbol. There is only one production for this
  746. nonterminal:
  747. module: module identifier ; globals.
  748. We now replace the nonterminal module with the right hand side of this
  749. production:
  750. module −→ module identifier; globals
  751. Note that we have underlined the nonterminal to be replaced. In the new
  752. string we now have, there is a new nonterminal to replace: globals . There are
  753. multiple production rules for globals :
  754. globals : ?.
  755. globals : global globals.
  756. globals : extern global globals.
  757. Program “One” does not have any globals (declarations or functions), so we
  758. replace the nonterminal globals with the empty string (?). Finally, we replace
  759. the nonterminal identifier . We provide no BNF rule for this, but it suffices to
  760. say that we may replace identifier with any word consisting of letters, digits and
  761. underscores and starting with either a letter or an underscore:
  762. 27
  763. module
  764. −→ module identifier; globals
  765. −→ module Program One; globals
  766. −→ module Program One;
  767. And we have created a valid program! The above list of production rule ap-
  768. plications is a called a derivation. A derivation is the application of production
  769. rules until there are no nonterminals left to replace. We now create a derivation
  770. for program “Two“, which contains two functions (one of which is extern , more
  771. on that later) and a declaration. We will not derive further than the function
  772. and declaration level, because these language structures will be explained in a
  773. subsequent section Here is the listing for program “Two” again:
  774. module Program Two;
  775. extern function;
  776. declaration ;
  777. function;
  778. module
  779. −→ module identifier; globals
  780. −→ module Program Two; globals
  781. −→ module Program Two; extern globals
  782. −→ module Program Two; extern global globals
  783. −→ module Program Two; extern function globals
  784. −→ module Program Two; extern function global globals
  785. −→ module Program Two; extern function declaration globals
  786. −→ module Program Two; extern function declaration global globals
  787. −→ module Program Two; extern function declaration function globals
  788. −→ module Program Two; extern function declaration function
  789. And with the last replacement, we have produced to source code for program
  790. “Two”, exactly the same as in the previous example.
  791. ?
  792. BNF is a somewhat rigid notation; it only allows the writer to make explicit
  793. the order in which nonterminals and terminals occur, but he must create addi-
  794. tional BNF rules to capture repetition and selection. For instance, the syntax
  795. diagram for module shows that zero or more data declarations or functions may
  796. appear in a program. In BNF, we show this by introducing a production rule
  797. called globals, which calls itself (is recursive). We also needed to create another
  798. production rule called global, which has two alternatives (function and declara-
  799. tion) to offer a choice. Note that globals has three alternatives. One alternative
  800. is needed to end the repetition of functions and declarations (this is denoted
  801. with an ?, meaning empty), and one alternative is used to include the keyword
  802. extern , which is optional.
  803. There is a more convenient notation called Extended Backus-Naur Form
  804. (EBNF), which allows the syntax diagram for module to be written like this:
  805. 28
  806. ( ) [ ]
  807. ! - + ~
  808. * & * /
  809. % + - >>
  810. << < <= >
  811. >= == != &
  812. ^ | && ||
  813. ? : = ,
  814. ; -> { }
  815. bool break case char
  816. continue default do else
  817. extern false float goto considered harmful
  818. if int label module
  819. return start switch true
  820. untyped while
  821. Table 3.1: Inger vocabulary
  822. module: module identifier ;
  823. ? ?
  824. extern
  825. ?
  826. ?
  827. function | declaration
  828. ? ?
  829. .
  830. In EBNF, we can use vertical bars (|) to indicate a choice, and brackets
  831. ([ and ]) to indicate an optional part. These symbols are called metasymbols;
  832. they are not part of the syntax being defined. We can also use the metasymbols
  833. ( and ) to enclose terminals and nonterminals so they may be used as a group.
  834. Braces ({ and }) are used to denote repetition zero or more times. In this book,
  835. we will use both EBNF and BNF. EBNF is short and clear, but BNF has some
  836. advantages which will become clear in chapter 5, Grammar.
  837. 3.3 Notation
  838. Like all programming languages, Inger has a number of reserved words, operators
  839. and delimiters (table 3.3). These words cannot be used for anything else than
  840. their intended purpose, which will be discussed in the following sections.
  841. One place where the reserved words may be used freely, along with any
  842. other words, is inside a comment. A comment is input text that is meant
  843. for the programmer, not the compiler, which skips them entirely. Comments
  844. are delimited by the special character combinations /* and */ and may span
  845. multiple lines. Listing 3.3 contains some examples of legal comments.
  846. The last comment in the example above starts with // and ends at the end
  847. of the line. This is a special form of comment called a single-line comment.
  848. Functions, constants and variables may be given arbitrary names, or identi-
  849. fiers by the programmer, provided reserved words are not used for this purpose.
  850. An identifier must begin with a letter or an underscore (_) to discern it from a
  851. number, and there is no limit to the identifier length (except physical memory).
  852. As a rule of thumb, 30 characters is a useful limit for the length of identifiers.
  853. Although an Inger compiler supports names much longer than that, more than
  854. 30 characters will make for confusing names which are too long to read. All
  855. 29
  856. /* This is a comment. */
  857. /* This is also a comment,
  858. spanning multiple
  859. 5 lines. */
  860. /*
  861. * This comment is decorated
  862. 10 * with extra asterisks to
  863. * make it stand out.
  864. */
  865. // This is a single-line comment.
  866. Listing 3.3: Legal Comments
  867. identifiers must be different, except when they reside in different scopes. Scopes
  868. will be discussed in greater detail later. We give a syntax diagram for identifiers
  869. in figure 3.2 and EBNF production rules for comparison:
  870. Figure 3.2: Syntax diagram for identifier
  871. identifier :
  872. ?
  873. | letter
  874. ? ?
  875. letter | digit |
  876. ?
  877. letter : A | ... | Z | a | ... | z
  878. digit : 0 | ... | 9
  879. Example 3.3 (Identifiers)
  880. Valid identifiers include:
  881. x
  882. _GrandMastaD_
  883. HeLlO_wOrLd
  884. Some examples of invalid identifiers:
  885. 2day
  886. 30
  887. bool
  888. @a
  889. 2+2
  890. ?
  891. Of course, the programmer is free to choose wonderful names such as or
  892. x234 . Even though the language allows this, the names are not very descriptive
  893. and the programmer is encouraged to choose better names that describe the
  894. purpose of variables.
  895. Inger supports two types of numbers: integer numbers (x ∈ N), floating
  896. point numbers (x ∈ R). Integer numbers consist of only digits, and are 32 bits
  897. wide. They have a very simple syntax diagram shown in figure 3.3. Integer
  898. numbers also include hexadecimal numbers, which are numbers with radix 16.
  899. Hexadecimal numbers are written using 0 through 9 and A through F as digits.
  900. The case of the letters is unimportant. Hexadecimal numbers must be prefixed
  901. with 0x to set them apart from ordinary integers. Inger can also work with
  902. binary numbers (numbers with radix 2). These numbers are written using only
  903. the digits 0 and 1 . Binary numbers must be postfixed with B or b to set them
  904. apart from other integers.
  905. Figure 3.3: Syntax diagram for integer
  906. Floating point numbers include a decimal separator (a dot) and an optional
  907. fractional part. They can be denoted using scientific notation (e.g. 12e-3 ). This
  908. makes their syntax diagram (figure 3.4) more involved than the syntax diagram
  909. for integers. Note that Inger always uses the dot as the decimal separator, and
  910. not the comma, as is customary in some locations.
  911. Example 3.4 (Integers and Floating Pointer Numbers)
  912. Some examples of valid integer numbers:
  913. 3 0x2e 1101b
  914. Some examples of invalid integer numbers (note that these numbers may be
  915. perfectly valid floating point numbers):
  916. 1a 0.2 2.0e8
  917. 31
  918. Figure 3.4: Syntax diagram for float
  919. Some examples of valid floating point numbers:
  920. 0.2 2.0e8 .34e-2
  921. Some examples of invalid floating point numbers:
  922. 2e-2 2a
  923. ?
  924. Alphanumeric information can be encoded in either single characters of
  925. strings. A single character must be inclosed within apostrophes ( ’ ) and a string
  926. must begin and and with double quotes ( ” ). Any and all characters may be
  927. used with a string or as a single character, as long as they are printable (control
  928. characters cannot be typed). If a control character must be used, Inger offers a
  929. way to escape ordinary characters to generate control characters, analogous to
  930. C. This is also the only way to include double quotes in a string, since they are
  931. normally used to start or terminate a string (and therefore confuse the compiler
  932. if not treated specially). See table 3.2 for a list of escape sequences and the
  933. special characters they produce.
  934. A string may not be spanned across multiple lines. Note that while whites-
  935. pace such as spaces, tabs and end of line are normally used only to separate
  936. symbols and further ignored, whitespace within a string remains unchanged by
  937. the compiler.
  938. Example 3.5 (Sample Strings and Characters)
  939. Here are some sample single characters:
  940. ’b’ ’&’ ’7’ ’"’ ’’’
  941. Valid strings include:
  942. "hello, world" "123"
  943. "\r\n" "\"hi!\""
  944. 32
  945. Escape Sequence Special character
  946. \” ”
  947. \’ ’
  948. \\ \
  949. \a Audible bell
  950. \b Backspace
  951. \Bnnnnnnnn Convert binary value to character
  952. \f Form feed
  953. \n Line feed
  954. \onnn Convert octal value to character
  955. \r Carriage return
  956. \t Horizontal tab
  957. \v Vertical tab
  958. \xnn Convert hexadecimal value to character
  959. Table 3.2: Escape Sequences
  960. ?
  961. This concludes the introduction to the nototional conventions to which valid
  962. Inger programs must adhere. In the next section, we will discuss the concept of
  963. data (variables) and how data is defined in Inger.
  964. 3.4 Data
  965. Almost all computer programs operate on data, with which we mean numbers
  966. or text strings. At the lowest level, computers deal with data in the form of bits
  967. (binary digits, which a value of either 0 or 1 ), which are difficult to manipulate.
  968. Inger programs can work at a higher level and offer several data abstractions
  969. that provide a more convenient way to handle data than through raw bits.
  970. The data abstractions in Inger are bool, char, float, int and untyped. All of
  971. these except untyped are scalar types, i.e. they are a subset of R. The untyped
  972. data abstraction is a very different phenomenon. Each of the data abstractions
  973. will be discussed in turn.
  974. 3.4.1 bool
  975. Inger supports so-called boolean 1 values and the means to work with them.
  976. Boolean values are truth values, either true of false. Variables of the boolean
  977. data type (keyword bool ) can only be assigned to using the keywords true or
  978. false , not 0 or 1 as other languages may allow.
  979. 1 In 1854, the mathematician George Boole (1815–1864) published An investigation into the
  980. Laws of Thought, on Which are founded the Mathematical Theories of Logic and Probabilities.
  981. Boole approached logic in a new way reducing it to a simple algebra, incorporating logic
  982. into mathematics. He pointed out the analogy between algebraic symbols and those that
  983. represent logical forms. It began the algebra of logic called Boolean algebra which now has
  984. wide applications in telephone switching and the design of modern computers. Boole’s work
  985. has to be seen as a fundamental step in today’s computer revolution.
  986. 33
  987. There is a special set of operators that work only with boolean values: see
  988. table 3.3. The result value of applying one of these operators is also a boolean
  989. value.
  990. Operator Operation
  991. && Logical conjunction (and)
  992. || Logical disjunction (or)
  993. ! Logical negation (not)
  994. A B A && B
  995. F F F
  996. F T F
  997. T F F
  998. T T T
  999. A B A || B
  1000. F F F
  1001. F T T
  1002. T F T
  1003. T T T
  1004. A !A
  1005. F T
  1006. T F
  1007. Table 3.3: Boolean Operations and Their Truth Tables
  1008. Some of the relational operators can be applied to boolean values, and all
  1009. yield boolean return values. In table 3.4, we list the relational operators and
  1010. their effect. Note that == and != can be applied to other types as well (not
  1011. just boolean values), but will always yield a boolean result. The assignment
  1012. operator = can be applied to many types as well. It will only yield a boolean
  1013. result when used to assign a boolean value to a boolean variable.
  1014. Operator Operation
  1015. == Equivalence
  1016. != Inequivalence
  1017. = Assignment
  1018. Table 3.4: Boolean relations
  1019. 3.4.2 int
  1020. Inger supports only one integral type, i.e. int . A variable of type int can store
  1021. any n ∈ N, as long as n is within the range the computer can store using its
  1022. maximum word size. In table 3.5, we show the size of integers that can be stored
  1023. using given maximum word sizes.
  1024. Word size Integer Range
  1025. 8 bits -128..127
  1026. 16 bits -32768..32768
  1027. 32 bits -2147483648..2147483647
  1028. Table 3.5: Integer Range by Word Size
  1029. Inger supports only signed integers, hence the negative ranges in the table.
  1030. Many operators can be used with integer types (see table 3.6), and all return a
  1031. 34
  1032. value of type int as well. Most of these operators are polymorphic : their return
  1033. type corresponds to the type of their operands (which must be of the same
  1034. type).
  1035. Operator Operation
  1036. - unary minus
  1037. + unary plus
  1038. ~ bitwise complement
  1039. * multiplication
  1040. / division
  1041. % modulus
  1042. + addition
  1043. - subtraction
  1044. >> bitwise shift right
  1045. << bitwise shift left
  1046. < less than
  1047. <= less than or equal
  1048. > greater than
  1049. >= greater than or equal
  1050. == equality
  1051. != inequality
  1052. & bitwise and
  1053. ^ bitwise xor
  1054. | bitwise or
  1055. = assignment
  1056. Table 3.6: Operations on Integers
  1057. Of these operators, the unary minus ( - ), unary plus ( + ) and (unary) bit-
  1058. wise complement ( ) associate to the right (since they are unary) and the rest
  1059. associates to the left, except assignment ( = ).
  1060. The relational operators = , != , ¡ , ¡= , ¿= and ¿ have a boolean result value,
  1061. even though they have operands of type int . Some operations, such as additions
  1062. and multiplications, can overflow when their result value exceeds the maximum
  1063. range of the int type. Consult table 3.5 for the maximum ranges. If a and b are
  1064. integer expressions, the operation
  1065. a op b
  1066. will not overflow if (N is the integer range of a given system):
  1067. 1. a op b ∈ N
  1068. 2. a ∈ N
  1069. 3. b ∈ N
  1070. 3.4.3 float
  1071. The float type is used to represent an element of R, although only a small part
  1072. of R is supported, using 8 bytes. A subset of the operators that can be used
  1073. 35
  1074. with operands of type int can also be used with operands of type float (see table
  1075. 3.7).
  1076. Operator Operation
  1077. - unary minus
  1078. + unary plus
  1079. * multiplication
  1080. / division
  1081. + addition
  1082. - subtraction
  1083. < less than
  1084. <= less than or equal
  1085. > greater than
  1086. >= greater than or equal
  1087. == equality
  1088. != inequality
  1089. = assignment
  1090. Table 3.7: Operations on Floats
  1091. Some of these operations yield a result value of type float , while others (the
  1092. relational operators) yield a value of type bool . Note that Inger supports only
  1093. floating point values of 8 bytes, while other languages also support 4-byte so-
  1094. called float values (while 8-byte types are called double).
  1095. 3.4.4 char
  1096. Variables of type char may be used to store single unsigned bytes (8 bits) or
  1097. single characters. All operations that can be performed on variables of type int
  1098. may also be applied to operands of type char . Variables of type char may be
  1099. initialized with actual characters, like so:
  1100. char c = ’a’;
  1101. All escape sequences from table 3.2 may be used to initialize a variable of
  1102. type char , although only one at a time, since a char represents only a single
  1103. character.
  1104. 3.4.5 untyped
  1105. In contrast to all the types discussed so far, the untyped type does not have a
  1106. fixed size. untyped is a polymorphic type, which can be used to represent any
  1107. other type. There is one catch: untyped must be used as a pointer.
  1108. Example 3.6 (Use of Untyped)
  1109. The following code is legal:
  1110. untyped ∗a; untyped ∗∗b;
  1111. but this code is not:
  1112. 36
  1113. untyped p;
  1114. ?
  1115. This example introduces the new concept of a pointer. Any type may have
  1116. one or more levels of indirection, which is denoted using one more more asterisks
  1117. ( * ). For an in-depth discussion on pointers, consult C Programming Language[1]
  1118. by Kernighan and Ritchie.
  1119. 3.5 Declarations
  1120. All data and functions in a program must have a name, so that the programmer
  1121. can refer to them. No module may contain or refer to more than one function
  1122. with the same name; every function name must be unique. Giving a variable
  1123. or a function in the program a type (in case of a function: input types and
  1124. an output type) and a name is called declaring the variable or function. All
  1125. variables must be declared before they can be used, but functions may be used
  1126. before they are defined.
  1127. An Inger program consists of a number of declarations of either global vari-
  1128. ables or functions. The variables are called global because they are declared at
  1129. the outermost scope of the program. Functions can have their own variables,
  1130. which are then called local variables and reside within the scope of the function.
  1131. In listing 3.4, three global variables are declared and accessed from within the
  1132. functions f en g . This code demonstrates that global variables can be accessed
  1133. from within any function.
  1134. Local variables can only be accessed from within the function in which they
  1135. are declared. Listing 3.5 shows a faulty program, in which variable i is accessed
  1136. from a scope in which it cannot be seen.
  1137. Variables are declared by naming their type ( bool , char , float , int or untyped ,
  1138. their level of indirection, their name and finally their array size. This structure
  1139. is shown in a syntax diagram in figure 3.5, and in the BNF production rules in
  1140. listing 3.6.
  1141. The syntax diagram and BNF productions show that it is possible to declare
  1142. multiple variables using one declaration statement, and that variables can be
  1143. initialized in a declaration. Consult the following example to get a feel for
  1144. declarations:
  1145. Example 3.7 (Examples of Declarations)
  1146. char ∗a, b = ’Q’, ∗c = 0x0;
  1147. int number = 0;
  1148. bool completed = false, found = true;
  1149. Here the variable a is a pointer char , which is not initialized. If no initial-
  1150. ization value is given, Inger will initialize a variable to 0. b is of type char and
  1151. is initialized to ’Q’ , and c is a pointer to data of type char . This pointer is ini-
  1152. tialized to 0x0 ( null ) – note that this does not initialize the char data to which
  1153. c points to null . The variable number is of type int and is initialized to 0. The
  1154. variables completed and found are of type bool and are initialized to false and true
  1155. respectively.
  1156. ?
  1157. 37
  1158. /*
  1159. * globvar.i - demonstration
  1160. * of global variables.
  1161. */
  1162. 5 module globvar;
  1163. int i;
  1164. bool b;
  1165. char c;
  1166. 10
  1167. g: void → void
  1168. {
  1169. i = 0;
  1170. b = false;
  1171. 15 c = ’b’;
  1172. }
  1173. start f : void → void
  1174. {
  1175. 20 i = 1;
  1176. b = true;
  1177. c = ’a’;
  1178. }
  1179. Listing 3.4: Global Variables
  1180. /*
  1181. * locvar.i - demonstration
  1182. * of local variables.
  1183. */
  1184. 5 module locvar;
  1185. g: void → void
  1186. {
  1187. i = 1; /* will not compile */
  1188. 10 }
  1189. start f : void → void
  1190. {
  1191. int i = 0;
  1192. 15 }
  1193. Listing 3.5: Local Variables
  1194. 38
  1195. Figure 3.5: Declaration Syntax Diagram
  1196. declarationblock: type declaration
  1197. ?
  1198. , declaration
  1199. ?
  1200. .
  1201. declaration :
  1202. ?
  1203. ?
  1204. identifier
  1205. ? ?
  1206. intliteral
  1207. ? ?
  1208. ?
  1209. = expression
  1210. ?
  1211. .
  1212. type: bool | char | float | int | untyped.
  1213. Listing 3.6: BNF for Declaration
  1214. 39
  1215. 3.6 Action
  1216. A computer program is not worth much if it does not contain some instruc-
  1217. tions (statements) that execute actions that operate on the data that program
  1218. declares. Actions come in two categories: simple statements and compound
  1219. statements.
  1220. 3.6.1 Simple Statements
  1221. There are many possible action statements, but there is only one statement
  1222. that actually has a side effect, i.e. manipulates data: this is the assignment
  1223. statement, which stores a value in a variable. The form of an assignment is:
  1224. <variable> = <expression>
  1225. = is the assignment operator. The variable to which an expression value
  1226. is assigned is called the left hand side or lvalue, and the expression which is
  1227. assigned to the variable is called the right hand side or rvalue.
  1228. The expression on the right hand side can consist of any (valid) combination
  1229. of constant values (numbers), variables, function calls and operators. The ex-
  1230. pression is evaluated to obtain a value to assign to the variable on the left hand
  1231. side of the assignment. Expression evaluation is done using the well-known rules
  1232. of mathematics, with regard to operator precedence and associativity. Consult
  1233. table 3.8 for all operators and their priority and associativity.
  1234. Example 3.8 (Expressions)
  1235. 2 ∗ 3 − 4 ∗ 5 = (2 ∗ 3) − (4 ∗ 5) = −14
  1236. 15 / 4 ∗ 4 = (15 / 4) ∗ 4 = 12
  1237. 80 / 5 / 3 = (80 / 5) / 3 = 5
  1238. 4 / 2 ∗ 3 = (4 / 2) ∗ 3 = 6
  1239. 9.0 ∗ 3 / 2 = (9.0 ∗ 3) / 2 = 13.5
  1240. ?
  1241. The examples show that a division of two integers results in an integer type
  1242. (rounded down), while if either one (or both) of the operands to a division is of
  1243. type float , the result will be float .
  1244. Any type of variable can be assigned to, so long as the expression type
  1245. and the variable type are equivalent. Assignments may also be chained, with
  1246. multiple variables being assigned the same expression with one statement. The
  1247. following example shows some valid assignments:
  1248. Example 3.9 (Expressions)
  1249. int a, b;
  1250. int c = a = b = 2 + 1;
  1251. int my sum = a ∗ b + c; /* 12 */
  1252. ?
  1253. All statements must be terminated with a semicolon ( ; ).
  1254. 40
  1255. Operator Priority Associatity Description
  1256. () 1 L function application
  1257. [] 1 L array indexing
  1258. ! 2 R logical negation
  1259. - 2 R unary minus
  1260. + 2 R unary plus
  1261. ~ 3 R bitwise complement
  1262. * 3 R indirection
  1263. & 3 R referencing
  1264. * 4 L multiplication
  1265. / 4 L division
  1266. % 4 L modulus
  1267. + 5 L addition
  1268. - 5 L subtraction
  1269. >> 6 L bitwise shift right
  1270. << 6 L bitwise shift left
  1271. < 7 L less than
  1272. <= 7 L less than or equal
  1273. > 7 L greater than
  1274. >= 7 L greater than or equal
  1275. == 8 L equality
  1276. != 8 L inequality
  1277. & 9 L bitwise and
  1278. ^ 10 L bitwise xor
  1279. | 11 L bitwise or
  1280. && 12 L logical and
  1281. || 12 L logical or
  1282. ?: 13 R ternary if
  1283. = 14 R assignment
  1284. Table 3.8: Operator Precedence and Associativity
  1285. 41
  1286. 3.6.2 Compound Statements
  1287. A compount statement is a group of zero or more statements contained within
  1288. braces ( { and } ). These statements are executed as a group, in the sequence in
  1289. which they are written. Compound statements are used in many places in Inger
  1290. including the body of a function, the action associated with an if -statement and
  1291. a while -statement. The form of a compound statement is:
  1292. block:
  1293. ?
  1294. code
  1295. ?
  1296. .
  1297. code: e.
  1298. code: block code.
  1299. code: statement code.
  1300. Figure 3.6: Syntax diagram for block
  1301. The BNF productions show, that a compound statement, or block, may
  1302. contain zero (empty), one or more statements, and may contain other blocks
  1303. as well. In the following example, the function f has a block of its own (the
  1304. function body), which contains another block, which finally contains a single
  1305. statement (a declaration).
  1306. Example 3.10 (Compound Statement)
  1307. module compound;
  1308. start f : void → void
  1309. {
  1310. 5 {
  1311. int a = 1;
  1312. }
  1313. }
  1314. ?
  1315. 3.6.3 Repetitive Statements
  1316. Compound statements (including compoung statements with only one statement
  1317. in their body) can be wrapped inside a repetitive statement to cause it to be
  1318. 42
  1319. executed multiple times. Some programming languages come with multiple
  1320. flavors of repetitive statements; Inger has only one: the while statement.
  1321. The while statement has the following BNF productions (also consult figure
  1322. 3.7 for the accompanying syntax diagram):
  1323. statement: while ? expression
  1324. ?
  1325. doblock
  1326. Figure 3.7: Syntax diagram for while
  1327. The expression between the parentheses must be of type bool . Before exe-
  1328. cuting the compound statement contained in the block , the repetitive statement
  1329. checks that expression evaluates to true. After the code contained in block has
  1330. executed, the repetitive statement evaluates expression again and so on until the
  1331. value of expression is false. If the expression is initially false, the compound
  1332. statement is executed zero times.
  1333. Since the expression between parentheses is evaluated each time the repeti-
  1334. tive statement (or loop) is executed, it is advised to keep the expression simple
  1335. so as not to consume too much processing time, especially in longer loops.
  1336. The demonstration program in listing 3.7 was taken from the analogous The
  1337. while statement section from Wirth’s PASCAL User Manual ([7]) and translated
  1338. to Inger.
  1339. The printint function and the #import directive will be discussed in a later sec-
  1340. tion. The output of this program is 2.9287 , printed on the console. It should be
  1341. noted that the compound statement that the while statement must be contained
  1342. in braces; it cannot be specified by itself (as it can be in the C programming
  1343. language).
  1344. Inger provides some additional control statements, that may be used in con-
  1345. junction with while : break and continue . The keyword break may be used to
  1346. prematurely leave a while -loop. It is often used from within the body of an if
  1347. statement, as shown in listings 3.8 and 3.9.
  1348. The continue statement is used to abort the current iteration of a loop and
  1349. continue from the top. Its use is analogous to break : see listings 3.10 and 3.11.
  1350. The use of break and continue is discouraged, since they tend to make a
  1351. program less readable.
  1352. 3.6.4 Conditional Statements
  1353. Not every statement must be executed. The choice of statements to execute
  1354. can be made using conditional statements. Inger provides the if and switch
  1355. statements.
  1356. The if statement
  1357. An if statement consists of a boolean expression, and one or two compound
  1358. statements. If the boolean expression is true, the first compound statement is
  1359. 43
  1360. /*
  1361. * Compute h(n) = 1 + 1/2 + 1/3 + ... + 1/n
  1362. * for a known n.
  1363. */
  1364. 5 module while demo;
  1365. #import ”printint.ih”
  1366. start main: void → void
  1367. 10 {
  1368. int n = 10;
  1369. float h = 0;
  1370. while( n > 0 ) do
  1371. 15 {
  1372. h = h + 1 / n;
  1373. n = n − 1;
  1374. }
  1375. printint ( h );
  1376. 20 }
  1377. Listing 3.7: The While Statement
  1378. int a = 10;
  1379. while( a > 0 )
  1380. {
  1381. 5 if ( a == 5 )
  1382. {
  1383. break;
  1384. }
  1385. 10 printfint ( a );
  1386. }
  1387. Listing 3.8: The Break Statement
  1388. 10
  1389. 9
  1390. 8
  1391. 7
  1392. 5 6
  1393. Listing 3.9: The Break Statement (output)
  1394. 44
  1395. int a = 10;
  1396. while( a > 0 )
  1397. {
  1398. if ( a % 2 == 0 )
  1399. 5 {
  1400. continue;
  1401. }
  1402. printint ( a );
  1403. 10 }
  1404. Listing 3.10: The Continue Statement
  1405. 10
  1406. 8
  1407. 6
  1408. 4
  1409. 5 2
  1410. Listing 3.11: The Continue Statement (output)
  1411. executed. If the boolean expression evaluates to false, the second compound
  1412. statement (if any) is executed. Remember that compound statements need
  1413. not contain multiple statements; they can contain a single statement or no
  1414. statements at all.
  1415. The above definition of the if conditional statement has the following BNF
  1416. producution associated with it (also consult figure 3.8 for the equivalent syntax
  1417. diagram):
  1418. statement: if
  1419. ?
  1420. expression
  1421. ?
  1422. block elseblock
  1423. elseblock: ?.
  1424. elseblock: else block.
  1425. Figure 3.8: Syntax diagram for if
  1426. The productions for the elseblock show that the if statement may contain a
  1427. second compound statement (which is executed if the boolean expression argu-
  1428. 45
  1429. ment evaluates to false) or no second statement at all. If there is a second block,
  1430. it must be prefixed with the keyword else .
  1431. As with the while statement, it is not possible to have the if statement execute
  1432. single statements, only blocks contained within braces. This approach solves the
  1433. dangling else problem from which the Pascal programming language suffers.
  1434. The “roman numerals” program (listing 3.12, copied from [7] and translated
  1435. to Inger illustrates the use of the if and while statements.
  1436. The case statement
  1437. The if statement only allows selection from two alternatives. If more alterna-
  1438. tives are required, the else blocks must contain secondary if statements up to
  1439. the required depth (see listing 3.14 for an example). Inger also provides the
  1440. switch statement, which constists of an expression (the selector) and a list of
  1441. alternatives cases. The cases are labelled with numbers (integers); the switch
  1442. statement evaluates the selector expression (which must evaluate to type inte-
  1443. ger) and executes the alternative whose label matches the result. If no case
  1444. has a matching label, switch executes the default case (which is required to be
  1445. present). The following BNF defines the switch statement more precisely:
  1446. statement: switch ? expression
  1447. ? ?
  1448. cases defaultblock
  1449. ?
  1450. .
  1451. cases: ?.
  1452. cases: case<int literal > block cases.
  1453. This is also shown in the syntax diagram in figure 3.9.
  1454. Figure 3.9: Syntax diagram for switch
  1455. It should be clear the use of the switch statement in listing 3.15 if much
  1456. clearer than the multiway if statement from listing 3.14.
  1457. There cannot be duplicate case labels in a case statement, because the com-
  1458. piler would not know which label to jump to. Also, the order of the case labels
  1459. is of no concern.
  1460. 3.6.5 Flow Control Statements
  1461. Flow control statements are statements the cause the execution of a program
  1462. to stop, move to another location in the program, and continue. Inger offers
  1463. one such statement: the goto considered harmful statement. The name of this
  1464. 46
  1465. /* Write roman numerals for the powers of 2. */
  1466. module roman numerals;
  1467. #import ”stdio.ih”
  1468. 5
  1469. start main: void → void
  1470. {
  1471. int x, y = 1;
  1472. while( y <= 5000 )
  1473. 10 {
  1474. x = y;
  1475. printint ( x );
  1476. while( x >= 1000 ) do
  1477. 15 {
  1478. printstr ( ”M” );
  1479. x = x − 1000;
  1480. }
  1481. if(x >= 5000 )
  1482. 20 {
  1483. printstr ( ”D” );
  1484. x = x − 500;
  1485. }
  1486. while( x >= 100 ) do
  1487. 25 {
  1488. printstr ( ”C” );
  1489. x = x − 100;
  1490. }
  1491. if(x >= 50 )
  1492. 30 {
  1493. printstr ( ”L” );
  1494. x = x − 50;
  1495. }
  1496. while( x >= 10 ) do
  1497. 35 {
  1498. printstr ( ”X” );
  1499. x = x − 10;
  1500. }
  1501. if(x >= 5 )
  1502. 40 {
  1503. printstr ( ”V” );
  1504. x = x − 5;
  1505. }
  1506. while( x >= 1 ) do
  1507. 45 {
  1508. printstr ( ”I” );
  1509. x = x − 1;
  1510. }
  1511. printstr ( ”\n” );
  1512. 50 y = 2 ∗ y;
  1513. }
  1514. }
  1515. Listing 3.12: Roman Numerals
  1516. 47
  1517. Output:
  1518. 1 I
  1519. 2 II
  1520. 4 IIII
  1521. 5 8 VIII
  1522. 16 XVI
  1523. 32 XXXII
  1524. 64 LXIIII
  1525. 128 CXXVIII
  1526. 10 256 CCLVI
  1527. 512 DXII
  1528. 1024 MXXIIII
  1529. 2048 MMXXXXVIII
  1530. 4096 MMMMLXXXXVI
  1531. Listing 3.13: Roman Numerals Output
  1532. if ( a == 0 )
  1533. {
  1534. printstr ( ”Case 0\n” );
  1535. }
  1536. 5 else
  1537. {
  1538. if ( a == 1 )
  1539. {
  1540. printstr ( ”Case 3\n” );
  1541. 10 }
  1542. else
  1543. {
  1544. if ( a == 2 )
  1545. {
  1546. 15 printstr ( ”Case 2\n” );
  1547. }
  1548. else
  1549. {
  1550. printstr ( ”Case >2\n” );
  1551. 20 }
  1552. }
  1553. }
  1554. Listing 3.14: Multiple If Alternatives
  1555. 48
  1556. switch( a )
  1557. {
  1558. case 0
  1559. {
  1560. 5 printstr ( ”Case 0\n” );
  1561. }
  1562. case 1
  1563. {
  1564. printstr ( ”Case 1\n” );
  1565. 10 }
  1566. case 2
  1567. {
  1568. printstr ( ”Case 2\n” );
  1569. }
  1570. 15 default ger
  1571. {
  1572. printfstr ( ”Case >2\n” );
  1573. }
  1574. }
  1575. Listing 3.15: The Switch Statement
  1576. int n = 10;
  1577. label here;
  1578. printstr ( n );
  1579. n = n − 1;
  1580. 5 if ( n > 0 )
  1581. {
  1582. goto considered harmful here;
  1583. }
  1584. Listing 3.16: The Goto Statement
  1585. statement (instead of the more common goto ) is a tribute to the Dutch computer
  1586. scientist Edger W. Dijkstra. 2
  1587. The goto considered harmful statement causes control to jump to a specified
  1588. (textual) label, which the programmer must provide using the label keyword.
  1589. There may not be any duplicate labels throughout the entire program, regardless
  1590. of scope level. For an example of the goto statement, see listing 3.16.
  1591. The goto considered harmful statement is provided for convenience, but its use
  1592. is strongly discouraged (like the name suggests), since it is detrimental to the
  1593. structure of a program.
  1594. 2 Edger W. Dijkstra (1930-2002) studied mathematics and physics in Leiden, The Nether-
  1595. lands. He obtained his PhD degree with a thesis on computer communications, and has
  1596. since been a pioneer in computer science, and was awarded the ACM Turing Award in
  1597. 1972. Dijkstra is best known for his theories about structured programming, including a
  1598. famous article titled Goto Considered Harmful. Dijkstra’s scientific work may be found at
  1599. http://www.cs.utexas.edu/users/EWD.
  1600. 49
  1601. 3.7 Array
  1602. Beyond the simple types bool , char , float , int and untyped discussed earlier, Inger
  1603. supports the advanced data type array. An array contains a predetermined
  1604. number of elements, all of the same type. Examples are an array of elements of
  1605. type int , or an array whose elements are of type bool . Types cannot be mixed.
  1606. The elements of an array are laid out in memory in a sequential manner.
  1607. Since the number and size of the elements is fixed, the location of any element
  1608. in memory can be calculated, so that all elements can be accessed equally fast.
  1609. Arrays are called random access structures for this reason. In the section on
  1610. declarations, BNF productions and a syntax diagram were shown which included
  1611. array brackets ( [ and ] ). We will illustrate their use here with an example:
  1612. int a [5];
  1613. declares an array of five elements of type int . The individual elements can
  1614. be accessed using the [] indexing operator, where the index is zero-based: a[0]
  1615. accesses the first element in the array, and a[4] accesses the last element in the
  1616. array. Indexed array elements may be used wherever a variable of the array’s
  1617. type is allowed. As an example, we translate another example program from N.
  1618. Wirth’s Pascal User Manual ([7]), in listing 3.17.
  1619. Arrays (matrices) may have more than one dimension. In declarations, this
  1620. is specified thus:
  1621. int a [4][6];
  1622. which declares a to be a 4 × 6 matrix. Elements access is similar: a [2][2]
  1623. accesses the element of a at row 2, column 2. There is no number to the number
  1624. of dimensions used in an array.
  1625. Inger has no way to initialize an array, with the exception of character
  1626. strings. An array of characters may be initialized with a string constant, as
  1627. shown in the code below:
  1628. char a[20] = ”hello, world!”;
  1629. In this code, the first 13 elements of array a are initialized with corresponding
  1630. characters from the string constant ”hello , world” . a[13] is initialized with zero,
  1631. to indicate the end of the string, and the remaining characters are uninitialized.
  1632. This example also shows that Inger works with zero-minated strings, just like
  1633. the C programming language. However, one could say that Inger has no concept
  1634. of string; a string is just an array of characters, like any other array. The fact
  1635. that strings are zero-terminated (so-called ASCIIZ-strings) is only relevant to
  1636. the system support libraries, which provide string manipulation functions.
  1637. It is not possible to assign an array to another array. This must be done
  1638. on an element-by-element basis. In fact, if any operator except the indexing
  1639. operator ( [] ) is used with an array, the array is treated like a typed pointer.
  1640. 3.8 Pointers
  1641. Any declaration may include some level of indirection, making the variable a
  1642. pointer. Pointers contain addresses; they are not normally used for storage
  1643. 50
  1644. minmax: int a [], n → void
  1645. {
  1646. int min, max, i , u, v;
  1647. 5 min = a[0]; max = min; i = 2;
  1648. while( i < n−1 ) do
  1649. {
  1650. u = a[i ]; v = a[i+1];
  1651. if ( u > v )
  1652. 10 {
  1653. if ( u > max ) { max = u; }
  1654. if ( v < min ) { min = v; }
  1655. }
  1656. else
  1657. 15 {
  1658. if ( v > max ) { max = v; }
  1659. if ( u < min ) { min = u; }
  1660. }
  1661. i = i + 2;
  1662. 20 }
  1663. if ( i == n )
  1664. {
  1665. if ( a[n] > max )
  1666. {
  1667. 25 max = a[n];
  1668. }
  1669. else if ( a[n] < min )
  1670. {
  1671. min = a[n];
  1672. 30 }
  1673. }
  1674. printint ( min );
  1675. printint ( max );
  1676. }
  1677. Listing 3.17: An Array Example
  1678. 51
  1679. themselves, but to point to other variables (hence the name). Pointers are a
  1680. convenient mechanism to pass large data structures between functions or mod-
  1681. ules. Instead of copying the entire data structure to the receiver, the receiver is
  1682. told where it can access the data structure (given the address).
  1683. The & operator can be used to retrieve the address of any variable, so it can
  1684. be assigned to a pointer, and the * operator is used to access the variable at a
  1685. given address. Examine the following example code to see how this works:
  1686. int a;
  1687. int ∗b = &a;
  1688. ∗b = 2;
  1689. printint ( a ); /* 2 */
  1690. The variable b is assigned the address of variable a . Then, the value 2 is
  1691. assigned to the variable to which b points ( a ), using the dereferencing operator
  1692. ( * ). After this, a contains the value 2 .
  1693. Pointers need not always refer to non-pointer variables; it is perfectly possible
  1694. for a pointer to refer to another pointer. Pointers can also hold multiple levels
  1695. of indirection, and can be dereferenced multiple times:
  1696. int a;
  1697. int ∗b = &a;
  1698. int ∗∗c = &b;
  1699. ∗∗c = 2;
  1700. printint ( a ); /* 2 */
  1701. Pointers have another use: they can contain the address of a dynamic vari-
  1702. able. While ordinary variables declared using the declaration statements dis-
  1703. cussed earlier are called static variables and reside on the stack, dynamic vari-
  1704. bles live on the heap. The only way to create them is by using operating system
  1705. functions to allocate memory for them, and storing their address in a pointer,
  1706. which must be used to access them for all subsequent operations until the oper-
  1707. ating system is told to release the memory that the dynamic variable occupies.
  1708. The allocation and deallocation of memory for dynamic variables is beyond the
  1709. scope of this text.
  1710. 3.9 Functions
  1711. Most of the examples thus far contained a single function, prefixed with they
  1712. keyword start and often postfixed with something like void → void . In this sec-
  1713. tion, we discuss how to write additional functions, which are an essential element
  1714. of Inger is one wants to write larger programs.
  1715. The purpose of a function is to encapsulate part of a program and associate
  1716. it with a name or identifier. Any Inger program consists of at least one function:
  1717. the start function, which is marked with the keyword start . To become familiar
  1718. with the structure of a function, let us examine the syntax diagram for a function
  1719. (figure 3.10 and 3.11). The associated BNF is a bit lengthy, so we will not print
  1720. it here.
  1721. 52
  1722. Figure 3.10: Syntax diagram for function
  1723. Figure 3.11: Syntax diagram for formal parameter block
  1724. A function must be declared before it can be used. The declaration does
  1725. not necessarily have to precede the actual use of the function, but it must take
  1726. place at some point. The declaration of a function couples an identifier (the
  1727. function name) to a set of function parameters (which may be empty), a return
  1728. value (which may be none), and a function body. An example of a function
  1729. declaration may be found in listing 3.17 (the minmax function).
  1730. Function parameters are values given to the function which may influence the
  1731. way it executes. Compare this to mathematical function definitions: they take
  1732. an input variable (usually x) and produce a result. The function declarations
  1733. in Inger are in fact modelled after the style in which mathematical functions
  1734. are defined. Function parameters must always have names so that the code in
  1735. the function can refer to them. The return value of a function does not have a
  1736. name. We will illustrate the declaration of functions with some examples.
  1737. Example 3.11 (Function Declarations)
  1738. The function f takes no arguments and produces no result. Although such a
  1739. function may seem useless, it is still possible for it to have a side effect, i.e. an
  1740. influence besides returning a value:
  1741. f : void → void
  1742. The function g takes an int and a bool parameter, and returns an int value:
  1743. g: int a; bool b → int
  1744. Finally, the function h takes a two-dimensional array of char as an argument,
  1745. and returns a pointer to an int :
  1746. 53
  1747. h: char str [ ][ ] → int ∗
  1748. ?
  1749. In the previous example, several sample function headers were given. Apart
  1750. from a header, a function must also have a body, which is simply a block of code
  1751. (contained within braces). From within the function body, the programmer may
  1752. refer to the function parameters as if they were local variables.
  1753. Example 3.12 (Function Definition)
  1754. Here is sample definition for the function g from the previous example:
  1755. g: int a; bool b → int
  1756. {
  1757. if ( b == true )
  1758. {
  1759. return( a );
  1760. }
  1761. else
  1762. {
  1763. return( −a );
  1764. }
  1765. }
  1766. ?
  1767. The last example illustrates the use of the return keyword to return from a
  1768. function call, while at the same time setting the return value. All functions
  1769. (except functions which return void ) must have a return statement somewhere in
  1770. their code, or their return value may never be set.
  1771. Some functions take no parameters at all. This class of functions is called
  1772. void , and we use the keyword void to identify them. It is also possible that a
  1773. function has no return value. Again, we use they keyword void to indicate this.
  1774. There are functions that take no parameters and return nothing: double void.
  1775. Now that functions have been defined, they need to be invoked, since that’s
  1776. the reason they exist. The () operator applies a function. It must be supplied
  1777. to call a function, even if that function takes no parameters ( void ).
  1778. Example 3.13 (Function Invocation)
  1779. The function f from example 3.11 has no parameters. It is invoked like this:
  1780. f ();
  1781. Note the use of () , even for a void function. The function g from the same
  1782. example might be invoked with the following parameters:
  1783. int result = g( 3, false ); /* -3 */
  1784. The programmer is free to choose completely different values for the param-
  1785. eters. In this example, constants have been supplied, but it is legal to fill in
  1786. variables or even complete expressions which can in turn contain function calls:
  1787. 54
  1788. int result = g( g( 3, false ), false ); /* 3 */
  1789. ?
  1790. Parameters are always passed by value, which means that their value is
  1791. copied to the target function. If that function changes the value of the param-
  1792. eter, the value of the original variable remains unchanged:
  1793. Example 3.14 (By Value vs. By Reference)
  1794. Suppose we have the function f , which is defined so:
  1795. f : int a → void
  1796. {
  1797. a = 2;
  1798. }
  1799. To illustrate invocation by value, we do this:
  1800. int i = 1;
  1801. f(i );
  1802. printint (i ); /* 1 */
  1803. It is impossible to change the value of the input variable i , unless we redefine
  1804. the function f to accept a pointer:
  1805. f : int ∗a → void
  1806. {
  1807. ∗a = 2;
  1808. }
  1809. Now, the address of i is passed by value, but still points to the actual memory
  1810. where i is stored. Thus i can be changed:
  1811. int i = 1;
  1812. f(&i);
  1813. printint (i ); /* 1 */
  1814. ?
  1815. 3.10 Modules
  1816. Not all code for a program has to reside within the same module. A program may
  1817. consist of multiple modules, one of which is the main module, which contains
  1818. one (and only one) function marked with the keyword start . This is the function
  1819. that will be executed when the program starts. A start function must always
  1820. be void → void , because there is no code that provides it with parameters and
  1821. no code to receive a return value. There can be only one module with a start
  1822. 55
  1823. /*
  1824. * printint.c
  1825. *
  1826. * Implementation of printint()
  1827. 5 */
  1828. void printint ( int x )
  1829. {
  1830. printf ( ”%d\n”, x );
  1831. }
  1832. Listing 3.18: C-implementation of printint Function
  1833. /*
  1834. * printint.ih
  1835. *
  1836. * Header file for printint.c
  1837. 5 */
  1838. extern printint : int x → void;
  1839. Listing 3.19: Inger Header File for printint Function
  1840. function. The start function may be called by other functions like any other
  1841. function.
  1842. Data and functions may be shared between modules using the extern keyword.
  1843. If a variable int a is declared in one function, it can be imported by another
  1844. module with the statement extern int a . The same goes for functions. The extern
  1845. statements are usually placed in a header file, with the .ih extension. Such files
  1846. can be referenced from Inger source code with the #import directive.
  1847. In listing 3.18, a C function called printint is defined. We wish to use this
  1848. function in an Inger program, so we write a header file called printint.ih with
  1849. contains an extern statement to import the C function (listing 3.19). Finally,
  1850. the Inger program in listing 3.20 can access the C function by importing the
  1851. header file with the #import directive.
  1852. 3.11 Libraries
  1853. Unlike other popular programming languages, Inger has no builtin functions
  1854. (e.g. read , write , sin , cos etc.). The programmer has to write all required functions
  1855. himself, or import them from a library. Inger code can be linked into a static
  1856. or dynamic using the linker. A library consists of one more code modules, all
  1857. of which do not contain a start function (if one or more of them do, the linker
  1858. will complain). The compiler not check the existence or nonexistence of start
  1859. functions, except for printing an error when there is more than one start function
  1860. in the same module.
  1861. Auxiliary functions need not be in an Inger module; they can also be im-
  1862. 56
  1863. /*
  1864. * printint.i
  1865. *
  1866. * Uses C-implementation of
  1867. 5 * printint()
  1868. */
  1869. module program;
  1870. #import ”printint.ih”
  1871. 10
  1872. int a,b;
  1873. start main: void → void
  1874. {
  1875. 15 a = b = 1;
  1876. printint ( a + b );
  1877. }
  1878. Listing 3.20: Inger Program Using printint
  1879. plemented in the C programming language. In order to use such functions in
  1880. an Inger program, an Inger header file ( .ih ) must be provided for the C library,
  1881. which contains extern function declarations for all the functions used in the Inger
  1882. program. A good example is the stdio.ih header file supplied with the Inger com-
  1883. piler. This header file is an interface to the ANSI C stdio library.
  1884. 3.12 Conclusion
  1885. This concludes the introduction to the Inger language. Please refer to the ap-
  1886. pendices, in particular appendices C and D for detailed tables on operator prece-
  1887. dence and the BNF productions for the entire language.
  1888. 57
  1889. Bibliography
  1890. [1] B. Kernighan and D. Ritchie: C Programming Language (2nd Edition),
  1891. Prentice Hall, 1998.
  1892. [2] A. C. Hartmann: A Concurrent Pascal Compiler for Minicomputers, Lec-
  1893. ture notes in computer science, Springer-Verlag, Berlin 1977.
  1894. [3] M. Marcotty and H. Ledgard: The World of Programming Languages,
  1895. Springer-Verlag, Berlin 1986., pages 41 and following.
  1896. [4] American National Standards Institute: ANSI X3.159-1989. American Na-
  1897. tional Standard for information systems - Programming Language C, ANSI,
  1898. New York, USA 1989.
  1899. [5] R. S. Scowen: Extended BNF a Generic Base Standard, Final Report,
  1900. SEG C1 N10 (DITC Software Engineering Group), National Physical Lab-
  1901. oratory, Teddington, Middlesex, UK 1993.
  1902. [6] W. Waite: ANSI C Specification,
  1903. http://www.cs.colorado.edu/∼eliuser/c html/c.html
  1904. [7] N. Wirth and K. Jensen: PASCAL User Manual and Report, Lecture notes
  1905. in computer science, Springer-Verlag, Berlin 1975.
  1906. 58
  1907. Part II
  1908. Syntax
  1909. 59
  1910. Humans can understand a sentence, spoken in a language, when they hear
  1911. it (provided they are familiar with the language being spoken). The brain is
  1912. trained to process the incoming string of words and give meaning to the sentence.
  1913. This process can only take place if the sentence under consideration obeys the
  1914. grammatical rules of the language, or else it would be gibberish. This set of rules
  1915. is called the syntax of a language and is denoted using a grammar. This part of
  1916. the book, syntax analysis, gives an introduction to formal grammars (notation
  1917. and manipulation) and how they are used to read (parse) actual sentences in
  1918. a language. It also discusses ways to vizualize the information gleaned from a
  1919. sentence in a tree structure (a parse tree). Apart from theoretical aspects, the
  1920. text treats practical matters such as lexical analysis (breaking a line of text up
  1921. into individual words and recognizing language keywords among them) and tree
  1922. traversal.
  1923. 60
  1924. Chapter 4
  1925. Lexical Analyzer
  1926. 4.1 Introduction
  1927. The first step in the compiling process involves reading source code, so that
  1928. the compiler can check that source code for errors before translating it to, for
  1929. example, assembly language. All programming languages provide an array of
  1930. keywords, like IF , WHILE , SWITCH and so on. A compiler is not usually interested
  1931. in the individual characters that make up these keywords; these keywords are
  1932. said to be atomic. However, in some cases the compiler does care about the
  1933. individual characters that make up a word: an integer number (e.g. 12345 ),
  1934. a string (e.g. ”hello, world” ) and a floating point number (e.g. 12e-09 ) are all
  1935. considered to be words, but the individual characters that make them up are
  1936. significant.
  1937. This distinction requires special processing of the input text, and this special
  1938. processing is usually moved out of the parser and placed in a module called the
  1939. lexical analyzer, or lexer or scanner for short. It is the lexer’s responsibility to
  1940. divide the input stream into tokens (atomic words). The parser (the module
  1941. that deals with groups of tokens, checking that their order is valid) requests a
  1942. token from the lexer, which reads characters from the input stream until it has
  1943. accumulated enough characters to form a complete token, which it returns to
  1944. the parser.
  1945. Example 4.1 (Tokenizing)
  1946. Given the input
  1947. the quick brown fox jumps over the lazy dog
  1948. 61
  1949. a lexer will split this into the tokens
  1950. the , quick , brown , fox , jumps , over , the , lazy and dog
  1951. ?
  1952. The process of splitting input into tokens is called tokenizing or scanning.
  1953. Apart from tokenizing a sentence, a lexer can also divide tokens up into classes.
  1954. This is process is called screening. Consider the following example:
  1955. Example 4.2 (Token Classes)
  1956. Given the input
  1957. the sum of 2 + 2 = 4.
  1958. a lexer will split this into the following tokens, with classes:
  1959. Word: the
  1960. Word: sum
  1961. Word: of
  1962. Number: 2
  1963. Plus: +
  1964. Number: 2
  1965. Equals: =
  1966. Number: 4
  1967. Dot: .
  1968. ?
  1969. Some token classes are very narrow (containing only one token), while others
  1970. are broad. For example, the token class Word is used to represent the , sum and
  1971. of , while the token class Dot can only be used when a . is read. Incidentally,
  1972. the lexical analyzer must know how to separate individual tokens. In program
  1973. source text, keywords are usually separated by whitespace (spaces, tabs and line
  1974. feeds). However, this is not always the case. Consider the following input:
  1975. Example 4.3 (Token Separation)
  1976. Given the input
  1977. sum=(2+2)*3;
  1978. a lexer will split this into the following tokens:
  1979. sum, =, (, 2, +, 2, ), *, 3 and ;
  1980. ?
  1981. Those familiar with popular programming languages like C or Pascal may
  1982. know that mathematical tokens like numbers, =, + and * are not required to
  1983. be separated from each other by whitespace. The lexer must have some way to
  1984. know when a token ends and the next token (if any) begins. In the next section,
  1985. 62
  1986. we will discuss the theory of regular languages to further clarify this point and
  1987. how lexers deal with it.
  1988. Lexers have an additional interesting property: they can be used to filter
  1989. out input that is not important to the parser, so that the parser has less differ-
  1990. ent tokens to deal with. Block comments and line comments are examples of
  1991. uninteresting input.
  1992. A token class may represent a (large) collection of values. The token class
  1993. OP MULTIPLY , representing the multiplication operator * contains only one to-
  1994. ken (*), but the token class LITERAL INTEGER can represents the collection of
  1995. all integers. We say that 2 is an integer, and so is 256 , 381 and so on. A compiler
  1996. is not only interested in the fact that a token is a literal integer, but also in
  1997. the value of that literal integer. This is why tokens are often accompanied by a
  1998. token value. In the case of the number 2 , the token could be LITERAL INTEGER
  1999. and the token value could be 2 .
  2000. Token values can be of many types: an integer number token has a token
  2001. value of type integer, a floating point number token has a token value of type
  2002. float or double , and a string token has a token value of type char * . Lexical
  2003. analyzers therefore often store token values using a union (a C construct that
  2004. allows a data structure to map fields of different type on the same memory,
  2005. provided that only one of these fields is used at the same time).
  2006. 4.2 Regular Language Theory
  2007. The lexical analyzer is a submodule of the parser. While the parser deals with
  2008. context-free grammars (a higher level of abstraction), the lexer deals with in-
  2009. dividual characters which form tokens (words). Some tokens are simple ( IF or
  2010. WHILE ) while others are complex. All integer numbers, for example, are repre-
  2011. sented using the same token ( INTEGER ), which covers many cases ( 1 , 100 , 5845
  2012. and so on). This requires some notation to match all integer numbers, so they
  2013. are treated the same.
  2014. The answer lies in realizing that the collection of integer numbers is really a
  2015. small language, with very strict rules (it is a so-called regular language). Before
  2016. we can show what regular languages are, we must must discuss some preliminary
  2017. definitions first.
  2018. A language is a set of rules that says which sentences can be generated by
  2019. stringing together elements of a certain alphabet. An alphabet is a collection
  2020. of symbols (or entire words) denoted as Σ. The set of all strings that can be
  2021. generated from an alphabet is denoted Σ ∗ . A language over an alphabet Σ is a
  2022. subset of Σ ∗ .
  2023. We now define, without proof, several operations that may be performed
  2024. on languages. The first operation on languages that we present is the binary
  2025. concatenation operation.
  2026. Definition 4.1 (Concatenation operation)
  2027. Let X and Y be two languages. Then XY is the concatenation of these lan-
  2028. guages, so that:
  2029. XY = {uv | u ∈ X ∧ v ∈ Y }.
  2030. 63
  2031. ?
  2032. Concatenation of a language with itself is also possible and is denoted X 2 .
  2033. Concatenation of a string can also be performed multiple times, e.g. X 7 . We
  2034. will illustrate the definition of concatenation with an example.
  2035. Example 4.4 (Concatenation operation)
  2036. Let Σ be the alphabet {a,b,c}.
  2037. Let X be the language over Σ with X = {aa,bb}.
  2038. Let Y be the language over Σ with Y = {ca,b}.
  2039. Then XY is the language {aaca,aab,bbca,bbb}.
  2040. ?
  2041. The second operation that will need to define regular languages is the binary
  2042. union operation.
  2043. Definition 4.2 (Union operation)
  2044. Let X and Y be two languages. Then X ∪ Y is the union of these languages
  2045. with
  2046. X ∪ Y = {u | u ∈ X ∨ u ∈ Y }.
  2047. ?
  2048. Note that the priority of concatenation is higher than the priority of union.
  2049. Here is an example that shows how the union operation works:
  2050. Example 4.5 (Union operation)
  2051. Let Σ be the alphabet {a,b,c}.
  2052. Let X be the language over Σ{aa,bb}.
  2053. Let Y be the language over Σ{ca,b}.
  2054. Then X ∪ Y is the language over Σ{aa,bb,ca,b}.
  2055. ?
  2056. The final operation that we need to define is the unary Kleene star opera-
  2057. tion. 1
  2058. Definition 4.3 (Kleene star)
  2059. Let X be a language. Then
  2060. X ∗ =
  2061. [
  2062. i=0
  2063. X i (4.1)
  2064. 1 The mathematician Stephen Cole Kleene was born in 1909 in Hartford, Connecticut.
  2065. His research was on the theory of algorithms and recursive functions. According to Robert
  2066. Soare, “From the 1930’s on Kleene more than any other mathematician developed the notions
  2067. of computability and effective process in all their forms both abstract and concrete, both
  2068. mathematical and philosophical. He tended to lay the foundations for an area and then move
  2069. on to the next, as each successive one blossomed into a major research area in his wake.”
  2070. Kleene died in 1994.
  2071. 64
  2072. ?
  2073. Or, in words: X ∗ means that you can take 0 or more sentences from X and
  2074. concatenate them. The Kleene star operation is best clarified with an example.
  2075. Example 4.6 (Kleene star)
  2076. Let Σ be the alphabet {a,b}.
  2077. Let X be the language over Σ{aa,bb}.
  2078. Then X ∗ is the language {λ,aa,bb,aaaa,aabb,bbaa,bbbb,...}.
  2079. ?
  2080. There is also an extension to the Kleene star. XX ∗ may be written X + ,
  2081. meaning that at least one string from X must be taken (whereas X ∗ allows the
  2082. empty string λ).
  2083. With these definitions, we can now give a definition for a regular language.
  2084. Definition 4.4 (Regular languages)
  2085. 1. Basis: ∅, {λ} and {a} are regular languages.
  2086. 2. Recursive step: Let X and Y be regular languages. Then
  2087. X ∪ Y is a regular language
  2088. XY is a regular language
  2089. X ∗ is a regular language
  2090. ?
  2091. Now that we have established what regular languages are, it is important to
  2092. note that lexical analyzer generators (software tools that will be discussed below)
  2093. use regular expressions to denote regular languages. Regular expressions are
  2094. merely another way of writing down regular languages. In regular expressions,
  2095. it is customary to write the language consisting of a single string composed of
  2096. one word, {a}, as a.
  2097. Definition 4.5 (Regular expressions)
  2098. Using recursion, we can define regular expressions as follows:
  2099. 1. Basis: ∅, λ and a are regular expressions.
  2100. 2. Recursive step: Let X and Y be regular expressions. Then
  2101. X ∪ Y is a regular expression
  2102. XY is a regular expression
  2103. X ∗ is a regular expression
  2104. ?
  2105. As you can see, the definition of regular expressions differs from the definition
  2106. of regular languages only by a notational convenience (less braces to write).
  2107. So any language that can be composed of other regular languages or expres-
  2108. sions using concatenation, union, and the Kleene star, is also a regular language
  2109. or expression. Note that the priority of the concatenation, Kleene Star and
  2110. union operations are listed here from highest to lowest priority.
  2111. 65
  2112. Example 4.7 (Regular Expression)
  2113. Let a and b be regular expressions by definition 4.5(1). Then ab is a regular
  2114. expression by definition 4.5(2) through concatenation. (ab ∪ b) is a regular
  2115. expression by definition 4.5(2) through union. (ab ∪ b) ∗ is a regular expression
  2116. by definition 4.5(2) through union. The sentences that can be generated by
  2117. (ab ∪ b) ∗ are {λ,ab,b,abb,bab,babab,...}.
  2118. ?
  2119. While context-free grammars are normally denoted using production rules,
  2120. for regular languages it is sufficient to use the easy to read regular expressions.
  2121. 4.3 Sample Regular Expressions
  2122. In this section, we present a number of sample regular expressions to illustrate
  2123. the theory presented in the previous section. From now on, we will now longer
  2124. use bold a to denote {a}, since we will soon move to UNIX regular expressions
  2125. which do not use bold either.
  2126. Regular expression Sentences generated
  2127. q the sentence q
  2128. qqq the sentence qqq
  2129. q ∗ all sentences of 0 or more q’s
  2130. q + all sentences of 1 or more q’s
  2131. q ∪ λ the empty sentence or q. Often denoted
  2132. as q? (see UNIX regular expressions).
  2133. b ∗ ((b + a ∪ λ)b ∗ the collection of sentences that begin
  2134. with 0 or more b’s, followed by either
  2135. one or more b’s followed by an a, or
  2136. nothing, followed by 0 or more b’s.
  2137. These examples show that through repeated application of definition 4.5,
  2138. complex sequences can be defined. This feature of regular expressions is used
  2139. for constructing lexical analyzers.
  2140. 4.4 UNIX Regular Expressions
  2141. Under UNIX, several extensions to regular expressions have been implemented
  2142. that we can use. A UNIX regular expression[3] is commonly called a regex
  2143. (multiple: regexes).
  2144. There is no union operator on UNIX. Instead, we supply a list of alternatives
  2145. contained within square brackets.
  2146. [abc] ≡ (a ∪ b ∪ c)
  2147. To avoid having to type in all the individual letters when we want to match
  2148. all lowercase letters, the following syntax is allowed:
  2149. [a-z] ≡ [abcdefghijklmnopqrstuvwxyz]
  2150. 66
  2151. UNIX does not have a λ either. Here is the alternative syntax:
  2152. a? ≡ a ∪ λ
  2153. Lexical analyzer generators allow the user to directly specify these regular
  2154. expressions in order to identify lexical tokens (atomic words that string together
  2155. to make sentences). We will discuss such a generator program shortly.
  2156. 4.5 States
  2157. With the theory of regular languages, we can now find out how a lexical analyzer
  2158. works. More specifically, we can see how the scanner can divide the input
  2159. (34+12) into separate tokens.
  2160. Suppose the programming language for which we wish to write a scanner
  2161. consists only of sentences of the form ( number + number ) . Then we require the
  2162. following regular expressions to define the tokens.
  2163. Token Regular expression
  2164. ( (
  2165. ) )
  2166. + +
  2167. number [0-9]+
  2168. A lexer uses states to determine which characters it can expect, and which
  2169. may not occur in a certain situation. For simple tokens ( ( , ) and + ) this is easy:
  2170. either one of these characters is read or it is not. For the number token, states
  2171. are required.
  2172. As soon as the first digit of a number is read, the lexer enters a state in
  2173. which it expects more digits, and nothing else. If another digit is read, the lexer
  2174. remains in this state and adds the digit to the token read so far. It something
  2175. else (not a digit) is read, the lexer knows the number token is finished and leaves
  2176. the number state, returning the token to the caller (usually the parser). After
  2177. that, it tries to match the unexpected character (maybe a + ) to another token.
  2178. Example 4.8 (States)
  2179. Let the input be (34+12) . The lexer starts out in the base state. For every
  2180. character read from the input, the following table shows the state that the lexer
  2181. is currently in and the action it performs.
  2182. ?
  2183. Token read State Action taken
  2184. ( base Return ( to caller
  2185. 3 base Save 3 , enter number state
  2186. 4 number Save 4
  2187. + number + not expected. Leave number
  2188. state and return 34 to caller
  2189. + base Return + to caller
  2190. 1 base Save 1 , enter number state
  2191. 2 number Save 2
  2192. ) number ) unexpected. Leave number
  2193. state and return 12 to caller
  2194. ) base return ) to caller
  2195. 67
  2196. This example did not include whitespace (spaces, line feeds and tabs) on pur-
  2197. pose, since it tends to be confusing. Most scanners ignore spacing by matching
  2198. it with a special regular expression and doing nothing.
  2199. There is another rule of thumb used by lexical analyzer generators (see the
  2200. discussion of this software below): they always try to return the longest token
  2201. possible.
  2202. Example 4.9 (Token Length)
  2203. = and == are both tokens. Now if = was read and the next character is also =
  2204. then == will be returned instead of two times = .
  2205. ?
  2206. In summary, a lexer determines which characters are valid in the input at
  2207. any given time through a set of states, on of which is the active state. Different
  2208. states have different valid characters in the input stream. Some characters cause
  2209. the lexer to shift from its current state into another state.
  2210. 4.6 Common Regular Expressions
  2211. This section discusses some commonly used regular expressions for interesting
  2212. tokens, such as strings and comments.
  2213. Integer numbers
  2214. An integer number consists of only digits. It ends when a non-digit character is
  2215. encountered. The scanner must watch out for an overflow, e.g. 12345678901234
  2216. does not fit in most programming languages’ type systems and should cause the
  2217. scanner to generate an overflow error.
  2218. The regular expression for integer numbers is
  2219. [0-9]+
  2220. This regular expression generates the collection of strings containing at least
  2221. one digit, and nothing but digits.
  2222. Practical advice 4.1 (Lexer Overflow)
  2223. If the scanner generates an overflow or similar error, parsing of the source code
  2224. can continue (but no target code can be generated). The scanner can just
  2225. replace the faulty value with a correct one, e.g. “ 1 ”.
  2226. ?
  2227. Floating point numbers
  2228. Floating point numbers have a slightly more complex syntax than integer num-
  2229. bers. Here are some examples of floating point numbers:
  2230. 1.0 , .001 , 1e-09 , .2e+5
  2231. The regular expression for floating point numbers is:
  2232. 68
  2233. [0-9]* . [0-9]+ ( e [+-] [0-9]+ )?
  2234. Spaces were added for readability. These are not part of the generated
  2235. strings. The scanner should check each of the subparts of the regular expression
  2236. containing digits for possible overflow.
  2237. Practical advice 4.2 (Long Regular Expressions)
  2238. If a regular expression becomes long or too complex, it is possible to split it up
  2239. into multiple regular expressions. The lexical analyzer’s internal state machine
  2240. will still work.
  2241. ?
  2242. Strings
  2243. Strings are a token type that requires some special processing by the lexer. This
  2244. should become clear when we consider the following sample input:
  2245. "3+4"
  2246. Even though this input consists of numbers, and the + operator, which may
  2247. have regular expressions of their own, the entire expression should be returned
  2248. to the caller since it is contained within double quotes. The trick to do this is to
  2249. introduce another state to the lexical analyzer, called an exclusive state. When
  2250. in this state, the lexer will process only regular expressions marked with this
  2251. state. The resulting regular expressions are these:
  2252. Regular expression Action
  2253. " Enter string state
  2254. string . Store character. A dot (.) means any-
  2255. thing. This regular expression is only
  2256. considered when the lexer is in the
  2257. string state.
  2258. string " Return to previous state. Return string
  2259. contents to caller. This regular expres-
  2260. sion is only considered when the lexer
  2261. is in the string state.
  2262. Practical advice 4.3 (Exclusive States)
  2263. You can write code for exclusive states yourself (when writing a lexical analyzer
  2264. from scratch), but AT&T lex and GNU flex can do it for you.
  2265. ?
  2266. The regular expressions proposed above for strings do not heed line feeds.
  2267. You may want to disallow line feeds within strings, though. Then you must add
  2268. another regular expressions that matches the line feed character (\n in some
  2269. languages) and generates an error when it is encountered within a string.
  2270. The lexer writer must also be wary of a buffer overflow; if the program source
  2271. code consists of a " and hundreds of thousands of letters (at least, not another
  2272. "), a compiler that does not check for buffer overflow conditions will eventually
  2273. crash for lack of memory. Note that you could match strings using a single
  2274. regular expression:
  2275. 69
  2276. "(.)*"
  2277. but the state approach makes it much easier to check for buffer overflow
  2278. conditions since you can decide at any time whether the current character must
  2279. be stored or not.
  2280. Practical advice 4.4 (String Limits)
  2281. To avoid a buffer overflow, limit the string length to about 64 KB and generate
  2282. an error if more characters are read. Skip all the offending characters until
  2283. another " is read (or end of file).
  2284. ?
  2285. Comments
  2286. Most compilers place the job of filtering comments out of the source code with
  2287. the lexical analyzer. We can therefore create some regular expressions that do
  2288. just that. This once again requires the use of an exclusive state. In programming
  2289. languages, the beginning and end of comments are usually clearly marked:
  2290. Language Comment style
  2291. C /* comment */
  2292. C++ // comment (line feed)
  2293. Pascal { comment }
  2294. BASIC REM comment :
  2295. We can build our regular expressions around these delimiters. Let’s build
  2296. sample expressions using the C comment delimiters:
  2297. Regular expression Action
  2298. /* Enter comment state
  2299. comment . Ignore character. A dot (.) means any-
  2300. thing. This regular expression is only
  2301. considered when the lexer is in the com-
  2302. ment state.
  2303. comment */ Return to previous state. Do not re-
  2304. turn to caller but read next token, effec-
  2305. tively ignoring the comment. This reg-
  2306. ular expression is only considered when
  2307. the lexer is in the comment state.
  2308. Using a minor modification, we can also allow nested comments. To do this,
  2309. we must have the lexer keep track of the comment nesting level. Only when the
  2310. nesting level reaches 0 after leaving the final comment should the lexer leave the
  2311. comment state. Note that you could handle comments using a single regular
  2312. expression:
  2313. /* (.)* */
  2314. But this approach does not support nested comments. The treatment of line
  2315. comments is slightly easier. Only one regular expression is needed:
  2316. //(.)*\n
  2317. 70
  2318. 4.7 Lexical Analyzer Generators
  2319. Although it is certainly possible to write a lexical analyzer by hand, this task
  2320. becomes increasingly complex as your input language gets richer. It is therefore
  2321. more practical use a lexical analyzer generator. The code generated by such a
  2322. generator program is usually faster and more efficient that any code you might
  2323. write by hand[2].
  2324. Here are several candidates you could use:
  2325. AT&T lex Not free, ancient, UNIX and Linux im-
  2326. plementations
  2327. GNU flex Free, modern, Linux implementation
  2328. Bumblebee lex Free, modern, Windows implementa-
  2329. tion
  2330. The Inger compiler was constructed using GNU flex; in the next sections we
  2331. will briefly discuss its syntax (since flex takes lexical analyzer specifications as
  2332. its input) and how to use the output flex generates.
  2333. Practical advice 4.5 (Lex)
  2334. We heard that some people think that a lexical analyzer must be written in lex
  2335. or flex in order to be called a lexer. Of course, this is blatant nonsense (it is the
  2336. other way around).
  2337. ?
  2338. Flex syntax
  2339. The layout of a flex input file (extension .l) is, in pseudocode:
  2340. %{
  2341. Any preliminary C code (inclusions, defines) that
  2342. will be pasted in the resulting .C file
  2343. %}
  2344. Any flex definitions
  2345. %%
  2346. Regular expressions
  2347. %%
  2348. Any C code that will be appended to
  2349. the resulting .C file
  2350. When a regular expression matches some input text, the lexical analyzer
  2351. must execute an action. This usually involves informing the caller (the parser)
  2352. of the token class found. With an action included, the regular expressions take
  2353. the following form:
  2354. [0-9]+ {
  2355. intValue_g = atoi( yytext );
  2356. return( INTEGER );
  2357. }
  2358. 71
  2359. Using return( INTEGER ), the lexer informs the caller (the parser) that
  2360. is has found an integer. It can only return one item (the token class) so the
  2361. actual value of the integer is passed to the parser through the global variable
  2362. intValue_g. Flex automatically stores the characters that make up the current
  2363. token in the global string yytext.
  2364. Sample flex input file
  2365. Here is a sample flex input file for the language that consists of sentences of
  2366. the form (number+number), and that allows spacing anywhere (except within
  2367. tokens).
  2368. %{
  2369. #define NUMBER 1000
  2370. int intValue_g;
  2371. %}
  2372. %%
  2373. "(" { return( ‘(‘ ); }
  2374. ")" { return( ‘)’ ); }
  2375. "+" { return( ‘+’ ); }
  2376. [0-9]+ {
  2377. intValue_g = atoi( yytext );
  2378. return( NUMBER );
  2379. }
  2380. %%
  2381. int main()
  2382. {
  2383. int result;
  2384. while( ( result = yylex() ) != 0 )
  2385. {
  2386. printf( "Token class found: %d\n", result );
  2387. }
  2388. return( 0 );
  2389. }
  2390. For many more examples, consult J. Levine’s Lex and yacc [2].
  2391. 4.8 Inger Lexical Analyzer Specification
  2392. As a practical example, we will now discuss the token categories in the Inger
  2393. language, and all regular expressions used for complex tokens. The full source
  2394. for the Inger lexer is included in appendix F.
  2395. Inger discerns several token categories: keywords ( IF , WHILE and so on),
  2396. operators (+, % and more), complex tokens (integer numbers, floating point
  2397. numbers, and strings), delimiters (parentheses, brackets) and whitespace.
  2398. We will list the tokens in each category and show which regular expressions
  2399. is used to match them.
  2400. 72
  2401. Keywords
  2402. Inger expects all keywords (sometimes called reserved words) to be written in
  2403. lowercase, allowing the literal keyword to be used to match the keyword itself.
  2404. The following table illustrates this:
  2405. Token Regular Expression Token identifier
  2406. break break KW_BREAK
  2407. case case KW_CASE
  2408. continue continue KW_CONTINUE
  2409. default default KW_DEFAULT
  2410. do do KW_DO
  2411. else else KW_ELSE
  2412. false false KW_FALSE
  2413. goto considered goto_considered
  2414. harmful _harmful KW_GOTO
  2415. if if KW_IF
  2416. label label KW_LABEL
  2417. module module KW_MODULE
  2418. return return KW_RETURN
  2419. start start KW_START
  2420. switch switch KW_SWITCH
  2421. true true KW_TRUE
  2422. while while KW_WHILE
  2423. Types
  2424. Type names are also tokens. They are invariable and can therefore be matched
  2425. using their full name.
  2426. Token Regular Expression Token identifier
  2427. bool bool KW_BOOL
  2428. char char KW_CHAR
  2429. float float KW_FLOAT
  2430. int int KW_INT
  2431. untyped untyped KW_UNTYPED
  2432. Note that the untyped type is equivalent to void in the C language; it is a
  2433. polymorphic type. One or more reference symbols (*) must be added after the
  2434. untyped keyword. For instance, the declaration
  2435. untyped ** a;
  2436. declares a to be a double polymorphic pointer.
  2437. Complex tokens
  2438. Inger’s complex tokens variable identifiers, integer literals, floating point literals
  2439. and character literals.
  2440. 73
  2441. Token Regular Expression Token identifier
  2442. integer literal [0-9]+ INT
  2443. identifier [_A-Za-z][_A-Za-z0-9]* IDENTIFIER
  2444. float [0-9]*\.[0-9]+([eE][\+-][0-9]+)? FLOAT
  2445. char \’.\’ CHAR
  2446. Strings
  2447. In Inger, strings cannot span multiple lines. Strings are read using and exlusive
  2448. lexer string state. This is best illustrated by some flex code:
  2449. \" { BEGIN STATE_STRING; }
  2450. <STATE_STRING>\" { BEGIN 0; return( STRING ); }
  2451. <STATE_STRING>\n { ERROR( "unterminated string" ); }
  2452. <STATE_STRING>. { (store a character) }
  2453. <STATE_STRING>\\\" { (add " to string) }
  2454. If a linefeed is encountered while reading a string, the lexer displays an error
  2455. message, since strings may not span lines. Every character that is read while in
  2456. the string state is added to the string, except ", which terminates a string and
  2457. causes the lexer to leave the exclusive string state. Using the \" control code,
  2458. the programmer can actually add the ” (double quotes) character to a string.
  2459. Comments
  2460. Inger supports two types of comments: line comments (which are terminated
  2461. by a line feed) and block comments (which must be explicitly terminated).
  2462. Line comments can be read (and subsequently skipped) using a single regular
  2463. expression:
  2464. "//"[^\n]*
  2465. whereas block comments need an exclusive lexer state (since they can also
  2466. be nested). We illustrate this again using some flex code:
  2467. /* { BEGIN STATE_COMMENTS;
  2468. ++commentlevel; }
  2469. <STATE_COMMENTS>"/*" { ++commentlevel; }
  2470. <STATE_COMMENTS>. { }
  2471. <STATE_COMMENTS>\n { }
  2472. <STATE_COMMENTS>"*/" { if( --commentlevel == 0 )
  2473. BEGIN 0; }
  2474. Once a comment is started using /*, the lexer sets the comment level to 1
  2475. and enters the comment state. The comment level is increased every time a
  2476. /* is encountered, and decreased every time a */ is read. While in comment
  2477. state, all characters but the comment start and end delimiters are discarded.
  2478. The lexer leaves the comment state after the last comment block terminates.
  2479. 74
  2480. Operators
  2481. Inger provides a large selection of operators, of varying priority. They are listed
  2482. here in alphabetic order of the token identifiers. This list includes only atomic
  2483. operators, not operators that delimit their argument on both sides, like function
  2484. application.
  2485. funcname ( expr[,expr...] )
  2486. or array indexing
  2487. arrayname [ index ].
  2488. In the next section, we will present a list of all operators (including function
  2489. application and array indexing) sorted by priority.
  2490. Some operators consist of multiple characters. The lexer can discern between
  2491. the two by looking one character ahead in the input stream and switching states
  2492. (as explained in section 4.5.
  2493. Token Regular Expression Token identifier
  2494. addition + OP_ADD
  2495. assignment = OP_ASSIGN
  2496. bitwise and & OP_BITWISE_AND
  2497. bitwise complement ~ OP_BITWISE_COMPLEMENT
  2498. bitwise left shift << OP_BITWISE_LSHIFT
  2499. bitwise or | OP_BITWISE_OR
  2500. bitwise right shift >> OP_BITWISE_RSHIFT
  2501. bitwise xor ^ OP_BITWISE_XOR
  2502. division / OP_DIVIDE
  2503. equality == OP_EQUAL
  2504. greater than > OP_GREATER
  2505. greater or equal >= OP_GREATEREQUAL
  2506. less than < OP_LESS
  2507. less or equal <= OP_LESSEQUAL
  2508. logical and && OP_LOGICAL_AND
  2509. logical or || OP_LOGICAL_OR
  2510. modulus % OP_MODULUS
  2511. multiplication * OP_MULTIPLY
  2512. logical negation ! OP_NOT
  2513. inequality != OP_NOTEQUAL
  2514. subtract - OP_SUBTRACT
  2515. ternary if ? OP_TERNARY_IF
  2516. Note that the * operator is also used for dereferencing (in unary form) besides
  2517. multiplication, and the & operator is also used for indirection besides bitwise
  2518. and.
  2519. Delimiters
  2520. Inger has a number of delimiters. There are listed here by there function de-
  2521. scription.
  2522. 75
  2523. Token Regexp Token identifier
  2524. precedes function return type -> ARROW
  2525. start code block { LBRACE
  2526. end code block } RBRACE
  2527. begin array index [ LBRACKET
  2528. end array index ] RBRACKET
  2529. start function parameter list : COLON
  2530. function argument separation , COMMA
  2531. expression priority, function application ( LPAREN
  2532. expression priority, function application ) RPAREN
  2533. statement terminator ; SEMICOLON
  2534. The full source to the Inger lexical analyzer is included in appendix F.
  2535. 76
  2536. Bibliography
  2537. [1] G. Goos, J. Hartmanis: Compiler Construction - An Advanced Course,
  2538. Lecture Notes in Computer Science, Springer-Verlag, Berlin, 1974.
  2539. [2] J. Levine: Lex and Yacc, O’Reilly & sons, 2000
  2540. [3] H. Spencer: POSIX 1003.2 regular expressions, UNIX man page regex(7),
  2541. 1994
  2542. 77
  2543. Chapter 5
  2544. Grammar
  2545. 5.1 Introduction
  2546. This chapter will introduce the concepts of language and grammar in both
  2547. informal and formal terms. After we have established exactly what a grammar
  2548. is, we offer several example grammars with documentation.
  2549. This introductory section discusses the value of the material that follows
  2550. in writing a compiler. A compiler can be thought of as a sequence of actions,
  2551. performed on some code (formulated in the source language) that transform
  2552. that code into the desired output. For example, a Pascal compiler transforms
  2553. Pascal code to assembly code, and a Java compiler transforms Java code to its
  2554. corresponding Java bytecode.
  2555. If you have used a compiler in the past, you may be familiar with “syntax
  2556. errors”. These occur when the input code does not conform to a set of rules set
  2557. by the language specification. You may have forgotten to terminate a statement
  2558. with a semicolon, or you may have used the THEN keyword in a C program (the
  2559. C language defines no THEN keyword).
  2560. One of the things that a compiler does when transforming source code to
  2561. target code is check the structure of the source code. This is a required step
  2562. before the compiler can move on to something else.
  2563. The first thing we must do when writing a compiler is write a grammar
  2564. for the source language. This chapter explains what a grammar is, how to
  2565. create one. Furthermore, it introduces several common ways of writing down a
  2566. grammar.
  2567. 78
  2568. 5.2 Languages
  2569. In this section we will try to formalize the concept of a language. When thinking
  2570. of languages, the first languages that usually come to mind are natural languages
  2571. like English or French. This is a class of languages that we will only consider in
  2572. passing here, since they are very difficult to understand by a computer. There
  2573. is another class of languages, the computer or formal languages, that are far
  2574. easier to parse since they obey a rigid set of rules. This is in constrast with
  2575. natural languages, whose leniant rules allow the speaker a great deal of freedom
  2576. in expressing himself.
  2577. Computers have been and are actively used to translate natural languages,
  2578. both for professional purposes (for example, voice-operated computers or Mi-
  2579. crosoft SQL Server’s English Query) and in games. This first so-called adventure
  2580. game 1 was written as early as 1975 and it was played by typing in English com-
  2581. mands.
  2582. All languages draw the words that they allow to be used from a pool of
  2583. words, called the alphabet. This is rather confusing, because we tend to think
  2584. of the alphabet as the 26 latin letters, A through Z. However, the definition of
  2585. a language is not concerned with how its most basic elements, the words, are
  2586. constructed from individual letters, but how these words are strung together.
  2587. In definitions, an alphabet is denoted as Σ.
  2588. A language is a collection of sentences or strings. From all the words that a
  2589. language allows, many sentences can be built but only some of these sentences
  2590. are valid for the language under consideration. All the sentences that may be
  2591. constructed from an alphabet Σ are denoted Σ ∗ . Also, there exists a special
  2592. sentence: the sentence with no words in it. This sentence is denoted λ.
  2593. In definitions, we refer to words using lowercase letters at the beginning of
  2594. our alphabet (a,b,c...), while we refer to sentences using letters near the end of
  2595. our alphabet (u,v,w,x...). We will now define how sentences may be built from
  2596. words.
  2597. Definition 5.1 (Alphabet)
  2598. Let Σ be an alphabet. Σ ∗ , the set of strings over Σ, is defined recursively as
  2599. follows:
  2600. 1. Basis: λ ∈ Σ ∗ .
  2601. 2. Recursive step: If w ∈ Σ ∗ , then wa ∈ Σ ∗ .
  2602. 3. Closure: w ∈ Σ ∗ only if it can be obtained from λ by a finite number of
  2603. applications of the recursive step.
  2604. ?
  2605. 1 In early 1977, Adventure swept the ARPAnet. Willie Crowther was the original author,
  2606. but Don Woods greatly expanded the game and unleashed it on an unsuspecting network.
  2607. When Adventure arrived at MIT, the reaction was typical: after everybody spent a lot of time
  2608. doing nothing but solving the game (it’s estimated that Adventure set the entire computer
  2609. industry back two weeks), the true lunatics began to think about how they could do it better
  2610. [proceeding to write Zork] (Tim Anderson, “The History of Zork – First in a Series” New Zork
  2611. Times; Winter 1985)
  2612. 79
  2613. This definition may need some explanation. It is put using induction. What
  2614. this means will become clear in a moment.
  2615. In the basis (line 1 of the defintion), we state that the empty string (λ) is
  2616. a sentence over Σ. This is a statement, not proof. We just state that for any
  2617. alphabet Σ, the empy string λ is among the sentences that may be constructed
  2618. from it.
  2619. In the recursive step (line 2 of the definition), we state that given a string w
  2620. that is part of Σ ∗ , the string wa is also part of Σ ∗ . Note that w denotes a string,
  2621. and a denotes a single word. Therefore what we mean is that given a string
  2622. generated from the alphabet, we may append any word from that alphabet to
  2623. it and the resulting string will still be part of the set of strings that can be
  2624. generated from the alphabet.
  2625. Finally, in the closure (line 3 of the definition), we add that all the strings
  2626. that can be built using the basis and recursive step are part of the set of strings
  2627. over Σ ∗ , and all the other strings are not. You can think of this as a sort
  2628. of safeguard for the definition. In most inductive defintions, we will leave the
  2629. closure line out.
  2630. Is Σ ∗ , then, a language? The answer is no. Σ ∗ is the set of all possible
  2631. strings that may be built using the alphabet Σ. Only some of these strings are
  2632. actually valid for a language. Therefore a language over an alphabet Σ is a
  2633. subset of Σ ∗ .
  2634. As an example, consider a small part of the English language, with the
  2635. alphabet { ’dog’, ’bone,’, the’, ’eats’ } (we cannot consider the actual English
  2636. language, as it has far too many words to list here). From this alphabet, we can
  2637. derive strings using definition 5.1:
  2638. λ
  2639. dog
  2640. dog dog dog
  2641. bone dog the
  2642. the dog eats the bone
  2643. the bone eats the dog
  2644. Many more strings are possible, but we can at least see that most of the
  2645. strings above are not valid for the English language: their structure does not
  2646. obey the rules of English grammar. Thus we may conclude that a language over
  2647. an alphabet Σ is a subset of Σ ∗ that follows certain grammar rules.
  2648. If you are wondering how all this relates to compiler construction, you should
  2649. realize that one of the things that a compiler does is check the structure of its
  2650. input by applying grammar rules. If the structure is off, the compiler prints a
  2651. syntax error.
  2652. 5.3 Syntax and Semantics
  2653. To illustrate the concept of grammar, let us examine the following line of text:
  2654. jumps the fox the dog over
  2655. Since it obviously does not obey the rules of English grammar, this sentence
  2656. is meaningless. It is said to be syntactically incorrect. The syntax of a sentence is
  2657. 80
  2658. its form or structure. Every sentence in a language must obey to that language’s
  2659. syntax for it to have meaning.
  2660. Here is another example of a sentence, whose meaning is unclear:
  2661. the fox drinks the color red
  2662. Though possibly considered a wondrous statement in Vogon poetry, this
  2663. statement has no meaning in the English language. We know that the color
  2664. red cannot be drunk, so that although the sentence is syntactically correct, it
  2665. conveys no useful information, and is therefore considered incorrect. Sentences
  2666. whose structure conforms to a language’s syntax but whose meaning cannot be
  2667. understood are said to be semantically incorrect.
  2668. The purpose of a grammar is to give a set of rules which all the sentences of
  2669. a certain language must follow. It should be noted that speakers of a natural
  2670. language (like English or French) will generally understand sentences that differ
  2671. from these rules, but this is not so for formal languages used by computers. All
  2672. sentences are required to adhere to the grammar rules without deviation for a
  2673. computer to be able to understand them.
  2674. In Compilers: Principles, Techniques and Tools ([1]), Aho defines a grammar
  2675. more formally:
  2676. A grammar is a formal device for specifying a potentially infinite
  2677. language (set of strings) in a finite way.
  2678. Because language semantics are hard to express in a set of rules (although
  2679. we will show a way to deal with sematics in part III), rammars deal with syntax
  2680. only: a grammar defines the structure of sentences in a language.
  2681. 5.4 Production Rules
  2682. In a grammar, we are not usually interested in the individual letters that make
  2683. up a word, but in the words themselves. We can give these words names so that
  2684. we can refer to them in a grammar. For example, there are very many words
  2685. that can be the subject or object of an English sentence (’fox’, ’dog’, ’chair’ and
  2686. so on) and it would not be feasible to list them all. Therefore we simply refer
  2687. to them as ’noun’. In the same way we can give all words that precede nouns to
  2688. add extra information (like ’brown’, ’lazy’ and ’small’) a name too: ’adjective’.
  2689. We call the set of all articles (’the’, ’a’, ’an’) ’article’. Finally, we call all verbs
  2690. ’verb’. Each of these names represent a set of many words, with the exception
  2691. of ’article’, which has all its members already listed.
  2692. Armed with this new terminology, we are now in a position to describe the
  2693. form of a very simple English sentence:
  2694. sentence: article adjective noun verb adjective noun.
  2695. From this lone production rule, we can generate (produce) English sentences.
  2696. We can replace every set name to the right of the colon with one of its elements.
  2697. For example, we can replace article with ’the’, adjective with ’quick’, noun with
  2698. ’fox’ and so on. This way we can build sentences such as
  2699. 81
  2700. the quick fox eats a delicious banana
  2701. the delicious banana thinks the quick fox
  2702. a quick banana outruns a delicious fox
  2703. The structure of these sentences matches the preceding rule, which means
  2704. that they conform to the syntax we specified. Incidentally, some of these sen-
  2705. tences have no real meaning, thus illustrating that semantic rules are not in-
  2706. cluded in the grammar rules we discuss here.
  2707. We have just defined a grammar, even though it contains only one rule that
  2708. allows only one type of sentence. Note that our grammar is a so-called abstract
  2709. grammar, since it does not specify the actual words that we may use to replace
  2710. the word classes (article, noun, verb) that we introduced.
  2711. So far we have given names to classes of individual words. We can also assign
  2712. names to common combinations of words. This requires multiple rules, making
  2713. the individual rules simpler:
  2714. sentence: object verb object.
  2715. object: article adjective noun.
  2716. This grammar generates the same sentences as the previous one, but is some-
  2717. what easier to read. Now we will also limit the choices that we can make when
  2718. replacing word classes by introducing some more rules:
  2719. noun: fox.
  2720. noun: banana.
  2721. verb: eats.
  2722. verb: thinks.
  2723. verb: outruns.
  2724. article : a.
  2725. article : the.
  2726. adjective: quick.
  2727. adjective : delicious.
  2728. Our grammar is now extended so that it is no longer an abstract grammar.
  2729. The rules above dictate how nonterminals (abstract grammar elements like ob-
  2730. ject or article ) may be replaced with concrete elements of the language’s alphabet.
  2731. The alphabet consists of all a the terminal symbols or terminals in a language
  2732. (the actual words). In the productions rules listed above, terminal symbols are
  2733. printed in bold.
  2734. Nonterminal symbols are sometimes called auxiliary symbols, because they
  2735. must be removed from any sentential form in order to create a concrete sentence
  2736. in the language.
  2737. Production rules are called production rules for a reason: they are used
  2738. to produce concrete sentences from the topmost nonterminal, or start symbol.
  2739. A concrete sentence may be derived from the start symbol by systematically
  2740. selecting nonterminals and replacing them with the right hand side of a suitable
  2741. production rule. In the listing below, we present the production rules for the
  2742. grammar we have constructed so far during this chapter. Consult the following
  2743. example, in which we use this grammar to derive a sentence.
  2744. 82
  2745. sentence: object verb object.
  2746. object: article adjective noun.
  2747. noun: fox.
  2748. noun: banana.
  2749. verb: eats.
  2750. verb: thinks.
  2751. verb: outruns.
  2752. article : a.
  2753. article : the.
  2754. adjective: quick.
  2755. adjective : delicious.
  2756. Derivation Rule Applied
  2757. ’sentence’
  2758. =⇒ ’object’ verb object 1
  2759. =⇒ ’ article ’adjective noun verb object 2
  2760. =⇒ the ’adjective ’ noun verb object 9
  2761. =⇒ thequick’noun’ verb object 2
  2762. =⇒ thequickfox ’verb’ object 3
  2763. =⇒ thequickfox eats ’object’ 5
  2764. =⇒ thequickfox eats ’ article ’ adjective noun 2
  2765. =⇒ thequickfox eats a ’adjective ’ noun 8
  2766. =⇒ thequickfox eats a delicious ’noun’ 11
  2767. =⇒ thequickfox eats a deliciousbanana 4
  2768. The symbol =⇒ indicates the application of a production rule, which the
  2769. rule number of the rule applied in the right column. The set of all sentences
  2770. which can be derived by repeated application of the production rules (deriving)
  2771. is the language defined by these production rules.
  2772. The string of terminals and nonterminals in each step is called a sentential
  2773. form. The last string, which contains only terminals, is the actual sentence.
  2774. This means that the process of derivation ends once all nonterminals have been
  2775. replaced with terminals.
  2776. You may have noticed that in every step, we consequently replaced the
  2777. leftmost nonterminal in the sentential form with one of its productions. This
  2778. is why the derivation we have performed is called a leftmost derivation. It is
  2779. also correct to perform a rightmost derivation by consequently replacing the
  2780. rightmost nonterminal in each sentential form, or any derivation in between.
  2781. Our current grammar states that every noun is preceded by precisely one
  2782. adjective. We now want to modify our grammar so that it allows us to specify
  2783. zero, one or more adjectives before each noun. This can be done by introducing
  2784. recursion, where a production rule for a nonterminal may again contain that
  2785. nonterminal:
  2786. object: article adjectivelist noun.
  2787. adjectivelist : adjective adjectivelist .
  2788. adjectivelist : ?.
  2789. The rule for the nonterminal object has been altered to include adjectivelist
  2790. 83
  2791. instead of simply adjective . An adjective list can either be empty (nothing,
  2792. indicated by ?), or an adjective, followed by another adjective list and so on.
  2793. The following sentences may now be derived:
  2794. sentence
  2795. =⇒ object verb object
  2796. =⇒ article adjectivelist noun verb object
  2797. =⇒ the adjectivelist noun verb object
  2798. =⇒ thenounverb object
  2799. =⇒ thebananaverbobject
  2800. =⇒ thebananaoutrunsobject
  2801. =⇒ thebananaoutrunsarticle adjectivelist noun
  2802. =⇒ thebananaoutrunstheadjectivelist noun
  2803. =⇒ thebananaoutrunstheadjective adjectivelist noun
  2804. =⇒ thebananaoutrunsthequickadjectivelist noun
  2805. =⇒ thebananaoutrunsthequickadjective adjectivelist noun
  2806. =⇒ thebananaoutrunsthequickdelicious adjectivelist noun
  2807. =⇒ thebananaoutrunsthequickdeliciousnoun
  2808. =⇒ thebananaoutrunsthequickdeliciousfox
  2809. 5.5 Context-free Grammars
  2810. After the introductory examples of sentence derivations, it is time to deal with
  2811. some formalisms. All the grammars we will work with in this book are context-
  2812. free grammars:
  2813. Definition 5.2 (Context-Free Grammar)
  2814. A context-free grammar is a quadruple (V,Σ,P,S) where V is a finite set of vari-
  2815. ables (nonterminals), Σ is a finite set of terminals, P is a finite set of production
  2816. rules and S ∈ V is an element of V designated as the start symbol.
  2817. ?
  2818. The grammar listings you have seen so far were context-free grammars, con-
  2819. sisting of a single nonterminal on the left-hand side of each production rule
  2820. (the mark of a context-free grammar). In fact, a symbol is a nonterminal only
  2821. when it acts as the left-hand side of a production rule. The right side of every
  2822. production rule may either be empty (denoted using an epsilon, ?), or contain
  2823. any combination of terminals and nonterminals. This notation is called the
  2824. Backus-Naur form 2 , after its inventors, John Backus and Peter Naur.
  2825. 2 John Backus and Peter Naur introduced for the first time a formal notation to describe
  2826. the syntax of a given language (this was for the description of the ALGOL 60 programming
  2827. language). To be precise, most of BNF was introduced by Backus in a report presented at an
  2828. earlier UNESCO conference on ALGOL 58. Few read the report, but when Peter Naur read it
  2829. he was surprised at some of the differences he found between his and Backus’s interpretation
  2830. of ALGOL 58. He decided that for the successor to ALGOL, all participants of the first
  2831. design had come to recognize some weaknesses, should be given in a similar form so that all
  2832. participants should be aware of what they were agreeing to. He made a few modificiations
  2833. that are almost universally used and drew up on his own the BNF for ALGOL 60 at the
  2834. meeting where it was designed. Depending on how you attribute presenting it to the world,
  2835. it was either by Backus in 59 or Naur in 60. (For more details on this period of programming
  2836. languages history, see the introduction to Backus’s Turing award article in Communications
  2837. of the ACM, Vol. 21, No. 8, august 1978. This note was suggested by William B. Clodius
  2838. from Los Alamos Natl. Lab).
  2839. 84
  2840. expression: expression + expression.
  2841. expression: expression − expression.
  2842. expression: expression ∗ expression.
  2843. expression: expression / expression.
  2844. expression: expression ˆ expression.
  2845. expression: number.
  2846. expression:
  2847. ?
  2848. expression
  2849. ?
  2850. .
  2851. number: 0.
  2852. number: 1.
  2853. number: 2.
  2854. number: 3.
  2855. number: 4.
  2856. number: 5.
  2857. number: 6.
  2858. number: 7.
  2859. number: 8.
  2860. number: 9.
  2861. Listing 5.1: Sample Expression Language
  2862. The process of deriving a valid sentence from the start symbol (in our pre-
  2863. vious examples, this was sentence ), is executed by repeatedly replacing a non-
  2864. terminal by the right-hand side of any one of the production rules of which it
  2865. acts as the left-hand side, until no nonterminals are left in the sentential form.
  2866. Nonterminals are always abstract names, while terminals are often expressed
  2867. using their actual (real-world) representations, often between quotes (e.g. ”+” ,
  2868. ”while” , ”true” ) or printed bold (like we do in this book).
  2869. The left-hand side of a production rule is separated from the right-hand side
  2870. by a colon, and every production rule is terminated by a period. Whether you do
  2871. this does not affect the meaning of the production rules at all, but is considered
  2872. good style and part of the specificiation of Backus Naur Form (BNF). Other
  2873. notations are used.
  2874. As a running example, we will work with a simple language for mathematical
  2875. expressions, analogous to the language discussed in the introduction to this
  2876. book. The language is capable of expressing the following types of sentences:
  2877. 1 + 2 * 3 + 4
  2878. 2 ^ 3 ^ 2
  2879. 2 * (1 + 3)
  2880. In listing 5.1, we give a sample context-free grammar for this language.
  2881. Note the periods that terminate each production rule. You can see that
  2882. there are only two nonterminals, each of which have a number of alternative
  2883. production rules associated with them. We now state that expression will be
  2884. the distinguished nonterminal that will act as the start symbol, and we can use
  2885. these production rules to derive the sentence 1 + 2 * 3 (see table 5.5).
  2886. 85
  2887. expression
  2888. =⇒ expression ∗ expression
  2889. =⇒ expression +expression ∗ expression
  2890. =⇒ number+expression ∗ expression
  2891. =⇒ 1 +expression ∗ expression
  2892. =⇒ 1 +number∗ expression
  2893. =⇒ 1 +2 ∗ expression
  2894. =⇒ 1 +2 ∗ number
  2895. =⇒ 1 +2 ∗ 3
  2896. Table 5.1: Derivation scheme for 1 + 2 * 3
  2897. The grammar in listing 5.1 has all its keywords (the operators and digits) de-
  2898. fined in it as terminals. One could ask how this grammar deals with whitespace,
  2899. which consists of spaces, tabs and (possibly) newlines. We would naturally like
  2900. to allow an abitrary amount of whitespace to occur between two tokens (dig-
  2901. its, operators, or parentheses), but the term whitespace occurs nowhere in the
  2902. grammar. The answer is that whitespace is not usually included in a grammar,
  2903. although it could be. The lexical analyzer uses whitespace to see where a word
  2904. ends and where a new word begins, but otherwise discards it (unless the wites-
  2905. pace occurs within comments or strings, in which case it is significant). In our
  2906. language, whitespace does not have any significance at all so we assume that it
  2907. is discarded.
  2908. We would now like to extend definition 5.2 a little further, because we have
  2909. not clearly stated what a production rule is.
  2910. Definition 5.3 (Production Rule)
  2911. In the quadruple (V,Σ,S,P) that defines a context-free grammar, P ⊆ N ×
  2912. (V ∪ Σ) ∗ is a finite set of production rules.
  2913. ?
  2914. Here, (V ∪Σ) is the union of the set of nonterminals and the set of terminals,
  2915. yielding the set of all symbols. (V ∪ Σ) ∗ denotes the set of finite strings of
  2916. elements from (V ∪ Σ). In other words, P is a set of 2-tuples with on the left-
  2917. hand side a nonterminal, and on the right-hand side a string constructed from
  2918. items from V and Σ. It should now be clear that the following are examples of
  2919. production rules:
  2920. expression: expression ∗ expression.
  2921. number: 3.
  2922. We have already shown that production rules are used to derive valid sen-
  2923. tences from the start symbol (sentences that may occur in the language under
  2924. consideration). The formal method used to derive such sentences are (also see
  2925. Languages and Machines by Thomas Sudkamp ([8]):
  2926. Definition 5.4 (String Derivation)
  2927. 86
  2928. Let G = (V,Σ,S,P) be a context-free grammar and v ∈ (V ∪ Σ) ∗ . The set of
  2929. strings derivable from v is recursively defined as follows:
  2930. 1. Basis: v is derivable from v.
  2931. 2. Recursive step: If u = xAy is derivable from v and A −→ w ∈ P, then
  2932. xwy is derivable from v.
  2933. 3. Closure: Precisely those strings constructed from v by finitely many ap-
  2934. plications of the recursive step are derivable from v.
  2935. ?
  2936. This definition illustrates how we use lowercase latin letters to represent
  2937. strings of zero or more terminal symbols, and uppercase latin letters to represent
  2938. a nonterminal symbol. Furthermore, we use lowercase greek letters to denote
  2939. strings of terminal and nonterminal symbols.
  2940. A close formula may be given for all the sentences derivable from a given
  2941. grammar, simultaneously introducing a new operator:
  2942. {s ∈ Σ ∗ : S =⇒ ∗ s} (5.1)
  2943. We have already discussed the operator =⇒, which denotes the derivation
  2944. of a sentential form from another sentential form by applying a production rule.
  2945. The =⇒ relation is defined as
  2946. {(σAτ,σατ) ∈ (V ∪ Σ) ∗ × (V ∪ Σ) ∗ : σ,τ ∈ (V ∪ Σ) ∗ ∧ (A,α) ∈ P}
  2947. which means, in words: =⇒ is a collecton of 2-tuples, and is therefore a
  2948. relation which binds the left element of each 2-tuple to the right-element, thereby
  2949. defining the possible replacements (productions) which may be performed. In
  2950. the tuples, the capital latin letter A represents a nonterminal symbol which gets
  2951. replaced by a string of nonterminal and terminal symbols, denoted using the
  2952. greek lowercase letter α. σ and τ remain unchanged and serve to illustrate that
  2953. a replacement (production) is context-insensitive or context-free. Whatever the
  2954. actual value of the strings σ and τ, the replacement can take place. We will
  2955. encounter other types of grammars which include context-sensitive productions
  2956. later.
  2957. The =⇒ ∗ relation is the reflexive closure of the relation =⇒. =⇒ ∗ is used
  2958. to indicate that multiple production rules have been applied in succession to
  2959. achieve the result stated on the right-hand side. The formula α =⇒ ∗ β denotes
  2960. an arbitrary derivation starting with α and ending with β. It is perfectly valid to
  2961. rewrite the derivation scheme we presented in table 5.5 using the new operator
  2962. (see table 5.5). We can use this approach to leave out derivations that are
  2963. obvious, analogous to the way one leaves out trivial steps in a mathematical
  2964. proof.
  2965. The concept of recursion is illustrated by the production rule expression :
  2966. ”(”expression”)”. When deriving a sentence, the nonterminal expression may be
  2967. replaced by itself (but between parentheses). This recursion may continue in-
  2968. definitely, termination being reached when another production rule for expression
  2969. is applied (in particular, expression : number).
  2970. 87
  2971. expression
  2972. =⇒ ∗ expression +expression ∗ expression
  2973. =⇒ ∗ 1 +number∗ expression
  2974. =⇒ ∗ 1 +2 ∗ 3
  2975. Table 5.2: Compact derivation scheme for 1 + 2 * 3
  2976. Recursion is considered left recursion when the nonterminal on the left-hand
  2977. side of a production also occurs as the first symbol in the right-hand side of the
  2978. production. This applies to most of the production rules for expression , for
  2979. example:
  2980. expression: expression + expression.
  2981. While this property does not affect our ability to derive sentences from the
  2982. grammar, it does prohibit a machine from automatically parsing an input text
  2983. using determinism. This will be discussed shortly. Left recursion can be obvious
  2984. (as it is in this example), but it can also be buried deeply in a grammar. It some
  2985. cases, it takes a keen eye to spot and remove left recursion. Consider the follow-
  2986. ing example of indirect recursion (in this example, we use capital latin letters
  2987. to indicate nonterminal symbols and lowercase latin letters to indicate strings
  2988. of terminal symbols, as is customary in the compiler construction literature):
  2989. Example 5.1 (Indirect Recursion)
  2990. A: Bx
  2991. B: Cy
  2992. C: Az
  2993. C: x
  2994. A may be replaced by Bx , thus removing an instance of A from a sentential
  2995. form, and B may be replaced by Cy . C may be replaced by Az , which reintroduces
  2996. A in the sentential form: indirect recursion.
  2997. This example was taken from [2].
  2998. ?
  2999. 5.6 The Chomsky Hierarchy
  3000. An unrestricted rewriting system[1] (grammar), the collection of production
  3001. rules is:
  3002. P ⊆ (V ∪ Σ) ∗ × (V ∪ Σ) ∗
  3003. This means that the most leniant form of grammar allows multiple symbols,
  3004. both terminals and nonterminals on the left hand side of a production rule.
  3005. Such a production rule is often denoted
  3006. 88
  3007. (α,ω)
  3008. since greek lowercase letters stand for a finite string of terminal and non-
  3009. terminal symbols, i.e. (V ∪ Σ) ∗ . The unrestricted grammar generates a type 0
  3010. language according to the Chomsky hierarchy. Noam Chomsky has defined four
  3011. levels of grammars which successively more severe restrictions on the form of
  3012. production rules which result in interesting classes of grammars.
  3013. A type 1 grammar or context-sensitive grammar is one in which each pro-
  3014. duction α −→ β is such that | β | ≥ | α |. Alternatively, a context-sensitive
  3015. grammar is sometimes defined as having productions of the form
  3016. γAρ −→ γωρ
  3017. where ω cannot be the empty string (?). This is, of course, the same defini-
  3018. tion. A type 1 grammar generates a type 1 language.
  3019. A type 2 grammar or context-free grammar is one in which each production
  3020. is of the form
  3021. A −→ ω
  3022. where ω can be the empty string (?). A context-free grammar generates a
  3023. type 2 language.
  3024. A type 3 grammar or regular grammar is either right linear, with each pro-
  3025. duction of the form
  3026. A −→ a or A −→ aB
  3027. or left-linear, in which each production is of the form:
  3028. A −→ a or A −→ Ba
  3029. A regular grammar generates a type 3 language. Regular grammars are very
  3030. easy to parse (analyze the structure of a text written using such a grammar)
  3031. but are not very powerful at the same time. They are often used to write
  3032. lexical analyzers and were discussed in some detail in the previous chapter.
  3033. Grammars for most actual programming languages are context free, since this
  3034. type of grammar turns out to be easy to parse and yet powerful. The higher
  3035. classes (0 and 1) are not often used.
  3036. As it happens, the class of context-free languages (type 2) is not very large. It
  3037. turns out that there are almost no interesting languages that are context-free.
  3038. But this problem is easily solved by first defining a superset of the language
  3039. that is being designed, in order to formalize the context-free aspects of this
  3040. language. After that, the remaining restrictions are defined using other means
  3041. (i.e. semantic analysis).
  3042. As an example of a context-sensitive aspect (from Meijer [6]), consider the
  3043. fact that in many programming languages, variables must be declared before
  3044. they may be used. More formally, in sentences of the form αXβXγ, in which
  3045. the number of possible productions for X and the length of the production for β
  3046. are not limited, both instances of X must always have the same production. Of
  3047. 89
  3048. course, this cannot be expressed in a context-free manner. 3 This is an immediate
  3049. consequence of the fact that the productions are context-free: every nonterminal
  3050. may be replaced by one of its right-hand sides regardless of its context. Context-
  3051. free grammars can therefore be spotted by the property that the left-hand side
  3052. of their production rules consist of precisely one nonterminal.
  3053. 5.7 Additional Notation
  3054. In the previous section, we have shown how a grammar can be written for
  3055. a simple language using the Backus-Naur form (BNF). Because BNF can be
  3056. unwieldy for languages which contain many alternative production rules for each
  3057. nonterminal or involve many recursive rules (rules that refer to themselves), we
  3058. also have the option to use the extended Backus-Naur form (EBNF). EBNF
  3059. introduces some meta-operators (which are only significant in EBNF and have
  3060. no function in the language being defined) which make the life of the grammar
  3061. writer a little easier. The operators are:
  3062. Operator Function
  3063. ( and ) Group symbols together so that other meta-
  3064. operators may be applied to them as a group.
  3065. [ and ] Symbols (or groups of symbols) contained within
  3066. square brackets are optional.
  3067. { and } Symbols (or groups of symbols) between braces
  3068. may be repeated zero or more times.
  3069. | Indicates a choice between two symbols (usually
  3070. grouped with parentheses).
  3071. Our sample grammar can now easily be rephrased using EBNF (see listing
  3072. 5.2). Note how we are now able to combine multiple productions rules for the
  3073. same nonterminal into one production rule, but be aware that the alternatives
  3074. specified between pipes ( | ) still constitute multiple production rules. EBNF is
  3075. the syntax description language that is most often used in the compiler con-
  3076. struction literature.
  3077. Yet another, very intuitive way of describing syntax that we have already
  3078. used extensively in the Inger language specification in chapter 3, is the syntax
  3079. diagram. The production rules from listing 5.2 have been converted into two
  3080. syntax diagrams in figure 5.1.
  3081. Syntax diagrams consist of terminals (in boxes with rounded corners) and
  3082. nonterminals (in boxes with sharp corners) connected by lines. In order to pro-
  3083. duce valid sentences, the user begins with the syntax diagram designated as the
  3084. top-level diagram. In our case, this is the syntax diagram for expression , since ex-
  3085. pression is the start symbol in our grammar. The user then traces the line leading
  3086. into the diagram, evaluating the boxes he encounters on the way. While tracing
  3087. lines, the user may follow only rounded corners, never sharp ones, and may not
  3088. reverse direction. When a box with a terminal is encountered, that terminal is
  3089. placed in the sentence that is written. When a box containing a nonterminal
  3090. 3 That is, unless there were a (low) limit on the number of possible productions for X
  3091. and/or the length of β were fixed and small. In that case, the total number of possibilities is
  3092. limited and one could write a separate production rule for each possibility, thereby regaining
  3093. the freedom of context.
  3094. 90
  3095. expression: expression + expression
  3096. | expression − expression
  3097. | expression ∗ expression
  3098. | expression / expression
  3099. | expression ˆ expression
  3100. | number
  3101. |
  3102. ?
  3103. expression
  3104. ?
  3105. .
  3106. number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
  3107. Listing 5.2: Sample Expression Language in EBNF
  3108. Figure 5.1: Syntax Diagrams for Mathematical Expressions
  3109. 91
  3110. is encountered, the user switches to the syntax diagram for that nonterminal.
  3111. In our case, there is only one nonterminal besides expression ( number ) and thus
  3112. there are only two syntax diagrams. In a grammar for a more complete lan-
  3113. guage, there may be many more syntax diagrams (consult appendix E for the
  3114. syntax diagrams of the Inger language).
  3115. Example 5.2 (Tracing a Syntax Diagram)
  3116. Let’s trace the syntax diagram in figure 5.1 to generate the sentence
  3117. 1 + 2 * 3 - 4
  3118. We start with the expression diagram, since expression is the start symbol.
  3119. Entering the diagram, we face a selection: we can either move to a box con-
  3120. taining expression , move to a box containing the terminal ( or kmove to a box
  3121. containing number . Since there are no parentheses in the sentence that we want
  3122. to generate, the second alternative is eliminated. Also, if we were to move to
  3123. number now, the sentence generation would end after we generate only one digit,
  3124. because after the number box, the line we are tracing ends. Therefore we are
  3125. left with only one alternative: move to the expression box.
  3126. The expression box is a nonterminal box, so we must restart tracing the
  3127. expression syntax diagram. This time, we move to the number box. This is
  3128. also a nonterminal box, so we must pause our current trace and start tracing
  3129. the number syntax diagram. The number diagram is simple: it only offers use
  3130. one choice (pick a digit). We trace through 1 and leave the number diagram,
  3131. picking up where we left off in the expression diagram. After the number box,
  3132. the expression diagram also ends so we continue our first trace of the expression
  3133. diagram, which was paused after we entered an expression box. We must now
  3134. choose an operator. We need a + , so we trace through the corresponding box.
  3135. Following the line from + brings us to a second expression box. We must once
  3136. again pause our progress and reenter the expression diagram. In the following
  3137. interations, we pick 2 , * , 3 , - and 4 . Completing the trace is left as an exercise
  3138. to the reader.
  3139. ?
  3140. Fast readers may have observed that converting (E)BNF production rules to
  3141. syntax diagrams does not yield very efficient syntax diagrams. For instance, the
  3142. syntax diagrams in figure 5.2 for our sample expression grammar are simpler
  3143. than the original ones, because we were able to remove most of the recursion in
  3144. the expression diagram.
  3145. At a later stage, we will have more to say about syntax diagrams. For now,
  3146. we will direct our attention back to the sentence generation process.
  3147. 5.8 Syntax Trees
  3148. The previous sections included some example of sentence generation from a
  3149. given grammar, in which the generation process was visualized using a derivation
  3150. scheme (such as table 5.5 on page 86. Much more insight is gained from drawing
  3151. a so-called parse tree or syntax tree for the derivation.
  3152. We return to our sample expression grammar (listing 5.3, printed here again
  3153. for easy reference) and generate the sentence
  3154. 92
  3155. Figure 5.2: Improved Syntax Diagrams for Mathematical Expressions
  3156. expression: expression + expression
  3157. | expression − expression
  3158. | expression ∗ expression
  3159. | expression / expression
  3160. | expression ˆ expression
  3161. | number
  3162. |
  3163. ?
  3164. expression
  3165. ?
  3166. .
  3167. number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
  3168. Listing 5.3: Sample Expression Language in EBNF
  3169. 1 + 2 * 3
  3170. We will derive this sentence using leftmost derivation as shown in the deriva-
  3171. tion scheme in table 5.8.
  3172. The resulting parse tree is in figure 5.3. Every nonterminal encountered
  3173. in the derivation has become a node in the tree, and the terminals (the digits
  3174. and operators themselves) are the leaf nodes. We can now easily imagine how a
  3175. machine would calculate the value of the expression 1 + 2 * 3 : every nonterminal
  3176. node retrieves the value of its children and performs an operation on them
  3177. (addition, subtraction, division, multiplication), and stores the result inside
  3178. itself. This process occurs recursively, so that eventually the topmost node of
  3179. the tree, known as the root node , contains the final value of the expression.
  3180. Not all nonterminal nodes perform an operation on the values of their children;
  3181. the number node does not change the value of its child, but merely serves as
  3182. a placeholder. When a parent node queries the number node for its value, it
  3183. merely passes the value of its child up to its parent. The following recursive
  3184. definition states this approach more formally:
  3185. 93
  3186. expression
  3187. =⇒ expression ∗ expression
  3188. =⇒ expression +expression ∗ expression
  3189. =⇒ number+expression ∗ expression
  3190. =⇒ 1 +expression ∗ expression
  3191. =⇒ 1 +number∗ expression
  3192. =⇒ 1 +2 ∗ expression
  3193. =⇒ 1 +2 ∗ number
  3194. =⇒ 1 +2 ∗ 3
  3195. Table 5.3: Leftmost derivation scheme for 1 + 2 * 3
  3196. Definition 5.5 (Tree Evaluation)
  3197. The following algorithm may be used to evaluate the final value of an expression
  3198. stored in a tree.
  3199. Let n be the root node of the tree.
  3200. • If n is a leaf node (i.e. if n has no children), the final value of n its current
  3201. value.
  3202. • If n is not a leaf node, the value of nis determined by retrieving the values
  3203. of its children, from left to right. If one of the children is an operator, it
  3204. is applied to the other children and the result is the final result of n.
  3205. ?
  3206. The tree we have just created is not unique. In fact, their are multiple valid
  3207. trees for the expression 1 + 2 * 3 . In figure 5.4, we show the parse tree for the
  3208. rightmost derivation of our sample expression. This tree differs slightly (but
  3209. significantly) from our original tree. Apparently out grammar is ambiguous: it
  3210. can generate multiple trees for the same expression.
  3211. Figure 5.3: Parse Tree for Leftmost Derivation of 1 + 2 * 3
  3212. The existence of multiple trees is not altogether a blessing, since it turns out
  3213. that different trees produce different expression results.
  3214. Example 5.3 (Tree Evaluation)
  3215. 94
  3216. expression
  3217. =⇒ expression +expression
  3218. =⇒ expression +expression ∗ expression
  3219. =⇒ expression +expression ∗ number
  3220. =⇒ expression +expression ∗ 3
  3221. =⇒ expression +number∗ 3
  3222. =⇒ expression +2 ∗ 3
  3223. =⇒ number+2 ∗ 3
  3224. =⇒ 1 +2 ∗ 3
  3225. Table 5.4: Rightmost derivation scheme for 1 + 2 * 3
  3226. Figure 5.4: Parse Tree for Rightmost Derivation of 1 + 2 * 3
  3227. In this example, we will calculate the value of the expression 1 + 2 * 3 using
  3228. the parse tree in figure 5.4. We start with the root node, and query the values
  3229. of its three expression child nodes. The value of the left child node is 1 , since it
  3230. has only one child ( number ) and its value is 1 . The value of the right expression
  3231. node is determined recursively by retrieving the values of its two expression child
  3232. nodes. These nodes evaluate to 2 and 3 respectively, and we apply the middle
  3233. child node, which is the multiplication ( * ) operator. This yields the value 6
  3234. which we store in the expression node.
  3235. The values of the left and right child nodes of the root expression node are now
  3236. known and we can calculate the final expression value. We do so by retrieving
  3237. the value of the root node’s middle child node (the + operator) and applying it
  3238. to the values of the left and right child nodes ( 1 and 6 respectively). The result,
  3239. 7 is stored in the root node. Incidentally, it is also the correct answer.
  3240. At the end of the evaluation, the expression result is known and resides
  3241. inside the root node.
  3242. ?
  3243. In this example, we have seen that the value 7 is found by evaluating the
  3244. tree corresponding to the rightmost derivation of the expression 1 + 2 * 3 . This
  3245. is illustrated by the annotated parse tree, which is shown in figure 5.5.
  3246. We can now apply the same technique to calculate the final value of the
  3247. parse tree corresponding to the leftmost derivation of the expression 1 + 2 * 3 ,
  3248. shown in figure 5.6. We find that the answer ( 9 ) is incorrect, which is caused
  3249. by the order in which the nodes are evaluated.
  3250. 95
  3251. Figure 5.5: Annotated Parse Tree for Rightmost Derivation of 1 + 2 * 3
  3252. Figure 5.6: Annotated Parse Tree for Leftmost Derivation of 1 + 2 * 3
  3253. The nodes in a parse tree must reflect the precedence of the operators used in
  3254. the expression in the parse tree. In case of the tree for the rightmost derivation
  3255. of 1 + 2 * 3 , the precedence was correct: the value of 2 * 3 was evaluated before
  3256. the 1 was added to the result. In the parse tree for the leftmost derivation, the
  3257. value of 1 + 2 was calculated before the result was multiplied by 3 , yielding an
  3258. incorrect result. Should we, then, always use rightmost derivations? The answer
  3259. is no: it is mere coindence that the rightmost derivation happens to yield the
  3260. correct result – it is the grammar that is flawed. With a correct grammar, any
  3261. derivation order will yield the same results and only one parse tree correspons
  3262. to a given expression.
  3263. 5.9 Precedence
  3264. The problem of ambiguity in the grammar of the previous section is solved for
  3265. a big part by introducing new nonterminals, which will serve as placeholders
  3266. to introduce operator precedence levels. We know that multiplication ( * ) and
  3267. division ( / ) bind more strongly than addition ( + ) and subtraction( - ), but we
  3268. need a means to visualize this concept in the parse tree. The solution lies in
  3269. adding the nonterminal term (see the new grammar in listing 5.4, which will
  3270. deal with multiplications and additions. The original expression nonterminal is
  3271. now only used for additions and subtractions. The result is, that whenever
  3272. a multiplication or division is encountered, the parse tree will contain a term
  3273. node in which all multiplications and divisions are resolved until an addition or
  3274. 96
  3275. expression: term + expression
  3276. | term − expression
  3277. | term.
  3278. term: factor ∗ term
  3279. | factor / term
  3280. | factor ˆ term
  3281. | factor.
  3282. factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
  3283. |
  3284. ?
  3285. expression
  3286. ?
  3287. .
  3288. Listing 5.4: Unambiguous Expression Language in EBNF
  3289. expression
  3290. =⇒ term +expression
  3291. =⇒ factor +expression
  3292. =⇒ 1 +expression
  3293. =⇒ 1 +term
  3294. =⇒ 1 +factor ∗ term
  3295. =⇒ 1 +2 ∗ term
  3296. =⇒ 1 +2 ∗ factor
  3297. =⇒ 1 +2 ∗ 3
  3298. Table 5.5: Leftmost derivation scheme for 1 + 2 * 3
  3299. subtraction arrives.
  3300. We also introduce the nonterminal factor to replace number , and to deal with
  3301. parentheses, which have the highest precedence. It should now become obvious
  3302. that the lower you get in the grammar, the higher the priority of the operators
  3303. dealt with. Tables 5.9 and 5.9 show the leftmost and rightmost derivation of 1
  3304. + 2 * 3 . Careful study shows that they are the same. In fact, the corresponding
  3305. parse trees are exactly identical (shown in figure 5.7). The parse tree is already
  3306. annotated for convience and yields the correct result for the expression it holds.
  3307. It should be noted that in some cases, an instance of, for example, term
  3308. actually adds an operator ( * or / ) and sometimes it is merely included as a
  3309. placeholder that holds an instance of factor . Such nodes have no function in
  3310. a syntax tree and can be safely left out (which we will do when we generate
  3311. abstract syntax trees.
  3312. There is an amazing (and amusing) trick that was used in the first FOR-
  3313. TRAN compilers to solve the problem of operator precedence. An excerpt from
  3314. a paper by Donald Knuth (1962):
  3315. An ingenious idea used in the first FORTRAN compiler was to sur-
  3316. round binary operators with peculiar-looking parentheses:
  3317. + and − were replaced by ))) + ((( and ))) − (((
  3318. ∗ and / were replaced by )) ∗ (( and ))/((
  3319. ∗∗ was replaced by ) ∗ ∗(
  3320. 97
  3321. expression
  3322. =⇒ term +expression
  3323. =⇒ term +term
  3324. =⇒ term +factor”∗” term
  3325. =⇒ term +factor ∗ factor
  3326. =⇒ term +factor ∗ 3
  3327. =⇒ term +2∗ 3
  3328. =⇒ term +2 ∗ 3
  3329. =⇒ factor +2 ∗ 3
  3330. =⇒ 1 +2 ∗ 3
  3331. Table 5.6: Rightmost derivation scheme for 1 + 2 * 3
  3332. Figure 5.7: Annotated Parse Tree for Arbitrary Derivation of 1 + 2 * 3
  3333. 98
  3334. and then an extra “(((” at the left and “)))” at the right were tacked
  3335. on. For example, if we consider “(X + Y ) + W/Z,” we obtain
  3336. ((((X))) + (((Y )))) + (((W))/((Z)))
  3337. This is admittedly highly redundant, but extra parentheses need not
  3338. affect the resulting machine language code.
  3339. Another approach to solve the precedence problem was invented by the Pol-
  3340. ish scientist J. Lukasiewicz in the late 20s. Today frequently called prefix no-
  3341. tation, the parenthesis-free or polish notation was a perfect notation for the
  3342. output of a compiler, and thus a step towards the actual mechanization and
  3343. formulation of the compilation process.
  3344. Example 5.4 (Prefix notation)
  3345. 1 + 2 * 3 becomes + 1 * 2 3
  3346. 1 / 2 - 3 becomes - / 1 2 3
  3347. ?
  3348. 5.10 Associativity
  3349. When we write down the syntax tree for the expression 2 - 1 - 1 according
  3350. to our example grammar, we discover that our grammar is still not correct
  3351. (see figure 5.8). The parse tree yields the result 2 while the correct result
  3352. is 0 , even though we have taken care of operator predence. It turns out that
  3353. apart from precedence, operator associativity is also important. The subtraction
  3354. operator - associates to the left, so that in a (sub) expression which consists
  3355. only of operators of equal precedence, the order in which the operators must be
  3356. evaluated is still fixed. In the case of subtraction, the order is from left to right.
  3357. In the case of ˆ (power), the order is from right to left. After all,
  3358. 2 2
  3359. 2
  3360. = 512 6= 64.
  3361. Figure 5.8: Annotated Parse Tree for 2 - 1 - 1
  3362. 99
  3363. expression: expression + term
  3364. | expression − term
  3365. | term.
  3366. term: factor ∗ term
  3367. | factor / term
  3368. | factor ˆ term
  3369. | factor.
  3370. factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
  3371. |
  3372. ?
  3373. expression
  3374. ?
  3375. .
  3376. Listing 5.5: Expression Grammar Modified for Associativity
  3377. It turns out that our grammar works only for right-associative operators (or
  3378. for non-associative operators like addition or multiplication, since these may be
  3379. treated like right-associative operators), because its production rules are right-
  3380. recursive. Consider the following excerpt:
  3381. expression: term + expression
  3382. | term − expression
  3383. | term.
  3384. The nonterminal expression acts as the left-hand side of these three production
  3385. rules, and in two of them also occurs on the far right. This causes right recursion
  3386. which can be spotted in the parse tree in figure 5.8: the right child node of every
  3387. expression node is again an expression node. Left recursion can be recognized the
  3388. same way. The solution, then, to the associativity problem is to introduce
  3389. left-recursion in the grammar. The grammar in listing 5.5 can deal with left-
  3390. associativity and right-associativity, because expression is left-recursive, causing
  3391. + and - to be treated as left-associative operators, and term is right-recursive,
  3392. causing * , / and ˆ to be treated as right-associative operators.
  3393. And presto–the expressions 2 - 1 - 1 and 2 ˆ 3 ˆ 2 now have correct parse
  3394. trees (figures 5.9 and 5.10). We will see in the next chapter that we are not
  3395. quite out of the woods yet, but never fear, the worst is behind us.
  3396. 5.11 A Logic Language
  3397. As a final and bigger example, we present a complete little language that handles
  3398. propositional logic notation (proposition, implication, conjunction, disjunction
  3399. and negation). This language has operator precedence and associativity, but
  3400. very few terminals (and an interpreter can therefore be completely implemented
  3401. as an exercise. We will do so in the next chapter). Consult the following sample
  3402. program:
  3403. Example 5.5 (Proposition Logic Program)
  3404. 100
  3405. Figure 5.9: Correct Annotated Parse Tree for 2 - 1 - 1
  3406. Figure 5.10: Correct Annotated Parse Tree for 2 ˆ 3 ˆ 2
  3407. 101
  3408. A = 1
  3409. B = 0
  3410. C = (˜A) | B
  3411. RESULT = C −> A
  3412. ?
  3413. The language allows the free declaration of variables, for which capital letters
  3414. are used (giving a range of 26 variables maximum). In the example, the variable
  3415. A is declared and set to true (1), and B is set to false (0). The variable C is
  3416. declared and set to
  3417. ?
  3418. ˜A ? | B, which is false (0). Incidentally, the parentheses
  3419. are not required because ˜ has higher priority than |. Finally, the program is
  3420. terminated with an instruction that prints the value of C −> B, which is true
  3421. (1). Termination of a program with such an instruction is required.
  3422. Since our language is a proposition logic language, we must define truth
  3423. tables for each operator (see table 5.7). You may already be familiar with all
  3424. the operators. Pay special attention to the operator precedence relation:
  3425. Operator Priority Operation
  3426. ~ 1 Negation (not)
  3427. & 2 Conjunction (and)
  3428. | 2 Disjunction (or)
  3429. -> 3 Right Implication
  3430. <- 3 Left Implication
  3431. <-> 3 Double Implication
  3432. A B A & B
  3433. F F F
  3434. F T F
  3435. T F F
  3436. T T T
  3437. A B A | B
  3438. F F F
  3439. F T T
  3440. T F T
  3441. T T T
  3442. A ~A
  3443. F T
  3444. T F
  3445. A B A -> B
  3446. F F T
  3447. F T T
  3448. T F F
  3449. T T T
  3450. A B A <- B
  3451. F F T
  3452. F T F
  3453. T F T
  3454. T T T
  3455. A B A <-> B
  3456. F F T
  3457. F T F
  3458. T F F
  3459. T T T
  3460. Table 5.7: Proposition Logic Operations and Their Truth Tables
  3461. Now that we are familiar with the language and with the operator precedence
  3462. relation, we can write a grammar in BNF. Incidentally, all operators are non-
  3463. associative, and we will treat them as if they associated to the right (which is
  3464. easiest for parsing by a machine, in the next chapter). The BNF grammar is
  3465. in listing 5.6. For good measure, we have also written the grammar in EBNF
  3466. (listing 5.7).
  3467. You may be wondering why we have built our BNF grammar using complex
  3468. constructions with empty production rules (?) while our running example, the
  3469. 102
  3470. program: statementlist RESULT = implication.
  3471. statementlist : ?.
  3472. statementlist : statement statementlist.
  3473. statement: identifier = implication ;.
  3474. implication: conjunction restimplication.
  3475. restimplication : ?.
  3476. restimplication : −> conjunction restimplication.
  3477. restimplication : <− conjunction restimplication.
  3478. restimplication : <−> conjunction restimplication.
  3479. conjunction: negation restconjunction.
  3480. restconjunction: ?.
  3481. restconjunction: & negation restconjunction.
  3482. restconjunction: | negation restconjunction.
  3483. negation: ˜ negation.
  3484. negation: factor.
  3485. factor :
  3486. ?
  3487. implication
  3488. ?
  3489. .
  3490. factor : identifier .
  3491. factor : 1.
  3492. factor : 0.
  3493. identifier : A.
  3494. ...
  3495. identifier : Z.
  3496. Listing 5.6: BNF for Logic Language
  3497. program:
  3498. ?
  3499. statement ;
  3500. ?
  3501. RESULT = implication.
  3502. statement: identifier = implication.
  3503. implication: conjunction
  3504. ? ?
  3505. −> | <− | <−>
  3506. ?
  3507. implication
  3508. ?
  3509. .
  3510. conjunction: negation
  3511. ? ?
  3512. & | |
  3513. ?
  3514. conjunction
  3515. ?
  3516. .
  3517. negation:
  3518. ?
  3519. ˜
  3520. ?
  3521. factor.
  3522. factor :
  3523. ?
  3524. implication
  3525. ?
  3526. | identifier
  3527. | 1
  3528. | 0.
  3529. identifier : A | ... | Z.
  3530. Listing 5.7: EBNF for Logic Language
  3531. 103
  3532. mathematical expression grammar, was so much easier. The reason is that in
  3533. our expression grammar, multiple individual production rules with the same
  3534. nonterminal on the left-hand side (e.g. factor ), also start with that nonterminal.
  3535. It turns out that this property of a grammar makes it difficult to implement in
  3536. an automatic parser (which we will do in the next chapter). This is why we
  3537. must go out of our way to create a more complex grammar.
  3538. 5.12 Common Pitfalls
  3539. We conclude our chapter on grammar with some practical advice. Grammars
  3540. are not the solution to everything. They can describe the basic structure of a
  3541. language, but fail to capture the details. You can easily spend much time trying
  3542. to formulate grammars that contain the intricate details of some shadowy corner
  3543. of your language, only to find out that it would have been far easier to handle
  3544. those details in the semantic analysis phase. Often, you will find that some
  3545. things just cannot be done with a context free grammar.
  3546. Also, if you try to capture a high level of detail in a grammar, your grammar
  3547. will grow rapidly grow and become unreadable. Extended Backus-Naur form
  3548. may cut you some notational slack, but in the end you will be moving towards
  3549. attribute grammars or affix grammars (discussed in Meijer, [6]).
  3550. Visualizing grammars using syntax diagrams can be a big help, because
  3551. Backus-Naur form can lure you into recursion without termination. Try to
  3552. formulate your entire grammar in syntax diagrams before moving to BNF (even
  3553. though you will have to invest more time). Refer to the syntax diagrams for the
  3554. Inger language in appendix E for an extensive example, especially compared to
  3555. the BNF notation in appendix D.
  3556. 104
  3557. Bibliography
  3558. [1] A.V. Aho, R. Sethi, J.D. Ullman: Compilers: Principles, Techniques and
  3559. Tools, Addison-Wesley, 1986.
  3560. [2] F.H.J. Feldbrugge: Dictaat Vertalerbouw, Hogeschool van Arnhem en Ni-
  3561. jmegen, edition 1.0, 2002.
  3562. [3] J.D. Fokker, H. Zantema, S.D. Swierstra: Programmeren en correctheid,
  3563. Academic Service, Schoonhoven, 1991.
  3564. [4] A. C. Hartmann: A Concurrent Pascal Compiler for Minicomputers, Lec-
  3565. ture notes in computer science, Springer-Verlag, Berlin 1977.
  3566. [5] J. Levine: Lex and Yacc, O’Reilly & sons, 2000
  3567. [6] H. Meijer: Inleiding Vertalerbouw, University of Nijmegen, Subfaculty of
  3568. Computer Science, 2002.
  3569. [7] M.J. Scott: Programming Language Pragmatics, Morgan Kaufmann Pub-
  3570. lishers, 2000.
  3571. [8] T. H. Sudkamp: Languages & Machines, Addison-Wesley, 2nd edition,
  3572. 1998.
  3573. [9] N. Wirth: Compilerbouw, Academic Service, 1987.
  3574. [10] N. Wirth and K. Jensen: PASCAL User Manual and Report, Lecture notes
  3575. in computer science, Springer-Verlag, Berlin 1975.
  3576. 105
  3577. Chapter 6
  3578. Parsing
  3579. 6.1 Introduction
  3580. In the previous chapter, we have devised grammars for formal languages. In
  3581. order to generate valid sentences in these languages, we have written derivation
  3582. schemes, and syntax trees. However, a compiler does not work by generat-
  3583. ing sentences in some language, buy by recognizing (parsing) them and then
  3584. translating them to another language (usually assembly language or machine
  3585. language).
  3586. In this chapter, we discuss how one writes a program that does exactly that:
  3587. parse sentences according to a grammar. Such a program is called a parser.
  3588. Some parsers build up a syntax tree for sentences as they recognize them. These
  3589. syntax trees are identical to the ones presented in the previous chapter, but they
  3590. are generated inversely: from the concrete sentence instead of from a derivation
  3591. scheme. In short, a parser is a program that will read an input text and tell
  3592. you if it obeys the rules of a grammar (and if not, why – if the parser is worth
  3593. anything). Another way of saying it would be that a parser determines if a
  3594. sentence can be generated from a grammar. The latter description states more
  3595. precisely what a parser does.
  3596. Only the more elaborate compilers build up a syntax tree in memory, but
  3597. we will do so explicitly because it is very enlightening. We will also discuss
  3598. a technique to simplify the syntax tree, thus creating an abstract syntax tree,
  3599. which is more compact than the original tree. The abstract syntax tree is very
  3600. important: it is the basis for the remaining compilation phases of semantic
  3601. analysis and code generation.
  3602. 106
  3603. Parsing techniques come in many flavors and we do not presume to be able to
  3604. discuss them all in detail here. We will only fully cover LL(1) parsing (recursive
  3605. descent parsing), and touch on LR(k) parsing. No other methods are discussed.
  3606. 6.2 Prefix code
  3607. The grammar chapter briefly touched on the subject of prefix notation, or polish
  3608. notation as it is sometimes known. Prefix notation was invented by the Polish
  3609. J. Lukasiewicz in the late 20s. This parenthesis-free notation was a perfect
  3610. notation for the output of a compiler:
  3611. Example 6.1 (Prefix notation)
  3612. 1 + 2 * 3 becomes + 1 * 2 3
  3613. 1 / 2 - 3 becomes - / 1 2 3
  3614. ?
  3615. In the previous chapter, we found that operator precedence and associativity
  3616. can and should be handled in a grammar. The later phases (semantic analysis
  3617. and code generation) should not be bothered with these operator properties
  3618. anymore–the parser should convert the input text to an intermediate format
  3619. that implies the operator priority and associativity. An unambiguous syntax
  3620. tree is one such structure, and prefix notation is another.
  3621. Prefix notation may not seem very powerful, but consider that fact that it
  3622. can easily be used to denote complex constructions like if ...then and while .. do
  3623. with which you are no doubt familiar (if not, consult chapter 3):
  3624. if (var > 0) { a + b } else { b − a } becomes ?>var’0’+a’b’-b’a’
  3625. while (n > 0) { n = n − 1 } becomes W>n’0’=n’-’n’1’
  3626. Apostrophes ( ’ ) are often used as monadic operators that delimit a variable
  3627. name, so that two variables are not actually read as one. As you can deduct from
  3628. this example, prefix notation is actually a flattened tree. As long as the number
  3629. of operands that each operator takes is known, the tree can easily be traversed
  3630. using a recursive function. In fact, a very simple compiler can be constructed
  3631. that translates the mathematical expression language that we devised into prefix
  3632. code. A second program, the evaluator, could then interpret the prefix code and
  3633. calculate the result.
  3634. Figure 6.1: Syntax Tree for If-Prefixcode
  3635. To illustrate these facts as clearly as possible, we have placed the prefix
  3636. expressions for the if and while examples in syntax trees (figures 6.1 and 6.2
  3637. respectively).
  3638. 107
  3639. Figure 6.2: Syntax Tree for While-Prefixcode
  3640. Notice that the original expressions may be regained by walking the tree in
  3641. a pre-order fashion. Conversely, try walking the tree in-order or post-order, and
  3642. examine the result as an interesting exercise.
  3643. The benefits of prefix notation do not end there: it is also an excellent means
  3644. to eliminate unnecessary syntactic sugar like whitespace and comments, without
  3645. loss of meaning.
  3646. The evaluator program is a recursive affair: it starts reading the prefix string
  3647. from left to right, and for every operator it encounters, it calls itself to retrieve
  3648. the operands. The recursion terminates when a constant (a variable name or
  3649. a literal value) is found. Compare this to the method we discussed in the
  3650. introduction to this book. We said that we needed a stack to place (shift) values
  3651. and operators on that could not yet be evaluated (reduce). The evaluator works
  3652. by this principle, and uses the recursive function as a stack.
  3653. The translator-evaluator construction we have discussed so far may seem
  3654. rather artificial to you. But real compilers. although more complex, work the
  3655. same way. The big difference is that the evaluator is the computer processor
  3656. (CPU) - it cannot be changed, and the code that your compiler outputs must
  3657. obey the processor’s rules. In fact, the machine code used by a real machine like
  3658. the Intel x86 processor is a language unto itself, with a real grammar (consult
  3659. the Intel instruction set manual [3] for details).
  3660. There is one more property of the prefix code and the associated trees:
  3661. operators are no longer leaf nodes in the trees, but have become internal nodes.
  3662. We could have used nodes like expression and term as we have done before, but
  3663. these nodes would then be void of content. By making the operators nodes
  3664. themselves, we save valuable space in the tree.
  3665. 6.3 Parsing Theory
  3666. We have discussed prefix notations and associated syntax trees (or parse trees),
  3667. but how is such a tree constructed from the original input (it is called a parse
  3668. tree, after all)? In this section we present some of the theory that underlies
  3669. parsing. Note that in a book of limited size, we do not presume to be able to
  3670. treat all of the theory. In fact, we will limit our discussion to LL(1) grammars
  3671. and mention LR parsing in a couple of places.
  3672. Parsing can occur in two basic fashions: top-down and bottom-up. With top-
  3673. down, you start with a grammar’s start symbol and work toward the concrete
  3674. sentence under consideration by repeatedly applying production rules (replacing
  3675. 108
  3676. nonterminals with one of their right-hand sides) until there are no nonterminals
  3677. left. This method is by far the easiest method, but also places the most restric-
  3678. tions on the grammar.
  3679. Bottom-up parsing starts with the sentence to be parsed (the string of termi-
  3680. nals), and repeatedly applies production rules inversely, i.e. replaces substrings
  3681. of terminals nonterminals with the left-hand side of production rules. This
  3682. method is more powerful than top-down parsing, but much harder to write
  3683. by hand. Tools that construct bottom-up parsers from a grammar (compiler-
  3684. compilers) exist for this purpose.
  3685. 6.4 Top-down Parsing
  3686. Top-down parsing relies on a grammar’s determinism property to work. A
  3687. top down or recursive descent parser always takes the leftmost nonterminal in a
  3688. sentential form (or the rightmost nonterminal, depending on the flavor of parser
  3689. you use) and replaces it with one of its right-hand sides. Which one, depends on
  3690. the next terminal character the parser reads from the input stream. Because the
  3691. parser must constantly make choices, it must be able to do so without having
  3692. to retrace its steps after making a wrong choice. There exist parsers that work
  3693. this way, but these are obviously not very efficient and we will not give them
  3694. any further thought.
  3695. If all goes well, eventually all nonterminals will have been replaced with
  3696. terminals and the input sentence should be readable. If it is not, something
  3697. went wrong along the way and the input text did not obey the rules of the
  3698. grammar; it is said to be syntactically incorrect–there were syntax errors. We
  3699. will later see ways to pinpoint the location of syntax errors precisely.
  3700. Incidentally, there is no real reason why we always replace the leftmost (or
  3701. rightmost) nonterminal. Since our grammar is context-free, it does not mat-
  3702. ter which nonterminal gets replaced since there is no dependency between the
  3703. nonterminals (context-insensitive). It is simply tradition to pick the leftmost
  3704. nonterminal, and hence the name of the collection of recursive descent parsers:
  3705. LL, which means “recognizable while reading from left to right, and rewriting
  3706. leftmost nonterminals.” We can also define RL right away, which means “recog-
  3707. nizable while reading from right to left, and rewriting leftmost nonterminals.”
  3708. – this type of recursive descent parsers would be used in countries where text is
  3709. read from right to left.
  3710. As an example of top-down parsing, consider the BNF grammar in listing
  3711. 6.1. This is a simplified version of the mathematical expression grammar, made
  3712. suitable for LL parsing (the details of that will follow shortly).
  3713. Example 6.2 (Top-down Parsing by Hand)
  3714. We will now parse the sentence 1 + 2 + 3 by hand, using the top-down approach.
  3715. A top-down parser always starts with the start symbol, which in this case is
  3716. expression . It then reads the first character from the input stream, which happens
  3717. to be 1 , and determines which production rule to apply. Since there is only one
  3718. producton rule that can replace expression (it only acts as the left-hand side of
  3719. one rule), we replace expression with factor restexpression :
  3720. expression =⇒ L factor restexpression
  3721. 109
  3722. expression: factor restexpression.
  3723. restexpression: ?.
  3724. restexpression: + factor restexpression.
  3725. restexpression: − factor restexpression.
  3726. factor: 0.
  3727. factor: 1.
  3728. factor: 2
  3729. factor: 3.
  3730. factor: 4.
  3731. factor: 5.
  3732. factor: 6.
  3733. factor: 7.
  3734. factor: 8.
  3735. factor: 9.
  3736. Listing 6.1: Expression Grammar for LL Parser
  3737. In LL parsing, we always replace the leftmost nonterminal (here, it is factor ).
  3738. factor has ten alternative production rules, but know exactly which one to pick,
  3739. since we have the character 1 in memory and there is only one production rule
  3740. whose right-hand side starts with 1 :
  3741. expression =⇒ L factor restexpression
  3742. =⇒ L 1 restexpression
  3743. We have just eliminated one terminal from the input stream, so we read the
  3744. next one, which is ”+” . The leftmost nonterminal which we need to replace is
  3745. restexpression , which has only one alternative that starts with + :
  3746. expression =⇒ L factor restexpression
  3747. =⇒ L 1 restexpression
  3748. =⇒ L 1 +factor restexpression
  3749. We continue this process until we run out of terminal tokens. The situation
  3750. at that point is:
  3751. expression =⇒ L factor restexpression
  3752. =⇒ L 1 restexpression
  3753. =⇒ L 1 +factor restexpression
  3754. =⇒ L 1 +2 restexpression
  3755. =⇒ L 1 +2 +factor restexpression
  3756. =⇒ L 1 +2 +factor restexpression
  3757. =⇒ L 1 +2 +3 restexpression
  3758. The current terminal symbol under consideration is empty, but we could also
  3759. use end of line or end of file. In that case, we see that of the three alternatives
  3760. for restexpression , the ones that start with + and - are invalid. So we pick the
  3761. production rule with the empty right-hand side, effectively removing restexpres-
  3762. sion from the sentential form. We are now left with the input sentence, having
  3763. eliminated all the terminal symbols. Parsing was successful.
  3764. 110
  3765. If we were to parse 1 + 2 * 3 using this grammar, parsing will not be suc-
  3766. cessful. Parsing will fail as soon as the terminal symbol * is encountered. If the
  3767. lexical analyzer cannot handle this token, parsing will end for that reason. If it
  3768. can (which we will assume here), the parser is in the following situation:
  3769. expression =⇒ ∗
  3770. L
  3771. 1 +2 restexpression
  3772. The parser must now find a production rule starting with * . There is none,
  3773. so it replaces restexpression with the empty alternative. After that, there are no
  3774. more nonterminals to replace, but there are still terminal symbols on the input
  3775. stream, thus the sentence cannot be completely recognized.
  3776. ?
  3777. There are a couple of caveats with the LL approach. Consider what happens
  3778. if a nonterminal is replaced by a collection of other nonterminals, and so on,
  3779. until at some point this collection of nonterminals is replaced by the original
  3780. nonterminal, while no new terminals have been processed along the way. This
  3781. process will then continue indefinitely, because there is no termination condition.
  3782. Some grammars cause this behaviour to occur. Such grammars are called left-
  3783. recursive.
  3784. Definition 6.1 (Left Recursion)
  3785. A context-free grammar (V,Σ,S,P) is left-recursive if
  3786. ∃X ∈ V,α,β ∈ (V ∪ Σ) ∗ : S =⇒ ∗ Xα =⇒ ∗
  3787. L Xβ
  3788. in which =⇒ ∗
  3789. L
  3790. is the reflexive transitive closure of =⇒ L , defined as:
  3791. {(uAτ) ∈ (V ∪ Σ) ∗ × (V ∪ Σ) ∗ : u ∈ Σ ∗ ,τ ∈ (V ∪ Σ) ∗ ∧ (A,α) ∈ P}
  3792. The difference between =⇒ and =⇒ L is that the former allows arbitrary
  3793. strings of terminals and nonterminals to precede the nonterminal that is going
  3794. to be replaced (A), while the latter insists that only terminals occur before A
  3795. (thus making A the leftmost nonterminal).
  3796. Equivalently, we may as well define =⇒ R :
  3797. {(τAu) ∈ (V ∪ Σ) ∗ × (V ∪ Σ) ∗ : u ∈ Σ ∗ ,τ ∈ (V ∪ Σ) ∗ ∧ (A,α) ∈ P}
  3798. =⇒ R insists that A be the rightmost nonterminal, followed only by terminal
  3799. symbols. Using =⇒ R , we are also in a position to define right-recursion (which
  3800. is similar to left-recursion):
  3801. ∃X ∈ V,α,β ∈ (V ∪ Σ) ∗ : S =⇒ ∗ αX =⇒ ∗
  3802. R βX
  3803. Grammars which contain left-recursion are not guaranteed to terminated, al-
  3804. though they may. Because this may introduce hard to find bugs, it is important
  3805. to weed out left-recursion from the outset if at all possible.
  3806. ?
  3807. 111
  3808. Removing left-recursion can be done using left-factorisation. Consider the
  3809. following excerpt from a grammar (which may be familiar from the previous
  3810. chapter):
  3811. expression: expression + term.
  3812. expression: expression − term.
  3813. expression: term.
  3814. Obviously, this grammar is left-recursive: the first two production rules both
  3815. start with expression , which also acts as their left-hand side. So expression may
  3816. be replaced with expression without processing a nonterminal along the way. Let
  3817. it be clear that there is nothing wrong with this grammar (it will generate valid
  3818. sentences in the mathematical expression language just fine), it just cannot be
  3819. recognized by a top-down parser.
  3820. Left-factorisation means recognizing that the nonterminal expression occurs
  3821. multiple times as the leftmost symbol in a production rule, and should therefore
  3822. be in a production rule on its own. Firstly, we swap the order in which term and
  3823. expression occur:
  3824. expression: term + expression.
  3825. expression: term − expression.
  3826. expression: term.
  3827. The + and - operators are now treated as if they were right-associative,
  3828. which - is definitely not. We will deal with this problem later. For now, assume
  3829. that associativity is not an issue. This grammar is no longer left-recursive, and
  3830. it obviously produces the same sentences as the original grammar. However, we
  3831. are not out of the woods yet.
  3832. It turns out that when multiple production rule alternatives start with the
  3833. same terminal or nonterminal symbol, it is impossible for the parser to choose an
  3834. alternative based on the token it currently has in memory. This is the situation
  3835. we have now; three production rules which all start with term . This is where we
  3836. apply the left-factorisation: term may be removed from the beginning of each
  3837. production rule and placed in a rule by itself. This is called “factoring out” a
  3838. nonterminal on the left side, hence left-factorisation. The result:
  3839. expression: term restexpression.
  3840. restexpression: + term restexpression.
  3841. restexpression: − term restexpression.
  3842. restexpression: ?.
  3843. Careful study will show that this grammar produces exactly the same sen-
  3844. tences as the original one. We have had to introduce a new nonterminal
  3845. ( restexpression ) with an empty alternative to solve the left-recursion, in addi-
  3846. tion to wrong associativity for the - operator, so we were not kidding when
  3847. we said that top-down parsing imposes some restrictions on grammar. On the
  3848. flipside, writing a parser for such a grammar is a snap.
  3849. So far, we have assumed that the parser selects the production rule to apply
  3850. 112
  3851. based on one terminal symbol, which is has in memory. There are also parsers
  3852. that work with more than one token at a time. A recursive descent parser which
  3853. works with 3 tokens is an LL(3) parser. More generally, an LL(k) parser is a
  3854. top-down parser with a k tokens lookahead.
  3855. Practical advice 6.1 (One Token Lookahead)
  3856. Do not be tempted to write a parser that uses a lookahead of more than one
  3857. token. The complexity of such a parser is much greater than the one-token
  3858. lookahead LL(1) parser, and it will not really be necessary. Most, if not all,
  3859. language constructs can be parsed using an LL(1) parser.
  3860. ?
  3861. We have now found that grammars, suitable for recursive descent parsing,
  3862. must obey the following two rules:
  3863. 1. There most not be left-recursion in the grammar.
  3864. 2. Each alternative production rule with the same left-hand side must start
  3865. with a distinct terminal symbol. If it starts with a nonterminal symbol,
  3866. examine the production rules for that nonterminal symbol and so on.
  3867. We will repeat these definitions more formally shortly, after we have dis-
  3868. cussed bottom-up parsing and compared it to recursive descent parsing.
  3869. 6.5 Bottom-up Parsing
  3870. Bottom-up parsers are the inverse of top-down parsers: they start with the
  3871. full input sentence (string of terminals) and work by replacing substrings of
  3872. terminals and nonterminals in the sentential form by nonterminals, effectively
  3873. reversely applying the production rules.
  3874. In de remainder of this chapter, we will focus on top-down parsing only, but
  3875. we will illustrate the concept of bottom-up parsing (also known as LR) with an
  3876. example. Consider the grammar in listing 6.2, which is not LL.
  3877. Example 6.3 (Bottom-up Parsing by Hand)
  3878. We will now parse the sentence 1 + 2 + 3 by hand, using the bottom-up approach.
  3879. A bottom-up parser begins with the entire sentence to parse, and replaces groups
  3880. of terminals and nonterminals with the left-hand side of production rules. In
  3881. the initial situation, the parser sees the first terminal symbol, 1 , and decides
  3882. to replace it with factor (which is the only possibility). Such a replacement is
  3883. called a reduction.
  3884. 1 +2 + 3 =⇒ factor +2 +3
  3885. Starting again from the left, the parser sees the nonterminal factor and de-
  3886. cides to replace it with expression (which is, once again, the only possibility):
  3887. 1 +2 + 3 =⇒ factor +2 +3
  3888. =⇒ expression +2 +3
  3889. 113
  3890. expression: expression + expression.
  3891. expression: expression − expression.
  3892. expression: factor.
  3893. factor: 0.
  3894. factor: 1.
  3895. factor: 2
  3896. factor: 3.
  3897. factor: 4.
  3898. factor: 5.
  3899. factor: 6.
  3900. factor: 7.
  3901. factor: 8.
  3902. factor: 9.
  3903. Listing 6.2: Expression Grammar for LR Parser
  3904. There is now no longer a suitable production rule that has a lone expression on
  3905. the right-hand side, so the parser reads another symbol from the input stream
  3906. ( + ). Still, there is no production rule that matches the current input. The
  3907. tokens expression and + are stored on a stack (shifted) for later reference. The
  3908. parser reads another symbol from the input, which happens to be 2 , which it
  3909. can replace with factor , which can in turn be replaced by expression
  3910. 1 +2 + 3 =⇒ factor +2 +3
  3911. =⇒ expression +2 +3
  3912. =⇒ expression +factor +3
  3913. =⇒ expression +expression +3
  3914. All of a sudden, the first three tokens in the sentential form ( expression +
  3915. expression ), two of which were stored on the stack, form the right hand side of a
  3916. production rule:
  3917. expression: expression + expression.
  3918. The parser replaces the three tokens with expression and continues the process
  3919. until the situation is thus:
  3920. 1 +2 + 3 =⇒ factor +2 +3
  3921. =⇒ expression +2 +3
  3922. =⇒ expression +factor +3
  3923. =⇒ expression +expression +3
  3924. =⇒ expression +3
  3925. =⇒ expression +factor
  3926. =⇒ expression +expression
  3927. =⇒ expression
  3928. In the final situation, the parser has reduced the entire original sentence
  3929. to the start symbol of the grammar, which is a sign that the input text was
  3930. syntactically correct.
  3931. 114
  3932. ?
  3933. Formally put, the shift-reduce method constructs a right derivation S =⇒ ∗
  3934. R
  3935. s,
  3936. but in reverse order. This example shows that bottom up parsers can deal with
  3937. left-recursion (in fact, left recursive grammars make more efficient bottom up
  3938. parsers), which helps keep grammars simple. However, we stick with top down
  3939. parsers since they are by far the easiest to write by hand.
  3940. 6.6 Direction Sets
  3941. So far, we have only informally defined which restrictions are placed on a gram-
  3942. mar for it to be LL(k). We will now present these limitations more precisely.
  3943. We must start with several auxiliary definitions.
  3944. Definition 6.2 (FIRST-Set of a Production)
  3945. The FIRST set of a production for a nonterminal A is the set of all terminal
  3946. symbols, with which the strings generated from A can start.
  3947. ?
  3948. Note that for an LL(k) grammar, the first k terminal symbols with which
  3949. a production starts are included in the FIRST set, as a string. Also note that
  3950. this definition relies on the use of BNF, not EBNF. It is important to realize
  3951. that the following grammar excerpt:
  3952. factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
  3953. actually consists of 10 different production rules (all of which happen to
  3954. share the same left-hand side). The FIRST set of a production is often denoted
  3955. PFIRST, as a reminder of the fact that it is the FIRST set of a single Production.
  3956. Definition 6.3 (FIRST-Set of a Nonterminal)
  3957. The FIRST set of a nonterminal A is the set of all terminal symbols, with which
  3958. the strings generated from A can start.
  3959. If the nonterminal X has n productions in which it acts as the left-hand
  3960. side, then
  3961. FIRST(X) :=
  3962. n
  3963. [
  3964. i=1
  3965. PFIRST(X i )
  3966. ?
  3967. The LL(1) FIRST set of factor in the previous example is {0, 1, 2, 3, 4, 5, 6,
  3968. 7, 8, 9}. Its individual PFIRST sets (per production) are {0} through {9}. We
  3969. will deal only with LL(1) FIRST sets in this book.
  3970. We also define the FOLLOW set of a nonterminal. FOLLOW sets are de-
  3971. termined only for entire nonterminals, not for productions:
  3972. Definition 6.4 (FOLLOW-Set of a Nonterminal)
  3973. 115
  3974. The FOLLOW set of a nonterminal A is the set of all terminal symbols, that
  3975. may follow directly after A.
  3976. ?
  3977. To illustrate FOLLOW-sets, we need a bigger grammar:
  3978. Example 6.4 (FOLLOW-Sets)
  3979. expression: factor restexpression.
  3980. restexpression: ?.
  3981. | + factor restexpression
  3982. | − factor restexpression.
  3983. factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
  3984. FOLLOW(expression) = {⊥} 1
  3985. FOLLOW(restexpresson) = {⊥}
  3986. FOLLOW(factor) = {⊥,+,−}
  3987. ?
  3988. We are now in a position to formalize the property of unambiguity for LL(k)
  3989. grammars:
  3990. Definition 6.5 (Unambiguity of LL(k) Grammars)
  3991. A grammar is unambiguous when
  3992. 1. If a nonterminal acts as the left-hand side of multiple productions, then
  3993. the PFIRST sets of these productions must be disjunct.
  3994. 2. If a nonterminal can produce the empty string (?), then its FIRST set
  3995. must be disjunct with its FOLLOW set.
  3996. ?
  3997. How does this work in practice? The first conditions is easy. Whenever an
  3998. LL parser reads a terminal, it must decide which production rule to apply. It
  3999. does this by looking at the first k terminal symbols that each production rule
  4000. can produce (its PFIRST set). In order for the parser to be able to make the
  4001. choice, these sets must not have any overlap. If there is no overlap, the grammar
  4002. is said to be deterministic.
  4003. If a nonterminal can be replaced with the empty string, the parser must check
  4004. whether it is valid to do so. Inserting the empty string is an option when no
  4005. other rule can be applied, and the nonterminals that come after the nonterminal
  4006. that will produce the empty string are able to produce the terminal that the
  4007. parser is currently considering. Hence, to make the decision, the FIRST set of
  4008. the nonterminal must not have any overlap with its FOLLOW set.
  4009. 1 We use ⊥ to denote end of file.
  4010. 116
  4011. Production PFIRST
  4012. program: statementlist RESULT=implication. {A...Z}
  4013. statementlist : ?. ∅
  4014. statementlist : statement statementlist. {A...Z}
  4015. statement: identifier =implication ;. {A...Z}
  4016. implication: conjunction restimplication . {∼,(,0,1,A...Z}
  4017. restimplication : ?. ∅
  4018. restimplication : −>conjunction restimplication . { -> }
  4019. restimplication : <−conjunction restimplication . { <- }
  4020. restimplication : <−>conjunction restimplication. { <-> }
  4021. conjunction: negation restconjunction. {∼,(,0,1,A...Z}
  4022. restconjunction: ?. ∅
  4023. restconjunction: &negation restconjunction. {&}
  4024. restconjunction: | negation restconjunction. {|}
  4025. negation: ˜ negation. {∼}
  4026. negation: factor. {(,0,1,A...Z}
  4027. factor :
  4028. ?
  4029. implication
  4030. ?
  4031. . {(}
  4032. factor : identifier . {A...Z}
  4033. factor : 1. {1}
  4034. factor : 0. {0}
  4035. identifier : A. {A}
  4036. identifier : Z. {Z}
  4037. Table 6.1: PFIRST Sets for Logic Language
  4038. 6.7 Parser Code
  4039. A wondrous amd most useful property of LL(k) grammars (henceforth referred
  4040. to as LL since we will only be working with LL(1) anyway) is that a parser can
  4041. be written for them in a very straightforward fashion (as long as the grammar
  4042. is truly LL).
  4043. A top-down parser needs a stack to place its nonterminals on. It it easiest to
  4044. use the stack offered by the C compiler (or whatever language you work with)
  4045. for this purpose. Now, for every nonterminal, we produce a function. This
  4046. function checks that the terminal symbol currently under consideration is an
  4047. element of the FIRST set of the nonterminal that the function represents, or
  4048. else it reports a syntax error.
  4049. After a syntax error, the parser may recover from the error using a synchro-
  4050. nization approach (see chapter 8 on error recovery for details) and continue if
  4051. possible, in order to find more errors.
  4052. The body of the function reads any terminals that are specified in the pro-
  4053. duction rules for the nonterminal, and calls other functions (that represent other
  4054. nonterminals) in turn, thus putting frames on the stack. In the next section,
  4055. we will show that this approach is ideal for construction a syntax tree.
  4056. Writing parser code is best illustrated with an (elaborate) example. Please
  4057. refer to the grammar for the logic language (section 5.11), for which we will write
  4058. a parser. In table 6.7, we show the PFIRST set for every individual production,
  4059. while in table 6.7, we show the FIRST and FOLLOW sets for every nonterminal.
  4060. 117
  4061. Nonterminal FIRST FOLLOW
  4062. program {A...Z} {⊥}
  4063. statementlist {A...Z} {RESULT}
  4064. statement {A...Z} {∼,(,0,1,A...Z,RESULT}
  4065. implication {∼,(,0,1,A...Z} {;,⊥,)}
  4066. restimplication { ->, <-, <-> } {;,⊥,)}
  4067. conjunction {∼,(,0,1,A...Z} {->,<-,<->,;,⊥,)}
  4068. restconjunction {&,|} {->,<-,<->,;,⊥,)}
  4069. negation {∼,(,0,1,A...Z} {&,|,->,<-,<->,;,⊥,)}
  4070. factor: {(,0,1,A...Z} {&,|,->,<-,<->,;,⊥,)}
  4071. identifier: {A...Z} {=,&,|,->,<-,<->,;,⊥,)}
  4072. Table 6.2: FIRST and FOLLOW Sets for Logic Language
  4073. With this information, we can now build the parser. Refer to appendix G for
  4074. the complete source code (including a lexical analyzer built with flex ). We will
  4075. discuss the C-function for the nonterminal conjunction here (shown in listing 6.3.
  4076. The conjunction function first checks that the current terminal input symbol
  4077. (stored in the global variable token ) is an element of FIRST(conjunction) (lines
  4078. 3–6). If not, conjunction returns an error.
  4079. If token is an element of the FIRST set, conjunction calls negation,which is
  4080. the first token in the production rule for conjunction (lines 11-14):
  4081. conjunction: negation restconjunction.
  4082. If negation returns without errors, conjunction must now decide whether to
  4083. call restconjunction (which may produce the empty string). It does so by looking
  4084. at the current terminal symbol under consideration. If it is a & or a | (both
  4085. part of FIRST(restconjunction), it calls restconjunction (lines 16-19). If not, it
  4086. skips restconjunction , assuming it produces the empty string.
  4087. The other functions in the parser are constructed using a similar approach.
  4088. Note that the parser presented here only performs a syntax check; the parser in
  4089. appendix G also interprets its input (it is an interpreter), which makes for more
  4090. interesting reading.
  4091. 6.8 Conclusion
  4092. Our discussion of parser construction is now complete. The results of parsing
  4093. are placing in a syntax tree and passed on to the next phase, semantic analysis.
  4094. 118
  4095. 1 int conjunction()
  4096. 2 {
  4097. 3 if ( token != ’˜’ && token != ’(’
  4098. 4 && token != IDENTIFIER
  4099. 5 && token != TRUE
  4100. 6 && token != FALSE )
  4101. 7 {
  4102. 8 return( ERROR );
  4103. 9 }
  4104. 10
  4105. 11 if ( negation() == ERROR )
  4106. 12 {
  4107. 13 return( ERROR );
  4108. 14 }
  4109. 15
  4110. 16 if ( token == ’&’ || token == ’|’ )
  4111. 17 {
  4112. 18 return( restconjunction () );
  4113. 19 }
  4114. 20 else
  4115. 21 {
  4116. 22 return( OK );
  4117. 23 }
  4118. 24 }
  4119. Listing 6.3: Conjunction Nonterminal Function
  4120. 119
  4121. Bibliography
  4122. [1] A.V. Aho, R. Sethi, J.D. Ullman: Compilers: Principles, Techniques and
  4123. Tools, Addison-Wesley, 1986.
  4124. [2] F.H.J. Feldbrugge: Dictaat Vertalerbouw, Hogeschool van Arnhem en Ni-
  4125. jmegen, edition 1.0, 2002.
  4126. [3] Intel: IA-32 Intel Architecture - Software Developer’s Manual - Volume 2:
  4127. Instruction Set, Intel Corporation, Mt. Prospect, 2001.
  4128. [4] J. Levine: Lex and Yacc, O’Reilly & sons, 2000
  4129. [5] H. Meijer: Inleiding Vertalerbouw, University of Nijmegen, Subfaculty of
  4130. Computer Science, 2002.
  4131. [6] M.J. Scott: Programming Language Pragmatics, Morgan Kaufmann Pub-
  4132. lishers, 2000.
  4133. [7] T. H. Sudkamp: Languages & Machines, Addison-Wesley, 2nd edition,
  4134. 1998.
  4135. [8] N. Wirth: Compilerbouw, Academic Service, 1987.
  4136. 120
  4137. Chapter 7
  4138. Preprocessor
  4139. 7.1 What is a preprocessor?
  4140. A preprocessor is a tool used by a compiler to transform a program before actual
  4141. compilation. The facilities a preprocessor provides may vary, but the four most
  4142. common functions a preprocessor could provide are:
  4143. • header file inclusion
  4144. • conditional compilation
  4145. • macro expansion
  4146. • line control
  4147. Header file inclusion is the substitution of files for include declarations (in
  4148. the C preprocessor this is the #include directive). Conditional compilation
  4149. provides a mechanism to include and exclude parts of a program based on
  4150. various conditions (in the C preprocessor this can be done with #define direc-
  4151. tives). Macro expansion is probably the most powerful feature of a preprocessor.
  4152. Macros are short abbreviations of longer program constructions. The prepro-
  4153. cessor replaces these macros with their definition throughout the program (in
  4154. the C preprocessor a macro is specified with #define). Line control is used
  4155. to inform the compiler where a source line originally came from when different
  4156. source files are combined into an intermediate file. Some preprocessors also re-
  4157. move comments from the source file, though it is also perfectly acceptable to do
  4158. this in the lexical analyzer.
  4159. 7.2 Features of the Inger preprocessor
  4160. The preprocessor in Inger only supports header file inclusion for now. In the near
  4161. future other preprocessor facilities may be added, but due to time constraints
  4162. header file inclusion is the only feature. The preprocessor directives in Inger
  4163. always start at the beginning of a line with a #, just like the C preprocessor.
  4164. The directive for header inclusion is #import followed by the name of the file
  4165. to include between quotes.
  4166. 121
  4167. 7.2.1 Multiple file inclusion
  4168. Multiple inclusion of the same header might give some problems. In C we
  4169. prevent this through conditional compiling with a #define or with a #pragma
  4170. once directive. The Inger preprocessor automatically prevents multiple inclusion
  4171. by keeping a list of files that are already included for this source file.
  4172. Example 7.1 (Multiple inclusion)
  4173. Multiple inclusion – this should be perfectly acceptable for a programmer so
  4174. no warning is shown, though hdrfile3 is included only once.
  4175. ?
  4176. Forcing the user not to include files more than once is not an option since
  4177. sometimes multiple header files just need the same other header file. This could
  4178. be solved by introducing conditional compiling into the preprocessor and have
  4179. the programmers solve it themselves, but it would be nice if it happened au-
  4180. tomatically so the Inger preprocessor keeps track of included files to prevent
  4181. it.
  4182. 7.2.2 Circular References
  4183. Another problem that arises from header files including other header files is
  4184. the problem of circular references. Again unlike the C preprocessor, the Inger
  4185. preprocessor detects circular references and shows a warning while ignoring the
  4186. circular include.
  4187. Example 7.2 (Circular References)
  4188. Circular inclusion – this always means that there is an error in the source so
  4189. the preprocessor gives a warning and the second inclusion of hdrfile2 is ignored.
  4190. ?
  4191. This is realized by building a tree structure of includes. Everytime a new
  4192. file is to be included, the tree is checked upwards to the root node, to see if
  4193. this file has already been included. When a file already has been included the
  4194. preprocessor shows a warning and the import directive is ignored. Because every
  4195. 122
  4196. include creates a new child node in the tree, the preprocessor is able to distinct
  4197. between a multiple inclusion and a circular inclusion by only going up in the
  4198. tree.
  4199. Example 7.3 (Include tree)
  4200. Include tree structure – for every inclusion, a new child node is added. This
  4201. example shows how the circular inclusion for header 2 is detected by going
  4202. upwards in the tree, while the multiple inclusion of header 3 is not seen as a
  4203. circular inclusion because it is in a different branch.
  4204. ?
  4205. 123
  4206. Chapter 8
  4207. Error Recovery
  4208. As soon as we started programming, we found to our surprise that it wasn’t
  4209. as easy to get programs right as we had thought. Debugging had to be discovered.
  4210. I can remember the exact instant when I realized that a large part of my life from
  4211. then on was going to be spent in finding mistakes in my own programmes.” -
  4212. Maurice Wilkes discovers debugging, 1949.
  4213. 8.1 Introduction
  4214. Almost no programs are ever written from scratch that contain no errors at all.
  4215. Since programming languages, as opposed to natural languages, have very rigid
  4216. syntax rules, it is very hard to write error-free code for a complex algorithm on
  4217. the first attempt. This is why compilers must have excellent features for error
  4218. handling. Most programs will require several runs of the compiler before they
  4219. are free of errors.
  4220. Error detection is a very important aspect of the compiler; it is the outward
  4221. face of the compiler and the most important bit that the user will become rapidly
  4222. familiar with. It is therefore imperative that error messages be clear, correct
  4223. and, above all, useful. The user should not have to look up additional error
  4224. information is a dusty manual, but the nature of the error should be clear from
  4225. the information that the compiler gives.
  4226. This chapter discusses the different natures of errors, and shows ways de-
  4227. tecting, reporting and recovering from errors.
  4228. 124
  4229. 8.2 Error handling
  4230. Parsing is all about detecting syntax errors and displaying them in the most
  4231. useful manner possible. For every compiler run, we want the parser to detect
  4232. and display as many syntax errors as it can find, to alleviate the need for the
  4233. user to run the compiler over and over, correcting the syntax errors one by one.
  4234. There are three stages in error handling:
  4235. • Detection
  4236. • Reporting
  4237. • Recovery
  4238. The detection of errors will happen during compilation or during execution.
  4239. Compile-time errors are detected by the compiler during translation. Runtime
  4240. errors are detected by the operating system in conjunction with the hardware
  4241. (such as a division by zero error). Compile-time errors are the only errors that
  4242. the compiler should have to worry about, although it should make an effort to
  4243. detect and warn about all the potential runtime errors that it can. Once an
  4244. error is detected, it must be reported both to the user and to a function which
  4245. will process the error. The user must be informed about the nature of the error
  4246. and its location (its line number, and possibly its character number).
  4247. The last and also most difficult task that the compiler faces is recovering
  4248. from the error. Recovery means returning the compiler to a position in which
  4249. it is able to resume parsing normally, so that subsequent errors do not result
  4250. from the original error.
  4251. 8.3 Error detection
  4252. The first stage of error handling is detection which is divided into compile-time
  4253. detection and runtime detection. Runtime detection is not part of the compiler
  4254. so therefore it will not be discussed here. Compile-time however is divided into
  4255. four stages which will be discussed here. These stages are:
  4256. • Detecting lexical errors
  4257. • Detecting syntactic errors
  4258. • Detecting semantic errors
  4259. • Detecting compiler errors
  4260. - Detecting lexical errors Normally the scanner reports a lexical error when
  4261. it encounters an input character that cannot be the first character of any lexical
  4262. token. In other words, an error is signalled when the scanner is unfamiliar
  4263. with a token found in the input stream. Sometimes, however, it is appropriate
  4264. to recognize a specific sequence of input characters as an invalid token. An
  4265. example of a error detected by the scanner is an unterminated comment. The
  4266. scanner must remove all comments from the source code. It is not correct, of
  4267. course, to begin a comment but never terminate it. The scanner will reach
  4268. the end of the source file before it encounters the end of the comment. Another
  4269. (similar) example is when the scanner is unable to determine the end of a string.
  4270. 125
  4271. Lexical errors also include the error class known as overflow errors. Most
  4272. languages include the integer type, which accepts integer numbers of a certain
  4273. bit length (32 bits on Intel x86 machines). Integer numbers that exceed the
  4274. maximum bit length generate a lexical error. These errors cannot be detected
  4275. using regular expressions, since regular expressions cannot interpret the value of
  4276. a token, but only calculate its length. The lexer can rule that no integer number
  4277. can be longer than 10 digits, but that would mean that 000000000000001 is not
  4278. a valid integer number (although it is!). Rather, the lexer must verify that
  4279. the literal value of the integer number does not exceed the maximum bit length
  4280. using a so-called lexical action. When the lexer matches the complete token and
  4281. is about to return it to the parser, it verifies that the token does not overflow.
  4282. If it does, the lexer reports a lexical error and returns zero (as a placeholder
  4283. value). Parsing may continue as normal.
  4284. - Detecting syntactic errors The parser uses the grammar’s production rules
  4285. to determine which tokens it expects the lexer to pass to it. Every nonterminal
  4286. has a FIRST set, which is the set of all the terminal tokens that the nonterminal
  4287. be replaced with, and a FOLLOW set, which is the set of all the terminal tokens
  4288. that may appear after the nonterminal. After receiving a terminal token from
  4289. the lexical analyzer, the parser must check that it matches the FIRST set of the
  4290. nonterminal it is currently evaluating. If so, then it continues with its normal
  4291. processing, otherwise the normal routine of the parser is interrupted and an
  4292. error processing function is called.
  4293. - Detecting semantic errors Semantic errors are detected by the action rou-
  4294. tines called within the parser. For example, when a variable is encountered it
  4295. must have an entry in the symbol table. Or when the value of variable "a" is
  4296. assigned to variable "b" they must be of the same type.
  4297. - Detecting compiler errors The last category of compile-time errors deals
  4298. with malfunctions within the compiler itself. A correct program could be incor-
  4299. rected compiled because of a bug in the compiler. The only thing the user can
  4300. do is report the error to the system staff. To make the compiler as error-free as
  4301. possible, it contains extensive self-tests.
  4302. 8.4 Error reporting
  4303. Once an error is detected, it must be reported to the user and to the error
  4304. handling function. Typically, the user recieves one or more messages that report
  4305. the error. Errors messages displayed to the user must obey a few style rules, so
  4306. that they may be clear
  4307. and easy to understand.
  4308. 1. The message should be specific, pinpointing the place in the program
  4309. where the error was detected as closely as possible. Some compilers include
  4310. only the line number in the source file on which the error occurred, while
  4311. others are able to highlight the character position in the line containing
  4312. the error, making the error easier to find.
  4313. 2. The messages should be written in clear and complete English sentences,
  4314. never in cryptic terms. Never list just a message code number such as
  4315. "error number 33" forcing the user to refer to a manual.
  4316. 126
  4317. 3. The message should not be redundant. For example when a variable is not
  4318. declared, it is not be nessesary to print that fact each time the variable is
  4319. referenced.
  4320. 4. The messages should indicate the nature of the error discovered. For
  4321. example, if a colon were expected but not found, then the message should
  4322. just say that and not just "syntax error" or
  4323. "missing symbol".
  4324. 5. It must be clear that the given error is actually an error (so that the
  4325. compiler did not generate an executable), or that the message is a warning
  4326. (and an executable may still be generated).
  4327. Example 8.1 (Error Reporting)
  4328. Error
  4329. The source code contains an overflowing integer value
  4330. (e.g. 1234578901234567890).
  4331. Response
  4332. This error may be treated as a warning, since compilation can still
  4333. take place. The offending overflowing value will be replaced with
  4334. some neutral value (say, zero) and this fact should be reported:
  4335. test.i (43): warning: integer value overflow
  4336. (12345678901234567890). Replaced with 0.
  4337. Error
  4338. The source code is missing the keyword THEN where it is expected
  4339. according to the grammar.
  4340. Response
  4341. This error cannot be treated as a warning, since an essential piece
  4342. of code can not be compiled. The location of the error must be
  4343. pinpointed so that the user can easily correct it:
  4344. test.i (54): error: THEN expected after IF condition.
  4345. Note that the compiler must now recover from the error; obviously
  4346. an important part of the IF statement is missing and it must be
  4347. skipped somehow. More information on error recovery will follow
  4348. below.
  4349. ?
  4350. 127
  4351. 8.5 Error recovery
  4352. There are three ways to perform error recovery:
  4353. 1. When an error is found, the parser stops and does not attempt to find
  4354. other errors.
  4355. 2. When an error is found, the parser reports the error and continues parsing.
  4356. No attempt is made at error correction (recovery), so the next errors may
  4357. be irrelevant because they are caused by the first error.
  4358. 3. When an error is found, the parser reports it and recovers from the error,
  4359. so that subsequent errors do not result from the original error. This is the
  4360. method discussed below.
  4361. Any of these three approaches may be used (and have been), but it should be
  4362. obvious that approach 3 is most useful to the programmer using the compiler.
  4363. Compiling a large source program may take a long time, so it is advantageous
  4364. to have the compiler report multiple errors at once. The user may then correct
  4365. all errors at his leisure.
  4366. 8.6 Synchronization
  4367. Error recovery uses so-called synchronization points that the parser looks for
  4368. after an error has been detected. A synchronization point is a location in the
  4369. source code from which the parser can safely continue parsing without printing
  4370. further errors resulting from the original error.
  4371. Error recovery uses two sets of terminal tokens, the so-called direction sets:
  4372. 1. The FIRST set - is the set of all terminal symbols with which the strings,
  4373. generated by all the productions for this nonterminal begin.
  4374. 2. The FOLLOW set - a set of all terminal symbols that can be generated
  4375. by the grammar directly after the current nonterminal.
  4376. As an example for direction sets, we will consider the following very simple
  4377. grammar and show how the FIRST and FOLLOW sets may be constructed for
  4378. it.
  4379. number: digit morenumber.
  4380. morenumber: digit morenumber.
  4381. morenumber: ?.
  4382. digit : 0.
  4383. digit : 1.
  4384. Any nonterminal has at least one, but frequently more than one production
  4385. rule. Every production rule has its own FIRST set, which we will call PFIRST.
  4386. The PFIRST set for a production rule contains all the leftmost terminal tokens
  4387. that the production rule may eventually produce. The FIRST set of any non-
  4388. terminal is the union of all its PFIRST sets. We will now construct the FIRST
  4389. and PFIRST sets for our sample grammar.
  4390. PFIRST sets for every production:
  4391. 128
  4392. number: digit morenumber. PFIRST =
  4393. ?
  4394. 0, 1
  4395. ?
  4396. morenumber: digit morenumber. PFIRST =
  4397. ?
  4398. 0, 1
  4399. ?
  4400. morenumber: ?. PFIRST =
  4401. ? ?
  4402. digit : 0. PFIRST =
  4403. ?
  4404. 0
  4405. ?
  4406. digit : 1. PFIRST =
  4407. ?
  4408. 1
  4409. ?
  4410. FIRST sets per terminal:
  4411. FIRST ? number ? =
  4412. ?
  4413. 0, 1
  4414. ?
  4415. FIRST ? morenumber ? =
  4416. ?
  4417. 0, 1
  4418. ?
  4419. V
  4420. ? ?
  4421. =
  4422. ?
  4423. 0, 1
  4424. ?
  4425. FIRST ? digit ? =
  4426. ?
  4427. 0
  4428. ?
  4429. V
  4430. ?
  4431. 1
  4432. ?
  4433. =
  4434. ?
  4435. 0, 1
  4436. ?
  4437. Practical advice 8.1 (Construction of PFIRST sets)
  4438. PFIRST sets may be most easily constructed by working from bottom to top:
  4439. find the PFIRST sets for ’digit’ first (these are easy since the production rules
  4440. for digit contain only terminal tokens). When finding the PFIRST set for a
  4441. production rule higher up (such as number), combine the FIRST sets of the
  4442. nonterminals it uses (in the case of number, that is digit). These make up the
  4443. PFIRST set.
  4444. ?
  4445. Every nonterminal must also have a FOLLOW set. A FOLLOW set con-
  4446. tains all the terminal tokens that the grammar accepts after the nonterminal to
  4447. which the FOLLOW set belongs. To illustrate this, we will now determine the
  4448. FOLLOW sets for our sample grammar.
  4449. number: digit morenumber.
  4450. morenumber: digit morenumber.
  4451. morenumber: ?.
  4452. digit : 0.
  4453. digit : 1.
  4454. FOLLOW sets for every nonterminal:
  4455. FOLLOW ? number ? =
  4456. ?
  4457. EOF
  4458. ?
  4459. FOLLOW ? morenumber ? =
  4460. ?
  4461. EOF
  4462. ?
  4463. FOLLOW ? digit ? =
  4464. ?
  4465. EOF, 0, 1
  4466. ?
  4467. The terminal tokens in these two sets are the synchronization points. After
  4468. the parser detects and displays an error, it must synchronize (recover from the
  4469. error). The parser does this by ignoring all further tokens until it reads a token
  4470. that occurs in a synchronization point set, after which parsing is resumed. This
  4471. point is best illustrated by a example, describing a Sync routine. Please refer to
  4472. listing 8.1.
  4473. 129
  4474. /* Forward declarations. */
  4475. /* If current token is not in FIRST set, display
  4476. * specified error.
  4477. * Skip tokens until current token is in FIRST
  4478. 5 * or in FOLLOW set.
  4479. * Return TRUE if token is in FIRST set, FALSE
  4480. * if it is in FOLLOW set.
  4481. */
  4482. BOOL Sync( int first [], int follow [], char ∗error )
  4483. 10 {
  4484. if ( !Element( token, first ) )
  4485. {
  4486. AddPosError( error , lineCount, charPos );
  4487. }
  4488. 15
  4489. while( !Element( token, first ) && !Element( token, follow ) )
  4490. {
  4491. GetToken();
  4492. /* If EOF reached, stop requesting tokens and just
  4493. 20 * exit, claiming that the current token is not
  4494. * in the FIRST set. */
  4495. if ( token == 0 )
  4496. {
  4497. return( FALSE );
  4498. 25 }
  4499. }
  4500. /* Return TRUE if token in FIRST set, FALSE
  4501. * if token in FOLLOW set.
  4502. 30 */
  4503. return( Element( token, first ) );
  4504. }
  4505. Listing 8.1: Sync routine
  4506. 130
  4507. /* Call this when an unexpected token occurs halfway a
  4508. * nonterminal function. It prints an error, then
  4509. * skips tokens until it reaches an element of the
  4510. * current nonterminal’s FOLLOW set. */
  4511. 5 void SyncOut( int follow [] )
  4512. {
  4513. /* Skip tokens until current token is in FOLLOW set. */
  4514. while( !Element( token, follow ) )
  4515. {
  4516. 10 GetToken();
  4517. /* If EOF is reached, stop requesting tokens and
  4518. * exit. */
  4519. if ( token == 0 ) return;
  4520. }
  4521. 15 }
  4522. Listing 8.2: SyncOut routine
  4523. Tokens are requested from the lexer and discarded until a token occurs in
  4524. one of the synchronization point lists.
  4525. At the beginning of each production function in the parser the FIRST and
  4526. FOLLOW sets are filled. Then the function Sync should be called to check if the
  4527. token given by the lexer is available in the FIRST or FOLLOW set. If not then
  4528. the compiler must display the error and search for a token that is part of the
  4529. FIRST or FOLLOW set of the current production. This is the synchronization
  4530. point. From here on we can start checking for other errors.
  4531. It is possible that an unexpected token is encountered halfway a nonterminal
  4532. function. When this happens, it is nessesary to synchronize until a token of the
  4533. FOLLOW set is found. The function SyncOut provides this functionality (see
  4534. listing 8.2).
  4535. Morgan (1970) claims that up to 80% of the spelling errors occurring in
  4536. student programs may be corrected in this fashion.
  4537. 131
  4538. Part III
  4539. Semantics
  4540. 132
  4541. We are now in a position to continue to the next level and take a look at
  4542. the shady side of compiler construction; semantics. This part of the book will
  4543. provide answers to questions like: What are semantics good for? What is the
  4544. difference between syntax and semantics? Which checks are performed? What
  4545. is typechecking? And what is a symbol table? In other words, this chapter will
  4546. unleash the riddles of semantic analysis. Firstly it is important to know the dif-
  4547. ference between syntax and semantics. Syntax is the grammatical arrangement
  4548. of words or tokens in a language which establishes their necessary relations.
  4549. Hence, syntax analysis checks the correctness of the relation between elements
  4550. of a sentence. Let’s explain this with an example using a natural language. The
  4551. sentence
  4552. Loud purple flowers talk
  4553. is incorrect according to the English grammar, hence the syntax of the sen-
  4554. tence is flawed. This means that the relation between the words is incorrect due
  4555. to its bad syntactical construction which results in a meaningless sentence.
  4556. Whereas syntax is about the relation between elements of a sentence, se-
  4557. mantics is concerned with the meaning of the production. The relation of the
  4558. elements in a sentence can be right, while the construction as a whole has no
  4559. meaning at all. The sentence
  4560. Purple flowers talk loud
  4561. is correct according to the English grammar, but the meaning of the sentence
  4562. is not flawless at all since purple flowers cannot talk! At least, not yet. The
  4563. semantic analysis checks for meaningless constructions, erroneous productions
  4564. which could have multiple meanings and generates error en warning messages.
  4565. When we apply this theory on programming languages we see that syntax
  4566. analysis finds syntax errors such as typos and invalid constructions such as illegal
  4567. variable names. However, it is possible to write programs that are syntactically
  4568. correct, but still violate the rules of the language. For example, the following
  4569. sample code conforms to the Inger syntax, but is invalid nonetheless: we cannot
  4570. assign a value to a function.
  4571. myFunc() = 6 ;
  4572. The semantic analysis is of great importance since code with assignments like
  4573. this may act strange when executing the program after successful compilation.
  4574. If the program above does not crash with a segmentation fault on execution and
  4575. apparently executes the way it should, there is a chance that something fishy is
  4576. going on: it is possible that a new address is assigned to the function myFunc() ,
  4577. or not? We do not assume
  4578. 1
  4579. that everything will work the way we think it will
  4580. work.
  4581. Some things are too complex for syntax analysis, this is where semantic
  4582. analysis comes in. Type checking is necessary because we cannot force correct
  4583. use of types in the syntax because too much additional information is needed.
  4584. This additional information, like (return) types, will be available to the type
  4585. checker stored in the AST and symbol table.
  4586. Let us begin, with the symbol table.
  4587. 1 Tip 27 from the Pragrammatic Programmer [1]: Don’t Assume It - Prove It Prove your
  4588. assumptions in the actual environment - with real data and boundary conditions
  4589. 133
  4590. Chapter 9
  4591. Symbol table
  4592. 9.1 Introduction to symbol identification
  4593. At compile time we need to keep track of the different symbols (functions and
  4594. variables) declared in the program source code. This is a requirement because
  4595. every symbol must be identifiable with a unique name so we can find the symbols
  4596. back later.
  4597. Example 9.1 (An Incorrect Program)
  4598. module flawed;
  4599. start main: void → void
  4600. {
  4601. 2 ∗ 4; // result is lost
  4602. 5 myfunction ( 3 ); // function does not exist
  4603. }
  4604. ?
  4605. Without symbol identification it is impossible to make assignments, or define
  4606. functions. After all, where do we assign the value to? And how can we call a
  4607. function if it has no name? We do not think we would make many program-
  4608. mers happy if they could only reference values and functions through memory
  4609. addresses. In the example the mathematical production yields 8 , but the result
  4610. is not assigned to a uniquely identified symbol. The call to myfunction yields a
  4611. 134
  4612. compiler error as myfunction is not declared anywhere in the program source so
  4613. there is no way for the compiler to know what code the programmer actually
  4614. wishes to call. This explains only why we need symbol identification but does
  4615. not yet tell anything practical about the subject.
  4616. 9.2 Scoping
  4617. We would first like to introduce scoping. What actually is scoping? In Webster’s
  4618. Revised Unabridged Dictionary (1913) a scope is defined as:
  4619. Room or opportunity for free outlook or aim; space for action; am-
  4620. plitude of opportunity; free course or vent; liberty; range of view,
  4621. intent, or action.
  4622. When discussing scoping in the context of a programming language the de-
  4623. scription comes closest to Webster’s range of view. A scope limits the view a
  4624. statement or expression has when it comes to other symbols. Let us illustrate
  4625. this with an example in which every block, delimited by { and }, results in a
  4626. new scope.
  4627. Example 9.2 (A simple scope example)
  4628. module example; // begin scope (global)
  4629. int a = 4;
  4630. 5 int b = 3;
  4631. start main: void → void
  4632. { // begin scope (main)
  4633. float a = 0;
  4634. 10
  4635. { // begin scope (free block}
  4636. char a = ’a’;
  4637. int x;
  4638. print ( a );
  4639. 15 } // end of scope {free block}
  4640. x = 1; // x is not in range of view!
  4641. print ( b );
  4642. } // end of scope (main)
  4643. 20 // end of scope 0 (global)
  4644. ?
  4645. Example 9.2 contains an Inger program with 3 nested scopes containing
  4646. variable declarations. Note that there are 3 declarations for variables named a ,
  4647. which is perfectly legal as each declaration is made in a scope that did not yet
  4648. contain a symbol called ‘ a ’ of its own. Inger only allows referencing of symbols
  4649. 135
  4650. in the local scope and (grand)parent scopes. The expression x = 1; is illegal
  4651. since x was declared in a scope that could be best described as a nephew scope.
  4652. Using identification in scopes enables us to declare a symbol name multiple
  4653. times in the same program but in different scopes, this implicates that a symbol
  4654. is unique in its own scope and not necessarily in the program, this way we do
  4655. not run out of useful variable names within a program.
  4656. Now that we know what scoping is, it is probably best to continue with some
  4657. theory on how to store information about these scopes and their symbols during
  4658. compile time.
  4659. 9.3 The Symbol Table
  4660. Symbols collected during parsing must be stored for later reference during se-
  4661. mantic analysis. In order to have access to the symbols during in a later stage
  4662. it is important to define a clear data structure in which we store the necessary
  4663. information about the symbols and the scopes they are in. This data structure
  4664. is called a Symbol Table and can be implemented in a variety of ways (e.g. ar-
  4665. rays, linked lists, hash tables, binary search trees & n-ary search trees). Later
  4666. on in this chapter we will discuss what data structure we considered best for
  4667. our symbol table implementation.
  4668. 9.3.1 Dynamic vs. Static
  4669. There are two possible types of symbol tables, dynamic or static symbol tables.
  4670. What exactly are the differences between the two types, and why is one bet-
  4671. ter than the other? A dynamic symbol table can only be used when both the
  4672. gathering of symbol information and usage of this information happen in one
  4673. pass. It works in a stack like manner: a symbol is pushed on the stack when it
  4674. is encountered. When a scope is left, we pop all symbols which belong to that
  4675. scope from the stack. A static table is built once and can be walked as many
  4676. times as required. It is only deconstructed when the compiler is finished.
  4677. Example 9.3 (Example of a dynamic vs. static symbol table)
  4678. module example;
  4679. int v1, v2;
  4680. 5 f : int v1, v2 → int
  4681. {
  4682. return (v1 + v2);
  4683. }
  4684. 10 start g: int v3 → int
  4685. 136
  4686. {
  4687. return (v1 + v3);
  4688. }
  4689. The following example illustrates now a dynamic table grows and
  4690. shrinks over time.
  4691. After line 3 the symbol table is a set of symbols
  4692. T = {v1,v2}
  4693. After line 5 the symbol table also contains the function f and the local variables
  4694. v1 and v2
  4695. T = {v1,v2,f,v1,v2}
  4696. At line 9 the symbol table is back to its global form
  4697. T = {v1,v2}
  4698. After line 10 the symbol table is expanded with the function g and the local
  4699. variable v3
  4700. T = {v1,v2,g,v3}
  4701. Next we illustrate now a static table’s set of symbols only grows.
  4702. After line 3 the symbol table is a set of symbols
  4703. T = {v1,v2}
  4704. After line 5 the symbol table also contains the function f and the local variables
  4705. v1 and v2
  4706. T = {v1,v2,f,v1,v2}
  4707. After line 10 the symbol table is expanded with the function g and the local
  4708. variable v3
  4709. T = {v1,v2,f,v1,v2,g,v3}
  4710. ?
  4711. In earlier stages of our research we assumed local symbols should only be
  4712. present in the symbol table when the scope it is declared in is being processed
  4713. (this includes, off course, children of that scope). After all, storing a symbol
  4714. that is no longer accessable seems pointless. This assumption originated when
  4715. we were working on the idea of a single-pass compiler (tokenizing, parsing, se-
  4716. mantics and code generation all in one single run) so for a while we headed
  4717. in the direction of a dynamic symbol table. Later on we decided to make the
  4718. compiler multi-pass which resulted in the need to store symbols over a longer
  4719. timespan. Using multiple passes, the local symbols should remain available for
  4720. as long as the compiler lives as they might be needed every pass around, thus
  4721. we decided to switch to a static symbol table.
  4722. When using a static symbol table, the table will not shrink but only grow.
  4723. Instead of building the symbol table during pass 1 which happens when using a
  4724. dynamic table, we will construct the symbol table from the AST. The AST will
  4725. be available after parsing the source code.
  4726. 137
  4727. 9.4 Data structure selection
  4728. 9.4.1 Criteria
  4729. In order to choose the right data structure for implementing the symbol table
  4730. we look at its primary goal and what the criteria are. Its primary goal is storing
  4731. symbol information an provide a fast lookup facility, for easy access to all the
  4732. stored symbol information.
  4733. Since we have only a short period of time to develop our language Inger and
  4734. the compiler, the only criteria we had in choosing a suitable data structure was
  4735. that it was easy to use, and implement.
  4736. 9.4.2 Data structures compared
  4737. One of the first possible data structures which comes to mind when thinking of
  4738. how to store a list of symbols is perhaps an array. Although an array is very
  4739. convenient to store symbols, it has quite a few practical limitations. Arrays are
  4740. defined with a static size, so chances are you define a symbol table with array
  4741. size 256, and end up with 257 symbols causing a buffer overflow (an internal
  4742. compiler error). Writing beyond an array limit will produce unpredictable re-
  4743. sults. Needless to say, this situation is undesirable. Searching an array is not
  4744. efficient at all due to its linear searching algorithm. A binary search algorithm
  4745. could be applied, but this would require the array to be sorted at all times.
  4746. For sorted arrays, searching is a fairly straightforward operation and easy to
  4747. implement. It is also notable that an array would probably only be usable when
  4748. using either a dynamic table or if no scoping would be allowed.
  4749. Figure 9.1: Array
  4750. If the table would be implemented as a stack, finding a symbol is a simple
  4751. matter of searching for the desired symbol from the top of the stack. The first
  4752. variable found is automatically the last variable added and thus the variable in
  4753. the nearest scope. This implementation makes it easy to use multi-level scop-
  4754. ing, but is a heavy burden on performance as with every search the stack has
  4755. to be deconstructed and stored on a second stack (until the first occurrence
  4756. is found) and reconstructed, a very expensive operation. This implementation
  4757. would probably be the best for a dynamic table if it were not for its expensive
  4758. search operations.
  4759. Another dynamic implementation would be a linked list of symbols. If im-
  4760. plemented as an unsorted double linked list it would be possible to use it just
  4761. 138
  4762. Figure 9.2: Stack
  4763. like the stack (append to the back and search from the back) without the dis-
  4764. advantage of a lot of push and pop operations. A search still takes place in
  4765. a linear time frame but the operations themselves are much cheaper than the
  4766. stack implementation.
  4767. Figure 9.3: Linked List
  4768. Binary search trees improve search time massively, but only in sorted form
  4769. (an unsorted tree after all, is not a tree at all). This results in the loss of an
  4770. advantage the stack and double linked list offered: easy scoping. Now the first
  4771. symbol found is not per definition the latest definition of that symbol name.
  4772. It in fact is probably the first occurrence. This means that the search is not
  4773. complete until it is impossible to find another occurrence. This also means that
  4774. we have to include some sort of scope field with every symbol to separate the
  4775. symbols: (a,1) and (a,2) are symbols of the same name, but a is in a higher
  4776. scope and therefore the correct symbol. Another big disadvantage is that when
  4777. a function is processed we need to rid the tree of all symbols in that function’s
  4778. scope. This requires a complete search and rebalancing of the tree. Since the
  4779. tree is sorted by string value every operation (insert, search, etc...) is quite ex-
  4780. pensive. These operations could be made more time efficient by using an hash
  4781. algorithm as explained in the next paragraph.
  4782. String comparisons are relatively heavy compared to comparison of simple
  4783. types such as integers or bytes so using an hash algorithm to convert symbol
  4784. names to simple types would speed all operations on the tree considerably.
  4785. The last option we discuss is the n-ary tree. Every node has n children each
  4786. of which implicate a new scope as a child of its parent scope. Every node is a
  4787. scope and all symbols in that scope are stored inside that node. When the AST
  4788. is walked, all the code has to do is make sure that the symbol table walks along.
  4789. Then when information about a symbol is requested, we only have to search the
  4790. current scope and its (grand) parents. This seems in our opinion to be the only
  4791. valid static symbol table implementation.
  4792. 139
  4793. Figure 9.4: Binary Tree
  4794. Figure 9.5: N-ary Tree
  4795. 9.4.3 Data structure selection
  4796. We think a combination of an n-ary tree combined with linked lists is a suitable
  4797. solution for us. Initially we thought using just one linked list was a good idea.
  4798. Each list node, representing a scope, would contain the root node of a binary
  4799. tree. The major advantage of this approach is that adding and removing a scope
  4800. was easy and fast. This advantage was based on the idea that the symbol table
  4801. would grow and shrink during the first pass of compilation, which means that
  4802. the symbol table is not available anymore after the first pass. This is not what
  4803. we want since the symbol table must be available at all times after the first
  4804. pass and therefore favour a new data structure that is less efficient in removing
  4805. scopes ( we do not remove scopes anyway ) but faster in looking up symbols in
  4806. the symbol table.
  4807. 9.5 Types
  4808. The symbol table data structure is not enough, it is just a tree and should be
  4809. decorated with symbol information like (return) types and modifiers. To store
  4810. this information correctly we designed several logical structures for symbols and
  4811. types. It basicly comes down to a set of functions which wrap a Type structure.
  4812. These functions are for example: CreateType() , AddSimpleType() , AddDimension() ,
  4813. AddModifier() , etc.... There is a similar set of accessor functions.
  4814. 9.6 An Example
  4815. To illustrate how the symbol table is filled from the Abstract Syntax Tree we
  4816. show which steps have to be taken to fill the symbol table.
  4817. 1. Start walking at the root node of the AST in a pre-order fashion.
  4818. 140
  4819. 2. For each block we encounter we add a new child to the current scope and
  4820. make this child the new current scope
  4821. 3. For each variable declaration found, we extract:
  4822. - Variable name
  4823. - Variable type 1
  4824. 4. For each function found, we extract:
  4825. - Function name
  4826. - Function types, starting with the return type 1
  4827. 5. After the end of a block is encountered we move back to the parent scope.
  4828. To conclude we will show a simple example.
  4829. Example 9.4 (Simple program to test expressions)
  4830. module test module;
  4831. int z = 0;
  4832. 5 inc : int a → int
  4833. {
  4834. return( a + 1 );
  4835. }
  4836. 10 start main: void → void
  4837. {
  4838. int i;
  4839. i = (z ∗ 5) / 10 + 20 − 5 ∗ (132 + inc ( 3 ) );
  4840. }
  4841. ?
  4842. We can distinguish the following steps in parsing the example source code.
  4843. 1. found z , add symbol to current scope ( global )
  4844. 2. found inc , add symbol to current scope ( global )
  4845. 3. enter a new scope level as we now parse the function inc
  4846. 4. found the parameter a , add this symbol to the current scope ( inc )
  4847. 5. as no new symbols are encountered, leave this scope
  4848. 6. found main , add symbol to current scope ( global )
  4849. 1 for every type we also store optional information such as modifiers (start, extern, etc...)
  4850. and dimensions (for pointers and arrays)
  4851. 141
  4852. 7. enter a new scope level as we now parse the function main
  4853. 8. found i , add symbol to current scope ( main )
  4854. 9. as no new symbols are encountered, leave this scope
  4855. After these steps our symbol table will look like this.
  4856. Figure 9.6: Conclusion
  4857. 142
  4858. Chapter 10
  4859. Type Checking
  4860. 10.1 Introduction
  4861. Type checking is part of the symantic analysis. The purpose of type checking
  4862. is to evaluate each operator, application and return statement in the AST (Ab-
  4863. stract Syntax Tree) and search for its operands or arguments. The operands
  4864. or arguments must both be of compatible types and form a valid combination
  4865. with the operator. For instance: when the operator is + , the left operand is a
  4866. integer and the right operand is a char pointer, it is not a valid addition. You
  4867. can not add a char pointer to an integer without explicit coercion.
  4868. The type checker evaluates all nodes where types are used in an expression
  4869. and produces an error when it cannot find a decent solution (through coercion)
  4870. to a type conflict.
  4871. Type checking is one of the the last steps to detect semantic errors in the
  4872. source code. After there are a few symantic checks left before code generation
  4873. can commence.
  4874. This chapter discusses the process of type checking, how to modify the AST
  4875. by including type info and produce proper error messages when necessary.
  4876. 10.2 Implementation
  4877. The process of type checking consists of two parts:
  4878. • Decorate the AST with types for literals.
  4879. • Propagate these types up the tree taking into account:
  4880. 143
  4881. – Type correctness for operators
  4882. – Type correctness for function arguments
  4883. – Type correctness for return statements
  4884. – If types do not match in their simple form ( int , float etc...) try to
  4885. coerce these types.
  4886. • Perform a last type check to make sure indirection levels are correct (e.g.
  4887. assigning an int to a pointer variable.
  4888. 10.2.1 Decorate the AST with types
  4889. To decorate the AST with types it is advisable to walk post-order through the
  4890. AST and search for all literal identifiers or values. When a literal is found in
  4891. the AST the type must be located in the symbol table. Therefore it is necessary
  4892. when walking through the AST to keep track of the current scope level. The
  4893. symbol table provides the information of the literal (all we are interested in is
  4894. the type) and this will be stored in the AST.
  4895. The second step is to move up in the AST and evaluate types for unary,
  4896. binary and application nodes.
  4897. The 10.1 illustrates the process of expanding the tree. It shows the AST
  4898. decoration process for the expression a = b + 1; . The variable a and b are both
  4899. declared as an integer.
  4900. Example 10.1 (Decorating the AST with types.)
  4901. Figure 10.1: AST type expanding
  4902. The nodes a, b and 1 are the literals. These are the first nodes we encounter
  4903. when walking post-order through the tree. The second part is to determine the
  4904. types of node + and node = . After we passed the the literals b and 1 we arrive
  4905. at node + . Because we have already determined the type of its left and right
  4906. child we can evaluate its type. In this case the outcome (futher referred to in
  4907. the text as the result type) is easy. Because node b and 1 are both of the type
  4908. int , node + will also become an int .
  4909. Because we are still walking post-order through the AST we finally arrive
  4910. at node = . The right and left child are also both integers so this node will also
  4911. become an integer.
  4912. ?
  4913. 144
  4914. The advantage by walking post-order through the AST is that all the type-
  4915. checking can be done in one pass. If you were to walk pre-order through the
  4916. AST it would be advisible to decorate the AST with types in two passes. The
  4917. first pass should walk pre-order through the AST and decorate only the literal
  4918. nodes, and the second pass which walks pre-order through the AST evaluates
  4919. the parent nodes from the literals. This cannot be done in one pass because
  4920. the first time walking pre-order through the AST you will first encounter the =
  4921. node. When you try to evaluate its type you will find that the children do not
  4922. have a type.
  4923. The above example was easy; all the literals were integers so the result type
  4924. will also be an integer. But what would happen if one of the literals was a float
  4925. and the others are all integers.
  4926. One way of dealing with this problem is to create a table with conversion
  4927. priorities. When for example a float and an int are located, the highest priority
  4928. operator wins. These priorities can be found in the table for each operator. For
  4929. an example of this table, see table 10.1. In this table the binary operators assign
  4930. = and add + are implemented. The final version of this table has all binary
  4931. implemented. The same goes for all unary operators like the not ( ! ) operator.
  4932. Node Type
  4933. NODE ASSIGN FLOAT
  4934. NODE ASSIGN INT
  4935. NODE ASSIGN CHAR
  4936. NODE BINARY ADD FLOAT
  4937. NODE BINARY ADD INT
  4938. Table 10.1: Conversion priorities
  4939. The highest priority is on top for each operator.
  4940. The second part is to make a list of types which can be converted for each
  4941. operator. In the Inger language it is possible to convert an integer to a float
  4942. but the conversion from integer to string is not possible. This table is called the
  4943. coercion table. For an example see table 10.2.
  4944. From type To type New node
  4945. INT FLOAT NODE INT TO FLOAT
  4946. CHAR INT NODE CHAR TO INT
  4947. CHAR FLOAT NODE CHAR TO FLOAT
  4948. Table 10.2: Coercion table
  4949. A concrete example is explained in section 10.2. It shows the AST for the
  4950. expression a = b + 1.0; . The variable a and is declared as a float and b is
  4951. declared by the type of integer. The literal 1.0 is also a float.
  4952. Example 10.2 (float versus int.)
  4953. The literals a , b and 1.0 are all looked up in the symbol table. The variable
  4954. a and the literal 1.0 are both floats. The variable b is an integer. Because we
  4955. 145
  4956. Figure 10.2: Float versus int
  4957. are walking post-order through the AST the first operator we encounter is the
  4958. + . Operator + has as its left child an integer and the right child is a float. Now
  4959. it is time to use the lookup table to find out of what type the + operator must
  4960. be. It appears that the first entry for the operator + in the lookup table is of
  4961. the type float. This type has the highest priority. Because one of the two types
  4962. is also a float, the result type for the operator + will be a float.
  4963. It is still nessesary to check if the other child can be converted to the float.
  4964. If not, an error message should appear on the screen.
  4965. The second operator is the operator = . This will be exactly the same process
  4966. as for the + operator. The left child ( a ) is of type float and the right child + of
  4967. type float so operator = will also become a float.
  4968. However, what would happen if the left child of the assignment operator =
  4969. was an integer? Normally the result type should be looked up in the table 10.1,
  4970. but in case of a assignment there is an exception. For the assignment operator
  4971. = its right child determines the result. So if the left child is an integer, the
  4972. assignment operator will also become an integer. When you declare a variable
  4973. as an integer and an assignment takes place of which the right child differs from
  4974. the original declared type, an error must occur. It is not possible to change the
  4975. original declaration type of any variable. This is the only operator exception
  4976. you should take care of.
  4977. We just illustrated an example of what would happen if two different types
  4978. are encountered, belonging to the same operator. After the complete pass the
  4979. AST is decorated with types and finally looks like 10.3.
  4980. Figure 10.3: Float versus int result
  4981. ?
  4982. 146
  4983. 10.2.2 Coercion
  4984. After the decoration of the AST is complete, and all the checks are executed the
  4985. main goal of the typechecker mudule is achieved. At this point it is nessesary to
  4986. make a choice. There are two ways to continue, the first way is to start with the
  4987. code generation. The type checking module is in this case completely finished.
  4988. The second way is to prepare the AST tree for the code generation module.
  4989. In the first approach, the type checker’s responsibility is now finished, and
  4990. it is up to the code generation module to perform the necessary conversions. In
  4991. the sample source line
  4992. int b;
  4993. float a = b + 1.0;
  4994. Listing 10.1: Coercion
  4995. the code generation module finds that since a is a float, the result of b + 1.0
  4996. must also be a float. This implies that the value of b must be converted to float
  4997. in order to add 1.0 to it and return the sum. To determine that variable b must
  4998. be converted to a float it is nessecary to evalute the expression just like the way
  4999. it is done in the typechecker module.
  5000. In the second approach, the typechecking module takes the responsibility
  5001. to convert the variable b to a float. Because the typechecker module already
  5002. decorates the AST with all types and therefore concludes any conversion to be
  5003. made it can easily apply the conversion so the code generation module does not
  5004. have to repeat the evaluation process.
  5005. To prepare the AST for the above problem we have to apply the coercion
  5006. technique. Coercion means the conversion form one type to another. However
  5007. it is not possible to convert any given type to any other given type. Since all
  5008. natural numbers (integers) are elements in the set N and all real numbers (float)
  5009. are in the set R the following formula applies:
  5010. N ⊂ R
  5011. A practical application of this theory is it nessesary to modify the AST by
  5012. adding new nodes. These new nodes are the so called coercion nodes. The best
  5013. way to explain this is by a practical example. For this example, refer to the
  5014. source of listing 10.1.
  5015. Example 10.3 (coercion)
  5016. In the first approach were we let the code generation module take care of the
  5017. coercion technique, the AST would end up looking like figure 10.3. In the second
  5018. approach, were the typechecker module takes responsibility for the coercion
  5019. technique, the AST will have the structure shown in figure 10.4.
  5020. Notice that the position of node b is repleaced by node IntToFloat and node b
  5021. has become a child of node IntToFloat . The node IntToFloat is called the coercion
  5022. node. When we arrive during the typechecker pass at node + , the left and right
  5023. child are both evaluated. Because the right child is a float and the right child
  5024. an integer the outcome must be a float. This is determined by the type lookup
  5025. table 10.1. Since we now know the result type for node + we can apply the
  5026. 147
  5027. coercion technique for its childs. This is only required for the child of which the
  5028. type differs from its parent (node + ).
  5029. Figure 10.4: AST coercion
  5030. When we find a child which type differs from its parent we use the coercion
  5031. table 10.2 to check if it is possible to convert the type of the child node (node b )
  5032. to its parent type. If this is not possible an error message must be produced and
  5033. the compilation progress will stop. When it is possible to apply the conversion
  5034. it is required to insert a new node in the AST. This node will replace node b
  5035. and the type becomes a float. Node b will be its child.
  5036. ?
  5037. 10.3 Overview.
  5038. Now all the steps for the typechecker module are completed. The AST is dec-
  5039. orated with types and prepared for the code generation module. Example 10.4
  5040. gives a complete display of the AST befor and after the type checking pass.
  5041. Example 10.4 (AST decoration)
  5042. Consult the sample Inger program in listing 10.2. The AST before decoration
  5043. is shown in figure 10.5, notice that all types are unknown (no type) . The AST
  5044. after the decoration is shown in figure 10.6.
  5045. ?
  5046. 10.3.1 Conclusion
  5047. Typechecking is the most important part of the semantic analysis. When the
  5048. typechecking is completed there could still be some errors in the source. For
  5049. example
  5050. • unreachable code, statements are located after a return keyword. These
  5051. statements will never be executed;
  5052. 148
  5053. module example;
  5054. start f : void → float
  5055. {
  5056. float a = 0;
  5057. int b = 0;
  5058. a = b + 1.0;
  5059. return (a);
  5060. }
  5061. Listing 10.2: Sample program listing
  5062. • when the function header is declared with a return type other than void ,
  5063. the return keyword must exist in the function body. It will not check if the
  5064. return type is valid, this already took place in the typechecker pass;
  5065. • check for double case labels in a switch ;
  5066. • lvalue check, when a assignment = is located in the AST its left child can
  5067. not be a function. This is a rule we applied for the Inger language, other
  5068. languages may allow this.
  5069. • when a goto statement is encountered the label which the goto points at
  5070. must exists.
  5071. • function parameter count, when a function is declared with two parame-
  5072. ters (return type excluded), the call to the function must also have two
  5073. parameters.
  5074. All these small checks are also part of the semantic analysis and will be discussed
  5075. in the next chapter. After these checks are preformed the code generation can
  5076. finally take place.
  5077. 149
  5078. Figure 10.5: AST before decoration
  5079. 150
  5080. Figure 10.6: AST after decoration
  5081. 151
  5082. Bibliography
  5083. [1] A.B. Pyster: Compiler Design and Construction, Van Nostrand Reinhold
  5084. Company, 1980
  5085. [2] G. Goos, J. Hartmanis: Compiler Construction - An Advanced Course,
  5086. Springer-Verlag, Berlin, 1974
  5087. 152
  5088. Chapter 11
  5089. Miscellaneous Semantic
  5090. Checks
  5091. 11.1 Left Hand Values
  5092. An lvalue, short for left hand value, is that expression or identifier reference that
  5093. can be placed on the left hand side of an assignment. The lvalue check is one of
  5094. the necessary checks in the semantic stage. An lvalue check makes sure that no
  5095. invalid assignments are done in the source code. Examples 11.1 and 11.2 show
  5096. us what lvalues are valid in the Inger compiler and which are not.
  5097. Example 11.1 (Invalid Lvalues)
  5098. function() = 6;
  5099. 2 = 2;
  5100. ”somestring” = ”somevalue”;
  5101. ?
  5102. What makes a valid lvalue? An lvalue must be a modifiable entity. One can
  5103. define the invalid lvalues and check for them, in our case it is better to check
  5104. for the lvalues that are valid, because this list is much shorter.
  5105. Example 11.2 (Valid Lvalues)
  5106. 153
  5107. int a = 6;
  5108. name = ”janwillem”;
  5109. ?
  5110. 11.1.1 Check Algorithm
  5111. To check the validity of the lvalues we need a filled AST (Abstract Syntax Tree)
  5112. in order to have access to the all elements of the source code. To get a better
  5113. grasp of the checking algorithm have a look at the pseudo code in example 11.3.
  5114. This algorithm results in a list of error messages if any.
  5115. Example 11.3 (Check Algorithm)
  5116. Start at the root of the AST
  5117. for each node found in the AST do
  5118. if the node is an ’=’ operator then
  5119. 5 check its most left child in the AST
  5120. which is the lvalue and see if this is
  5121. a valid one.
  5122. if invalid report an error
  5123. 10
  5124. else go to next node
  5125. else go to the next node
  5126. ?
  5127. Not all the lvalues are as straightforward as they seem. A valid but bizarre
  5128. example of a semantically correct assignment is:
  5129. Example 11.4 (Bizarre Assignment)
  5130. int a[20];
  5131. int b = 4;
  5132. a = a ∗ b;
  5133. ?
  5134. We choose to make this a valid assignment in Inger and to provide some
  5135. address arithmetic possibilities. The code in example 11.4 multiplies the base
  5136. address by the absolute value of identifier b. Lets say that the base address of
  5137. array a was initially 0x2FFFFFA0 , then the base address of a will be 0xBFFFFE80 .
  5138. 154
  5139. 11.2 Function Parameters
  5140. This section covers argument count checking. Amongst other things, function
  5141. parameters must be checked before we actually start generating code. Apart
  5142. from checking the use of a correct number of function arguments in function
  5143. calls and the occurence of multiple definitions of the main function,we also
  5144. check whether the passed arguments are of the correct type. Argument type
  5145. checking is explained in 10
  5146. The idea of checking the number of arguments passed to a function is pretty
  5147. straightforward. The check consists of two steps: firstly, we collect all the
  5148. function header nodes from the AST and store them in a list. Secondly, we
  5149. compare the number of arguments used in each function call to the number of
  5150. arguments required by each function and check that the numbers match.
  5151. To build a list of all nodes that are function headers we make a pass through
  5152. the AST and collect all nodes that are of type NODE FUNCTIONHEADER , and
  5153. put them in a list structure provided by the generic list module. It is faster to
  5154. go through the AST once and build a list of the nodes we need, than to make a
  5155. pass through the AST to look for a node each time we need it. After building
  5156. the list for the example program 11.5 it will contain the header nodes for the
  5157. functions main and AddOne .
  5158. Example 11.5 (Program With Function Headers)
  5159. module example;
  5160. start main : void → void
  5161. {
  5162. 5 AddOne( 2 );
  5163. }
  5164. AddOne : int a → int
  5165. {
  5166. 10 return( a + 1 );
  5167. }
  5168. ?
  5169. The next step is to do a second pass through the AST and look for nodes
  5170. of type NODE APPLICATION which represent a function call in the source code.
  5171. When such a node is found we first retrieve the actual number of arguments
  5172. passed in the function application with the helper function GetArgumentCount-
  5173. FromApplication . Secondly we get the number of arguments as defined in the
  5174. function declaration, to do this we use the function GetArgumentCount . Then
  5175. it is just a matter of comparing the number of arguments we expect and the
  5176. number of arguments we found. We only print an error message when a function
  5177. was called with too many or few arguments.
  5178. 155
  5179. 11.3 Return Keywords
  5180. The typechecking mechanism of the Inger compiler checks if a function returns
  5181. the right type when assigning a function return value to a variable.
  5182. Example 11.6 (Correct Variable Assignment)
  5183. int a;
  5184. a = myfunction();
  5185. ?
  5186. The source code in example 11.6 is correct and implies that the function
  5187. myfunction returns a value. As in most programming languages we introduced
  5188. a return keyword in our language and we define the following semantic rules:
  5189. unreachable code and non-void function returns (definition 11.1 and 11.2).
  5190. 11.3.1 Unreachable Code
  5191. Definition 11.1 (Unreachable code)
  5192. Code after a return keyword in Inger source will not be executed. A
  5193. warning for this unreachable code will be generated.
  5194. ?
  5195. For this check we run over the AST pre-order and check each code block for
  5196. the return keyword. If the child node containing the return keyword is not the
  5197. last child node in the code block, the remaining statements will be unreachable;
  5198. unreachable code. An example of unreachable code can be found in example
  5199. 11.7 in which function print takes an integer as parameter and prints this to the
  5200. screen. The statement ‘ print( 2 ) ’ will never be executed since the function main
  5201. returns before the print function is reached.
  5202. Example 11.7 (Unreachable Code)
  5203. start main : void → int
  5204. {
  5205. int a = 8;
  5206. 5 if ( a == 8 )
  5207. {
  5208. print ( 1 );
  5209. return( a );
  5210. print ( 2 );
  5211. 10 }
  5212. }
  5213. ?
  5214. Unreachable code is, besides useless, not a problem and the compilation
  5215. process can continue, therefor a warning messages is printed.
  5216. 156
  5217. 11.3.2 Non-void Function Returns
  5218. Definition 11.2 (Non-void function returns)
  5219. The last statement in a non-void function should be the keyword
  5220. ’return’ in order to return a value. If the last statement in a non-
  5221. void function is not ’return’ we generate a warning ‘control reaches
  5222. end of non-void function’.
  5223. ?
  5224. It is nice that unreachable code is detected, but it is not essential to the next
  5225. phase the process of compilation. Non-void function returns, on the contrary,
  5226. have a greater impact. Functions that should return a value but never do, can
  5227. result in an errorneous program. In example 11.8 variable a is assigned the
  5228. result value of function myfunction , but the function myfunction never returns a
  5229. value.
  5230. Example 11.8 (Non-void Function Returns)
  5231. module functionreturns;
  5232. start main : void → void
  5233. {
  5234. 5 int a;
  5235. a = myfunction();
  5236. }
  5237. myfunction : void → int
  5238. 10 {
  5239. int b = 2;
  5240. }
  5241. ?
  5242. To make sure all non-void function return, we check for the return keyword
  5243. which should be in the function code block. Like with most semantic checks
  5244. we go through the AST pre-order and search all function code block for the
  5245. return keyword. When a function has a return statement in an if-then-else
  5246. statement both then and else block should contain the return keyword because
  5247. the code blocks are executed conditionally. The same is for a switch block, all
  5248. case block should contain a return statement. All non-void function without
  5249. return keyword will generate a warning.
  5250. 11.4 Duplicate Cases
  5251. Generally a switch statement has one or more case blocks. It is syntactically
  5252. correct to define multiple code blocks with the same case value, so-called dupli-
  5253. cate case values. If duplicate case values occur it might not be clear which code
  5254. 157
  5255. block is executed, this is a choice which you should make as a compiler builder.
  5256. The semantic check in Inger generates a warning when a duplicate case value is
  5257. found and generates code for the first case code block. We choose to generate a
  5258. warning instead of an error message because the multi-value construction still
  5259. allows us to go to the next phase in compilation; code generation. Example
  5260. program 11.9, will have the output
  5261. This is the first code block
  5262. because we choose to generate code for the first code block definition for
  5263. duplicate case value 0 .
  5264. Example 11.9 (Duplicate Case Values)
  5265. /* Duplicate cases
  5266. * A program with duplicate case values
  5267. */
  5268. module duplicate cases;
  5269. 5
  5270. start main : void → void
  5271. {
  5272. int a = 0;
  5273. 10 switch( a )
  5274. {
  5275. case 0
  5276. {
  5277. printf ( ”This is the first case block” );
  5278. 15
  5279. }
  5280. case 0
  5281. {
  5282. printf ( ”This is the second case block” );
  5283. 20
  5284. }
  5285. default
  5286. {
  5287. printf ( ”This is the default case” );
  5288. 25
  5289. }
  5290. }
  5291. }
  5292. ?
  5293. The algorithm that checks for duplicate case values is pretty simple and
  5294. works recursively down the AST. It starts at the root node of the AST and
  5295. searches for NODE SWITCH nodes. For each switch node found we search for
  5296. duplicate children in the cases block. If any duplicates were found, generate a
  5297. proper warning, else continue until the complete AST was searched. In the end
  5298. this check will detect all duplicate values and report them.
  5299. 158
  5300. 11.5 Goto Labels
  5301. In the Inger language we implemented the goto statement although use of this
  5302. statement is often considered harmful. Why exactly is goto considered harmful?
  5303. As the late Edsger Dijkstra ([3]) stated:
  5304. The go to statement as it stands is just too primitive; it is too much
  5305. an invitation to make a mess of one’s program
  5306. Despite its possible harmfulness we decided to implement it. Why? Because
  5307. it is a very cool feature. For the unaware Inger programmer we added a subtle
  5308. reminder to the keyword goto and implemented it as goto considered harmful .
  5309. As with using variables, goto labels should be declared before using them.
  5310. Since this pre-condition cannot be forced using grammar rules (syntax) it should
  5311. be checked in the semantic stage. Due to a lack of time we did not implement
  5312. this semantic check and therefore programmers and users of the Inger compiler
  5313. should be aware that jumping to undeclared goto labels may result in inexplica-
  5314. ble and possibly undesired program behaviour. Example code 11.10 shows the
  5315. correct way to use the goto keyword.
  5316. Example 11.10 (Goto Usage)
  5317. int n = 10;
  5318. label here;
  5319. printstr ( n );
  5320. n = n − 1;
  5321. 5 if ( n > 0 )
  5322. {
  5323. goto considered harmful here;
  5324. }
  5325. ?
  5326. A good implementation for this check would be, to store the label decla-
  5327. rations in the symbol table and walk through the AST and search for goto
  5328. statements. The identifier in a goto statement like
  5329. goto considered harmful labelJumpHere
  5330. will be looked up in the symbol table. If the goto label is not found, an error
  5331. message will be generated. Although goto is a very cool feature, be careful using
  5332. it.
  5333. 159
  5334. Bibliography
  5335. [1] A. Hunt, D. Thomas: The Pragmatic Programmer, Addison Wesley, 2002
  5336. [2] Thomas A. Sudkamp: Languages And Machines Second Edition, Addison
  5337. Wesley, 1997
  5338. [3] Edsger W. Dijkstra: Go To Statement Considered Harmful
  5339. http://www.acm.org/classics/oct95/
  5340. 160
  5341. Part IV
  5342. Code Generation
  5343. 161
  5344. Code generation is the final step in building a compiler. After the semantic
  5345. analysis there should be no more errors in the source code. If there are still
  5346. errors then the code generation will almost certainly fail.
  5347. This part of the book contains descriptions of how the assembly output will
  5348. be generated from the Inger source code. The subjects covered in this part
  5349. include implementation (assembly code) of every operator supported by Inger,
  5350. storage of data types, calculation of array offsets and function calls, with regard
  5351. to stack frames and return values.
  5352. In the next chapter, code generation is explained at an abstract level. In the
  5353. final chapter of this book, code templates, we present assembly code templates
  5354. for each operation in Inger. Using templates, we can guarantee that operations
  5355. can be chained together in any order desired by the programmer, including
  5356. orders we did not expect.
  5357. 162
  5358. Chapter 12
  5359. Code Generation
  5360. 12.1 Introduction
  5361. Code generation is the least discussed and therefore the most mystical aspect
  5362. of compiler construction in the literature. It is also not extremely difficult, but
  5363. requires great attention to detail. The approach using in the Inger compiler is to
  5364. write a template for each operation. For instance, there is a template for addition
  5365. (the code+ operation), a template for multiplication, dereferencing, function
  5366. calls and array indexing. All of these templates may be chained together in any
  5367. order. We can assume that the order is valid, since if the compiler gets to code
  5368. generation, the input code has passed the syntax analysis and semantic analysis
  5369. phases. Let’s take a look at a small example of using templates.
  5370. int a = ∗(b + 0x20);
  5371. Generating code for this line of Inger code involves the use of four templates.
  5372. The order of the templates required is determined by the order in which the
  5373. expression is evaluated, i.e. the order in which the tree nodes in the abstract
  5374. syntax tree are linked together. By traversing the tree post-order, the first
  5375. template applied is the template for addition, since the result of b + 0x20 must
  5376. be known before anything else can be evaluated. This leads to the following
  5377. ordering of templates:
  5378. 1. Addition: calculate the result of b + 0x20
  5379. 2. Dereferencing: find the memory location that the number between braces
  5380. 163
  5381. points to. This number was, of course, calculated by the previous (inner)
  5382. template.
  5383. 3. Declaration: the variable a is declared as an integer, either on the stack
  5384. (if it is a local variable) or on the heap (if it is a global variable).
  5385. 4. Assignment: the value delivered by the dereferencing template is stored
  5386. in the location returned by the declaration template.
  5387. If the templates are written carefully enough (and tested well enough), we
  5388. can create a compiler that suppports and ordering of templates. The question,
  5389. then, is how templates can be linked together. The answer lies in assigning one
  5390. register (in casu, eax ), as the result register. Every template stores its result in
  5391. eax , whether it is a value or a pointer. The meaning of the value stored in eax
  5392. is determined by the template that stored the value.
  5393. 12.2 Boilerplate Code
  5394. Since the Inger compiler generates assembly code, it is necessary to wrap up this
  5395. code in a format that the assembler expects. We use the GNU AT&T assembler,
  5396. which uses the AT&T assembly language syntax (a syntax very similar to Intel
  5397. assembly, but with some peculiar quirks. Take a look at the following assembly
  5398. instruction, first in Intel assembly syntax:
  5399. MOV EAX, EBX
  5400. This instruction copies the value stored in the EBX register into the EAX
  5401. register. In GNU AT&T syntax:
  5402. movl %ebx, %eax
  5403. We note several differences:
  5404. 1. Register names are written lowercase, and prefixed with a percent ( % ) sign
  5405. to indicate that they are registers, not global variable names;
  5406. 2. The order of the operands is reversed. This is a most irritating property
  5407. of the AT&T assembly language syntax which is a major source of errors.
  5408. You have been warned.
  5409. 3. The instruction mnemonic mov is prefixed with the size of its operands (4
  5410. bytes, long). This is similar to Intel’s BYTE PTR , WORD PTR and DWORD
  5411. PTR keywords.
  5412. There are other differences, some more subtle than others, regarding deref-
  5413. erencing and indexing. For complete details, please refer to the GNU As Man-
  5414. ual[10].
  5415. The GNU Assembler specifies a defailt syntax for the assembly files, at file
  5416. level. Every file has at least one data segment (designated with .data ), and
  5417. one code segment (designated with .text ). The data segment contains global
  5418. 164
  5419. .data
  5420. .globl a
  5421. .align 4
  5422. .type a,@object
  5423. 5 .size a,4
  5424. a:
  5425. .long 0
  5426. Listing 12.1: Global Variable Declaration
  5427. variables and string constants, while the code segment holds the actual code.
  5428. The code segment may never be written to, while the data segment is modifiable.
  5429. Global variables are declared by specifying their size, and optionally a type and
  5430. alignment. Global variables are always of type @object (as opposed to type
  5431. @function for functions). The code in listing 12.1 declares the variable a .
  5432. It is also required to declare at least one function (the main function) as
  5433. a global label. This function is used as the program entry point. Its type is
  5434. always @function .
  5435. 12.3 Globals
  5436. The assembly code for an Inger program is generated by traversing the tree
  5437. multiple times. The first pass is necessary to find all global declarations. As
  5438. the tree is traversed, the code generation module checks for declaration nodes.
  5439. When it finds a declaration node, the symbol that belongs to the declaration is
  5440. retrieved from the symbol table. If this symbol is a global, the type information
  5441. is retrieved and the assembly code to declare this global variable is generated
  5442. (see listing 12.1). Local variables and function parameters are skipped during
  5443. this pass.
  5444. 12.4 Resource Calculation
  5445. During the second pass when the real code is generated, the implementations
  5446. for functions are also created. Before the code of a function can be generated,
  5447. the code generation module must know the location of all function parameters
  5448. and local variables on the stack. This is done by quickly scanning the body of
  5449. the function for local declarations. Whenever a declaration is found its position
  5450. on the stack is determined and this is stored in the symbol itself, in the sym-
  5451. bol table. This way references to local variables and parameters can easily be
  5452. converted to stack locations when generating code for the function implemen-
  5453. tation. The size and location of each symbol play an important role in creating
  5454. the layout for function stack frames, later on.
  5455. 165
  5456. 12.5 Intermediate Results of Expressions
  5457. The code generation module in Inger is implemented in a very simple and
  5458. straightforward way. There is no real register allocation involved, all inter-
  5459. mediate values and results of expressions are stored in the EAX register. Even
  5460. though this will lead to extremely unoptimized code – both in speed and size –
  5461. it is also very easy to write. Consider the following simple program:
  5462. /*
  5463. * simple.i
  5464. * Simple example program to demonstrate code generation.
  5465. */
  5466. 5 module simple;
  5467. extern printInt : int i → void;
  5468. int a, b;
  5469. 10
  5470. start main : void → void
  5471. {
  5472. a = 16;
  5473. b = 32;
  5474. 15 printInt ( a ∗ b );
  5475. }
  5476. This little program translates to the following x86 assembly code wich shows
  5477. how the intermediate values and results of expressions are kept in the EAX
  5478. register:
  5479. .data
  5480. .globl a
  5481. .align 4
  5482. .type a,@object
  5483. 5 .size a,4
  5484. a:
  5485. .long 0
  5486. .globl b
  5487. .align 4
  5488. 10 .type b,@object
  5489. .size b,4
  5490. b:
  5491. .long 0
  5492. .text
  5493. 15 .align 4
  5494. .globl main
  5495. .type main,@function
  5496. main:
  5497. pushl %ebp
  5498. 20 movl %esp, %ebp
  5499. subl $0, %esp
  5500. movl $16, %eax
  5501. movl %eax, a
  5502. 166
  5503. movl $32, %eax
  5504. 25 movl %eax, b
  5505. movl a, %eax
  5506. movl %eax, %ebx
  5507. movl b, %eax
  5508. imul %ebx
  5509. 30 pushl %eax
  5510. call printInt
  5511. addl $4, %esp
  5512. leave
  5513. ret
  5514. The eax register may contain either values or references, depending on the
  5515. code template that placed a value in eax . If the code template was, for instance,
  5516. addition , eax will contain a numeric value (either floating point or integer). If
  5517. the code template was the template for the address operator ( & ), then eax will
  5518. contain an address (a pointer).
  5519. Since the code generation module can assume that the input code is both
  5520. syntactically and semantically correct, the meaning of the value in eax does not
  5521. really matter. All the code generation module needs to to is make sure that the
  5522. value in eax is passed between templates correctly, and if possible, efficiently.
  5523. 12.6 Function calls
  5524. Function calls are executed using the Intel assembly call statement. Since the
  5525. GNU assembler is reasonable high-level assembler, it is sufficient to supply the
  5526. name of the function being called; the linker ( ld ) will take care of the job of
  5527. filling in the correct address, assuming the function exists, but once again – we
  5528. can assume that the input Inger code is semantically and syntactically correct.
  5529. If a function really does not exist, and the linker complains about it, it is because
  5530. there is an error in a header file. The syntax of a basic function call is:
  5531. call printInt
  5532. Of course, most interesting functions take parameters. Parameters to func-
  5533. tions are always passed to the function using the stack. For this, Inger using
  5534. the same paradigm that the C language uses: the caller is responsible for both
  5535. placing parameters on the stack and for removing them after the function call
  5536. has completed. The reason that Inger is so compatible with C is a practical one:
  5537. this way Inger can call C functions in operating system libraries, and we do not
  5538. need to supply wrapper libraries that call these functions. This makes the life
  5539. of Inger programmers and compiler writers a little bit easier.
  5540. Apart from parameters, functions also have local variables and these live on
  5541. the stack too. All in all the stack is rapidly becoming complex and we call the
  5542. order in which parameters and local variables are placed on the stack the stack
  5543. frame. As stated earlier, Inger adheres to the calling convention popularized
  5544. by the C programming language, and therefore the stack frames of the two
  5545. languages are identifical.
  5546. 167
  5547. The function being called uses the ESP register to point to the top of the
  5548. stack. The EBP register is the base pointer to the stack frame. As in C,
  5549. parameters are pushed on the stack from right to left (the last argument is
  5550. pushed first). Return values of 4 bytes or less are stored in the EAX register. For
  5551. return values with more than 4 bytes, the caller passes an extra first argument
  5552. to the callee (the function being called). This extra argument is the address
  5553. of the location where the return value must be stored (this extra argument is
  5554. the first argument, so it is the last argument to be pushed on the stack). To
  5555. illustrate this point, we give an example in C:
  5556. Example 12.1 (Stack Frame)
  5557. /* vec3 is a structure of
  5558. * 3 floats (12 bytes). */
  5559. struct vec3
  5560. {
  5561. 5 int x, y, z;
  5562. };
  5563. /* f is a function that returns
  5564. * a vec3 struct: */
  5565. 10 vec3 f( int a, int b, int c );
  5566. Since the return value of the function f is more than 4 bytes, an extra
  5567. first argument must be placed on the stack, containing the address of the vec3
  5568. structure that the function returns. This means the call:
  5569. v = f ( 1, 0, 3 );
  5570. is transformed into:
  5571. f( &v , 1, 0, 3 );
  5572. ?
  5573. It should be noted that Inger does not support structures at this time, and
  5574. all data types can be handled using either return values of 4 bytes or less, which
  5575. fit in eax , or using pointers (which are also 4 bytes and therefore fit in eax ). For
  5576. future extensions of Inger, we have decided to support the extra return value
  5577. function argument.
  5578. Since functions have a stack frame of their own, the contents of the stack
  5579. frame occupied by the caller are quite safe. However, the registers used by the
  5580. caller will be overwritten by the callee, so the caller must take care to push any
  5581. values it needs later onto the stack. If the caller wants to save the eax , ecx and
  5582. edx registers, it has to push them on the stack first. After that, it pushes the
  5583. arguments (from right to left), and when the call instruction is called, the eip
  5584. register is pushed onto the stack too (implicitly, by the call instruction), which
  5585. means the return address is on top of the stack.
  5586. 168
  5587. Although the caller does most of the work creating the stack frame (pushing
  5588. parameters on the stack), the callee still has to do several things. The stack
  5589. frame is not yet finished, because the callee must create space for local variables
  5590. (and set them to their initial values, if any). Furthermore, the callee must set
  5591. save the contents of ebx , esi and edi as needed and set esp and ebp to point to the
  5592. top and bottom of the stack, respectively. Initially, the EBP register points to
  5593. a location in the caller’s stack frame. This value must be preserved, so it must
  5594. be pushed onto the stack. The contents of esp (the bottom of the current stack
  5595. frame) are then copied into esp , so that esp is free to do other things and to
  5596. allow arguments to be referenced as an offset from ebp . This gives us the stack
  5597. frame depicted in figure 12.1.
  5598. Figure 12.1: Stack Frame Without Local Variables
  5599. To allocate space for local variables and temporary storage, the callee just
  5600. subtracts the number of bytes required for the allocation from esp . Finally, it
  5601. pushes ebx , esi and edi on the stack, if the function overwrites them. Of course,
  5602. this depends on the templates used in the function, so for every template, its
  5603. effects on ebx , esi and edi must be known.
  5604. The stack frame now has the form shown in figure 12.2.
  5605. During the execution of the function, the stack pointer esp might go up
  5606. and down, but the ebp register is fixed, so the function can always refer to the
  5607. first argument as [ebp+8] . The second argument is located at [ebp+12] (decimal
  5608. offset), the third argument is at [ebp+16] and so on, assuming all argument are
  5609. 4 bytes in size.
  5610. The callee is not done yet, because when execution of the function body
  5611. is complete, it must perform some cleanup operations. Of course, the caller is
  5612. responsible for cleaning up function parameters it pushed onto the stack (just
  5613. like in C), but the remainer of the cleanup is the callee’s job. The callee must:
  5614. • Store the return value in eax , or in the extra parameter;
  5615. • Restore the ebx , esi and edi registers as needed.
  5616. Restoration of the values of the ebx , esi and edi registers is performed by
  5617. popping them from the stack, where they had been stored for safekeeping earlier.
  5618. 169
  5619. Figure 12.2: Stack Frame With Local Variables
  5620. Of course, it is important to only pop the registers that were pushed onto the
  5621. stack in the first place: some functions save ebx , esi and edi , while others do not.
  5622. The last thing to do is taking down the stack frame. This is done by moving
  5623. the contents from ebp to esp (thus effectively discarding the stack frame) and
  5624. popping the original ebp from the stack. 1 The return ( ret ) instruction can now
  5625. be executed, wich pops the return address of the stack and places it in the eip
  5626. register.
  5627. Since the stack is now exactly the same as it was before making the function
  5628. call, the arguments (and return value when larger than 4 bytes) are still on the
  5629. stack. The esp can be restored by adding the number of bytes the arguments
  5630. use to esp .
  5631. Finally, if there were any saved registers ( eax , ecx and edx ) they must be
  5632. popped from the stack as well.
  5633. 12.7 Control Flow Structures
  5634. The code generation module handles if/then/else structures by generating com-
  5635. parison code and conditional jumps. The jumps go to the labels that are gen-
  5636. erated before the then and else blocks.
  5637. Loops are also implemented in a very straight forward manner. First it
  5638. generates a label to jump back to every iteration. After that the comparison
  5639. code is generated. This is done in exactly the same way as it is done with
  5640. if expressions. After this the code block of the loop is generated followed by
  5641. a jump to the label right before the comparison code. The loop is concluded
  5642. 1 The i386 instruction set has an instruction leave
  5643. which does this exact thing.
  5644. 170
  5645. with a final label where the comparison code can jump to if the result of the
  5646. expression is false.
  5647. 12.8 Conclusion
  5648. This concludes the description of the inner workings of the code generation
  5649. module for the Inger language.
  5650. 171
  5651. Bibliography
  5652. [1] O. Andrew and S. Talbott: Managing projects with Make, OReilly & asso-
  5653. ciates, inc., December 1991.
  5654. [2] B. Brey: 8086/8088, 80286, 80386, and 80486 Assembly Language Pro-
  5655. gramming, Macmillan Publishing Company, 1994.
  5656. [3] G. Chapell: DOS Internals, Addison-Wesley, 1994.
  5657. [4] J. Duntemann: Assembly Language Step-by-Step, John Wiley & Sons, Inc.,
  5658. 1992.
  5659. [5] T. Hogan: The Programmers PC Sourcebook: Charts and Tables for the
  5660. IBM PC Compatibles, and the MS-DOS Operating System, including the
  5661. new IBM Personal System/2 computers, Microsoft Press, Redmond, Wash-
  5662. ington, 1988.
  5663. [6] K. Irvine: Assembly Language for Intel-based Computers, Prentice-Hall,
  5664. Upper Saddle River, NJ, 1999.
  5665. [7] M. L. Scott: Porgramming Language Pragmatics, Morgan Kaufmann Pub-
  5666. lishers, 2000.
  5667. [8] I. Sommerville: Software Engineering (sixth edition), Addison-Wesley,
  5668. 2001.
  5669. [9] W. Stallings: Operating Systems: achtergronden, werking en ontwerp, Aca-
  5670. demic Service, Schoonhoven, 1999.
  5671. [10] R. Stallman: GNU As Manual,
  5672. http://www.cs.utah.edu/dept/old/texinfo/as/as.html
  5673. 172
  5674. Chapter 13
  5675. Code Templates
  5676. This final chapter of the book serves as a repository of code templates.
  5677. These templates are used by the compiler to generate code for common (sub)
  5678. expressions. Every template has a name and will be treated on the page ahead,
  5679. each template on a page of its own.
  5680. 173
  5681. Addition
  5682. Inger
  5683. expr + expr
  5684. Example
  5685. 3 + 5
  5686. Assembler
  5687. 1. The left expression is evaluated and stored in eax .
  5688. 2. movl %eax, %ebx
  5689. 3. The right expression is evaluated and stored in eax .
  5690. 4. addl %ebx, %eax
  5691. Description
  5692. The result of the left expression is added to the result of the right
  5693. expression and the result of the addition is stored in eax .
  5694. 174
  5695. Subtraction
  5696. Inger
  5697. expr − expr
  5698. Example
  5699. 8 − 3
  5700. Assembler
  5701. 1. Left side of expression is evaluated and stored in eax .
  5702. 2. movl %eax, %ebx
  5703. 3. Right side of expression is evaluated and stored in eax.
  5704. 4. subl %ebx, %eax
  5705. Description
  5706. The result of the right expression is subtracted from the result of
  5707. the left expression and the result of the subtraction is stored in eax .
  5708. 175
  5709. Multiplication
  5710. Inger
  5711. expr ∗ expr
  5712. Example
  5713. 12 − 4
  5714. Assembler
  5715. 1. Left side of expression is evaluated and stored in eax .
  5716. 2. movl %eax, %ebx
  5717. 3. Right side of expression is evaluated and stored in eax .
  5718. 4. imul %ebx
  5719. Description
  5720. The result of the left expression is multiplied with the result of the
  5721. right expression and the result of the multiplication is stored in eax .
  5722. 176
  5723. Division
  5724. Inger
  5725. expr / expr
  5726. Example
  5727. 32 / 8
  5728. Assembler
  5729. 1. Left side of expression is evaluated and stored in eax .
  5730. 2. movl %eax, %ebx
  5731. 3. Right side of expression is evaluated and stored in eax .
  5732. 4. xchgl %eax, %ebx
  5733. 5. xorl %edx, %edx
  5734. 6. idiv %ebx
  5735. Description
  5736. The result of the left expression is divided by the result of the right
  5737. expression and the result of the division is stored in eax .
  5738. 177
  5739. Modulus
  5740. Inger
  5741. expr % expr
  5742. Example
  5743. 14 % 3
  5744. Assembler
  5745. 1. Left side of expression is evaluated and stored in eax .
  5746. 2. movl %eax, %ebx
  5747. 3. Right side of expression is evaluated and stored in eax .
  5748. 4. xchgl %eax, %ebx
  5749. 5. xorl %edx, %edx
  5750. 6. idiv %ebx
  5751. 7. movl %edx, %eax
  5752. Description
  5753. The result of the left expression is divided by the result of the right
  5754. expression and the remainder of the division is stored in eax .
  5755. 178
  5756. Negation
  5757. Inger
  5758. −expr
  5759. Example
  5760. −10
  5761. Assembler
  5762. 1. Expression is evaluated and stored in eax .
  5763. 2. neg %eax
  5764. Description
  5765. The result of the expression is negated and stored in the eax register.
  5766. 179
  5767. Left Bitshift
  5768. Inger
  5769. expr << expr
  5770. Example
  5771. 256 << 2
  5772. Assembler
  5773. 1. Left side of expression is evaluated and stored in eax .
  5774. 2. movl %eax, %ecx
  5775. 3. Right side of expression is evaluated and stored in eax .
  5776. 4. xchgl %eax, %ecx
  5777. 5. sall %cl, %eax
  5778. Description
  5779. The result of the left expression is shifted n bits to the left, where n
  5780. is the result of the right expression. The result is stored in the eax
  5781. register.
  5782. 180
  5783. Right Bitshift
  5784. Inger
  5785. expr >> expr
  5786. Example
  5787. 16 >> 2
  5788. Assembler
  5789. 1. Left side of expression is evaluated and stored in eax .
  5790. 2. movl %eax, %ecx
  5791. 3. Right side of expression is evaluated and stored in eax .
  5792. 4. xchgl %eax, %ecx
  5793. 5. sarl %cl, %eax
  5794. Description
  5795. The result of the left expression is shifted n bits to the right, where
  5796. n is the result of the right expression. The result is stored in the eax
  5797. register.
  5798. 181
  5799. Bitwise And
  5800. Inger
  5801. expr & expr
  5802. Example
  5803. 255 & 15
  5804. Assembler
  5805. 1. Left side of expression is evaluated and stored in eax .
  5806. 2. movl %eax, %ebx
  5807. 3. Right side of expression is evaluated and stored in eax .
  5808. 4. andl %ebx, %eax
  5809. Description
  5810. The result of an expression is subject to a bitwise and operation with
  5811. the result of another expression and this is stored in the eax register.
  5812. 182
  5813. Bitwise Or
  5814. Inger
  5815. expr | expr
  5816. Example
  5817. 13 | 22
  5818. Assembler
  5819. 1. Left side of expression is evaluated and stored in eax .
  5820. 2. movl %eax, %ebx
  5821. 3. Right side of expression is evaluated and stored in eax .
  5822. 4. orl %ebx, %eax
  5823. Description
  5824. The result of an expression is subject to a bitwise or operation with
  5825. the result of another expression and this is stored in the eax register.
  5826. 183
  5827. Bitwise Xor
  5828. Inger
  5829. expr ˆ expr
  5830. Example
  5831. 63 ˆ 31
  5832. Assembler
  5833. 1. Left side of expression is evaluated and stored in eax .
  5834. 2. movl %eax, %ebx
  5835. 3. Right side of expression is evaluated and stored in eax .
  5836. 4. andl %ebx, %eax
  5837. Description
  5838. The result of an expression is subject to a bitwise xor operation with
  5839. the result of another expression and this is stored in the eax register.
  5840. 184
  5841. If-Then-Else
  5842. Inger
  5843. if ( expr )
  5844. {
  5845. // Code block
  5846. }
  5847. 5 // The following part is optional
  5848. else
  5849. {
  5850. // Code block
  5851. }
  5852. Example
  5853. int a = 2;
  5854. if ( a == 1 )
  5855. {
  5856. a = 5;
  5857. 5 }
  5858. else
  5859. {
  5860. a = a − 1;
  5861. }
  5862. Assembler
  5863. When there is only a then block:
  5864. 1. Expression is evaluated and stored in eax .
  5865. 2. cmpl $0, %eax
  5866. 3. je .LABEL0
  5867. 4. Then code block is generated.
  5868. 5. .LABEL0:
  5869. When there is an else block:
  5870. 1. Expression is evaluated and stored in eax .
  5871. 2. cmpl $0, %eax
  5872. 3. je .LABEL0
  5873. 4. Then code block is generated.
  5874. 5. jmp .LABEL1
  5875. 6. .LABEL0:
  5876. 7. Else code block is generated.
  5877. 8. .LABEL1:
  5878. 185
  5879. Description
  5880. This template describes an if-then-else construction. The condi-
  5881. tional code execution is realized with conditional jumps to labels.
  5882. Different templates are used for if-then and if-then-else construc-
  5883. tions.
  5884. 186
  5885. While Loop
  5886. Inger
  5887. while( expr ) do
  5888. {
  5889. // Code block
  5890. }
  5891. Example
  5892. int i = 5;
  5893. while( i > 0 ) do
  5894. {
  5895. i = i − 1;
  5896. 5 }
  5897. Assembler
  5898. 1. Expression is evaluated and stored in eax .
  5899. 2. .LABEL0:
  5900. 3. cmpl $0, %eax
  5901. 4. je .LABEL1
  5902. 5. Code block is generated
  5903. 6. jmp .LABEL0
  5904. Description
  5905. This template describes a while loop. The expression is evaluated
  5906. and while the result of the expression is true the code block is exe-
  5907. cuted.
  5908. 187
  5909. Function Application
  5910. Inger
  5911. func( arg1, arg2, argN );
  5912. Example
  5913. printInt ( 4 );
  5914. Assembler
  5915. 1. The expression of each argument is evaluated, stored in eax ,
  5916. and pushed on the stack.
  5917. 2. movl %ebp, %ecx
  5918. 3. The location on the stack is determined.
  5919. 4. call printInt (in this example the function name is printInt)
  5920. 5. The number of bytes used for the arguments is calculated.
  5921. 6. addl $4, %esp (in this example the number of bytes is 4)
  5922. Description
  5923. This template describes the application of a function. The argu-
  5924. ments are pushed on the stack according to the C style function call
  5925. convention.
  5926. 188
  5927. Function Implementation
  5928. Inger
  5929. func: type ident1 , type ident2 , type identN → returntype
  5930. {
  5931. // Implementation
  5932. }
  5933. Example
  5934. square: int i → int
  5935. {
  5936. return( i ∗ i );
  5937. }
  5938. Assembler
  5939. 1. .globl square (in this example the function name is square)
  5940. 2. .type square, @function
  5941. 3. square:
  5942. 4. pushl %ebp
  5943. 5. movl %esp, %ebp
  5944. 6. The number of bytes needed for the parameters are counted
  5945. here.
  5946. 7. subl $4, %esp (in this example the number of bytes needed
  5947. is 4)
  5948. 8. The implementation code is generated here.
  5949. 9. leave
  5950. 10. ret
  5951. Description
  5952. This template describes implementation of a function. The number
  5953. of bytes needed for the parameters is calculated and subtracted from
  5954. the esp register to allocate space on the stack.
  5955. 189
  5956. Identifier
  5957. Inger
  5958. identifier
  5959. Example
  5960. i
  5961. Assembler
  5962. For a global variable:
  5963. 1. Expression is evaluated and stored in eax
  5964. 2. movl i, %eax (in this example the name of the identifier is i)
  5965. For a local variable:
  5966. 1. movl %ebp, %ecx
  5967. 2. The location on the stack is determined
  5968. 3. addl $4, %ecx (in this example the stack offset is 4)
  5969. 4. movl (%ecx), %eax
  5970. Description
  5971. This template describes the use of a variable. When a global variable
  5972. is used it is easy to generate the assembly because we can just use
  5973. the name of the identifier. For locals its position on the stack has to
  5974. be determined.
  5975. 190
  5976. Assignment
  5977. Inger
  5978. identifier = expr;
  5979. Example
  5980. i = 12;
  5981. Assembler
  5982. For a global variable:
  5983. 1. The expression is evaluated and stored in eax .
  5984. 2. movl %eax, i (in this example the name of the identifier is i)
  5985. For a local variable:
  5986. 1. The expression is evaluated and stored in eax .
  5987. 2. The location on the stack is determined
  5988. 3. movl %eax, 4(%ebp) (in this example the offset on the stack is
  5989. 4)
  5990. Description
  5991. This template describes an assignment of a variable. Global and
  5992. local variables must be handled differently.
  5993. 191
  5994. Global Variable Declaration
  5995. Inger
  5996. type identifier = initializer ;
  5997. Example
  5998. int i = 5;
  5999. Assembler
  6000. For a global variable:
  6001. 1. .data
  6002. 2. .globl i (in this example the name of the identifier is i)
  6003. 3. .type i,@object)
  6004. 4. .size i,4 (in this example the type is 4 bytes in size)
  6005. 5. a:
  6006. 6. .long 5 (in this example the initializer is 5)
  6007. Description
  6008. This template describes the declaration of a global variable. When
  6009. no initializer is specified, the variable is initialized to zero.
  6010. 192
  6011. Equal
  6012. Inger
  6013. expr == expr
  6014. Example
  6015. i == 3
  6016. Assembler
  6017. 1. The left expression is evaluated and stored in eax .
  6018. 2. movl %eax, %ebx
  6019. 3. The right expression is evaluated and stored in eax .
  6020. 4. cmpl %eax, %ebx
  6021. 5. movl $0, %ebx
  6022. 6. movl $1, %ecx
  6023. 7. cmovne %ebx, %eax
  6024. 8. cmove %ecx, %eax
  6025. Description
  6026. This template describes the == operator. The two expressions are
  6027. evaluated and the results are compared. When the results are the
  6028. same, 1 is loaded in eax . When the results are not the same, 0 is
  6029. loaded in eax .
  6030. 193
  6031. Not Equal
  6032. Inger
  6033. expr != expr
  6034. Example
  6035. i != 5
  6036. Assembler
  6037. 1. The left expression is evaluated and stored in eax .
  6038. 2. movl %eax, %ebx
  6039. 3. The right expression is evaluated and stored in eax .
  6040. 4. cmpl %eax, %ebx
  6041. 5. movl $0, %ebx
  6042. 6. movl $1, %ecx
  6043. 7. cmove %ebx, %eax
  6044. 8. cmovne %ecx, %eax
  6045. Description
  6046. This template describes the 6= operator. The two expressions are
  6047. evaluated and the results are compared. When the results are not
  6048. the same, 1 is loaded in eax . When the results are the same, 0 is
  6049. loaded in eax .
  6050. 194
  6051. Less
  6052. Inger
  6053. expr < expr
  6054. Example
  6055. i < 18
  6056. Assembler
  6057. 1. The left expression is evaluated and stored in eax .
  6058. 2. movl %eax, %ebx
  6059. 3. The right expression is evaluated and stored in eax .
  6060. 4. cmpl %eax, %ebx
  6061. 5. movl $0, %ebx
  6062. 6. movl $1, %ecx
  6063. 7. cmovnl %ebx, %eax
  6064. 8. cmovl %ecx, %eax
  6065. Description
  6066. This template describes the < operator. The two expressions are
  6067. evaluated and the results are compared. When the left result is less
  6068. than the right result, 1 is loaded in eax . When the left result is not
  6069. smaller than the right result, 0 is loaded in eax .
  6070. 195
  6071. Less Or Equal
  6072. Inger
  6073. expr <= expr
  6074. Example
  6075. i <= 44
  6076. Assembler
  6077. 1. The left expression is evaluated and stored in eax .
  6078. 2. movl %eax, %ebx
  6079. 3. The right expression is evaluated and stored in eax .
  6080. 4. cmpl %eax, %ebx
  6081. 5. movl $0, %ebx
  6082. 6. movl $1, %ecx
  6083. 7. cmovnle %ebx, %eax
  6084. 8. cmovle %ecx, %eax
  6085. Description
  6086. This template describes the ≤ operator. The two expressions are
  6087. evaluated and the results are compared. When the left result is less
  6088. than or equals the right result, 1 is loaded in eax . When the left
  6089. result is not smaller than and does not equal the right result, 0 is
  6090. loaded in eax .
  6091. 196
  6092. Greater
  6093. Inger
  6094. expr > expr
  6095. Example
  6096. i > 57
  6097. Assembler
  6098. 1. The left expression is evaluated and stored in eax .
  6099. 2. movl %eax, %ebx
  6100. 3. The right expression is evaluated and stored in eax .
  6101. 4. cmpl %eax, %ebx
  6102. 5. movl $0, %ebx
  6103. 6. movl $1, %ecx
  6104. 7. cmovng %ebx, %eax
  6105. 8. cmovg %ecx, %eax
  6106. Description
  6107. This template describes the > operator. The two expressions are
  6108. evaluated and the results are compared. When the left result is
  6109. greater than the right result, 1 is loaded in eax . When the left result
  6110. is not greater than the right result, 0 is loaded in eax .
  6111. 197
  6112. Greater Or Equal
  6113. Inger
  6114. expr >= expr
  6115. Example
  6116. i >= 26
  6117. Assembler
  6118. 1. The left expression is evaluated and stored in eax .
  6119. 2. movl %eax, %ebx
  6120. 3. The right expression is evaluated and stored in eax .
  6121. 4. cmpl %eax, %ebx
  6122. 5. movl $0, %ebx
  6123. 6. movl $1, %ecx
  6124. 7. cmovnge %ebx, %eax
  6125. 8. cmovge %ecx, %eax
  6126. Description
  6127. This template describes the ≥ operator. The two expressions are
  6128. evaluated and the results are compared. When the left result is
  6129. greater than or equals the right result, 1 is loaded in eax . When the
  6130. left result is not greater than and does not equal the right result, 0
  6131. is loaded in eax .
  6132. 198
  6133. Chapter 14
  6134. Bootstrapping
  6135. A subject which has not been discussed so far is bootstrapping. Bootstrapping
  6136. means building a compiler in its own language. So for the Inger language it
  6137. would mean that we build the compiler in the Inger language as well. To discuss
  6138. the practical application of this theory is beyond the scope of this book. Below
  6139. we explain it in theory.
  6140. Developing a new language is mostly done to improve some aspects compared
  6141. to other, existing languages. What we would prefer, is to compile the compiler
  6142. in its own language, but how can that be done when there is no compiler to
  6143. compile the compiler? To visualize this problem we use so-called T-diagrams,
  6144. to illustrate the process of bootstrapping. To get familiar with T-diagrams we
  6145. present a few examples.
  6146. Example 14.1 (T-Diagrams)
  6147. Figure 14.1: Program P can work on machine with language M . I = input, O =
  6148. output.
  6149. Figure 14.2: Interpreter for language T , is able to work on a machine with
  6150. language M .
  6151. ?
  6152. 199
  6153. Figure 14.3: Compiler for language T1 to language T2 , is able to work on a
  6154. machine with language M .
  6155. Figure 14.4: Machine for language M .
  6156. The bootstrapping problem can be resolved in the following way:
  6157. 1. Build two versions of the complier. One version is the optimal-compiler
  6158. and the other version is the bootstrap-compiler. The optimal compiler
  6159. written for the new language T , complete with all optimisation is written
  6160. in language T itself. The bootstrap-compiler is written in an existing
  6161. language m . Because this compiler is not optimized and therefore slower
  6162. in use, the m is written in lowercase instead of uppercase.
  6163. 2. Translate the optimal-compiler with the bootstrap-compiler. The result is
  6164. the optimal-compiler, which can run on the target machine M . However,
  6165. this version of the compiler is not optimized yet (slow, and a lot of memory
  6166. usage). We call this the temporary-compiler.
  6167. 3. Now we must the compile the optimal-compiler again, only this time we
  6168. use the temporary-compiler to compile it with. The result will be the
  6169. final, optimal compiler able to run on machine M. This compiler will be
  6170. fast and produce optimized output.
  6171. 4. The result:
  6172. It is a long way before you have a bootstrap compiler, but remember, this
  6173. is the ultimate compiler!
  6174. Figure 14.5: Program P runs on machine M .
  6175. 200
  6176. Figure 14.6: The two compilers.
  6177. Figure 14.7: temporary-compiler.
  6178. Figure 14.8: compile process.
  6179. Figure 14.9: final-compiler.
  6180. 201
  6181. Chapter 15
  6182. Conclusion
  6183. All parts how to build a compiler, from the setup of a language to the code
  6184. generation, have now been discussed. Using this book as a reference, it should
  6185. be possible to build your own compiler.
  6186. The compiler we build in this book is not innovating. Lots of this type of
  6187. compiler (for imperative languages) compilers already exist, only the language
  6188. differs: examples include compilers for C or Pascal .
  6189. Because the Inger compiler is a low-level compiler, it is extremely suitable for
  6190. system programming (building operating systems). The same applies to game
  6191. programming and programming command line applications (such as UNIX filter
  6192. tools).
  6193. We hope you will be able to put the theory and practical examples we de-
  6194. scribed in this book to use, in order to build your own compliler. It is up to you
  6195. now!
  6196. 202
  6197. Appendix A
  6198. Requirements
  6199. A.1 Introduction
  6200. This chapter specifies the software necessary to either use (run) Inger or to
  6201. develop for Inger. The version numbers supplied in this text are the version
  6202. numbers of software packages we used. You may well be able to use older
  6203. versions of this software, but it is not guaranteed that this will work. You can
  6204. always (except in rare cases) use newer versions of the software discussed below.
  6205. A.2 Running Inger
  6206. Inger was designed for the Linux operating system. It is perfectly possible to run
  6207. Inger on other platforms like Windows, and even do development work for Inger
  6208. on non-Linux platforms. However, this section discusses the software required
  6209. to run Inger on Linux.
  6210. The Linux distribution we used is RedHat, 1 but other advanced Linux dis-
  6211. tributions like SuSE 2 or Mandrake 3 will do fine.
  6212. The most elementary package you need to run Inger is naturally Inger itself.
  6213. It can be downloaded from its repository at Source Forge. 4 There are two pack-
  6214. ages available: the user package, which contains only the compiler binary and
  6215. the user manual, and the development package, which contains the compiler
  6216. source as well. Be sure to download the user version, not the developer version.
  6217. As the Inger compiler compiles to GNU AT&T assembly code, the GNU
  6218. assembler as is required to convert the assembly code to object code. The GNU
  6219. assembler is part of the binutils package. 5 You may use any other assembler,
  6220. provided it supports the AT&T assembly syntax; as is the only assembler that
  6221. supports it that we are currently aware of. A linker is also required — we use
  6222. the GNU linker (which is also part of the binutils package).
  6223. Some of the code for Inger is generated using scripts written in the Perl
  6224. scripting language. You will need the perl 6 interpreter to execute these scripts.
  6225. 1 RedHat Linux 7.2, http://www.redhat.com/apps/download/
  6226. 2 SuSE Linux 8.0, http://www.suse.com/us/private/download/index.html
  6227. 3 Mandrake Linux 8.0, http://www.mandrake.com
  6228. 4 Inger 0.x, http://www.sourceforge.net/projects/inger
  6229. 5 Binutils 2.11.90.0.8, http://www.gnu.org/directory/GNU/binutils.html
  6230. 6 Perl 6, http://www.perl.com
  6231. 203
  6232. If you use a Windows port of Inger, you can also use the GNU ports of as
  6233. and ld that come with DJGPP. 7 DJGPP is a free port of (most of) the GNU
  6234. tools.
  6235. It can be advantageous to be able to view this documentation in digital form
  6236. (as a Portable Document File), which is possible with the Acrobat Reader. 8 The
  6237. Inger website may also offer this documentation in other forms, such as HTML.
  6238. Editing Inger source code in Linux can be done with the free editor vi , which
  6239. is included with virtually all Linux distributions. You can use any editor you
  6240. want, though. An Inger syntax highlighting template for Ultra Edit is available
  6241. from the Inger archive at Source Forge.
  6242. If a Windows binary of the Inger compiler is not available or not usable, and
  6243. you need to run Inger on a Windows platform, you may be able to use the Linux
  6244. emulator for the Windows platform, Cygwin , 9 to execute the Linux binary.
  6245. A.3 Inger Development
  6246. For development of the Inger language, rather than development with the Inger
  6247. language, some additional software packages are required. For development
  6248. purposes we strongly recommend that you work on a Linux platform, since we
  6249. cannot guarantee that all development tools are available on Windows platforms
  6250. or work the same way as they do on Linux.
  6251. The Inger binary is built from source using automake 10 and autoconf 11 , both
  6252. of which are free GNU software. These packages allow the developer to generate
  6253. makefiles that target the user platform, i.e. use available C compiler and lexical
  6254. scanner generator versions, and warn if no suitable software is available. To
  6255. execute the generated makefiles, GNU make , which is part of the binutils package,
  6256. is also required. Most Linux installations should have this software already
  6257. installed.
  6258. C sources are compiled using the GNU Compiler Collection ( gcc ).
  6259. 12
  6260. We used
  6261. the lexical analyzer generator GNU flex
  6262. 13
  6263. to generate a lexical scanner.
  6264. All Inger code is stored in a Concurrent Versioning System repository on
  6265. a server at Source Forge, which may be accessed using the cvs package.
  6266. 14
  6267. Note that you must be registered as an Inger developer to be able to change
  6268. the contents of the CVS repository. Registration can be done through Source
  6269. Forge.
  6270. All documentation was written using the L
  6271. A T E X2 ε
  6272. typesetting package,
  6273. 15
  6274. which is also available for Windows as the MikT E X 16 system. Editors that come
  6275. in handy when working with T E X sources are Ultra Edit,
  6276. 17
  6277. which supports T E X
  6278. syntax highlighting, and TexnicCenter
  6279. 18
  6280. which is a full-fledged T E X editor with
  6281. 7 DJGPP 2.03, http://www.delorie.com or http://www.simtel.net/pub/djgpp
  6282. 8 Adobe Acrobat 5.0, http://www.adobe.com/products/acrobat/readstep.html
  6283. 9 Cygwin 1.11.1p1, http://www.cygwin.com
  6284. 10 Automake 1.4-p5, http://www.gnu.org/software/automake
  6285. 11 Autoconf 2.13, http://www.gnu.org/software/autoconf/autoconf.html
  6286. 12 GCC 2.96, http://www.gnu.org/software/gcc/gcc.html
  6287. 13 Flex 2.5.4, http://www.gnu.org/software/flex
  6288. 14 CVS 1.11.1p1 , http://www.cvshome.org
  6289. 15 L A T E X2 ε , http://www.latex-project.org
  6290. 16 MikT E X 2.2, http://www.miktex.org
  6291. 17 Ultra Edit 9.20, http://www.ultraedit.com
  6292. 18 TexnicCenter, http://www.toolscenter.org/products/texniccenter
  6293. 204
  6294. many options (although no direct visual feedback — it is a what you see is what
  6295. you mean (WYSIWYM) tool).
  6296. The Inger development package comes with a project definition file for
  6297. KDevelop, an open source clone of Microsoft Visual Studio. If you have a
  6298. Linux distribution that has the X window system with KDE (K Desktop Envi-
  6299. ronment) installed, then you can do development work for Inger in a graphical
  6300. environment.
  6301. A.4 Required Development Skills
  6302. Development on Inger requires the following skills:
  6303. - A working knowledge of the C programming language;
  6304. - A basic knowlege of the Perl scripting language;
  6305. - Experience with GNU assembler (specifically, the AT&T assembly syn-
  6306. tax).
  6307. The rest of the skills needed, including working with the lexical analyzer
  6308. generator flex and writing tree data structures can be acquired from this book.
  6309. Use the bibliography at the end of this chapter to find additional literature that
  6310. will help you master all the tools discussed in the preceding sections.
  6311. 205
  6312. Bibliography
  6313. [1] M. Bar: Open Source Development with CVS, Coriolis Group, 2 nd edition,
  6314. 2001
  6315. [2] D. Elsner: Using As: The Gnu Assembler, iUniverse.com, 2001
  6316. [3] M. Goossens: The Latex Companion, Addison-Wesley Publishing, 1993
  6317. [4] A. Griffith: GCC: the Complete Reference, McGraw-Hill Osborne Media,
  6318. 1 st edition, 2002
  6319. [5] E. Harlow: Developing Linux Applications, New Riders Publishing, 1999.
  6320. [6] J. Levine: Lex & Yacc, O’Reilly & Associates, 1992
  6321. [7] M. Kosta Loukides: Programming with GNU Software, O’Reilly & Asso-
  6322. ciates, 1996
  6323. [8] C. Negus: Red Hat Linux 8 Bible, John Wiley & Sons, 2002
  6324. [9] Oetiker, T.: The Not So Short Introduction to L
  6325. A T E X2 ε , version 3.16, 2000
  6326. [10] A. Oram: Managing Projects with Make, O’Reilly & Associates, 2 nd edition,
  6327. 1991
  6328. [11] G. Purdy: CVS Pocket Reference, O’Reilly & Associates, 2000
  6329. [12] R. Stallman: Debugging with GDB: The GNU Source-Level Debugger, Free
  6330. Software Foundation, 2002
  6331. [13] G.V. Vaughan: GNU Autoconf, Automake, and Libtool, New Riders Pub-
  6332. lishing, 1 st edition, 2000
  6333. [14] L. Wall: Programming Perl, O’Reilly & Associates, 3 r d edition, 2000
  6334. [15] M. Welsch: Running Linux, O’Reilly & Associates, 3 r d edition, 1999
  6335. 206
  6336. Appendix B
  6337. Software Packages
  6338. This appendix lists the locations software packages that are required or rec-
  6339. ommended in order to use Inger or do development work for Inger. Note that
  6340. the locations (URLs) of these packages are subject to change and may not be
  6341. correct.
  6342. Package Description and location
  6343. RedHat Linux 7.2 Operating system
  6344. http://www.redhat.com
  6345. SuSE Linux 8.0 Operating system
  6346. http://www.suse.com
  6347. Mandrake 8.0 Operating system
  6348. http://www.mandrake.com
  6349. GNU Assembler 2.11.90.0.8 AT&T syntax assembler
  6350. http://www.gnu.org/directory/GNU/binutils.html
  6351. GNU Linker 2.11.90.0.8 COFF file linker
  6352. http://www.gnu.org/directory/GNU/binutils.html
  6353. DJGPP 2.03 GNU tools port
  6354. http://www.delorie.com
  6355. Cygwin 1.2 GNU Linux emulator for Windows
  6356. http://www.cygwin.com
  6357. 207
  6358. Package Description and location
  6359. CVS 1.11.1p1 Concurrent Versioning System
  6360. http://www.cvshome.org
  6361. Automake 1.4-p5 Makefile generator
  6362. http://www.gnu.org/software/automake
  6363. Autoconf 2.13 Makefile generator support
  6364. http://www.gnu.org/software/autoconf/autoconf.html
  6365. Make 2.11.90.0.8 Makefile processor
  6366. http://www.gnu.org/software/make/make.html
  6367. Flex 2.5.4 Lexical analyzer generator
  6368. http://www.gnu.org/software/flex
  6369. L
  6370. A T E X2 ε
  6371. Typesetting system
  6372. http://www.latex-project.org
  6373. MikT E X 2.2 Typesetting system
  6374. http://www.miktex.org
  6375. TexnicCenter T E X editor
  6376. http://www.toolscenter.org/products/texniccenter
  6377. Ultra Edit 9.20 T E X editor
  6378. http://www.ultraedit.com
  6379. Perl 6 Scripting language
  6380. http://www.perl.com
  6381. 208
  6382. Appendix C
  6383. Summary of Operations
  6384. C.1 Operator Precedence Table
  6385. Operator Priority Associatity Description
  6386. () 1 L function application
  6387. [] 1 L array indexing
  6388. ! 2 R logical negation
  6389. - 2 R unary minus
  6390. + 2 R unary plus
  6391. ~ 3 R bitwise complement
  6392. * 3 R indirection
  6393. & 3 R referencing
  6394. * 4 L multiplication
  6395. / 4 L division
  6396. % 4 L modulus
  6397. + 5 L addition
  6398. - 5 L subtraction
  6399. >> 6 L bitwise shift right
  6400. << 6 L bitwise shift left
  6401. < 7 L less than
  6402. <= 7 L less than or equal
  6403. > 7 L greater than
  6404. >= 7 L greater than or equal
  6405. == 8 L equality
  6406. != 8 L inequality
  6407. & 9 L bitwise and
  6408. ^ 10 L bitwise xor
  6409. | 11 L bitwise or
  6410. && 12 L logical and
  6411. || 12 L logical or
  6412. ?: 13 R ternary if
  6413. = 14 R assignment
  6414. 209
  6415. C.2 Operand and Result Types
  6416. Operator Operation Operands Result
  6417. () function application any any
  6418. [] array indexing int none
  6419. ! logical negation bool bool
  6420. - unary minus int int
  6421. + unary plus int int
  6422. ~ bitwise complement int , char int , char
  6423. * indirection any any pointer
  6424. & referencing any pointer any
  6425. * multiplication int , float int , float
  6426. / division int , float int , float
  6427. % modulus int , char int , char
  6428. + addition int , float , char int , float , char
  6429. - subtraction int , float , char int , float , char
  6430. >> bitwise shift right int , char int , char
  6431. << bitwise shift left int , char int , char
  6432. < less than int , float , char int , float , char
  6433. <= less than or equal int , float , char int , float , char
  6434. > greater than int , float , char int , float , char
  6435. >= greater than or equal int , float , char int , float , char
  6436. == equality int , float , char int , float , char
  6437. != inequality int , float , char int , float , char
  6438. & bitwise and int , char int , char
  6439. ^ bitwise xor int , char int , char
  6440. | bitwise or int , char int , char
  6441. && logical and bool bool
  6442. || logical or bool bool
  6443. ?: ternary if bool (2x) any
  6444. = assignment any any
  6445. 210
  6446. Appendix D
  6447. Backus-Naur Form
  6448. module: module <identifier> ; globals.
  6449. globals : ?.
  6450. globals : global globals.
  6451. globals : extern global globals.
  6452. global: function.
  6453. global: declaration.
  6454. function: functionheader functionrest.
  6455. functionheader: modifiers < identifier > : paramlist
  6456. −> returntype.
  6457. functionrest : ;.
  6458. functionrest : block.
  6459. modifiers: ?.
  6460. modifiers: start.
  6461. paramlist: void.
  6462. paramlist: paramblock moreparamblocks.
  6463. moreparamblocks: ?.
  6464. moreparamblocks: ; paramblock moreparamblocks.
  6465. paramblock: type param moreparams.
  6466. moreparams: ?.
  6467. moreparams: , param moreparams.
  6468. param: reference < identifier > dimensionblock.
  6469. 211
  6470. returntype: type reference dimensionblock.
  6471. reference : ?.
  6472. reference : ∗ reference.
  6473. dimensionblock: ?.
  6474. dimensionblock:
  6475. ? ?
  6476. dimensionblock.
  6477. block:
  6478. ?
  6479. code
  6480. ?
  6481. .
  6482. code: ?.
  6483. code: block code
  6484. code: statement cod?.
  6485. statement: label <identifier> ;
  6486. statement: ;
  6487. statement: break ;
  6488. statement: continue ;
  6489. statement: expression ;
  6490. statement: declarationblock ;
  6491. statement: if
  6492. ?
  6493. expression
  6494. ?
  6495. block elseblock
  6496. statement: goto <identifier> ;
  6497. statement: while
  6498. ?
  6499. expression
  6500. ?
  6501. do block
  6502. statement: do block while
  6503. ?
  6504. expression
  6505. ?
  6506. ;
  6507. statement: switch
  6508. ?
  6509. expression
  6510. ? ?
  6511. switchcases default block
  6512. ?
  6513. statement: return
  6514. ?
  6515. expression
  6516. ?
  6517. ;.
  6518. elseblock : ?.
  6519. elseblock : else block.
  6520. switchcases: ?.
  6521. switchcases: case <intliteral> block swithcases.
  6522. declarationblock: type declaration restdeclarations .
  6523. restlocals : ?.
  6524. restlocals : , declaration restdeclarations .
  6525. local : reference < identifier > indexblock
  6526. initializer .
  6527. indexblock: ?.
  6528. indexblock:
  6529. ?
  6530. < intliteral >
  6531. ?
  6532. indexblock.
  6533. initializer : ?.
  6534. initializer : = expression.
  6535. expression: logicalor restexpression.
  6536. 212
  6537. restexpression : ?.
  6538. restexpression : = logicalor restexpression.
  6539. logicalor : logicaland restlogicalor .
  6540. restlogicalor : ?.
  6541. restlogicalor : || logicaland restlogicalor .
  6542. logicaland: bitwiseor restlogicaland .
  6543. restlogicaland : ?.
  6544. restlogicaland : && bitwiseor restlogicaland.
  6545. bitwiseor : bitwisexor restbitwiseor .
  6546. restbitwiseor : ?.
  6547. restbitwiseor : | bitwisexor restbitwiseor .
  6548. bitwisexor: bitwiseand restbitwisexor.
  6549. restbitwisexor : ?.
  6550. restbitwisexor : ˆ bitwiseand restbitwisexor.
  6551. bitwiseand: equality restbitwiseand.
  6552. restbitwiseand: ?.
  6553. restbitwiseand: & equality restbitwiseand.
  6554. equality : relation restequality .
  6555. restequality : ?.
  6556. restequality : equalityoperator relation
  6557. restequality .
  6558. equalityoperator: ==.
  6559. equalityoperator: !=.
  6560. relation : shift restrelation .
  6561. restrelation : ?.
  6562. restrelation : relationoperator shift restrelation .
  6563. relationoperator : <.
  6564. relationoperator : <=.
  6565. relationoperator : >.
  6566. relationoperator : >=.
  6567. shift : addition restshift .
  6568. 213
  6569. restshift : ?.
  6570. restshift : shiftoperator addition restshift .
  6571. shiftoperator : <<.
  6572. shiftoperator : >>.
  6573. addition: multiplication restaddition.
  6574. restaddition : ?.
  6575. restaddition : additionoperator multiplication
  6576. restaddition.
  6577. additionoperator: +.
  6578. additionoperator: −.
  6579. multiplication : unary3 restmultiplication.
  6580. restmultiplication : ?.
  6581. restmultiplication : multiplicationoperator unary3
  6582. restmultiplication .
  6583. multiplicationoperator : ∗.
  6584. multiplicationoperator : /.
  6585. multiplicationoperator : %.
  6586. unary3: unary2
  6587. unary3: unary3operator unary3.
  6588. unary3operator: &.
  6589. unary3operator: ∗.
  6590. unary3operator: ˜.
  6591. unary2: factor.
  6592. unary2: unary2operator unary2.
  6593. unary2operator: +.
  6594. unary2operator: −.
  6595. unary2operator: !.
  6596. factor : <identifier> application.
  6597. factor : immediat?.
  6598. factor :
  6599. ?
  6600. expression
  6601. ?
  6602. .
  6603. application : ?.
  6604. application :
  6605. ?
  6606. expression
  6607. ?
  6608. application.
  6609. application :
  6610. ?
  6611. expression moreexpressions
  6612. ?
  6613. .
  6614. moreexpressions: ?.
  6615. moreexpressions: , expression morexpressions.
  6616. 214
  6617. type: bool.
  6618. type: char.
  6619. type: float.
  6620. type: int.
  6621. type: untyped.
  6622. immediate: <booleanliteral>.
  6623. immediate: <charliteral>.
  6624. immediate: < floatliteral >.
  6625. immediate: < intliteral >.
  6626. immediate: < stringliteral >.
  6627. 215
  6628. Appendix E
  6629. Syntax Diagrams
  6630. Figure E.1: Module
  6631. Figure E.2: Function
  6632. 216
  6633. Figure E.3: Formal function parameters
  6634. Figure E.4: Data declaration
  6635. Figure E.5: Code block
  6636. 217
  6637. Figure E.6: Statement
  6638. Figure E.7: Switch cases
  6639. 218
  6640. Figure E.8: Assignment, Logical OR Operators
  6641. Figure E.9: Ternary IF
  6642. Figure E.10: Logical AND and Bitwise OR Operators
  6643. 219
  6644. Figure E.11: Bitwise XOR and Bitwise AND Operators
  6645. Figure E.12: Equality Operators
  6646. Figure E.13: Relational Operators
  6647. Figure E.14: Bitwise Shift, Addition and Subtraction Operators
  6648. Figure E.15: Multiplication and Division Operators
  6649. 220
  6650. Figure E.16: Unary Operators
  6651. Figure E.17: Factor (Variable, Immediate or Expression)
  6652. Figure E.18: Immediate (Literal) Value
  6653. Figure E.19: Literal identifier
  6654. 221
  6655. Figure E.20: Literal integer number
  6656. Figure E.21: Literal float number
  6657. 222
  6658. Appendix F
  6659. Inger Lexical Analyzer
  6660. Source
  6661. F.1 tokens.h
  6662. #ifndefTOKENS H
  6663. #define TOKENS H
  6664. #include ”defs.h”
  6665. 5 /* #include "type.h" */
  6666. #include ”tokenvalue.h”
  6667. #include ”ast.h”
  6668. /*
  6669. 10 *
  6670. * MACROS
  6671. *
  6672. */
  6673. 15 /* Define where a line starts (at position 1)
  6674. */
  6675. #define LINECOUNTBASE 1
  6676. /* Define the position of a first character of a line.
  6677. */
  6678. 20 #define CHARPOSBASE 1
  6679. /* Define the block size with which strings are allocated.
  6680. */
  6681. #define STRING BLOCK 100
  6682. 25 /*
  6683. *
  6684. * TYPES
  6685. *
  6686. */
  6687. 30
  6688. /* This enum contains all the keywords and operators
  6689. 223
  6690. * used in the language.
  6691. */
  6692. enum
  6693. 35 {
  6694. /* Keywords */
  6695. KW BREAK = 1000, /* "break" keyword */
  6696. KW CASE, /* "case" keyword */
  6697. KW CONTINUE, /* "continue" keyword */
  6698. 40 KW DEFAULT, /* "default" keyword */
  6699. KW DO, /* "do" keyword */
  6700. KW ELSE, /* "else" keyword */
  6701. KW EXTERN, /* "extern" keyword */
  6702. KW GOTO, /* "goto" keyword */
  6703. 45 KW IF, /* "if" keyword */
  6704. KW LABEL, /* "label" keyword */
  6705. KW MODULE, /* "module" keyword */
  6706. KW RETURN, /* "return"keyword */
  6707. KW START, /* "start" keyword */
  6708. 50 KW SWITCH, /* "switch" keyword */
  6709. KW WHILE, /* "while" keyword */
  6710. /* Type identifiers */
  6711. KW BOOL, /* "bool" identifier */
  6712. 55 KW CHAR, /* "char" identifier */
  6713. KW FLOAT, /* "float" identifier */
  6714. KW INT, /* "int" identifier */
  6715. KW UNTYPED, /* "untyped" identifier */
  6716. KW VOID, /* "void" identifier */
  6717. 60
  6718. /* Variable lexer tokens */
  6719. LIT BOOL, /* bool constant */
  6720. LIT CHAR, /* character constant */
  6721. LIT FLOAT, /* floating point constant */
  6722. 65 LIT INT, /* integer constant */
  6723. LIT STRING, /* string constant */
  6724. IDENTIFIER, /* identifier */
  6725. /* Operators */
  6726. 70 OP ADD, /* "+" */
  6727. OP ASSIGN, /* "=" */
  6728. OP BITWISE AND, /* "&" */
  6729. OP BITWISE COMPLEMENT, /* "~" */
  6730. OP BITWISE LSHIFT, /* "<<" */
  6731. 75 OP BITWISE OR, /* "|" */
  6732. OP BITWISE RSHIFT, /* ">>" */
  6733. OP BITWISE XOR, /* "^" */
  6734. OP DIVIDE, /* "/" */
  6735. OP EQUAL, /* "==" */
  6736. 80 OP GREATER, /* ">" */
  6737. OP GREATEREQUAL, /* ">=" */
  6738. OP LESS, /* "<" */
  6739. OP LESSEQUAL, /* "<=" */
  6740. OP LOGICAL AND, /* "&&" */
  6741. 85 OP LOGICAL OR, /* "||" */
  6742. 224
  6743. OP MODULUS, /* "%" */
  6744. OP MULTIPLY, /* "*" */
  6745. OP NOT, /* "!" */
  6746. OP NOTEQUAL, /* "!=" */
  6747. 90 OP SUBTRACT, /* "-" */
  6748. OP TERNARY IF, /* "?" */
  6749. /* Delimiters */
  6750. ARROW, /* "->" */
  6751. 95 LBRACE, /* "{" */
  6752. RBRACE, /* "}" */
  6753. LBRACKET, /* "[" */
  6754. RBRACKET, /* "]" */
  6755. COLON, /* ":" */
  6756. 100 COMMA, /* "," */
  6757. LPAREN, /* "(" */
  6758. RPAREN, /* ")" */
  6759. SEMICOLON /* ";" */
  6760. }
  6761. 105 tokens;
  6762. /*
  6763. *
  6764. * FUNCTION DECLARATIONS
  6765. 110 *
  6766. */
  6767. TreeNode ∗Parse();
  6768. 115 /*
  6769. *
  6770. * GLOBALS
  6771. *
  6772. */
  6773. 120
  6774. extern Tokenvalue tokenvalue;
  6775. #endif
  6776. F.2 lexer.l
  6777. %{
  6778. /* Include stdlib for string to number conversion routines. */
  6779. #include <stdlib.h>
  6780. 5 /* Include errno for errno system variable. */
  6781. #include <errno.h>
  6782. /* Include string.h to use strtoul(). */
  6783. #include <string.h>
  6784. /* include assert.h for assert macro. */
  6785. 10 #include <assert.h>
  6786. /* Include global definitions. */
  6787. 225
  6788. #include ”defs.h”
  6789. /* The token #defines are defined in tokens.h. */
  6790. #include ”tokens.h”
  6791. 15 /* Include error/warning reporting module. */
  6792. #include ”errors.h”
  6793. /* Include option.h to access command line option. */
  6794. #include ”options.h”
  6795. 20 /*
  6796. *
  6797. * MACROS
  6798. *
  6799. */
  6800. 25 #define INCPOS charPos += yyleng;
  6801. /*
  6802. *
  6803. 30 * FORWARD DECLARATIONS
  6804. *
  6805. */
  6806. char SlashToChar( char str [] );
  6807. void AddToString( char c );
  6808. 35
  6809. /*
  6810. *
  6811. * GLOBALS
  6812. *
  6813. 40 */
  6814. /*
  6815. * Tokenvalue (declared in tokens.h) is used to pass
  6816. * literal token values to the parser.
  6817. 45 */
  6818. Tokenvalue tokenvalue;
  6819. /*
  6820. * lineCount keeps track of the current line number
  6821. 50 * in the source input file.
  6822. */
  6823. int lineCount;
  6824. /*
  6825. 55 * charPos keeps track of the current character
  6826. * position on the current source input line.
  6827. */
  6828. int charPos;
  6829. 60 /*
  6830. * Counters used for string reading
  6831. */
  6832. static int stringSize , stringPos;
  6833. 65 /*
  6834. 226
  6835. * commentsLevel keeps track of the current
  6836. * comment nesting level, in order to ignore nested
  6837. * comments properly.
  6838. */
  6839. 70 static int commentsLevel = 0;
  6840. %}
  6841. 75 /*
  6842. *
  6843. * LEXER STATES
  6844. *
  6845. */
  6846. 80
  6847. /* Exclusive state in which the lexer ignores all input
  6848. until a nested comment ends. */
  6849. %x STATE COMMENTS
  6850. /* Exclusive state in which the lexer returns all input
  6851. 85 until a string terminates with a double quote. */
  6852. %x STATE STRING
  6853. %pointer
  6854. 90
  6855. /*
  6856. *
  6857. * REGULAR EXPRESSIONS
  6858. *
  6859. 95 */
  6860. %%
  6861. /*
  6862. *
  6863. 100 * KEYWORDS
  6864. *
  6865. */
  6866. start { INCPOS; return KW START; }
  6867. 105
  6868. bool { INCPOS; return KW BOOL; }
  6869. char { INCPOS; return KW CHAR; }
  6870. float { INCPOS; return KW FLOAT; }
  6871. int { INCPOS; return KW INT; }
  6872. 110 untyped { INCPOS; return KW UNTYPED; }
  6873. void { INCPOS; return KW VOID; }
  6874. break { INCPOS; return KW BREAK; }
  6875. case { INCPOS; return KW CASE; }
  6876. 115 default { INCPOS; return KW DEFAULT; }
  6877. do { INCPOS; return KW DO; }
  6878. else { INCPOS; return KW ELSE; }
  6879. extern { INCPOS; return KW EXTERN; }
  6880. goto considered harmful { INCPOS; return KW GOTO; }
  6881. 227
  6882. 120 if { INCPOS; return KW IF; }
  6883. label { INCPOS; return KW LABEL; }
  6884. module { INCPOS; return KW MODULE; }
  6885. return { INCPOS; return KW RETURN; }
  6886. switch { INCPOS; return KW SWITCH; }
  6887. 125 while { INCPOS; return KW WHILE; }
  6888. /*
  6889. *
  6890. 130 * OPERATORS
  6891. *
  6892. */
  6893. ”−>” { INCPOS; return ARROW; }
  6894. 135 ”==” { INCPOS; return OP EQUAL; }
  6895. ”!=” { INCPOS; return OP NOTEQUAL; }
  6896. ”&&” { INCPOS; return OP LOGICAL AND; }
  6897. ”||” { INCPOS; return OP LOGICAL OR; }
  6898. ”>=” { INCPOS; return OP GREATEREQUAL; }
  6899. 140 ”<=” { INCPOS; return OP LESSEQUAL; }
  6900. ”<<” { INCPOS; return OP BITWISE LSHIFT; }
  6901. ”>>” { INCPOS; return OP BITWISE RSHIFT; }
  6902. ”+” { INCPOS; return OP ADD; }
  6903. ”−” { INCPOS; return OP SUBTRACT; }
  6904. 145 ”∗” { INCPOS; return OP MULTIPLY; }
  6905. ”/” { INCPOS; return OP DIVIDE; }
  6906. ”!” { INCPOS; return OP NOT; }
  6907. ”˜” { INCPOS; return OP BITWISE COMPLEMENT; }
  6908. 150 ”%” { INCPOS; return OP MODULUS; }
  6909. ”=” { INCPOS; return OP ASSIGN; }
  6910. ”>” { INCPOS; return OP GREATER; }
  6911. ”<” { INCPOS; return OP LESS; }
  6912. 155 ”&” { INCPOS; return OP BITWISE AND; }
  6913. ”|” { INCPOS; return OP BITWISE OR; }
  6914. ”ˆ” { INCPOS; return OP BITWISE XOR; }
  6915. ”?” { INCPOS; return OP TERNARY IF; }
  6916. 160
  6917. /*
  6918. *
  6919. * DELIMITERS
  6920. 165 *
  6921. */
  6922. ”(” { INCPOS; return LPAREN; }
  6923. ”)” { INCPOS; return RPAREN; }
  6924. 170 ”[” { INCPOS; return LBRACKET; }
  6925. ”]” { INCPOS; return RBRACKET; }
  6926. ”:” { INCPOS; return COLON; }
  6927. ”;” { INCPOS; return SEMICOLON; }
  6928. 228
  6929. ”{” { INCPOS; return LBRACE; }
  6930. 175 ”}” { INCPOS; return RBRACE; }
  6931. ”,” { INCPOS; return COMMA; }
  6932. /*
  6933. 180 *
  6934. * VALUE TOKENS
  6935. *
  6936. */
  6937. 185 true { /* boolean constant */
  6938. INCPOS;
  6939. tokenvalue.boolvalue = TRUE;
  6940. return( LIT BOOL );
  6941. }
  6942. 190
  6943. false { /* boolean constant */
  6944. INCPOS;
  6945. tokenvalue.boolvalue = FALSE;
  6946. return( LIT BOOL );
  6947. 195 }
  6948. [0−9]+ { /* decimal integer constant */
  6949. INCPOS;
  6950. tokenvalue. uintvalue = strtoul ( yytext , NULL, 10 );
  6951. 200 if ( tokenvalue. uintvalue == −1 )
  6952. {
  6953. tokenvalue. uintvalue = 0;
  6954. AddPosWarning( ”integer literal value ”
  6955. ”too large . Zero used”,
  6956. 205 lineCount, charPos );
  6957. }
  6958. return( LIT INT );
  6959. }
  6960. 210 ”0x”[0−9A−Fa−f]+ {
  6961. /* hexidecimal integer constant */
  6962. INCPOS;
  6963. tokenvalue. uintvalue = strtoul ( yytext , NULL, 16 );
  6964. if ( tokenvalue. uintvalue == −1 )
  6965. 215 {
  6966. tokenvalue. uintvalue = 0;
  6967. AddPosWarning( ”hexadecimal integer literal value ”
  6968. ”too large . Zero used”,
  6969. lineCount, charPos );
  6970. 220 }
  6971. return( LIT INT );
  6972. }
  6973. [0−1]+[Bb] { /* binary integer constant */
  6974. 225 INCPOS;
  6975. tokenvalue. uintvalue = strtoul ( yytext , NULL, 2 );
  6976. if ( tokenvalue. uintvalue == −1 )
  6977. 229
  6978. {
  6979. tokenvalue. uintvalue = 0;
  6980. 230 AddPosWarning( ”binary integer literal value too ”
  6981. ”large . Zero used” ,
  6982. lineCount, charPos );
  6983. }
  6984. return( LIT INT );
  6985. 235 }
  6986. [ A−Za−z]+[ A−Za−z0−9]∗ {
  6987. /* identifier */
  6988. INCPOS;
  6989. 240 tokenvalue. identifier = strdup( yytext );
  6990. return( IDENTIFIER );
  6991. }
  6992. [0−9]∗\.[0−9]+([Ee][+−]?[0−9]+)? {
  6993. 245 /* floating point number */
  6994. INCPOS;
  6995. if ( sscanf( yytext , ”%f”,
  6996. &tokenvalue. floatvalue ) == 0 )
  6997. {
  6998. 250 tokenvalue. floatvalue = 0;
  6999. AddPosWarning( ”floating point literal value too ”
  7000. ”large . Zero used”,
  7001. lineCount, charPos );
  7002. }
  7003. 255 return( LIT FLOAT );
  7004. }
  7005. 260 /*
  7006. *
  7007. * CHARACTERS
  7008. *
  7009. */
  7010. 265
  7011. \’\\[\’\”abfnrtv ]\’ {
  7012. INCPOS;
  7013. yytext[ strlen (yytext)−1] = ’\0’;
  7014. tokenvalue.charvalue =
  7015. 270 SlashToChar( yytext+1 );
  7016. return( LIT CHAR );
  7017. }
  7018. \’\\B[0−1][0−1][0−1][0−1][0−1][0−1][0−1][0−1]\’ {
  7019. 275 /∗ \B escape sequence. ∗/
  7020. INCPOS;
  7021. yytext[ strlen (yytext)−1] = ’\0’;
  7022. tokenvalue.charvalue =
  7023. SlashToChar( yytext+1 );
  7024. 280 return( LIT CHAR );
  7025. }
  7026. 230
  7027. \’\\o[0−7][0−7][0−7]\’ {
  7028. /∗ \o escape sequence. ∗/
  7029. 285 INCPOS;
  7030. yytext[ strlen (yytext)−1] = ’\0’;
  7031. tokenvalue.charvalue =
  7032. SlashToChar( yytext+1 );
  7033. return( LIT CHAR );
  7034. 290 }
  7035. \’\\x[0−9A−Fa−f][0−9A−Fa−f]\’ {
  7036. /∗ \x escape sequence. ∗/
  7037. INCPOS;
  7038. 295 yytext[ strlen (yytext)−1] = ’\0’;
  7039. tokenvalue.charvalue =
  7040. SlashToChar( yytext+1 );
  7041. return( LIT CHAR );
  7042. }
  7043. 300
  7044. \’.\’ {
  7045. /∗ Single character . ∗/
  7046. INCPOS;
  7047. tokenvalue.charvalue = yytext [1];
  7048. 305 return( LIT CHAR );
  7049. }
  7050. 310 /∗
  7051. ∗ STRINGS
  7052. ∗/
  7053. 315
  7054. \” { INCPOS;
  7055. tokenvalue. stringvalue =
  7056. (char∗) malloc( STRING BLOCK );
  7057. memset( tokenvalue.stringvalue ,
  7058. 320 0, STRING BLOCK );
  7059. stringSize = STRING BLOCK;
  7060. stringPos = 0;
  7061. BEGIN STATE STRING; /∗ begin of string ∗/
  7062. }
  7063. 325
  7064. <STATE STRING>\” {
  7065. INCPOS;
  7066. BEGIN 0;
  7067. /∗ Do not include terminating ” in string ∗/
  7068. 330 return( LIT STRING ); /∗ end of string ∗/
  7069. }
  7070. <STATE STRING>\n {
  7071. INCPOS;
  7072. 335 AddPosWarning( ”strings cannot span multiple ”
  7073. 231
  7074. ”lines ”, lineCount, charPos );
  7075. AddToString( ’\n’ );
  7076. }
  7077. 340 <STATE STRING>\\[\’\”abfnrtv] {
  7078. /∗ Escape sequences in string . ∗/
  7079. INCPOS;
  7080. AddToString( SlashToChar( yytext ) );
  7081. }
  7082. 345
  7083. <STATE STRING>\\B[0−1][0−1][0−1][0−1][0−1][0−1][0−1][0−1] {
  7084. /∗ \B escape sequence. ∗/
  7085. INCPOS;
  7086. AddToString( SlashToChar( yytext ) );
  7087. 350 }
  7088. <STATE STRING>\\o[0−7][0−7][0−7] {
  7089. /∗ \o escape sequence. ∗/
  7090. INCPOS;
  7091. 355 AddToString( SlashToChar( yytext ) );
  7092. }
  7093. <STATE STRING>\\x[0−9A−Fa−f][0−9A−Fa−f] {
  7094. /∗ \x escape sequence. ∗/
  7095. 360 INCPOS;
  7096. AddToString( SlashToChar( yytext ) );
  7097. }
  7098. <STATE STRING>. {
  7099. 365 /∗ Any other character ∗/
  7100. INCPOS;
  7101. AddToString( yytext [0] );
  7102. }
  7103. 370
  7104. /∗
  7105. ∗ LINE COMMENTS
  7106. 375 ∗/
  7107. ”//”[ˆ\n]∗ { ++lineCount; /∗ ignore comment lines ∗/ }
  7108. 380 /∗
  7109. ∗ BLOCK COMMENTS
  7110. ∗/
  7111. 385
  7112. ”/∗” { INCPOS;
  7113. ++commentsLevel;
  7114. BEGIN STATE COMMENTS;
  7115. /∗ start of comments ∗/
  7116. 232
  7117. 390 }
  7118. <STATE COMMENTS>”/∗” {
  7119. INCPOS;
  7120. ++commentsLevel;
  7121. 395 /∗ begin of deeper nested
  7122. comments ∗/
  7123. }
  7124. <STATE COMMENTS>. { INCPOS; /∗ ignore all characters ∗/ }
  7125. 400
  7126. <STATE COMMENTS>\n {
  7127. charPos = 0;
  7128. ++lineCount; /∗ ignore newlines ∗/
  7129. }
  7130. 405
  7131. <STATE COMMENTS>”∗/” {
  7132. INCPOS;
  7133. if ( −−commentsLevel == 0 )
  7134. BEGIN 0; /∗ end of comments∗/
  7135. 410 }
  7136. /∗
  7137. 415 ∗ WHITESPACE
  7138. ∗/
  7139. [\t ] { ++charPos; /∗ ignore whitespaces ∗/ }
  7140. 420
  7141. \n { ++lineCount;
  7142. charPos = 0; /∗ ignored newlines ∗/
  7143. }
  7144. 425 /∗ unmatched character ∗/
  7145. . { INCPOS; return yytext[0]; }
  7146. %%
  7147. 430
  7148. /∗
  7149. ∗ ADDITIONAL VERBATIM C CODE
  7150. 435 ∗/
  7151. /∗
  7152. ∗ Convert slashed character (e.g. \n, \r etc .) to a
  7153. ∗ char value.
  7154. 440 ∗ The argument is a string that start with a backslash,
  7155. ∗ e.g. \x2e, \o056, \n, \b11011101
  7156. ∗ Pre: (for \x, \B and \o): strlen (yytext) is large
  7157. 233
  7158. ∗ enough. The regexps in the lexer take care
  7159. 445 ∗ of this .
  7160. ∗/
  7161. char SlashToChar( char str [] )
  7162. {
  7163. static char strPart [20];
  7164. 450
  7165. memset( strPart , 0, 20 );
  7166. switch( str [1] )
  7167. {
  7168. 455 case ’\\’:
  7169. return( ’\\’ );
  7170. case ’\”’:
  7171. return ( ’\”’ );
  7172. case ’\’’:
  7173. 460 return ( ’\’’ );
  7174. case ’a’:
  7175. return ( ’\a ’ );
  7176. case ’b’:
  7177. return ( ’\b ’ );
  7178. 465 case ’B’:
  7179. strncpy( strPart , str +2, 8 );
  7180. return( strtoul ( yytext+2, NULL, 2 ) );
  7181. case ’ f ’:
  7182. 470 return ( ’\ f ’ );
  7183. case ’n’:
  7184. return ( ’\n ’ );
  7185. case ’o’:
  7186. strncpy( strPart , str +2, 3 );
  7187. 475 return( strtoul ( strPart , NULL, 8 ) );
  7188. case ’t ’:
  7189. return ( ’\t ’ );
  7190. case ’ r ’:
  7191. return ( ’\ r ’ );
  7192. 480 case ’v’:
  7193. return ( ’\v ’ );
  7194. case ’x’:
  7195. strncpy( strPart , str +2, 2 );
  7196. return( strtoul ( strPart , NULL, 16 ) );
  7197. 485 default :
  7198. /∗ Control should never get here! ∗/
  7199. assert ( 0 );
  7200. }
  7201. }
  7202. 490
  7203. /∗
  7204. ∗ For string reading (which happens on a
  7205. ∗ character−by−character basis), add a character to
  7206. 495 ∗ the global lexer string ’tokenvalue. stringvalue ’.
  7207. ∗/
  7208. void AddToString( char c )
  7209. 234
  7210. {
  7211. if ( tokenvalue. stringvalue == NULL )
  7212. 500 {
  7213. /∗ Some previous realloc () already went wrong.
  7214. ∗ Silently abort.
  7215. ∗/
  7216. return;
  7217. 505 }
  7218. if ( stringPos >= stringSize − 1 )
  7219. {
  7220. 510 stringSize += STRING BLOCK;
  7221. DEBUG( ”resizing string memory +%d, now %d bytes”,
  7222. STRING BLOCK, stringSize );
  7223. tokenvalue. stringvalue =
  7224. 515 (char∗) realloc ( tokenvalue. stringvalue ,
  7225. stringSize );
  7226. if ( tokenvalue. stringvalue == NULL )
  7227. {
  7228. AddPosWarning( ”Unable to claim enough memory ”
  7229. 520 ”for string storage”,
  7230. lineCount, charPos );
  7231. return;
  7232. }
  7233. memset( tokenvalue.stringvalue + stringSize
  7234. 525 − STRING BLOCK, 0, STRING BLOCK );
  7235. }
  7236. tokenvalue. stringvalue [stringPos] = c;
  7237. stringPos++;
  7238. 530 }
  7239. 235
  7240. Appendix G
  7241. Logic Language Parser
  7242. Source
  7243. G.1 Lexical Analyzer
  7244. %{
  7245. #include ”lexer.h”
  7246. 5 unsigned int nr = 0;
  7247. %}
  7248. %%
  7249. 10
  7250. \n {
  7251. nr = 0;
  7252. }
  7253. 15 [ \t]+ {
  7254. nr += yyleng;
  7255. }
  7256. ”−>” {
  7257. 20 nr += yyleng;
  7258. return (RIMPL);
  7259. }
  7260. ”<−” {
  7261. 25 nr += yyleng;
  7262. return (LIMPL);
  7263. }
  7264. ”<−>” {
  7265. 30 nr += yyleng;
  7266. return (EQUIV);
  7267. 236
  7268. }
  7269. [A−Z]{1} {
  7270. 35 nr += yyleng;
  7271. return (IDENT);
  7272. }
  7273. ”RESULT” {
  7274. 40 nr += yyleng;
  7275. return (RESULT);
  7276. }
  7277. ”PRINT” {
  7278. 45 nr += yyleng;
  7279. return (PRINT);
  7280. }
  7281. . {
  7282. 50 nr += yyleng;
  7283. return (yytext [0]);
  7284. }
  7285. 55 %%
  7286. G.2 Parser Header
  7287. #ifndef LEXER H
  7288. #define LEXER H 1
  7289. enum
  7290. 5 {
  7291. LIMPL = 300,
  7292. RIMPL,
  7293. EQUIV,
  7294. RESULT,
  7295. 10 PRINT,
  7296. IDENT
  7297. };
  7298. #endif
  7299. G.3 Parser Source
  7300. #include <stdio.h>
  7301. #include <stdlib.h>
  7302. #include ”lexer.h”
  7303. 5
  7304. 237
  7305. #ifdef DEBUG
  7306. # define debug(args ...) printf (args)
  7307. #else
  7308. # define debug (...)
  7309. 10 #endif
  7310. extern unsigned char∗ yytext;
  7311. extern unsigned int nr;
  7312. extern int yylex (void);
  7313. 15
  7314. unsigned int token;
  7315. // who needs complex datastructures anyway?
  7316. unsigned int acVariables [26];
  7317. 20
  7318. void gettoken (void)
  7319. {
  7320. token = yylex();
  7321. debug(”new token: %s\n”, yytext);
  7322. 25 }
  7323. void error (char ∗ e)
  7324. {
  7325. fprintf ( stderr , ”ERROR(%d:%c): %s\n”,
  7326. 30 nr, yytext [0], e);
  7327. exit (1);
  7328. }
  7329. void statement (void);
  7330. 35 int negation (void);
  7331. int restnegation (int );
  7332. int conjunction (void);
  7333. int restconjunction (int );
  7334. int implication (void);
  7335. 40 int restimplication (int );
  7336. int factor (void);
  7337. void statement (void)
  7338. {
  7339. 45 int res = 0, i , var = 0;
  7340. debug(”statement()\n”);
  7341. if (token == IDENT)
  7342. {
  7343. 50 var = yytext[0] − 65;
  7344. gettoken ();
  7345. if (token == ’=’)
  7346. {
  7347. gettoken();
  7348. 55 res = implication ();
  7349. acVariables [var] = res;
  7350. if (token != ’;’)
  7351. error (”; expected”);
  7352. gettoken();
  7353. 238
  7354. 60 } else {
  7355. error (”= expected”);
  7356. }
  7357. } else {
  7358. error (”This shouldn’t have happened.”);
  7359. 65 }
  7360. for ( i = 0 ; i < 26 ; i++)
  7361. debug (”%d”, acVariables[i ]);
  7362. debug (”\n”);
  7363. }
  7364. 70
  7365. int implication (void)
  7366. {
  7367. int res = 0;
  7368. debug(”implication()\n”);
  7369. 75 res = conjunction();
  7370. res = restimplication (res );
  7371. return (res );
  7372. }
  7373. 80 int restimplication (int val)
  7374. {
  7375. int res = val;
  7376. int operator;
  7377. debug(” restimplication ()\n”);
  7378. 85
  7379. if (token == EQUIV || token == RIMPL || token == LIMPL)
  7380. {
  7381. operator = token;
  7382. gettoken();
  7383. 90 res = conjunction();
  7384. switch (operator)
  7385. {
  7386. case RIMPL:
  7387. res = (val == 0) || ( res == 1) ? 1 : 0;
  7388. 95 break;
  7389. case LIMPL:
  7390. res = (val == 1) || ( res == 0) ? 1 : 0;
  7391. break;
  7392. case EQUIV:
  7393. 100 res = (res == val ) ? 1 : 0;
  7394. break;
  7395. }
  7396. res = restimplication (res );
  7397. }
  7398. 105 return (res );
  7399. }
  7400. int conjunction (void)
  7401. {
  7402. 110 int res = 0;
  7403. debug(”conjunction()\n”);
  7404. res = negation();
  7405. res = restconjunction(res );
  7406. 239
  7407. return (res );
  7408. 115 }
  7409. int restconjunction (int val)
  7410. {
  7411. int res = val, operator;
  7412. 120 debug(”restconjunction()\n”);
  7413. if (token == ’&’ || token == ’|’)
  7414. {
  7415. operator = token;
  7416. gettoken();
  7417. 125 res = negation();
  7418. if (operator == ’&’)
  7419. {
  7420. res = ((res == 1) && (val == 1)) ? 1 : 0;
  7421. } else { /* ’|’ */
  7422. 130 res = ((res == 1) || ( val == 1)) ? 1 : 0;
  7423. }
  7424. res = restconjunction(res );
  7425. }
  7426. return (res );
  7427. 135 }
  7428. int negation (void)
  7429. {
  7430. int res = 0;
  7431. 140 debug(”negation()\n”);
  7432. if (token == ’˜’)
  7433. {
  7434. gettoken();
  7435. res = negation() == 0 ? 1 : 0;
  7436. 145 } else {
  7437. res = factor ();
  7438. }
  7439. return (res );
  7440. }
  7441. 150
  7442. int factor (void)
  7443. {
  7444. int res = 0;
  7445. 155 debug(”factor()\n”);
  7446. switch (token)
  7447. {
  7448. case ’(’:
  7449. gettoken();
  7450. 160 res = implication ();
  7451. if (token != ’)’)
  7452. error (”missing ’)’ ”);
  7453. break;
  7454. case ’1’:
  7455. 165 res = 1;
  7456. break;
  7457. case ’0’:
  7458. 240
  7459. res = 0;
  7460. break;
  7461. 170 case IDENT:
  7462. debug(”’%s’ processed\n”, yytext);
  7463. res = acVariables[yytext [0] − 65];
  7464. break;
  7465. default:
  7466. 175 error (” (, 1, 0 or identifier expected”);
  7467. }
  7468. debug (”factor is returning %d\n”, res);
  7469. gettoken();
  7470. return (res );
  7471. 180 }
  7472. void program (void)
  7473. {
  7474. while (token == IDENT || token == PRINT)
  7475. 185 {
  7476. if (token == IDENT)
  7477. {
  7478. statement();
  7479. }
  7480. 190 else if (token == PRINT)
  7481. {
  7482. gettoken();
  7483. printf (”%d\n”, implication());
  7484. if (token != ’;’)
  7485. 195 error (”; expected”);
  7486. gettoken();
  7487. }
  7488. }
  7489. }
  7490. 200
  7491. int main (void)
  7492. {
  7493. int i = 0;
  7494. for ( i = 0 ; i < 26 ; i++)
  7495. 205 acVariables [i ] = 0;
  7496. /* start off */
  7497. gettoken();
  7498. program ();
  7499. 210 return (0);
  7500. }
  7501. 241
  7502. Listings
  7503. 3.1 Inger Factorial Program . . . . . . . . . . . . . . . . . . . . . . . 25
  7504. 3.2 Backus-Naur Form for module . . . . . . . . . . . . . . . . . . . 27
  7505. 3.3 Legal Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
  7506. 3.4 Global Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
  7507. 3.5 Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
  7508. 3.6 BNF for Declaration . . . . . . . . . . . . . . . . . . . . . . . . . 39
  7509. 3.7 The While Statement . . . . . . . . . . . . . . . . . . . . . . . . 44
  7510. 3.8 The Break Statement . . . . . . . . . . . . . . . . . . . . . . . . 44
  7511. 3.9 The Break Statement (output) . . . . . . . . . . . . . . . . . . . 44
  7512. 3.10 The Continue Statement . . . . . . . . . . . . . . . . . . . . . . . 45
  7513. 3.11 The Continue Statement (output) . . . . . . . . . . . . . . . . . . 45
  7514. 3.12 Roman Numerals . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
  7515. 3.13 Roman Numerals Output . . . . . . . . . . . . . . . . . . . . . . 48
  7516. 3.14 Multiple If Alternatives . . . . . . . . . . . . . . . . . . . . . . . 48
  7517. 3.15 The Switch Statement . . . . . . . . . . . . . . . . . . . . . . . . 49
  7518. 3.16 The Goto Statement . . . . . . . . . . . . . . . . . . . . . . . . . 49
  7519. 3.17 An Array Example . . . . . . . . . . . . . . . . . . . . . . . . . . 51
  7520. 3.18 C-implementation of printint Function . . . . . . . . . . . . . . . . 56
  7521. 3.19 Inger Header File for printint Function . . . . . . . . . . . . . . . . 56
  7522. 3.20 Inger Program Using printint . . . . . . . . . . . . . . . . . . . . . 57
  7523. 5.1 Sample Expression Language . . . . . . . . . . . . . . . . . . . . 85
  7524. 5.2 Sample Expression Language in EBNF . . . . . . . . . . . . . . . 91
  7525. 5.3 Sample Expression Language in EBNF . . . . . . . . . . . . . . . 93
  7526. 5.4 Unambiguous Expression Language in EBNF . . . . . . . . . . . 97
  7527. 5.5 Expression Grammar Modified for Associativity . . . . . . . . . . 100
  7528. 5.6 BNF for Logic Language . . . . . . . . . . . . . . . . . . . . . . . 103
  7529. 5.7 EBNF for Logic Language . . . . . . . . . . . . . . . . . . . . . . 103
  7530. 6.1 Expression Grammar for LL Parser . . . . . . . . . . . . . . . . . 110
  7531. 6.2 Expression Grammar for LR Parser . . . . . . . . . . . . . . . . . 114
  7532. 6.3 Conjunction Nonterminal Function . . . . . . . . . . . . . . . . . 119
  7533. 8.1 Sync routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
  7534. 8.2 SyncOut routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
  7535. 10.1 Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
  7536. 10.2 Sample program listing . . . . . . . . . . . . . . . . . . . . . . . . 149
  7537. 12.1 Global Variable Declaration . . . . . . . . . . . . . . . . . . . . . 165
  7538. 242
  7539. Index
  7540. abstract grammar, 82
  7541. Abstract Syntax Tree, 140
  7542. abstract syntax tree, 97, 106
  7543. Ada, 17
  7544. address, 52
  7545. adventure game, 79
  7546. Algol 60, 16
  7547. Algol 68, 16
  7548. algorithm, 24
  7549. alphabet, 63, 79, 82
  7550. ambiguous, 94
  7551. annotated parse tree, 95
  7552. annotated syntax tree, 95
  7553. array, 50
  7554. arrays, 136
  7555. assignment chaining, 40
  7556. assignment statement, 40
  7557. associativity, 13, 99
  7558. AST, 133, 157–159
  7559. auxiliary symbol, 82
  7560. Backus-Naur Form, 26
  7561. Backus-Naur form, 84
  7562. basis, 80
  7563. binary number, 31
  7564. binary search trees, 136
  7565. block, 42
  7566. bool, 33
  7567. boolean value, 33
  7568. bootstrap-compiler, 200
  7569. bootstrapping, 199
  7570. bottom-up, 108
  7571. break, 43
  7572. by value, 55
  7573. C, 17
  7574. C++, 18
  7575. callee, 168
  7576. calling convention, 167
  7577. case, 46, 158
  7578. case block, 157
  7579. case blocks, 157
  7580. case value, 157
  7581. char, 36
  7582. character, 32, 36
  7583. child node, 156
  7584. children, 158
  7585. Chomsky hierarchy, 89
  7586. closure, 80
  7587. CLU, 17
  7588. COBOL, 16
  7589. code, 11
  7590. code block, 156–158
  7591. code blocks, 157
  7592. code generation, 158
  7593. coercion, 143
  7594. comment, 29
  7595. common language specification, 21
  7596. compiler, 10
  7597. compiler-compiler, 109
  7598. compound statement, 40
  7599. computer language, 79
  7600. conditional statement, 43
  7601. context-free, 87
  7602. context-free grammar, 63, 84, 89
  7603. context-insensitive, 87
  7604. context-sensitive grammar, 89
  7605. continue, 43
  7606. dangling else problem, 46
  7607. data, 24
  7608. decimal separator, 31
  7609. declaration, 24
  7610. default, 46
  7611. definition, 24
  7612. delimiter, 29
  7613. derivation, 28
  7614. derivation scheme, 92
  7615. determinism, 88, 109
  7616. deterministic grammar, 116
  7617. dimension, 50
  7618. double, 36
  7619. 243
  7620. duplicate case value, 157, 158
  7621. duplicate case values, 157
  7622. duplicate values, 158
  7623. duplicates, 158
  7624. dynamic variable, 52
  7625. Eiffel, 18
  7626. encapsulation, 17
  7627. end of file, 110
  7628. error, 133, 158
  7629. error message, 159
  7630. error recovery, 117
  7631. escape, 32
  7632. evaluator, 107
  7633. exclusive state, 69
  7634. expression evaluation, 40
  7635. Extended Backus-Naur Form, 28
  7636. extended Backus-Naur form, 90
  7637. extern, 56
  7638. FIRST, 128
  7639. float, 35, 36
  7640. floating point number, 31
  7641. flow control statement, 46
  7642. FOLLOW, 128
  7643. formal language, 79
  7644. FORTRAN, 16
  7645. fractional part, 31
  7646. function, 24, 52
  7647. function body, 42, 54
  7648. function header, 54
  7649. functional programming language, 16
  7650. global variable, 24, 37
  7651. goto, 159
  7652. goto label, 159
  7653. goto labels, 159
  7654. goto statement, 159
  7655. grammar, 60, 78
  7656. hash tables, 136
  7657. header file, 56
  7658. heap, 52
  7659. hexadecimal number, 31
  7660. identifier, 29, 52
  7661. if, 43
  7662. imperative programming, 16
  7663. imperative programming language,
  7664. 16
  7665. indirect recursion, 88
  7666. indirection, 37
  7667. induction, 80
  7668. information hiding, 17
  7669. inheritance, 17
  7670. int, 34
  7671. integer number, 31
  7672. Intel assemby language, 10
  7673. interpreter, 9
  7674. Java, 18
  7675. Kevo, 18
  7676. Kleene star, 64
  7677. label, 49, 159
  7678. language, 78
  7679. left hand side, 40
  7680. left recursion, 88
  7681. left-factorisation, 112
  7682. left-linear, 89
  7683. left-recursive, 111
  7684. leftmost derivation, 83, 93
  7685. lexer, 61
  7686. lexical analyzer, 61, 86
  7687. library, 56
  7688. linked list, 136
  7689. linker, 56
  7690. LISP, 17
  7691. LL, 109
  7692. local variable, 37
  7693. lookahead, 113
  7694. loop, 43
  7695. lvalue, 40
  7696. lvalue check, 153
  7697. metasymbol, 29
  7698. Modula2, 17
  7699. module, 55
  7700. n-ary search trees, 136
  7701. natural language, 79, 133
  7702. Non-void function returns, 157
  7703. non-void function returns, 156
  7704. non-void functions, 157
  7705. nonterminal, 26, 82
  7706. object-oriented programming language,
  7707. 16
  7708. operator, 29
  7709. optimal-compiler, 200
  7710. 244
  7711. parse tree, 92
  7712. parser, 106
  7713. Pascal, 16
  7714. PL/I, 16
  7715. pointer, 36
  7716. polish notation, 99, 107
  7717. polymorphic type, 36
  7718. pop, 13
  7719. prefix notation, 99, 107
  7720. priority, 13
  7721. priority list, 13
  7722. procedural, 16
  7723. procedural programming, 16
  7724. procedural programming language,
  7725. 16
  7726. production rule, 27, 81
  7727. push, 13
  7728. random access structure, 50
  7729. read pointer, 11
  7730. recursion, 83, 87
  7731. recursive descent, 109
  7732. recursive step, 80
  7733. reduce, 12, 13, 108
  7734. reduction, 12, 113
  7735. regex, 66
  7736. regular expression, 65
  7737. regular grammar, 89
  7738. regular language, 63
  7739. reserved word, 29, 73
  7740. return, 156, 157
  7741. return statement, 157
  7742. right hand side, 40
  7743. right linear, 89
  7744. rightmost derivation, 94
  7745. root node, 158
  7746. rvalue, 40
  7747. SASL, 17
  7748. scanner, 61
  7749. scanning, 62
  7750. Scheme, 17
  7751. scientific notation, 31
  7752. scope, 30, 37, 136
  7753. scoping, 135
  7754. screening, 62
  7755. selector, 46
  7756. semantic, 10
  7757. semantic analysis, 133, 136
  7758. semantic check, 158
  7759. semantics, 81, 133
  7760. sentential form, 83, 85, 109
  7761. shift, 12, 13, 108, 114
  7762. shift-reduce method, 115
  7763. side effect, 40, 53
  7764. signed integers, 34
  7765. simple statement, 40
  7766. Simula, 17
  7767. single-line comment, 29
  7768. SmallTalk, 17
  7769. SML, 17
  7770. stack, 11, 52, 108
  7771. stack frame, 167
  7772. start, 52, 55
  7773. start function, 52
  7774. start symbol, 27, 82, 84, 90, 108
  7775. statement, 24, 40, 156
  7776. static variable, 52
  7777. string, 32
  7778. strings, 79
  7779. switch block, 157
  7780. switch node, 158
  7781. switch statement, 157
  7782. symbol, 134, 136
  7783. symbol identification, 134, 135
  7784. Symbol Table, 136
  7785. symbol table, 133, 159
  7786. syntactic sugar, 108
  7787. syntax, 60, 80, 133
  7788. syntax analysis, 133
  7789. syntax diagram, 24, 90
  7790. syntax error, 109
  7791. syntax tree, 92
  7792. T-diagram, 199
  7793. template, 163
  7794. terminal, 26, 82
  7795. terminal symbol, 82
  7796. token, 61, 114
  7797. token value, 63
  7798. tokenizing, 62
  7799. top-down, 108
  7800. translator, 9
  7801. Turing Machine, 17
  7802. type 0 language, 89
  7803. type 1 grammar, 89
  7804. type 1 language, 89
  7805. type 2 grammar, 89
  7806. type 2 language, 89
  7807. type 3 grammar, 89
  7808. 245
  7809. type 3 language, 89
  7810. type checker, 133
  7811. Type checking, 133
  7812. typed pointer, 50
  7813. types, 133
  7814. union, 63, 64
  7815. unique, 136
  7816. Unreachable code, 156
  7817. unreachable code, 156, 157
  7818. unsigned byte, 36
  7819. untyped, 36, 73
  7820. warning, 133, 156, 158
  7821. whitespace, 86
  7822. zero-terminated string, 50
  7823. 246
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement