Untitled

Compiler Construction
A Practical Approach
F.J.F. Benders
J.W. Haaring
T.H. Janssen
D. Meffert
A.C. van Oostenrijk
January 11, 2003
2
Abstract
Plenty of literature is available to learn about compiler construction,
but most of it is either too easy, covering only the very basics, or
too difficult and accessible only to academics. We find that, most
notably, literature about code generation is lacking and it is in this
area that this book attempts to fill in the gaps.
In this book, we design a new language (Inger), and explain how
to write a compiler for it that compiles to Intel assembly language.
We discuss lexical analysis (scanning), LL(1) grammars, recursive
descent parsing, syntax error recovery, identification, type checking
and code generation using templates and give practical advice on
tackling each part of the compiler.
3
Acknowledgements
The authors would like to extend their gratitude to a number of
people who were invaluable in the conception and writing of this
book. The compiler construction project of which this book is the
result was started with the help of Frits Feldbrugge and Robert
Holwerda. The Inger language was named after Inger Vermeir (in
the good tradition of naming languages after people, like Ada). The
project team was coordinated by Marco Devillers, who proved to be
a valuable source of advice.
We would also like to thank Carola Doumen for her help in struc-
turing the project and coordinating presentations given about the
project. Cees Haaring helped us getting a number of copies of the
book printed.
Furthermore, we thank Mike Wertman for letting us study his source
of a Pascal compiler written in Java, and Hans Meijer of the Uni-
versity of Nijmegen for his invaluable compiler construction course.
Finally, we would like to thank the University of Arnhem and Ni-
jmegen for letting us use a project room and computer equipment
for as long as we wanted.
4
Contents
1 Introduction 9
1.1 Translation and Interpretation . . . . . . . . . . . . . . . . . . . 9
1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 A Sample Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Compiler History 16
2.1 Procedural Programming . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Functional Programming . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Object Oriented Programming . . . . . . . . . . . . . . . . . . . 17
2.4 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
I Inger 23
3 Language Specification 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 bool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 int . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 float . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.4 char . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.5 untyped . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.1 Simple Statements . . . . . . . . . . . . . . . . . . . . . . 40
3.6.2 Compound Statements . . . . . . . . . . . . . . . . . . . . 42
3.6.3 Repetitive Statements . . . . . . . . . . . . . . . . . . . . 42
3.6.4 Conditional Statements . . . . . . . . . . . . . . . . . . . 43
3.6.5 Flow Control Statements . . . . . . . . . . . . . . . . . . 46
3.7 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.9 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.10 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.11 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5
II Syntax 59
4 Lexical Analyzer 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Regular Language Theory . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Sample Regular Expressions . . . . . . . . . . . . . . . . . . . . . 66
4.4 UNIX Regular Expressions . . . . . . . . . . . . . . . . . . . . . 66
4.5 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6 Common Regular Expressions . . . . . . . . . . . . . . . . . . . . 68
4.7 Lexical Analyzer Generators . . . . . . . . . . . . . . . . . . . . . 71
4.8 Inger Lexical Analyzer Specification . . . . . . . . . . . . . . . . 72
5 Grammar 78
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Syntax and Semantics . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Production Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . 84
5.6 The Chomsky Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Additional Notation . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.8 Syntax Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9 Precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.10 Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.11 A Logic Language . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.12 Common Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Parsing 106
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Prefix code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Parsing Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Top-down Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5 Bottom-up Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6 Direction Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.7 Parser Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7 Preprocessor 121
7.1 What is a preprocessor? . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Features of the Inger preprocessor . . . . . . . . . . . . . . . . . 121
7.2.1 Multiple file inclusion . . . . . . . . . . . . . . . . . . . . 122
7.2.2 Circular References . . . . . . . . . . . . . . . . . . . . . . 122
8 Error Recovery 124
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.2 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.3 Error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.4 Error reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.5 Error recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.6 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6
III Semantics 132
9 Symbol table 134
9.1 Introduction to symbol identification . . . . . . . . . . . . . . . . 134
9.2 Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.3 The Symbol Table . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.3.1 Dynamic vs. Static . . . . . . . . . . . . . . . . . . . . . . 136
9.4 Data structure selection . . . . . . . . . . . . . . . . . . . . . . . 138
9.4.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.4.2 Data structures compared . . . . . . . . . . . . . . . . . . 138
9.4.3 Data structure selection . . . . . . . . . . . . . . . . . . . 140
9.5 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.6 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10 Type Checking 143
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.2.1 Decorate the AST with types . . . . . . . . . . . . . . . . 144
10.2.2 Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.3 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.3.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11 Miscellaneous Semantic Checks 153
11.1 Left Hand Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.1.1 Check Algorithm . . . . . . . . . . . . . . . . . . . . . . . 154
11.2 Function Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.3 Return Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
11.3.1 Unreachable Code . . . . . . . . . . . . . . . . . . . . . . 156
11.3.2 Non-void Function Returns . . . . . . . . . . . . . . . . . 157
11.4 Duplicate Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
11.5 Goto Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
IV Code Generation 161
12 Code Generation 163
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.2 Boilerplate Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.3 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.4 Resource Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.5 Intermediate Results of Expressions . . . . . . . . . . . . . . . . 166
12.6 Function calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.7 Control Flow Structures . . . . . . . . . . . . . . . . . . . . . . . 170
12.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13 Code Templates 173
14 Bootstrapping 199
15 Conclusion 202
7
A Requirements 203
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
A.2 Running Inger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
A.3 Inger Development . . . . . . . . . . . . . . . . . . . . . . . . . . 204
A.4 Required Development Skills . . . . . . . . . . . . . . . . . . . . 205
B Software Packages 207
C Summary of Operations 209
C.1 Operator Precedence Table . . . . . . . . . . . . . . . . . . . . . 209
C.2 Operand and Result Types . . . . . . . . . . . . . . . . . . . . . 210
D Backus-Naur Form 211
E Syntax Diagrams 216
F Inger Lexical Analyzer Source 223
F.1 tokens.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
F.2 lexer.l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
G Logic Language Parser Source 236
G.1 Lexical Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
G.2 Parser Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
G.3 Parser Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8
Chapter 1
Introduction
This book is about constructing a compiler. But what, precisely, is a compiler?
We must give a clear and complete answer to this question before we can begin
building our own compiler.
In this chapter, we will introduce the concept of a translator, and more
specifically, a compiler. It serves as an introduction to the rest of the book and
presents some basic definitions that we will assume to be clear throughout the
remainder of the book.
1.1 Translation and Interpretation
A compiler is a special form of a translator:
Definition 1.1 (Translator)
A translator is a program, or a system, that converts an input text some lan-
guage to a text in another language, with the same meaning.
?
One translator could translate English text to Dutch text, and another could
translate a Pascal program to assembly or machine code. Yet another translator
might translate chess notation to an actual representation of a chess board, or
translate a web page description in HTML to the actual web page. The latter
two examples are in fact not so much translators as they are interpreters:
Definition 1.2 (Interpreter)
An interpreter is a translator that converts an input text to its meaning, as
defined by its semantics.
?
A BASIC-interpreter like GW-BASIC is a classic and familiar example of
an interpreter. Conversely, a translator translates the expression 2+3 to, for
example, machine code that evaluates to 5 . It does not translate directly to 5 .
The processor (CPU) that executes the machine code is the actual interpreter,
delivering the final result. These observations lead to the following relationship:
9
translators ⊂ interpreters
Sometimes the difference between the translation of an input text and its
meaning is not immediately clear, and it can be difficult to decide whether a
certain translator is an interpreter or not.
A compiler is a translator that converts program source code to some target
code, such as Pascal to assembly code, C to machine code and so on. Such
translators differ from translators for, for example, natural languages because
their input is expected to follow very strict rules for form (syntax) and the
meaning of an input text must always be clear, i.e. follow a set of semantic
rules.
Many programs can be considered translators, not just the ones that deal
with text. Other types of input and output can also be viewed as structured text
(SQL queries, vector graphics, XML) which adheres to a certain syntax, and
therefore be treated the same way. Many conversion tools (conversion between
graphics formats, or HTML to L
A T E X) are in fact translators. In order to think
of some process as a translator, one must find out which alphabet is used (the
set of allowed words) and which sentences are spoken. An interesting exercise
is writing a program that converts chess notation to a chess board diagram.
Meijer ([1] presents a set of definitions that clarify the distinction between
translation and interpretation. If the input text to a translator is a program,
then that program can have its own input stream. Such a program can be
translated without knowledge of the contents of the input stream, but it cannot
be interpreted.
Let p be the program that must be translated, in programming language P,
and let i be the input. Then the interpreter is a function v P , and the result of
the translation of p with input i is denoted as:
v P (p,i)
If c is a translator, the same result is obtained by applying the translation
c(p) in a programming language M to the input stream i:
v M (c(p),i)
Interpreters are quite common. Many popular programming languages can-
not be compiled but must be interpreted, such as (early forms of) BASIC,
Smalltalk, Ruby, Perl and PHP. Other programming languages provide both
the option of compilation and the option of interpretation.
The rest of this book will focus on compilers, which translate program input
texts to some target language. We are specifically interested in translating
program input text in a programming language (in particular, Inger) to Intel
assemby language.
1.2 Roadmap
Constructing a compiler involves specifying the programming language for which
you wish to build a compiler, and then write a grammar for it. The com-
piler then reads source programs written in the new programming language and
10
checks that they are syntactically valid (well-formed). After that, the compiler
verifies that the meaning of the program is correct, i.e. it checks the program’s
semantics. The final step in the compilation is generating code in the target
language.
To help you vizualize where you are in the compiler construction process,
every chapter begins with a copy of the roadmap:
One of the squares on the map will be highlighted.
1.3 A Sample Interpreter
In this section, we will discuss a very simple sample interpreter that calculates
the result of simple mathematical expressions, using the operators + (addition),
- (subtraction), * (multiplication) and / (division). We will work only with
numbers consisting of one digits (0 through 9).
We will now devise a systematical approach to calculate the result of the
expression
1 + 2 * 3 - 4
This is traditionally done by reading the input string on a character by
character basis. Initially, the read pointer is set at the beginning of the string,
just before the number 1 :
1 + 2 * 3 - 4
^
We now proceed by reading the first character (or code), which happens to
be 1 . This is not an operator so we cannot calculate anything yet. We must
store the 1 we just read away for later use, and we do so by creating a stack
(a last-in-first-out queue abstraction) and placing 1 on it. We illustrate this
by drawing a vertical line between the items on the stack (on the left) and the
items on the input stream (on the right):
1 | + 2 * 3 - 4
^
The read pointer is now at the + operator. This operator needs two operands,
only one of which is known at this time. So all we can do is store the + on the
stack and move the read pointer forwards one position.
1 + | 2 * 3 - 4
^
11
The next character read is 2 . We must now resist the temptation to combine
this new operand with the operator and operand already on the stack and
evaluate 1 + 2 , since the rules of precedence dictate that we must evaluate 2 *
3 , and then add this to 1 . Therefore, we place (shift) the value 2 on the stack:
1 + 2 | * 3 - 4
^
We now read another operator ( * ) which needs two operands. We shift it
on the stack because the second operand is not yet known. The read pointer is
once again moved to the right and we read the number 3 . This number is also
placed on the stack and the read pointer now points to the operator - :
1 + 2 * 3 | - 4
^
We are now in a position to fold up (reduce) some of the contents of the
stack. The operator - is of lower priority than the operator * . According to the
rules of precedence, we may now calculate 2 * 3 , which happen to be the topmost
three items on the stack (which, as you will remember, is a last-in-first-out data
structure). We pop the last three items off the stack and calculate the result,
which is shifted back onto the stack. This is the process of reduction.
1 + 6 | - 4
^
We now compare the priority of the operator - with the priority of the op-
erator + and find that, according to the rules of precedence, they have equal
priority. This means we can either evaluate the current stack contents or con-
tinue shifting items onto the stack. In order to keep the contents of the stack
to a minimum (consider what would happen if an endless number of + and -
operators were encountered in succession) we reduce the contents of the stack
first, by calculating 1 + 6 :
7 | - 4
^
The stack can be simplied no further, so we direct our attention to the next
operator in the input stream ( - ). This operator needs two operands, so we must
shift the read pointer still further to the right:
7 - 4 |^
We have now reached the end of the stream but are able to reduce the
contents of the stack to a final result. The expression 7 - 4 is evaluated, yielding
3 . Evaluation of the entire expression 1 + 2 * 3 - 4 is now complete and the
algorithm used in the process is simple. There are a couple of interesting points:
12
1. Since the list of tokens already read from the input stream are placed on
a stack in order to wait for evaluation, the operations shift and reduce are
in fact equivalent to the operators push and pop.
2. The relative precedence of the operators encountered in the input stream
determine the order in which the contents of the stack are evaluated.
Operators not only have priority, but also associativity. Consider the ex-
pression
1 − 2 − 3
The order in which the two operators are evaluated is significant, as the
following two possible orders show:
(1 − 2) − 3 = −4
1 − (2 − 3) = 2
Of course, the correct answer is −4 and we may conclude that the - operator
associates to the left. There are also (but fewer) operators that associate to the
right, like the “to the power of” operator ( ˆ ):
(2 3 ) 2 = 8 2 = 64 (incorrect)
2 ( 3 2 ) = 2 9 = 512 (correct)
A final class of operators is nonassociative, like + :
(1 + 4) + 3 = 5 + 3 = 8
1 + (4 + 3) = 1 + 7 = 8
Such operators may be evaluated either to the left or to the right; it does
not really matter. In compiler construction, non-associative operators are often
treated as left-associative operators for simplicity.
The importance of priority and associativty in the evaluation of mathemati-
cal expressions, leads to the observation, that an operator priority list is required
by the interpreter. The following table could be used:
operator priority associativity
ˆ 1 right
* 2 left
/ 2 left
+ 3 left
- 3 left
The parentheses, ( and ) can also be considered an operator, with the highest
priority (and could therefore be added to the priority list). At this point, the
priority relation is still incomplete. We also need invisible markers to indicate
the beginning and end of an expression. The begin-marker [ should be of the
lowest priority (in order to cause every other operator that gets shifted onto
an otherwise empty stack not to evaluate. The end-marker ] should be of the
lowest priority (just lower than [ ) for the same reasons. The new, full priority
relation is then:
13
{ [ , ] } < { + , - } < { * , / } < { ˆ }
The language discussed in our example supports only one-digit numbers.
In order to support numbers of arbitrary length while still reading one digit
at a time and working with the stack-based shift-reduce approach, we could
introduce a new implicit concatenation operator:
1.2 = 1 ∗ 10 + 2 = 12
Numerous other problems accompany the introduction of numbers of arbi-
trary length, which will not be discussed here (but most certainly in the rest
of this book). This concludes the simple interpreter which we have crafted by
hand. In the remaining chapter, you will learn how an actual compiler may be
built using standard methods and techniques, which you can apply to your own
programming language.
14
Bibliography
[1] H. Meijer: Inleiding Vertalerbouw, University of Nijmegen, Subfaculty of
Computer Science, 2002.
[2] M.J. Scott: Programming Language Pragmatics, Morgan Kaufmann Pub-
lishers, 2000.
15
Chapter 2
Compiler History
This chapter gives an overview of compiler history. Programming languages
can roughly be divided into three classes: procedural or imperative programming
languages, functional programming languages and object-oriented programming
languages. Compilers exist for all three classes and each type have their own
quirks and specialties.
2.1 Procedural Programming
The first programming languages evolved from machine code. Instead of writ-
ing numbers to specify addresses and instructions you could write symbolic
names. The computer executes this sequence of instructions when the user runs
the program. This style of programming is known as procedural or imperative
programming.
Some of the first procedural languages include:
• FORTRAN, created in 1957 by IBM to produce programs to solve math-
ematical problems. FORTRAN is short for formula translation.
• Algol 60, created in the late fifties with to goal to provide a “universal
programming language”. Even though the language was not widely used,
its syntax became the standard language for describing algorithms.
• COBOL, a “data processing” language developed by Sammett introduced
many new data types and implicit type conversion.
• PL/I, a programming language that would combine the best features of
FORTRAN, Algol and COBOL.
• Algol 68, the successor to Algol 60. Also not widely used, though the
ideas it introduced have been widely imitated.
• Pascal, created by Wirth to demonstrate a powerful programming lan-
gauge can be simple too, as opposed to the complex Algol 68, PL/I and
others.
16
• Modula2, also created by Wirth as an improvement to Pascal with modules
as most important new feature.
• C, designed by Ritchie as a low level language mainly for the task of
system programming. C became very popular because UNIX was very
popular and heavily depended on it.
• Ada, a large and complex language created by Whitaker and one of the
latest attempts at designing a procedural language.
2.2 Functional Programming
Functional programming is based on the abstract model of programming Tur-
ing introduced, known as the Turing Machine. Also the theory of recursive
functions Kleene and Church introduced play an important role in functional
programming. The big difference with procedural programming is the insight
that everything can be done with expression as opposed to commands.
Some important functional programming languages:
• LISP, the language that introduced functional programming. Developed
by John McCarthy in 1958.
• Scheme, a language with a syntax and semantics very similar to LISP, but
simpler and more consistent. Designed by Guy L. Steele Jr. and Gerald
Lay Sussmann.
• SASL, short for St. Andrew’s Symbolic Language. It was created by
David Turner and has an Algol like syntax.
• SML, designed by Milner, Tofte and Harper as a “metalanguage”.
2.3 Object Oriented Programming
Object oriented programming is entirely focused on objects, not functions. It
has some major advantages over procedural programming. Well written code
consists of objects that keep their data private and only accesible through cer-
tain methods (the interface), this concept is known as encapsulation. Another
important object oriented programming concept is inheritance—a mechanism
to have objects inherit state and behaviour from their superclass.
Important object oriented programming languages:
• Simula, developed in Norway by Kristiaan Nyaard and Ole-Johan Dahl.
A language to model system, wich are collections of interacting processes
(objects), wich in turn are represented by multiple procedures.
• SmallTalk, originated with Alan Kay’s ideas on computers and program-
ming. It is influenced by LISP and Simula.
• CLU, a language introduced by Liskov and Zilles to support the idea of
information hiding (the interface of a module should be public, while the
implementation remains private).
17
• C++, created at Bell Labs by Bjarne Soustroup as a programming lan-
guage to replace C. It is a hybrid language — it supports both imperative
and object oriented programming.
• Eiffel, a stricly object oriented language wich is strongly focused on soft-
ware engineering.
• Java, a object oriented programming language with a syntax wich looks
like C++, but much simpler. Java is compiled to byte code wich makes it
portable. This is also the reason it became very popular for web develop-
ment.
• Kevo, a language based on prototypes instead of classes.
2.4 Timeline
In this section, we give a compact overview of the timeline of compiler con-
struction. As described in the overview article [4], the conception of the first
computer language goes back as far as 1946. In this year (or thereabouts), Kon-
rad Zuse, a german engineer working alone Konrad Zuse, a German engineer
working alone while hiding out in the Bavarian Alps, develops Plankalkul. He
applies the language to, among other things, chess. Not long after that, the first
compiled language appears: Short Code, which the first computer language ac-
tually used on an electronic computing device. It is, however, a “hand-compiled”
language.
In 1951, Grace Hopper, working for Remington Rand, begins design work on
the first widely known compiler, named A-0. When the language is released by
Rand in 1957, it is called MATH-MATIC. Less well-known is the fact that almost
simulaneously, a rudimentary compiler was developed at a much less professional
level. Alick E. Glennie, in his spare time at the University of Manchester, devises
a compiler called AUTOCODE.
A few years after that, in 1957, the world famous programming language
FORTRAN (FORmula TRANslation) is conceived. John Backus (responsible
for his Backus-Naur Form for syntax specification) leads the development of
FORTRAN and later on works on the ALGOL programming language. The
publication of FORTRAN was quickly followed by FORTRAN II (1958), which
supported subroutines (a major innovation at the time, giving birth to the
concept of modular programming).
Also in 1958, John McCarthy at M.I.T. begins work on LISP–LISt Process-
ing, the precursor of (almost) all functional programming languages we know
today. Also, this is the year in which the ALGOL programming language ap-
pears (at least, the specification). The specification of ALGOL does not describe
how data will be input or output; that is left to the individual implementations.
1959 was another year of much innovation. LISP 1.5 appears and the func-
tional programming paradigm is settled. Also, COBOL is created by the Con-
ference on Data Systems and Languages (CODASYL). In the next year, the first
18
actual implementation of ALGOL appears (ALGOL60). It is the root of the
family tree that will ultimately produce the likes of Pascal by Niklaus Wirth.
ALGOL goes on to become the most popular language in Europe in the mid-
to late-1960s.
Sometime in the early 1960s, Kenneth Iverson begins work on the language
that will become APL – A Programming Language. It uses a specialized char-
acter set that, for proper use, requires APL-compatible I/O devices. In 1962,
Iverson publishes a book on his new language (titled, aptly, A Programming
Language). 1962 is also the year in which FORTRAN IV appears, as well as
SNOBOL (StriNg-Oriented symBOlic Language) and associated compilers.
In 1963, the new language PL/1 is conceived. This language will later form
the basis for many other languages. In the year after, APL/360 is implemented
and at Dartmouth University, professors John G. Kemeny and Thomas E. Kurtz
invent BASIC. The first implementation is a compiler. The first BASIC program
runs at about 4:00 a.m. on May 1, 1964.
Languages start appearing rapidly now: 1965 - SNOBOL3. 1966 - FOR-
TRAN 66 and LISP 2. Work begins on LOGO at Bolt, Beranek, & Newman.
The team is headed by Wally Fuerzeig and includes Seymour Papert. LOGO is
best known for its “turtle graphics.” Lest we forget: 1967 - SNOBOL4.
In 1968, the aptly named ALGOL68 appears. This new language is not alto-
gether a success, and some members of the specifications committee–including
C.A.R. Hoare and Niklaus Wirth–protest its approval. ALGOL 68 proves diffi-
cult to implement. Wirth begins work on his new language Pascal in this year,
which also sees the birth of ALTRAN, a FORTRAN variant, and the official
definition of COBOL by the American National Standards Institute (ANSI).
Compiler construction attracts a lot of interest – in 1969, 500 people attend an
APL conference at IBM’s headquarters in Armonk, New York. The demands
for APL’s distribution are so great that the event is later referred to as “The
March on Armonk.”
Sometime in the early 1970s , Charles Moore writes the first significant pro-
grams in his new language, Forth. Work on Prolog begins about this time. Also
sometime in the early 1970s, work on Smalltalk begins at Xerox PARC, led by
Alan Kay. Early versions will include Smalltalk-72, Smalltalk-74, and Smalltalk-
76. An implementation of Pascal appears on a CDC 6000-series computer. Icon,
a descendant of SNOBOL4, appears.
Remember 1946? In 1972, the manuscript for Konrad Zuse’s Plankalkul (see
1946) is finally published. In the same year, Dennis Ritchie and Brian Kernighan
produces C. The definitive reference manual for it will not appear until 1974.
The first implementation of Prolog – by Alain Colmerauer and Phillip Rous-
sel – appears. Three years later, in 1975, Tiny BASIC by Bob Albrecht and
Dennis Allison (implementation by Dick Whipple and John Arnold) runs on a
microcomputer in 2 KB of RAM. A 4-KB machine is sizable, which left 2 KB
available for the program. Bill Gates and Paul Allen write a version of BA-
SIC that they sell to MITS (Micro Instrumentation and Telemetry Systems)
19
on a per-copy royalty basis. MITS is producing the Altair, an 8080-based mi-
crocomputer. Also in 1975, Scheme, a LISP dialect by G.L. Steele and G.J.
Sussman, appears. Pascal User Manual and Report, by Jensen and Wirth, (also
extensively used in the conception of Inger) is published.
B.W. Kerninghan describes RATFOR – RATional FORTRAN. It is a pre-
processor that allows C-like control structures in FORTRAN. RATFOR is used
in Kernighan and Plauger’s “Software Tools,” which appears in 1976. In that
same year, the Design System Language, a precursor to PostScript (which was
not developed until much later), appears.
In 1977, ANSI defines a standard for MUMPS: the Massachusetts General
Hospital Utility Multi-Programming System. Used originally to handle medical
records, MUMPS recognizes only a string data-type. Later renamed M. The
design competition (ordered by the Department of Defense) that will produce
Ada begins. A team led by Jean Ichbiah, will win the competition. Also,
sometime in the late 1970s, Kenneth Bowles produces UCSD Pascal, which
makes Pascal available on PDP-11 and Z80-based (remember the ZX-spectrum)
computers and thus for “home use”. Niklaus Wirth begins work on Modula,
forerunner of Modula-2 and successor to Pascal.
The text-processing language AWK (after the designers: Aho, Weinberger
and Kernighan) becomes available in 1978. So does the ANSI standard for
FORTRAN 77. Two years later, the first “real” implementation of Smalltalk
(Smalltalk-80) appears. So does Modula-2. Bjarne Stroustrup develops “C With
Classes”, which will eventually become C++.
In 1981, design begins on Common LISP, a version of LISP that must unify
the many different dialects in use at the time. Japan begins the “Fifth Gener-
ation Computer System” project. The primary language is Prolog. In the next
year, the International Standards Organisation (ISO) publishes Pascal appears.
PostScript is published (after DSL).
The famous book on Smalltalk: Smalltalk-80: The Language and Its Imple-
mentation by Adele Goldberg is published. Ada appears, the language named
after Lady Augusta Ada Byron, Countess of Lovelace and daughter of the En-
glish poet Byron. She has been called the first computer programmer because
of her work on Charles Babbage’s analytical engine. In 1983, the Department
of Defense ’(DoD) directs that all new “mission-critical” applications be written
in Ada.
In late 1983 and early 1984, Microsoft and Digital Research both release the
first C compilers for microcomputers. The use of compilers by back-bedroom
programmers becomes almost feasible. In July , the first implementation of
C++ appears. It is in 1984 that Borland produces its famous Turbo Pascal. A
reference manual for APL2 appears, an extension of APL that permits nested
arrays.
An important year for computer languages is 1985. It is the year in which
Forth controls the submersible sled that locates the wreck of the Titanic. Meth-
20
ods, a line-oriented Smalltalk for personal computers, is introduced. Also, in
1986, jargonSmalltalk/V appears–the first widely available version of Smalltalk
for microcomputers. Apple releases Object Pascal for the Mac, greatly popular-
izing the Pascal language. Borland extends its “Turbo” product line with Turbo
Prolog.
Charles Duff releases Actor, an object-oriented language for developing Mi-
crosoft Windows applications. Eiffel, an object-oriented language, appears. So
does C++. Borland produces the fourth incarnation of Turbo Pascal (1987). In
1988, the spefication of CLOS (Common LISP Object System) is fianlly pub-
lished. Wirth finishes Oberon, his follow-up to Modula-2, his third language so
far.
In 1989, the ANSI specification for C is published, leveraging the already
popular language even further. C++ 2.0 arrives in the form of a draft refer-
ence manual. The 2.0 version adds features such as multiple inheritance (not
approved by everyone) and pointers to members. A year later, the Annotated
C++ Reference Manual by Bjarne Stroustrup is published, adding templates
and exception-handling features. FORTRAN 90 includes such new elements as
case statements and derived types. Kenneth Iverson and Roger Hui present J
at the APL90 conference.
Dylan – named for Dylan Thomas – an object-oriented language resembling
Scheme, is released by Apple in 1992. A year later, ANSI releases the X3J4.1
technical report – the first-draft proposal for object-oriented COBOL. The stan-
dard is expected to be finalized in 1997.
In 1994, Microsoft incorporates Visual Basic for Applications into Excel and
in 1995, ISO accepts the 1995 revision of the Ada language. Called Ada 95, it
includes OOP features and support for real-time systems.
This concludes the compact timeline of the evolution of programming lan-
guages. Of course, in the present day, another revolution is taking place, in the
form of the Microsoft .NET platform. This platform is worth an entire book
unto itself, and much literature is in fact already available. We will not discuss
the .NET platform and the common language specification any further in this
book. It is now time to move on to the first part of building our own compiler.
21
Bibliography
[1] T. Dodd: An Advanced Logic Programming Language - Prolog-2 Encyclo-
pedia, Blackwell Scientific Publications Ltd., 1990.
[2] M. Looijen: Grepen uit de Geschiedenis van de Automatisering, Kluwer
Bedrijfswetenschappen, Deventer, 1992.
[3] G. Moody: Rebel Code - Inside Linux and the Open Source Revolution,
Perseus Publishing, 2001.
[4] N.N.A Brief History of Programming Language, BYTE Magazine, 20th
anniversary, 1995.
[5] P.H. Salus: Handbook of Programming Languages, Volume I: Object-
oriented Programming Languages, Macmillan Technical Publishing, 1998.
[6] P.H. Salus: Handbook of Programming Languages, Volume II: Imperative
Programming Languages, Macmillan Technical Publishing, 1998.
[7] P.H. Salus: Handbook of Programming Languages, Volume III: Little Lan-
guages and Tools, Macmillan Technical Publishing, 1998.
[8] P.H. Salus: Handbook of Programming Languages, Volume IV: Functional
and Logic Programming Languages, Macmillan Technical Publishing, 1998.
22
Part I
Inger
23
Chapter 3
Language Specification
3.1 Introduction
This chapter gives a detailed introduction to the Inger language. The reader is
assumed to have some familiarity with the concept of a programming language,
and some experience with mathematics.
To give the reader an introduction to programming in general, we cite a
short fragment of the introduction to the PASCAL User Manual and Report by
Niklaus Wirth [7]:
An algorithm or computer program consists of two essential parts, a
description of the actions which are to be performed, and a descrip-
tion of the data, which are manipulated by so-called statements, and
data are described by so-called declarations and definitions.
Inger provides language constructs (declarations) to define the data a pro-
gram requires, and numerous ways to manipulate that data. In the next sections,
we will explore Inger in some detail.
3.2 Program Structure
A program consists of one or more named modules, all of which contribute data
or actions to the final program. Every module resides in its own source file. The
best way to get to know Inger is to examine a small module source file. The
program “factorial” (listing 3.1) calculates the factorial of the number 6. The
output is 6! = 720.
All modules begin with the module name, which can be any name the pro-
grammer desires. A module then contains zero or more functions, which en-
capsulate (parts of) algorithms, and zero or more global variables (global data).
The functions and data declarations can occur in any order, which is made
clear in a syntax diagram (figure 3.1). By starting from the left, one can trace
24
/* factor.i - test program.
Contains a function that calculates
the factorial of the number 6.
This program tests the while loop. */
5
module test module;
factor : int n → int
{
10 int factor = 1;
int i = 1;
while( i <= n ) do
{
15 factor = factor ∗ n;
n = n + 1;
}
return( factor );
}
20
start main: void → void
{
int f;
f = factor ( 6 );
25 }
Listing 3.1: Inger Factorial Program
25
the lines leading through boxes and rounded enclosures. Boxes represent ad-
ditional syntax diagrams, while rounded enclosures contain terminal symbols
(those actually written in an Inger program). A syntactically valid program is
constructed by following the lines and always taking smooth turns, never sharp
turns. Note that dotted lines are used to break a syntax diagram in half that is
too wide to fit on the page.
Figure 3.1: Syntax diagram for module
Example 3.1 (Tracing a Syntax Diagram)
As an example, we will show two valid programs that are generated by tracing
the syntax diagram for module. These are not complete programs; they still
contain the names of the additional syntax diagrams function and declaration
that must be traced.
module Program One;
Program One is a correct program that contains no functions or declarations.
The syntax diagram for module allows this, because the loop leading through
either function or declaration is taken zero times.
module Program Two;
extern function;
declaration ;
function;
Program Two is also correct. It contains two functions and one declaration.
One of the functions is marked extern ; the keyword extern is optional, as the the
syntax diagram for module shows.
?
Syntax diagrams are a very descriptive way of writing down language syntax,
but not very compact. We may also use Backus-Naur Form (BNF) to denote
the syntax for the program structure, as shown in listing 3.2.
In BNF, each syntax diagram is denoted using one or more lines. The
line begins with the name of the syntax diagram (a nonterminal), followed
by a colon. The contents of the syntax diagram are written after the colon:
nonterminals, (which have their own syntax diagrams), and terminals, which
26
module: module identifier ; globals.
globals : ?.
globals : global globals.
globals : extern global globals.
global: function.
global: declaration.
Listing 3.2: Backus-Naur Form for module
are printed in bold. Since nonterminals may have syntax diagrams of their
own, a single syntax diagram may be expressed using multiple lines of BNF.
A line of BNF is also called a production rule. It provides information on how
to “produce” actual code from a nonterminal. In the following example, we
produce the programs “one” and “two” from the previous example using the
BNF productions.
Example 3.2 (BNF Derivations)
Here is the listing from program “one” again:
module Program One;
To derive this program, we start with the topmost BNF nonterminal,
module . This is called the start symbol. There is only one production for this
nonterminal:
module: module identifier ; globals.
We now replace the nonterminal module with the right hand side of this
production:
module −→ module identifier; globals
Note that we have underlined the nonterminal to be replaced. In the new
string we now have, there is a new nonterminal to replace: globals . There are
multiple production rules for globals :
globals : ?.
globals : global globals.
globals : extern global globals.
Program “One” does not have any globals (declarations or functions), so we
replace the nonterminal globals with the empty string (?). Finally, we replace
the nonterminal identifier . We provide no BNF rule for this, but it suffices to
say that we may replace identifier with any word consisting of letters, digits and
underscores and starting with either a letter or an underscore:
27
module
−→ module identifier; globals
−→ module Program One; globals
−→ module Program One;
And we have created a valid program! The above list of production rule ap-
plications is a called a derivation. A derivation is the application of production
rules until there are no nonterminals left to replace. We now create a derivation
for program “Two“, which contains two functions (one of which is extern , more
on that later) and a declaration. We will not derive further than the function
and declaration level, because these language structures will be explained in a
subsequent section Here is the listing for program “Two” again:
module Program Two;
extern function;
declaration ;
function;
module
−→ module identifier; globals
−→ module Program Two; globals
−→ module Program Two; extern globals
−→ module Program Two; extern global globals
−→ module Program Two; extern function globals
−→ module Program Two; extern function global globals
−→ module Program Two; extern function declaration globals
−→ module Program Two; extern function declaration global globals
−→ module Program Two; extern function declaration function globals
−→ module Program Two; extern function declaration function
And with the last replacement, we have produced to source code for program
“Two”, exactly the same as in the previous example.
?
BNF is a somewhat rigid notation; it only allows the writer to make explicit
the order in which nonterminals and terminals occur, but he must create addi-
tional BNF rules to capture repetition and selection. For instance, the syntax
diagram for module shows that zero or more data declarations or functions may
appear in a program. In BNF, we show this by introducing a production rule
called globals, which calls itself (is recursive). We also needed to create another
production rule called global, which has two alternatives (function and declara-
tion) to offer a choice. Note that globals has three alternatives. One alternative
is needed to end the repetition of functions and declarations (this is denoted
with an ?, meaning empty), and one alternative is used to include the keyword
extern , which is optional.
There is a more convenient notation called Extended Backus-Naur Form
(EBNF), which allows the syntax diagram for module to be written like this:
28
( ) [ ]
! - + ~
* & * /
% + - >>
<< < <= >
>= == != &
^ | && ||
? : = ,
; -> { }
bool break case char
continue default do else
extern false float goto considered harmful
if int label module
return start switch true
untyped while
Table 3.1: Inger vocabulary
module: module identifier ;
? ?
extern
?
?
function | declaration
? ?
.
In EBNF, we can use vertical bars (|) to indicate a choice, and brackets
([ and ]) to indicate an optional part. These symbols are called metasymbols;
they are not part of the syntax being defined. We can also use the metasymbols
( and ) to enclose terminals and nonterminals so they may be used as a group.
Braces ({ and }) are used to denote repetition zero or more times. In this book,
we will use both EBNF and BNF. EBNF is short and clear, but BNF has some
advantages which will become clear in chapter 5, Grammar.
3.3 Notation
Like all programming languages, Inger has a number of reserved words, operators
and delimiters (table 3.3). These words cannot be used for anything else than
their intended purpose, which will be discussed in the following sections.
One place where the reserved words may be used freely, along with any
other words, is inside a comment. A comment is input text that is meant
for the programmer, not the compiler, which skips them entirely. Comments
are delimited by the special character combinations /* and */ and may span
multiple lines. Listing 3.3 contains some examples of legal comments.
The last comment in the example above starts with // and ends at the end
of the line. This is a special form of comment called a single-line comment.
Functions, constants and variables may be given arbitrary names, or identi-
fiers by the programmer, provided reserved words are not used for this purpose.
An identifier must begin with a letter or an underscore (_) to discern it from a
number, and there is no limit to the identifier length (except physical memory).
As a rule of thumb, 30 characters is a useful limit for the length of identifiers.
Although an Inger compiler supports names much longer than that, more than
30 characters will make for confusing names which are too long to read. All
29
/* This is a comment. */
/* This is also a comment,
spanning multiple
5 lines. */
/*
* This comment is decorated
10 * with extra asterisks to
* make it stand out.
*/
// This is a single-line comment.
Listing 3.3: Legal Comments
identifiers must be different, except when they reside in different scopes. Scopes
will be discussed in greater detail later. We give a syntax diagram for identifiers
in figure 3.2 and EBNF production rules for comparison:
Figure 3.2: Syntax diagram for identifier
identifier :
?
| letter
? ?
letter | digit |
?
letter : A | ... | Z | a | ... | z
digit : 0 | ... | 9
Example 3.3 (Identifiers)
Valid identifiers include:
x
_GrandMastaD_
HeLlO_wOrLd
Some examples of invalid identifiers:
2day
30
bool
@a
2+2
?
Of course, the programmer is free to choose wonderful names such as or
x234 . Even though the language allows this, the names are not very descriptive
and the programmer is encouraged to choose better names that describe the
purpose of variables.
Inger supports two types of numbers: integer numbers (x ∈ N), floating
point numbers (x ∈ R). Integer numbers consist of only digits, and are 32 bits
wide. They have a very simple syntax diagram shown in figure 3.3. Integer
numbers also include hexadecimal numbers, which are numbers with radix 16.
Hexadecimal numbers are written using 0 through 9 and A through F as digits.
The case of the letters is unimportant. Hexadecimal numbers must be prefixed
with 0x to set them apart from ordinary integers. Inger can also work with
binary numbers (numbers with radix 2). These numbers are written using only
the digits 0 and 1 . Binary numbers must be postfixed with B or b to set them
apart from other integers.
Figure 3.3: Syntax diagram for integer
Floating point numbers include a decimal separator (a dot) and an optional
fractional part. They can be denoted using scientific notation (e.g. 12e-3 ). This
makes their syntax diagram (figure 3.4) more involved than the syntax diagram
for integers. Note that Inger always uses the dot as the decimal separator, and
not the comma, as is customary in some locations.
Example 3.4 (Integers and Floating Pointer Numbers)
Some examples of valid integer numbers:
3 0x2e 1101b
Some examples of invalid integer numbers (note that these numbers may be
perfectly valid floating point numbers):
1a 0.2 2.0e8
31
Figure 3.4: Syntax diagram for float
Some examples of valid floating point numbers:
0.2 2.0e8 .34e-2
Some examples of invalid floating point numbers:
2e-2 2a
?
Alphanumeric information can be encoded in either single characters of
strings. A single character must be inclosed within apostrophes ( ’ ) and a string
must begin and and with double quotes ( ” ). Any and all characters may be
used with a string or as a single character, as long as they are printable (control
characters cannot be typed). If a control character must be used, Inger offers a
way to escape ordinary characters to generate control characters, analogous to
C. This is also the only way to include double quotes in a string, since they are
normally used to start or terminate a string (and therefore confuse the compiler
if not treated specially). See table 3.2 for a list of escape sequences and the
special characters they produce.
A string may not be spanned across multiple lines. Note that while whites-
pace such as spaces, tabs and end of line are normally used only to separate
symbols and further ignored, whitespace within a string remains unchanged by
the compiler.
Example 3.5 (Sample Strings and Characters)
Here are some sample single characters:
’b’ ’&’ ’7’ ’"’ ’’’
Valid strings include:
"hello, world" "123"
"\r\n" "\"hi!\""
32
Escape Sequence Special character
\” ”
\’ ’
\\ \
\a Audible bell
\b Backspace
\Bnnnnnnnn Convert binary value to character
\f Form feed
\n Line feed
\onnn Convert octal value to character
\r Carriage return
\t Horizontal tab
\v Vertical tab
\xnn Convert hexadecimal value to character
Table 3.2: Escape Sequences
?
This concludes the introduction to the nototional conventions to which valid
Inger programs must adhere. In the next section, we will discuss the concept of
data (variables) and how data is defined in Inger.
3.4 Data
Almost all computer programs operate on data, with which we mean numbers
or text strings. At the lowest level, computers deal with data in the form of bits
(binary digits, which a value of either 0 or 1 ), which are difficult to manipulate.
Inger programs can work at a higher level and offer several data abstractions
that provide a more convenient way to handle data than through raw bits.
The data abstractions in Inger are bool, char, float, int and untyped. All of
these except untyped are scalar types, i.e. they are a subset of R. The untyped
data abstraction is a very different phenomenon. Each of the data abstractions
will be discussed in turn.
3.4.1 bool
Inger supports so-called boolean 1 values and the means to work with them.
Boolean values are truth values, either true of false. Variables of the boolean
data type (keyword bool ) can only be assigned to using the keywords true or
false , not 0 or 1 as other languages may allow.
1 In 1854, the mathematician George Boole (1815–1864) published An investigation into the
Laws of Thought, on Which are founded the Mathematical Theories of Logic and Probabilities.
Boole approached logic in a new way reducing it to a simple algebra, incorporating logic
into mathematics. He pointed out the analogy between algebraic symbols and those that
represent logical forms. It began the algebra of logic called Boolean algebra which now has
wide applications in telephone switching and the design of modern computers. Boole’s work
has to be seen as a fundamental step in today’s computer revolution.
33
There is a special set of operators that work only with boolean values: see
table 3.3. The result value of applying one of these operators is also a boolean
value.
Operator Operation
&& Logical conjunction (and)
|| Logical disjunction (or)
! Logical negation (not)
A B A && B
F F F
F T F
T F F
T T T
A B A || B
F F F
F T T
T F T
T T T
A !A
F T
T F
Table 3.3: Boolean Operations and Their Truth Tables
Some of the relational operators can be applied to boolean values, and all
yield boolean return values. In table 3.4, we list the relational operators and
their effect. Note that == and != can be applied to other types as well (not
just boolean values), but will always yield a boolean result. The assignment
operator = can be applied to many types as well. It will only yield a boolean
result when used to assign a boolean value to a boolean variable.
Operator Operation
== Equivalence
!= Inequivalence
= Assignment
Table 3.4: Boolean relations
3.4.2 int
Inger supports only one integral type, i.e. int . A variable of type int can store
any n ∈ N, as long as n is within the range the computer can store using its
maximum word size. In table 3.5, we show the size of integers that can be stored
using given maximum word sizes.
Word size Integer Range
8 bits -128..127
16 bits -32768..32768
32 bits -2147483648..2147483647
Table 3.5: Integer Range by Word Size
Inger supports only signed integers, hence the negative ranges in the table.
Many operators can be used with integer types (see table 3.6), and all return a
34
value of type int as well. Most of these operators are polymorphic : their return
type corresponds to the type of their operands (which must be of the same
type).
Operator Operation
- unary minus
+ unary plus
~ bitwise complement
* multiplication
/ division
% modulus
+ addition
- subtraction
>> bitwise shift right
<< bitwise shift left
< less than
<= less than or equal
> greater than
>= greater than or equal
== equality
!= inequality
& bitwise and
^ bitwise xor
| bitwise or
= assignment
Table 3.6: Operations on Integers
Of these operators, the unary minus ( - ), unary plus ( + ) and (unary) bit-
wise complement ( ) associate to the right (since they are unary) and the rest
associates to the left, except assignment ( = ).
The relational operators = , != , ¡ , ¡= , ¿= and ¿ have a boolean result value,
even though they have operands of type int . Some operations, such as additions
and multiplications, can overflow when their result value exceeds the maximum
range of the int type. Consult table 3.5 for the maximum ranges. If a and b are
integer expressions, the operation
a op b
will not overflow if (N is the integer range of a given system):
1. a op b ∈ N
2. a ∈ N
3. b ∈ N
3.4.3 float
The float type is used to represent an element of R, although only a small part
of R is supported, using 8 bytes. A subset of the operators that can be used
35
with operands of type int can also be used with operands of type float (see table
3.7).
Operator Operation
- unary minus
+ unary plus
* multiplication
/ division
+ addition
- subtraction
< less than
<= less than or equal
> greater than
>= greater than or equal
== equality
!= inequality
= assignment
Table 3.7: Operations on Floats
Some of these operations yield a result value of type float , while others (the
relational operators) yield a value of type bool . Note that Inger supports only
floating point values of 8 bytes, while other languages also support 4-byte so-
called float values (while 8-byte types are called double).
3.4.4 char
Variables of type char may be used to store single unsigned bytes (8 bits) or
single characters. All operations that can be performed on variables of type int
may also be applied to operands of type char . Variables of type char may be
initialized with actual characters, like so:
char c = ’a’;
All escape sequences from table 3.2 may be used to initialize a variable of
type char , although only one at a time, since a char represents only a single
character.
3.4.5 untyped
In contrast to all the types discussed so far, the untyped type does not have a
fixed size. untyped is a polymorphic type, which can be used to represent any
other type. There is one catch: untyped must be used as a pointer.
Example 3.6 (Use of Untyped)
The following code is legal:
untyped ∗a; untyped ∗∗b;
but this code is not:
36
untyped p;
?
This example introduces the new concept of a pointer. Any type may have
one or more levels of indirection, which is denoted using one more more asterisks
( * ). For an in-depth discussion on pointers, consult C Programming Language[1]
by Kernighan and Ritchie.
3.5 Declarations
All data and functions in a program must have a name, so that the programmer
can refer to them. No module may contain or refer to more than one function
with the same name; every function name must be unique. Giving a variable
or a function in the program a type (in case of a function: input types and
an output type) and a name is called declaring the variable or function. All
variables must be declared before they can be used, but functions may be used
before they are defined.
An Inger program consists of a number of declarations of either global vari-
ables or functions. The variables are called global because they are declared at
the outermost scope of the program. Functions can have their own variables,
which are then called local variables and reside within the scope of the function.
In listing 3.4, three global variables are declared and accessed from within the
functions f en g . This code demonstrates that global variables can be accessed
from within any function.
Local variables can only be accessed from within the function in which they
are declared. Listing 3.5 shows a faulty program, in which variable i is accessed
from a scope in which it cannot be seen.
Variables are declared by naming their type ( bool , char , float , int or untyped ,
their level of indirection, their name and finally their array size. This structure
is shown in a syntax diagram in figure 3.5, and in the BNF production rules in
listing 3.6.
The syntax diagram and BNF productions show that it is possible to declare
multiple variables using one declaration statement, and that variables can be
initialized in a declaration. Consult the following example to get a feel for
declarations:
Example 3.7 (Examples of Declarations)
char ∗a, b = ’Q’, ∗c = 0x0;
int number = 0;
bool completed = false, found = true;
Here the variable a is a pointer char , which is not initialized. If no initial-
ization value is given, Inger will initialize a variable to 0. b is of type char and
is initialized to ’Q’ , and c is a pointer to data of type char . This pointer is ini-
tialized to 0x0 ( null ) – note that this does not initialize the char data to which
c points to null . The variable number is of type int and is initialized to 0. The
variables completed and found are of type bool and are initialized to false and true
respectively.
?
37
/*
* globvar.i - demonstration
* of global variables.
*/
5 module globvar;
int i;
bool b;
char c;
10
g: void → void
{
i = 0;
b = false;
15 c = ’b’;
}
start f : void → void
{
20 i = 1;
b = true;
c = ’a’;
}
Listing 3.4: Global Variables
/*
* locvar.i - demonstration
* of local variables.
*/
5 module locvar;
g: void → void
{
i = 1; /* will not compile */
10 }
start f : void → void
{
int i = 0;
15 }
Listing 3.5: Local Variables
38
Figure 3.5: Declaration Syntax Diagram
declarationblock: type declaration
?
, declaration
?
.
declaration :
?
∗
?
identifier
? ?
intliteral
? ?
?
= expression
?
.
type: bool | char | float | int | untyped.
Listing 3.6: BNF for Declaration
39
3.6 Action
A computer program is not worth much if it does not contain some instruc-
tions (statements) that execute actions that operate on the data that program
declares. Actions come in two categories: simple statements and compound
statements.
3.6.1 Simple Statements
There are many possible action statements, but there is only one statement
that actually has a side effect, i.e. manipulates data: this is the assignment
statement, which stores a value in a variable. The form of an assignment is:
<variable> = <expression>
= is the assignment operator. The variable to which an expression value
is assigned is called the left hand side or lvalue, and the expression which is
assigned to the variable is called the right hand side or rvalue.
The expression on the right hand side can consist of any (valid) combination
of constant values (numbers), variables, function calls and operators. The ex-
pression is evaluated to obtain a value to assign to the variable on the left hand
side of the assignment. Expression evaluation is done using the well-known rules
of mathematics, with regard to operator precedence and associativity. Consult
table 3.8 for all operators and their priority and associativity.
Example 3.8 (Expressions)
2 ∗ 3 − 4 ∗ 5 = (2 ∗ 3) − (4 ∗ 5) = −14
15 / 4 ∗ 4 = (15 / 4) ∗ 4 = 12
80 / 5 / 3 = (80 / 5) / 3 = 5
4 / 2 ∗ 3 = (4 / 2) ∗ 3 = 6
9.0 ∗ 3 / 2 = (9.0 ∗ 3) / 2 = 13.5
?
The examples show that a division of two integers results in an integer type
(rounded down), while if either one (or both) of the operands to a division is of
type float , the result will be float .
Any type of variable can be assigned to, so long as the expression type
and the variable type are equivalent. Assignments may also be chained, with
multiple variables being assigned the same expression with one statement. The
following example shows some valid assignments:
Example 3.9 (Expressions)
int a, b;
int c = a = b = 2 + 1;
int my sum = a ∗ b + c; /* 12 */
?
All statements must be terminated with a semicolon ( ; ).
40
Operator Priority Associatity Description
() 1 L function application
[] 1 L array indexing
! 2 R logical negation
- 2 R unary minus
+ 2 R unary plus
~ 3 R bitwise complement
* 3 R indirection
& 3 R referencing
* 4 L multiplication
/ 4 L division
% 4 L modulus
+ 5 L addition
- 5 L subtraction
>> 6 L bitwise shift right
<< 6 L bitwise shift left
< 7 L less than
<= 7 L less than or equal
> 7 L greater than
>= 7 L greater than or equal
== 8 L equality
!= 8 L inequality
& 9 L bitwise and
^ 10 L bitwise xor
| 11 L bitwise or
&& 12 L logical and
|| 12 L logical or
?: 13 R ternary if
= 14 R assignment
Table 3.8: Operator Precedence and Associativity
41
3.6.2 Compound Statements
A compount statement is a group of zero or more statements contained within
braces ( { and } ). These statements are executed as a group, in the sequence in
which they are written. Compound statements are used in many places in Inger
including the body of a function, the action associated with an if -statement and
a while -statement. The form of a compound statement is:
block:
?
code
?
.
code: e.
code: block code.
code: statement code.
Figure 3.6: Syntax diagram for block
The BNF productions show, that a compound statement, or block, may
contain zero (empty), one or more statements, and may contain other blocks
as well. In the following example, the function f has a block of its own (the
function body), which contains another block, which finally contains a single
statement (a declaration).
Example 3.10 (Compound Statement)
module compound;
start f : void → void
{
5 {
int a = 1;
}
}
?
3.6.3 Repetitive Statements
Compound statements (including compoung statements with only one statement
in their body) can be wrapped inside a repetitive statement to cause it to be
42
executed multiple times. Some programming languages come with multiple
flavors of repetitive statements; Inger has only one: the while statement.
The while statement has the following BNF productions (also consult figure
3.7 for the accompanying syntax diagram):
statement: while ? expression
?
doblock
Figure 3.7: Syntax diagram for while
The expression between the parentheses must be of type bool . Before exe-
cuting the compound statement contained in the block , the repetitive statement
checks that expression evaluates to true. After the code contained in block has
executed, the repetitive statement evaluates expression again and so on until the
value of expression is false. If the expression is initially false, the compound
statement is executed zero times.
Since the expression between parentheses is evaluated each time the repeti-
tive statement (or loop) is executed, it is advised to keep the expression simple
so as not to consume too much processing time, especially in longer loops.
The demonstration program in listing 3.7 was taken from the analogous The
while statement section from Wirth’s PASCAL User Manual ([7]) and translated
to Inger.
The printint function and the #import directive will be discussed in a later sec-
tion. The output of this program is 2.9287 , printed on the console. It should be
noted that the compound statement that the while statement must be contained
in braces; it cannot be specified by itself (as it can be in the C programming
language).
Inger provides some additional control statements, that may be used in con-
junction with while : break and continue . The keyword break may be used to
prematurely leave a while -loop. It is often used from within the body of an if
statement, as shown in listings 3.8 and 3.9.
The continue statement is used to abort the current iteration of a loop and
continue from the top. Its use is analogous to break : see listings 3.10 and 3.11.
The use of break and continue is discouraged, since they tend to make a
program less readable.
3.6.4 Conditional Statements
Not every statement must be executed. The choice of statements to execute
can be made using conditional statements. Inger provides the if and switch
statements.
The if statement
An if statement consists of a boolean expression, and one or two compound
statements. If the boolean expression is true, the first compound statement is
43
/*
* Compute h(n) = 1 + 1/2 + 1/3 + ... + 1/n
* for a known n.
*/
5 module while demo;
#import ”printint.ih”
start main: void → void
10 {
int n = 10;
float h = 0;
while( n > 0 ) do
15 {
h = h + 1 / n;
n = n − 1;
}
printint ( h );
20 }
Listing 3.7: The While Statement
int a = 10;
while( a > 0 )
{
5 if ( a == 5 )
{
break;
}
10 printfint ( a );
}
Listing 3.8: The Break Statement
10
9
8
7
5 6
Listing 3.9: The Break Statement (output)
44
int a = 10;
while( a > 0 )
{
if ( a % 2 == 0 )
5 {
continue;
}
printint ( a );
10 }
Listing 3.10: The Continue Statement
10
8
6
4
5 2
Listing 3.11: The Continue Statement (output)
executed. If the boolean expression evaluates to false, the second compound
statement (if any) is executed. Remember that compound statements need
not contain multiple statements; they can contain a single statement or no
statements at all.
The above definition of the if conditional statement has the following BNF
producution associated with it (also consult figure 3.8 for the equivalent syntax
diagram):
statement: if
?
expression
?
block elseblock
elseblock: ?.
elseblock: else block.
Figure 3.8: Syntax diagram for if
The productions for the elseblock show that the if statement may contain a
second compound statement (which is executed if the boolean expression argu-
45
ment evaluates to false) or no second statement at all. If there is a second block,
it must be prefixed with the keyword else .
As with the while statement, it is not possible to have the if statement execute
single statements, only blocks contained within braces. This approach solves the
dangling else problem from which the Pascal programming language suffers.
The “roman numerals” program (listing 3.12, copied from [7] and translated
to Inger illustrates the use of the if and while statements.
The case statement
The if statement only allows selection from two alternatives. If more alterna-
tives are required, the else blocks must contain secondary if statements up to
the required depth (see listing 3.14 for an example). Inger also provides the
switch statement, which constists of an expression (the selector) and a list of
alternatives cases. The cases are labelled with numbers (integers); the switch
statement evaluates the selector expression (which must evaluate to type inte-
ger) and executes the alternative whose label matches the result. If no case
has a matching label, switch executes the default case (which is required to be
present). The following BNF defines the switch statement more precisely:
statement: switch ? expression
? ?
cases defaultblock
?
.
cases: ?.
cases: case<int literal > block cases.
This is also shown in the syntax diagram in figure 3.9.
Figure 3.9: Syntax diagram for switch
It should be clear the use of the switch statement in listing 3.15 if much
clearer than the multiway if statement from listing 3.14.
There cannot be duplicate case labels in a case statement, because the com-
piler would not know which label to jump to. Also, the order of the case labels
is of no concern.
3.6.5 Flow Control Statements
Flow control statements are statements the cause the execution of a program
to stop, move to another location in the program, and continue. Inger offers
one such statement: the goto considered harmful statement. The name of this
46
/* Write roman numerals for the powers of 2. */
module roman numerals;
#import ”stdio.ih”
5
start main: void → void
{
int x, y = 1;
while( y <= 5000 )
10 {
x = y;
printint ( x );
while( x >= 1000 ) do
15 {
printstr ( ”M” );
x = x − 1000;
}
if(x >= 5000 )
20 {
printstr ( ”D” );
x = x − 500;
}
while( x >= 100 ) do
25 {
printstr ( ”C” );
x = x − 100;
}
if(x >= 50 )
30 {
printstr ( ”L” );
x = x − 50;
}
while( x >= 10 ) do
35 {
printstr ( ”X” );
x = x − 10;
}
if(x >= 5 )
40 {
printstr ( ”V” );
x = x − 5;
}
while( x >= 1 ) do
45 {
printstr ( ”I” );
x = x − 1;
}
printstr ( ”\n” );
50 y = 2 ∗ y;
}
}
Listing 3.12: Roman Numerals
47
Output:
1 I
2 II
4 IIII
5 8 VIII
16 XVI
32 XXXII
64 LXIIII
128 CXXVIII
10 256 CCLVI
512 DXII
1024 MXXIIII
2048 MMXXXXVIII
4096 MMMMLXXXXVI
Listing 3.13: Roman Numerals Output
if ( a == 0 )
{
printstr ( ”Case 0\n” );
}
5 else
{
if ( a == 1 )
{
printstr ( ”Case 3\n” );
10 }
else
{
if ( a == 2 )
{
15 printstr ( ”Case 2\n” );
}
else
{
printstr ( ”Case >2\n” );
20 }
}
}
Listing 3.14: Multiple If Alternatives
48
switch( a )
{
case 0
{
5 printstr ( ”Case 0\n” );
}
case 1
{
printstr ( ”Case 1\n” );
10 }
case 2
{
printstr ( ”Case 2\n” );
}
15 default ger
{
printfstr ( ”Case >2\n” );
}
}
Listing 3.15: The Switch Statement
int n = 10;
label here;
printstr ( n );
n = n − 1;
5 if ( n > 0 )
{
goto considered harmful here;
}
Listing 3.16: The Goto Statement
statement (instead of the more common goto ) is a tribute to the Dutch computer
scientist Edger W. Dijkstra. 2
The goto considered harmful statement causes control to jump to a specified
(textual) label, which the programmer must provide using the label keyword.
There may not be any duplicate labels throughout the entire program, regardless
of scope level. For an example of the goto statement, see listing 3.16.
The goto considered harmful statement is provided for convenience, but its use
is strongly discouraged (like the name suggests), since it is detrimental to the
structure of a program.
2 Edger W. Dijkstra (1930-2002) studied mathematics and physics in Leiden, The Nether-
lands. He obtained his PhD degree with a thesis on computer communications, and has
since been a pioneer in computer science, and was awarded the ACM Turing Award in
1972. Dijkstra is best known for his theories about structured programming, including a
famous article titled Goto Considered Harmful. Dijkstra’s scientific work may be found at
http://www.cs.utexas.edu/users/EWD.
49
3.7 Array
Beyond the simple types bool , char , float , int and untyped discussed earlier, Inger
supports the advanced data type array. An array contains a predetermined
number of elements, all of the same type. Examples are an array of elements of
type int , or an array whose elements are of type bool . Types cannot be mixed.
The elements of an array are laid out in memory in a sequential manner.
Since the number and size of the elements is fixed, the location of any element
in memory can be calculated, so that all elements can be accessed equally fast.
Arrays are called random access structures for this reason. In the section on
declarations, BNF productions and a syntax diagram were shown which included
array brackets ( [ and ] ). We will illustrate their use here with an example:
int a [5];
declares an array of five elements of type int . The individual elements can
be accessed using the [] indexing operator, where the index is zero-based: a[0]
accesses the first element in the array, and a[4] accesses the last element in the
array. Indexed array elements may be used wherever a variable of the array’s
type is allowed. As an example, we translate another example program from N.
Wirth’s Pascal User Manual ([7]), in listing 3.17.
Arrays (matrices) may have more than one dimension. In declarations, this
is specified thus:
int a [4][6];
which declares a to be a 4 × 6 matrix. Elements access is similar: a [2][2]
accesses the element of a at row 2, column 2. There is no number to the number
of dimensions used in an array.
Inger has no way to initialize an array, with the exception of character
strings. An array of characters may be initialized with a string constant, as
shown in the code below:
char a[20] = ”hello, world!”;
In this code, the first 13 elements of array a are initialized with corresponding
characters from the string constant ”hello , world” . a[13] is initialized with zero,
to indicate the end of the string, and the remaining characters are uninitialized.
This example also shows that Inger works with zero-minated strings, just like
the C programming language. However, one could say that Inger has no concept
of string; a string is just an array of characters, like any other array. The fact
that strings are zero-terminated (so-called ASCIIZ-strings) is only relevant to
the system support libraries, which provide string manipulation functions.
It is not possible to assign an array to another array. This must be done
on an element-by-element basis. In fact, if any operator except the indexing
operator ( [] ) is used with an array, the array is treated like a typed pointer.
3.8 Pointers
Any declaration may include some level of indirection, making the variable a
pointer. Pointers contain addresses; they are not normally used for storage
50
minmax: int a [], n → void
{
int min, max, i , u, v;
5 min = a[0]; max = min; i = 2;
while( i < n−1 ) do
{
u = a[i ]; v = a[i+1];
if ( u > v )
10 {
if ( u > max ) { max = u; }
if ( v < min ) { min = v; }
}
else
15 {
if ( v > max ) { max = v; }
if ( u < min ) { min = u; }
}
i = i + 2;
20 }
if ( i == n )
{
if ( a[n] > max )
{
25 max = a[n];
}
else if ( a[n] < min )
{
min = a[n];
30 }
}
printint ( min );
printint ( max );
}
Listing 3.17: An Array Example
51
themselves, but to point to other variables (hence the name). Pointers are a
convenient mechanism to pass large data structures between functions or mod-
ules. Instead of copying the entire data structure to the receiver, the receiver is
told where it can access the data structure (given the address).
The & operator can be used to retrieve the address of any variable, so it can
be assigned to a pointer, and the * operator is used to access the variable at a
given address. Examine the following example code to see how this works:
int a;
int ∗b = &a;
∗b = 2;
printint ( a ); /* 2 */
The variable b is assigned the address of variable a . Then, the value 2 is
assigned to the variable to which b points ( a ), using the dereferencing operator
( * ). After this, a contains the value 2 .
Pointers need not always refer to non-pointer variables; it is perfectly possible
for a pointer to refer to another pointer. Pointers can also hold multiple levels
of indirection, and can be dereferenced multiple times:
int a;
int ∗b = &a;
int ∗∗c = &b;
∗∗c = 2;
printint ( a ); /* 2 */
Pointers have another use: they can contain the address of a dynamic vari-
able. While ordinary variables declared using the declaration statements dis-
cussed earlier are called static variables and reside on the stack, dynamic vari-
bles live on the heap. The only way to create them is by using operating system
functions to allocate memory for them, and storing their address in a pointer,
which must be used to access them for all subsequent operations until the oper-
ating system is told to release the memory that the dynamic variable occupies.
The allocation and deallocation of memory for dynamic variables is beyond the
scope of this text.
3.9 Functions
Most of the examples thus far contained a single function, prefixed with they
keyword start and often postfixed with something like void → void . In this sec-
tion, we discuss how to write additional functions, which are an essential element
of Inger is one wants to write larger programs.
The purpose of a function is to encapsulate part of a program and associate
it with a name or identifier. Any Inger program consists of at least one function:
the start function, which is marked with the keyword start . To become familiar
with the structure of a function, let us examine the syntax diagram for a function
(figure 3.10 and 3.11). The associated BNF is a bit lengthy, so we will not print
it here.
52
Figure 3.10: Syntax diagram for function
Figure 3.11: Syntax diagram for formal parameter block
A function must be declared before it can be used. The declaration does
not necessarily have to precede the actual use of the function, but it must take
place at some point. The declaration of a function couples an identifier (the
function name) to a set of function parameters (which may be empty), a return
value (which may be none), and a function body. An example of a function
declaration may be found in listing 3.17 (the minmax function).
Function parameters are values given to the function which may influence the
way it executes. Compare this to mathematical function definitions: they take
an input variable (usually x) and produce a result. The function declarations
in Inger are in fact modelled after the style in which mathematical functions
are defined. Function parameters must always have names so that the code in
the function can refer to them. The return value of a function does not have a
name. We will illustrate the declaration of functions with some examples.
Example 3.11 (Function Declarations)
The function f takes no arguments and produces no result. Although such a
function may seem useless, it is still possible for it to have a side effect, i.e. an
influence besides returning a value:
f : void → void
The function g takes an int and a bool parameter, and returns an int value:
g: int a; bool b → int
Finally, the function h takes a two-dimensional array of char as an argument,
and returns a pointer to an int :
53
h: char str [ ][ ] → int ∗
?
In the previous example, several sample function headers were given. Apart
from a header, a function must also have a body, which is simply a block of code
(contained within braces). From within the function body, the programmer may
refer to the function parameters as if they were local variables.
Example 3.12 (Function Definition)
Here is sample definition for the function g from the previous example:
g: int a; bool b → int
{
if ( b == true )
{
return( a );
}
else
{
return( −a );
}
}
?
The last example illustrates the use of the return keyword to return from a
function call, while at the same time setting the return value. All functions
(except functions which return void ) must have a return statement somewhere in
their code, or their return value may never be set.
Some functions take no parameters at all. This class of functions is called
void , and we use the keyword void to identify them. It is also possible that a
function has no return value. Again, we use they keyword void to indicate this.
There are functions that take no parameters and return nothing: double void.
Now that functions have been defined, they need to be invoked, since that’s
the reason they exist. The () operator applies a function. It must be supplied
to call a function, even if that function takes no parameters ( void ).
Example 3.13 (Function Invocation)
The function f from example 3.11 has no parameters. It is invoked like this:
f ();
Note the use of () , even for a void function. The function g from the same
example might be invoked with the following parameters:
int result = g( 3, false ); /* -3 */
The programmer is free to choose completely different values for the param-
eters. In this example, constants have been supplied, but it is legal to fill in
variables or even complete expressions which can in turn contain function calls:
54
int result = g( g( 3, false ), false ); /* 3 */
?
Parameters are always passed by value, which means that their value is
copied to the target function. If that function changes the value of the param-
eter, the value of the original variable remains unchanged:
Example 3.14 (By Value vs. By Reference)
Suppose we have the function f , which is defined so:
f : int a → void
{
a = 2;
}
To illustrate invocation by value, we do this:
int i = 1;
f(i );
printint (i ); /* 1 */
It is impossible to change the value of the input variable i , unless we redefine
the function f to accept a pointer:
f : int ∗a → void
{
∗a = 2;
}
Now, the address of i is passed by value, but still points to the actual memory
where i is stored. Thus i can be changed:
int i = 1;
f(&i);
printint (i ); /* 1 */
?
3.10 Modules
Not all code for a program has to reside within the same module. A program may
consist of multiple modules, one of which is the main module, which contains
one (and only one) function marked with the keyword start . This is the function
that will be executed when the program starts. A start function must always
be void → void , because there is no code that provides it with parameters and
no code to receive a return value. There can be only one module with a start
55
/*
* printint.c
*
* Implementation of printint()
5 */
void printint ( int x )
{
printf ( ”%d\n”, x );
}
Listing 3.18: C-implementation of printint Function
/*
* printint.ih
*
* Header file for printint.c
5 */
extern printint : int x → void;
Listing 3.19: Inger Header File for printint Function
function. The start function may be called by other functions like any other
function.
Data and functions may be shared between modules using the extern keyword.
If a variable int a is declared in one function, it can be imported by another
module with the statement extern int a . The same goes for functions. The extern
statements are usually placed in a header file, with the .ih extension. Such files
can be referenced from Inger source code with the #import directive.
In listing 3.18, a C function called printint is defined. We wish to use this
function in an Inger program, so we write a header file called printint.ih with
contains an extern statement to import the C function (listing 3.19). Finally,
the Inger program in listing 3.20 can access the C function by importing the
header file with the #import directive.
3.11 Libraries
Unlike other popular programming languages, Inger has no builtin functions
(e.g. read , write , sin , cos etc.). The programmer has to write all required functions
himself, or import them from a library. Inger code can be linked into a static
or dynamic using the linker. A library consists of one more code modules, all
of which do not contain a start function (if one or more of them do, the linker
will complain). The compiler not check the existence or nonexistence of start
functions, except for printing an error when there is more than one start function
in the same module.
Auxiliary functions need not be in an Inger module; they can also be im-
56
/*
* printint.i
*
* Uses C-implementation of
5 * printint()
*/
module program;
#import ”printint.ih”
10
int a,b;
start main: void → void
{
15 a = b = 1;
printint ( a + b );
}
Listing 3.20: Inger Program Using printint
plemented in the C programming language. In order to use such functions in
an Inger program, an Inger header file ( .ih ) must be provided for the C library,
which contains extern function declarations for all the functions used in the Inger
program. A good example is the stdio.ih header file supplied with the Inger com-
piler. This header file is an interface to the ANSI C stdio library.
3.12 Conclusion
This concludes the introduction to the Inger language. Please refer to the ap-
pendices, in particular appendices C and D for detailed tables on operator prece-
dence and the BNF productions for the entire language.
57
Bibliography
[1] B. Kernighan and D. Ritchie: C Programming Language (2nd Edition),
Prentice Hall, 1998.
[2] A. C. Hartmann: A Concurrent Pascal Compiler for Minicomputers, Lec-
ture notes in computer science, Springer-Verlag, Berlin 1977.
[3] M. Marcotty and H. Ledgard: The World of Programming Languages,
Springer-Verlag, Berlin 1986., pages 41 and following.
[4] American National Standards Institute: ANSI X3.159-1989. American Na-
tional Standard for information systems - Programming Language C, ANSI,
New York, USA 1989.
[5] R. S. Scowen: Extended BNF a Generic Base Standard, Final Report,
SEG C1 N10 (DITC Software Engineering Group), National Physical Lab-
oratory, Teddington, Middlesex, UK 1993.
[6] W. Waite: ANSI C Specification,
http://www.cs.colorado.edu/∼eliuser/c html/c.html
[7] N. Wirth and K. Jensen: PASCAL User Manual and Report, Lecture notes
in computer science, Springer-Verlag, Berlin 1975.
58
Part II
Syntax
59
Humans can understand a sentence, spoken in a language, when they hear
it (provided they are familiar with the language being spoken). The brain is
trained to process the incoming string of words and give meaning to the sentence.
This process can only take place if the sentence under consideration obeys the
grammatical rules of the language, or else it would be gibberish. This set of rules
is called the syntax of a language and is denoted using a grammar. This part of
the book, syntax analysis, gives an introduction to formal grammars (notation
and manipulation) and how they are used to read (parse) actual sentences in
a language. It also discusses ways to vizualize the information gleaned from a
sentence in a tree structure (a parse tree). Apart from theoretical aspects, the
text treats practical matters such as lexical analysis (breaking a line of text up
into individual words and recognizing language keywords among them) and tree
traversal.
60
Chapter 4
Lexical Analyzer
4.1 Introduction
The first step in the compiling process involves reading source code, so that
the compiler can check that source code for errors before translating it to, for
example, assembly language. All programming languages provide an array of
keywords, like IF , WHILE , SWITCH and so on. A compiler is not usually interested
in the individual characters that make up these keywords; these keywords are
said to be atomic. However, in some cases the compiler does care about the
individual characters that make up a word: an integer number (e.g. 12345 ),
a string (e.g. ”hello, world” ) and a floating point number (e.g. 12e-09 ) are all
considered to be words, but the individual characters that make them up are
significant.
This distinction requires special processing of the input text, and this special
processing is usually moved out of the parser and placed in a module called the
lexical analyzer, or lexer or scanner for short. It is the lexer’s responsibility to
divide the input stream into tokens (atomic words). The parser (the module
that deals with groups of tokens, checking that their order is valid) requests a
token from the lexer, which reads characters from the input stream until it has
accumulated enough characters to form a complete token, which it returns to
the parser.
Example 4.1 (Tokenizing)
Given the input
the quick brown fox jumps over the lazy dog
61
a lexer will split this into the tokens
the , quick , brown , fox , jumps , over , the , lazy and dog
?
The process of splitting input into tokens is called tokenizing or scanning.
Apart from tokenizing a sentence, a lexer can also divide tokens up into classes.
This is process is called screening. Consider the following example:
Example 4.2 (Token Classes)
Given the input
the sum of 2 + 2 = 4.
a lexer will split this into the following tokens, with classes:
Word: the
Word: sum
Word: of
Number: 2
Plus: +
Number: 2
Equals: =
Number: 4
Dot: .
?
Some token classes are very narrow (containing only one token), while others
are broad. For example, the token class Word is used to represent the , sum and
of , while the token class Dot can only be used when a . is read. Incidentally,
the lexical analyzer must know how to separate individual tokens. In program
source text, keywords are usually separated by whitespace (spaces, tabs and line
feeds). However, this is not always the case. Consider the following input:
Example 4.3 (Token Separation)
Given the input
sum=(2+2)*3;
a lexer will split this into the following tokens:
sum, =, (, 2, +, 2, ), *, 3 and ;
?
Those familiar with popular programming languages like C or Pascal may
know that mathematical tokens like numbers, =, + and * are not required to
be separated from each other by whitespace. The lexer must have some way to
know when a token ends and the next token (if any) begins. In the next section,
62
we will discuss the theory of regular languages to further clarify this point and
how lexers deal with it.
Lexers have an additional interesting property: they can be used to filter
out input that is not important to the parser, so that the parser has less differ-
ent tokens to deal with. Block comments and line comments are examples of
uninteresting input.
A token class may represent a (large) collection of values. The token class
OP MULTIPLY , representing the multiplication operator * contains only one to-
ken (*), but the token class LITERAL INTEGER can represents the collection of
all integers. We say that 2 is an integer, and so is 256 , 381 and so on. A compiler
is not only interested in the fact that a token is a literal integer, but also in
the value of that literal integer. This is why tokens are often accompanied by a
token value. In the case of the number 2 , the token could be LITERAL INTEGER
and the token value could be 2 .
Token values can be of many types: an integer number token has a token
value of type integer, a floating point number token has a token value of type
float or double , and a string token has a token value of type char * . Lexical
analyzers therefore often store token values using a union (a C construct that
allows a data structure to map fields of different type on the same memory,
provided that only one of these fields is used at the same time).
4.2 Regular Language Theory
The lexical analyzer is a submodule of the parser. While the parser deals with
context-free grammars (a higher level of abstraction), the lexer deals with in-
dividual characters which form tokens (words). Some tokens are simple ( IF or
WHILE ) while others are complex. All integer numbers, for example, are repre-
sented using the same token ( INTEGER ), which covers many cases ( 1 , 100 , 5845
and so on). This requires some notation to match all integer numbers, so they
are treated the same.
The answer lies in realizing that the collection of integer numbers is really a
small language, with very strict rules (it is a so-called regular language). Before
we can show what regular languages are, we must must discuss some preliminary
definitions first.
A language is a set of rules that says which sentences can be generated by
stringing together elements of a certain alphabet. An alphabet is a collection
of symbols (or entire words) denoted as Σ. The set of all strings that can be
generated from an alphabet is denoted Σ ∗ . A language over an alphabet Σ is a
subset of Σ ∗ .
We now define, without proof, several operations that may be performed
on languages. The first operation on languages that we present is the binary
concatenation operation.
Definition 4.1 (Concatenation operation)
Let X and Y be two languages. Then XY is the concatenation of these lan-
guages, so that:
XY = {uv | u ∈ X ∧ v ∈ Y }.
63
?
Concatenation of a language with itself is also possible and is denoted X 2 .
Concatenation of a string can also be performed multiple times, e.g. X 7 . We
will illustrate the definition of concatenation with an example.
Example 4.4 (Concatenation operation)
Let Σ be the alphabet {a,b,c}.
Let X be the language over Σ with X = {aa,bb}.
Let Y be the language over Σ with Y = {ca,b}.
Then XY is the language {aaca,aab,bbca,bbb}.
?
The second operation that will need to define regular languages is the binary
union operation.
Definition 4.2 (Union operation)
Let X and Y be two languages. Then X ∪ Y is the union of these languages
with
X ∪ Y = {u | u ∈ X ∨ u ∈ Y }.
?
Note that the priority of concatenation is higher than the priority of union.
Here is an example that shows how the union operation works:
Example 4.5 (Union operation)
Let Σ be the alphabet {a,b,c}.
Let X be the language over Σ{aa,bb}.
Let Y be the language over Σ{ca,b}.
Then X ∪ Y is the language over Σ{aa,bb,ca,b}.
?
The final operation that we need to define is the unary Kleene star opera-
tion. 1
Definition 4.3 (Kleene star)
Let X be a language. Then
X ∗ =
∞
[
i=0
X i (4.1)
1 The mathematician Stephen Cole Kleene was born in 1909 in Hartford, Connecticut.
His research was on the theory of algorithms and recursive functions. According to Robert
Soare, “From the 1930’s on Kleene more than any other mathematician developed the notions
of computability and effective process in all their forms both abstract and concrete, both
mathematical and philosophical. He tended to lay the foundations for an area and then move
on to the next, as each successive one blossomed into a major research area in his wake.”
Kleene died in 1994.
64
?
Or, in words: X ∗ means that you can take 0 or more sentences from X and
concatenate them. The Kleene star operation is best clarified with an example.
Example 4.6 (Kleene star)
Let Σ be the alphabet {a,b}.
Let X be the language over Σ{aa,bb}.
Then X ∗ is the language {λ,aa,bb,aaaa,aabb,bbaa,bbbb,...}.
?
There is also an extension to the Kleene star. XX ∗ may be written X + ,
meaning that at least one string from X must be taken (whereas X ∗ allows the
empty string λ).
With these definitions, we can now give a definition for a regular language.
Definition 4.4 (Regular languages)
1. Basis: ∅, {λ} and {a} are regular languages.
2. Recursive step: Let X and Y be regular languages. Then
X ∪ Y is a regular language
XY is a regular language
X ∗ is a regular language
?
Now that we have established what regular languages are, it is important to
note that lexical analyzer generators (software tools that will be discussed below)
use regular expressions to denote regular languages. Regular expressions are
merely another way of writing down regular languages. In regular expressions,
it is customary to write the language consisting of a single string composed of
one word, {a}, as a.
Definition 4.5 (Regular expressions)
Using recursion, we can define regular expressions as follows:
1. Basis: ∅, λ and a are regular expressions.
2. Recursive step: Let X and Y be regular expressions. Then
X ∪ Y is a regular expression
XY is a regular expression
X ∗ is a regular expression
?
As you can see, the definition of regular expressions differs from the definition
of regular languages only by a notational convenience (less braces to write).
So any language that can be composed of other regular languages or expres-
sions using concatenation, union, and the Kleene star, is also a regular language
or expression. Note that the priority of the concatenation, Kleene Star and
union operations are listed here from highest to lowest priority.
65
Example 4.7 (Regular Expression)
Let a and b be regular expressions by definition 4.5(1). Then ab is a regular
expression by definition 4.5(2) through concatenation. (ab ∪ b) is a regular
expression by definition 4.5(2) through union. (ab ∪ b) ∗ is a regular expression
by definition 4.5(2) through union. The sentences that can be generated by
(ab ∪ b) ∗ are {λ,ab,b,abb,bab,babab,...}.
?
While context-free grammars are normally denoted using production rules,
for regular languages it is sufficient to use the easy to read regular expressions.
4.3 Sample Regular Expressions
In this section, we present a number of sample regular expressions to illustrate
the theory presented in the previous section. From now on, we will now longer
use bold a to denote {a}, since we will soon move to UNIX regular expressions
which do not use bold either.
Regular expression Sentences generated
q the sentence q
qqq the sentence qqq
q ∗ all sentences of 0 or more q’s
q + all sentences of 1 or more q’s
q ∪ λ the empty sentence or q. Often denoted
as q? (see UNIX regular expressions).
b ∗ ((b + a ∪ λ)b ∗ the collection of sentences that begin
with 0 or more b’s, followed by either
one or more b’s followed by an a, or
nothing, followed by 0 or more b’s.
These examples show that through repeated application of definition 4.5,
complex sequences can be defined. This feature of regular expressions is used
for constructing lexical analyzers.
4.4 UNIX Regular Expressions
Under UNIX, several extensions to regular expressions have been implemented
that we can use. A UNIX regular expression[3] is commonly called a regex
(multiple: regexes).
There is no union operator on UNIX. Instead, we supply a list of alternatives
contained within square brackets.
[abc] ≡ (a ∪ b ∪ c)
To avoid having to type in all the individual letters when we want to match
all lowercase letters, the following syntax is allowed:
[a-z] ≡ [abcdefghijklmnopqrstuvwxyz]
66
UNIX does not have a λ either. Here is the alternative syntax:
a? ≡ a ∪ λ
Lexical analyzer generators allow the user to directly specify these regular
expressions in order to identify lexical tokens (atomic words that string together
to make sentences). We will discuss such a generator program shortly.
4.5 States
With the theory of regular languages, we can now find out how a lexical analyzer
works. More specifically, we can see how the scanner can divide the input
(34+12) into separate tokens.
Suppose the programming language for which we wish to write a scanner
consists only of sentences of the form ( number + number ) . Then we require the
following regular expressions to define the tokens.
Token Regular expression
( (
) )
+ +
number [0-9]+
A lexer uses states to determine which characters it can expect, and which
may not occur in a certain situation. For simple tokens ( ( , ) and + ) this is easy:
either one of these characters is read or it is not. For the number token, states
are required.
As soon as the first digit of a number is read, the lexer enters a state in
which it expects more digits, and nothing else. If another digit is read, the lexer
remains in this state and adds the digit to the token read so far. It something
else (not a digit) is read, the lexer knows the number token is finished and leaves
the number state, returning the token to the caller (usually the parser). After
that, it tries to match the unexpected character (maybe a + ) to another token.
Example 4.8 (States)
Let the input be (34+12) . The lexer starts out in the base state. For every
character read from the input, the following table shows the state that the lexer
is currently in and the action it performs.
?
Token read State Action taken
( base Return ( to caller
3 base Save 3 , enter number state
4 number Save 4
+ number + not expected. Leave number
state and return 34 to caller
+ base Return + to caller
1 base Save 1 , enter number state
2 number Save 2
) number ) unexpected. Leave number
state and return 12 to caller
) base return ) to caller
67
This example did not include whitespace (spaces, line feeds and tabs) on pur-
pose, since it tends to be confusing. Most scanners ignore spacing by matching
it with a special regular expression and doing nothing.
There is another rule of thumb used by lexical analyzer generators (see the
discussion of this software below): they always try to return the longest token
possible.
Example 4.9 (Token Length)
= and == are both tokens. Now if = was read and the next character is also =
then == will be returned instead of two times = .
?
In summary, a lexer determines which characters are valid in the input at
any given time through a set of states, on of which is the active state. Different
states have different valid characters in the input stream. Some characters cause
the lexer to shift from its current state into another state.
4.6 Common Regular Expressions
This section discusses some commonly used regular expressions for interesting
tokens, such as strings and comments.
Integer numbers
An integer number consists of only digits. It ends when a non-digit character is
encountered. The scanner must watch out for an overflow, e.g. 12345678901234
does not fit in most programming languages’ type systems and should cause the
scanner to generate an overflow error.
The regular expression for integer numbers is
[0-9]+
This regular expression generates the collection of strings containing at least
one digit, and nothing but digits.
Practical advice 4.1 (Lexer Overflow)
If the scanner generates an overflow or similar error, parsing of the source code
can continue (but no target code can be generated). The scanner can just
replace the faulty value with a correct one, e.g. “ 1 ”.
?
Floating point numbers
Floating point numbers have a slightly more complex syntax than integer num-
bers. Here are some examples of floating point numbers:
1.0 , .001 , 1e-09 , .2e+5
The regular expression for floating point numbers is:
68
[0-9]* . [0-9]+ ( e [+-] [0-9]+ )?
Spaces were added for readability. These are not part of the generated
strings. The scanner should check each of the subparts of the regular expression
containing digits for possible overflow.
Practical advice 4.2 (Long Regular Expressions)
If a regular expression becomes long or too complex, it is possible to split it up
into multiple regular expressions. The lexical analyzer’s internal state machine
will still work.
?
Strings
Strings are a token type that requires some special processing by the lexer. This
should become clear when we consider the following sample input:
"3+4"
Even though this input consists of numbers, and the + operator, which may
have regular expressions of their own, the entire expression should be returned
to the caller since it is contained within double quotes. The trick to do this is to
introduce another state to the lexical analyzer, called an exclusive state. When
in this state, the lexer will process only regular expressions marked with this
state. The resulting regular expressions are these:
Regular expression Action
" Enter string state
string . Store character. A dot (.) means any-
thing. This regular expression is only
considered when the lexer is in the
string state.
string " Return to previous state. Return string
contents to caller. This regular expres-
sion is only considered when the lexer
is in the string state.
Practical advice 4.3 (Exclusive States)
You can write code for exclusive states yourself (when writing a lexical analyzer
from scratch), but AT&T lex and GNU flex can do it for you.
?
The regular expressions proposed above for strings do not heed line feeds.
You may want to disallow line feeds within strings, though. Then you must add
another regular expressions that matches the line feed character (\n in some
languages) and generates an error when it is encountered within a string.
The lexer writer must also be wary of a buffer overflow; if the program source
code consists of a " and hundreds of thousands of letters (at least, not another
"), a compiler that does not check for buffer overflow conditions will eventually
crash for lack of memory. Note that you could match strings using a single
regular expression:
69
"(.)*"
but the state approach makes it much easier to check for buffer overflow
conditions since you can decide at any time whether the current character must
be stored or not.
Practical advice 4.4 (String Limits)
To avoid a buffer overflow, limit the string length to about 64 KB and generate
an error if more characters are read. Skip all the offending characters until
another " is read (or end of file).
?
Comments
Most compilers place the job of filtering comments out of the source code with
the lexical analyzer. We can therefore create some regular expressions that do
just that. This once again requires the use of an exclusive state. In programming
languages, the beginning and end of comments are usually clearly marked:
Language Comment style
C /* comment */
C++ // comment (line feed)
Pascal { comment }
BASIC REM comment :
We can build our regular expressions around these delimiters. Let’s build
sample expressions using the C comment delimiters:
Regular expression Action
/* Enter comment state
comment . Ignore character. A dot (.) means any-
thing. This regular expression is only
considered when the lexer is in the com-
ment state.
comment */ Return to previous state. Do not re-
turn to caller but read next token, effec-
tively ignoring the comment. This reg-
ular expression is only considered when
the lexer is in the comment state.
Using a minor modification, we can also allow nested comments. To do this,
we must have the lexer keep track of the comment nesting level. Only when the
nesting level reaches 0 after leaving the final comment should the lexer leave the
comment state. Note that you could handle comments using a single regular
expression:
/* (.)* */
But this approach does not support nested comments. The treatment of line
comments is slightly easier. Only one regular expression is needed:
//(.)*\n
70
4.7 Lexical Analyzer Generators
Although it is certainly possible to write a lexical analyzer by hand, this task
becomes increasingly complex as your input language gets richer. It is therefore
more practical use a lexical analyzer generator. The code generated by such a
generator program is usually faster and more efficient that any code you might
write by hand[2].
Here are several candidates you could use:
AT&T lex Not free, ancient, UNIX and Linux im-
plementations
GNU flex Free, modern, Linux implementation
Bumblebee lex Free, modern, Windows implementa-
tion
The Inger compiler was constructed using GNU flex; in the next sections we
will briefly discuss its syntax (since flex takes lexical analyzer specifications as
its input) and how to use the output flex generates.
Practical advice 4.5 (Lex)
We heard that some people think that a lexical analyzer must be written in lex
or flex in order to be called a lexer. Of course, this is blatant nonsense (it is the
other way around).
?
Flex syntax
The layout of a flex input file (extension .l) is, in pseudocode:
%{
Any preliminary C code (inclusions, defines) that
will be pasted in the resulting .C file
%}
Any flex definitions
%%
Regular expressions
%%
Any C code that will be appended to
the resulting .C file
When a regular expression matches some input text, the lexical analyzer
must execute an action. This usually involves informing the caller (the parser)
of the token class found. With an action included, the regular expressions take
the following form:
[0-9]+ {
intValue_g = atoi( yytext );
return( INTEGER );
}
71
Using return( INTEGER ), the lexer informs the caller (the parser) that
is has found an integer. It can only return one item (the token class) so the
actual value of the integer is passed to the parser through the global variable
intValue_g. Flex automatically stores the characters that make up the current
token in the global string yytext.
Sample flex input file
Here is a sample flex input file for the language that consists of sentences of
the form (number+number), and that allows spacing anywhere (except within
tokens).
%{
#define NUMBER 1000
int intValue_g;
%}
%%
"(" { return( ‘(‘ ); }
")" { return( ‘)’ ); }
"+" { return( ‘+’ ); }
[0-9]+ {
intValue_g = atoi( yytext );
return( NUMBER );
}
%%
int main()
{
int result;
while( ( result = yylex() ) != 0 )
{
printf( "Token class found: %d\n", result );
}
return( 0 );
}
For many more examples, consult J. Levine’s Lex and yacc [2].
4.8 Inger Lexical Analyzer Specification
As a practical example, we will now discuss the token categories in the Inger
language, and all regular expressions used for complex tokens. The full source
for the Inger lexer is included in appendix F.
Inger discerns several token categories: keywords ( IF , WHILE and so on),
operators (+, % and more), complex tokens (integer numbers, floating point
numbers, and strings), delimiters (parentheses, brackets) and whitespace.
We will list the tokens in each category and show which regular expressions
is used to match them.
72
Keywords
Inger expects all keywords (sometimes called reserved words) to be written in
lowercase, allowing the literal keyword to be used to match the keyword itself.
The following table illustrates this:
Token Regular Expression Token identifier
break break KW_BREAK
case case KW_CASE
continue continue KW_CONTINUE
default default KW_DEFAULT
do do KW_DO
else else KW_ELSE
false false KW_FALSE
goto considered goto_considered
harmful _harmful KW_GOTO
if if KW_IF
label label KW_LABEL
module module KW_MODULE
return return KW_RETURN
start start KW_START
switch switch KW_SWITCH
true true KW_TRUE
while while KW_WHILE
Types
Type names are also tokens. They are invariable and can therefore be matched
using their full name.
Token Regular Expression Token identifier
bool bool KW_BOOL
char char KW_CHAR
float float KW_FLOAT
int int KW_INT
untyped untyped KW_UNTYPED
Note that the untyped type is equivalent to void in the C language; it is a
polymorphic type. One or more reference symbols (*) must be added after the
untyped keyword. For instance, the declaration
untyped ** a;
declares a to be a double polymorphic pointer.
Complex tokens
Inger’s complex tokens variable identifiers, integer literals, floating point literals
and character literals.
73
Token Regular Expression Token identifier
integer literal [0-9]+ INT
identifier [_A-Za-z][_A-Za-z0-9]* IDENTIFIER
float [0-9]*\.[0-9]+([eE][\+-][0-9]+)? FLOAT
char \’.\’ CHAR
Strings
In Inger, strings cannot span multiple lines. Strings are read using and exlusive
lexer string state. This is best illustrated by some flex code:
\" { BEGIN STATE_STRING; }
<STATE_STRING>\" { BEGIN 0; return( STRING ); }
<STATE_STRING>\n { ERROR( "unterminated string" ); }
<STATE_STRING>. { (store a character) }
<STATE_STRING>\\\" { (add " to string) }
If a linefeed is encountered while reading a string, the lexer displays an error
message, since strings may not span lines. Every character that is read while in
the string state is added to the string, except ", which terminates a string and
causes the lexer to leave the exclusive string state. Using the \" control code,
the programmer can actually add the ” (double quotes) character to a string.
Comments
Inger supports two types of comments: line comments (which are terminated
by a line feed) and block comments (which must be explicitly terminated).
Line comments can be read (and subsequently skipped) using a single regular
expression:
"//"[^\n]*
whereas block comments need an exclusive lexer state (since they can also
be nested). We illustrate this again using some flex code:
/* { BEGIN STATE_COMMENTS;
++commentlevel; }
<STATE_COMMENTS>"/*" { ++commentlevel; }
<STATE_COMMENTS>. { }
<STATE_COMMENTS>\n { }
<STATE_COMMENTS>"*/" { if( --commentlevel == 0 )
BEGIN 0; }
Once a comment is started using /*, the lexer sets the comment level to 1
and enters the comment state. The comment level is increased every time a
/* is encountered, and decreased every time a */ is read. While in comment
state, all characters but the comment start and end delimiters are discarded.
The lexer leaves the comment state after the last comment block terminates.
74
Operators
Inger provides a large selection of operators, of varying priority. They are listed
here in alphabetic order of the token identifiers. This list includes only atomic
operators, not operators that delimit their argument on both sides, like function
application.
funcname ( expr[,expr...] )
or array indexing
arrayname [ index ].
In the next section, we will present a list of all operators (including function
application and array indexing) sorted by priority.
Some operators consist of multiple characters. The lexer can discern between
the two by looking one character ahead in the input stream and switching states
(as explained in section 4.5.
Token Regular Expression Token identifier
addition + OP_ADD
assignment = OP_ASSIGN
bitwise and & OP_BITWISE_AND
bitwise complement ~ OP_BITWISE_COMPLEMENT
bitwise left shift << OP_BITWISE_LSHIFT
bitwise or | OP_BITWISE_OR
bitwise right shift >> OP_BITWISE_RSHIFT
bitwise xor ^ OP_BITWISE_XOR
division / OP_DIVIDE
equality == OP_EQUAL
greater than > OP_GREATER
greater or equal >= OP_GREATEREQUAL
less than < OP_LESS
less or equal <= OP_LESSEQUAL
logical and && OP_LOGICAL_AND
logical or || OP_LOGICAL_OR
modulus % OP_MODULUS
multiplication * OP_MULTIPLY
logical negation ! OP_NOT
inequality != OP_NOTEQUAL
subtract - OP_SUBTRACT
ternary if ? OP_TERNARY_IF
Note that the * operator is also used for dereferencing (in unary form) besides
multiplication, and the & operator is also used for indirection besides bitwise
and.
Delimiters
Inger has a number of delimiters. There are listed here by there function de-
scription.
75
Token Regexp Token identifier
precedes function return type -> ARROW
start code block { LBRACE
end code block } RBRACE
begin array index [ LBRACKET
end array index ] RBRACKET
start function parameter list : COLON
function argument separation , COMMA
expression priority, function application ( LPAREN
expression priority, function application ) RPAREN
statement terminator ; SEMICOLON
The full source to the Inger lexical analyzer is included in appendix F.
76
Bibliography
[1] G. Goos, J. Hartmanis: Compiler Construction - An Advanced Course,
Lecture Notes in Computer Science, Springer-Verlag, Berlin, 1974.
[2] J. Levine: Lex and Yacc, O’Reilly & sons, 2000
[3] H. Spencer: POSIX 1003.2 regular expressions, UNIX man page regex(7),
1994
77
Chapter 5
Grammar
5.1 Introduction
This chapter will introduce the concepts of language and grammar in both
informal and formal terms. After we have established exactly what a grammar
is, we offer several example grammars with documentation.
This introductory section discusses the value of the material that follows
in writing a compiler. A compiler can be thought of as a sequence of actions,
performed on some code (formulated in the source language) that transform
that code into the desired output. For example, a Pascal compiler transforms
Pascal code to assembly code, and a Java compiler transforms Java code to its
corresponding Java bytecode.
If you have used a compiler in the past, you may be familiar with “syntax
errors”. These occur when the input code does not conform to a set of rules set
by the language specification. You may have forgotten to terminate a statement
with a semicolon, or you may have used the THEN keyword in a C program (the
C language defines no THEN keyword).
One of the things that a compiler does when transforming source code to
target code is check the structure of the source code. This is a required step
before the compiler can move on to something else.
The first thing we must do when writing a compiler is write a grammar
for the source language. This chapter explains what a grammar is, how to
create one. Furthermore, it introduces several common ways of writing down a
grammar.
78
5.2 Languages
In this section we will try to formalize the concept of a language. When thinking
of languages, the first languages that usually come to mind are natural languages
like English or French. This is a class of languages that we will only consider in
passing here, since they are very difficult to understand by a computer. There
is another class of languages, the computer or formal languages, that are far
easier to parse since they obey a rigid set of rules. This is in constrast with
natural languages, whose leniant rules allow the speaker a great deal of freedom
in expressing himself.
Computers have been and are actively used to translate natural languages,
both for professional purposes (for example, voice-operated computers or Mi-
crosoft SQL Server’s English Query) and in games. This first so-called adventure
game 1 was written as early as 1975 and it was played by typing in English com-
mands.
All languages draw the words that they allow to be used from a pool of
words, called the alphabet. This is rather confusing, because we tend to think
of the alphabet as the 26 latin letters, A through Z. However, the definition of
a language is not concerned with how its most basic elements, the words, are
constructed from individual letters, but how these words are strung together.
In definitions, an alphabet is denoted as Σ.
A language is a collection of sentences or strings. From all the words that a
language allows, many sentences can be built but only some of these sentences
are valid for the language under consideration. All the sentences that may be
constructed from an alphabet Σ are denoted Σ ∗ . Also, there exists a special
sentence: the sentence with no words in it. This sentence is denoted λ.
In definitions, we refer to words using lowercase letters at the beginning of
our alphabet (a,b,c...), while we refer to sentences using letters near the end of
our alphabet (u,v,w,x...). We will now define how sentences may be built from
words.
Definition 5.1 (Alphabet)
Let Σ be an alphabet. Σ ∗ , the set of strings over Σ, is defined recursively as
follows:
1. Basis: λ ∈ Σ ∗ .
2. Recursive step: If w ∈ Σ ∗ , then wa ∈ Σ ∗ .
3. Closure: w ∈ Σ ∗ only if it can be obtained from λ by a finite number of
applications of the recursive step.
?
1 In early 1977, Adventure swept the ARPAnet. Willie Crowther was the original author,
but Don Woods greatly expanded the game and unleashed it on an unsuspecting network.
When Adventure arrived at MIT, the reaction was typical: after everybody spent a lot of time
doing nothing but solving the game (it’s estimated that Adventure set the entire computer
industry back two weeks), the true lunatics began to think about how they could do it better
[proceeding to write Zork] (Tim Anderson, “The History of Zork – First in a Series” New Zork
Times; Winter 1985)
79
This definition may need some explanation. It is put using induction. What
this means will become clear in a moment.
In the basis (line 1 of the defintion), we state that the empty string (λ) is
a sentence over Σ. This is a statement, not proof. We just state that for any
alphabet Σ, the empy string λ is among the sentences that may be constructed
from it.
In the recursive step (line 2 of the definition), we state that given a string w
that is part of Σ ∗ , the string wa is also part of Σ ∗ . Note that w denotes a string,
and a denotes a single word. Therefore what we mean is that given a string
generated from the alphabet, we may append any word from that alphabet to
it and the resulting string will still be part of the set of strings that can be
generated from the alphabet.
Finally, in the closure (line 3 of the definition), we add that all the strings
that can be built using the basis and recursive step are part of the set of strings
over Σ ∗ , and all the other strings are not. You can think of this as a sort
of safeguard for the definition. In most inductive defintions, we will leave the
closure line out.
Is Σ ∗ , then, a language? The answer is no. Σ ∗ is the set of all possible
strings that may be built using the alphabet Σ. Only some of these strings are
actually valid for a language. Therefore a language over an alphabet Σ is a
subset of Σ ∗ .
As an example, consider a small part of the English language, with the
alphabet { ’dog’, ’bone,’, the’, ’eats’ } (we cannot consider the actual English
language, as it has far too many words to list here). From this alphabet, we can
derive strings using definition 5.1:
λ
dog
dog dog dog
bone dog the
the dog eats the bone
the bone eats the dog
Many more strings are possible, but we can at least see that most of the
strings above are not valid for the English language: their structure does not
obey the rules of English grammar. Thus we may conclude that a language over
an alphabet Σ is a subset of Σ ∗ that follows certain grammar rules.
If you are wondering how all this relates to compiler construction, you should
realize that one of the things that a compiler does is check the structure of its
input by applying grammar rules. If the structure is off, the compiler prints a
syntax error.
5.3 Syntax and Semantics
To illustrate the concept of grammar, let us examine the following line of text:
jumps the fox the dog over
Since it obviously does not obey the rules of English grammar, this sentence
is meaningless. It is said to be syntactically incorrect. The syntax of a sentence is
80
its form or structure. Every sentence in a language must obey to that language’s
syntax for it to have meaning.
Here is another example of a sentence, whose meaning is unclear:
the fox drinks the color red
Though possibly considered a wondrous statement in Vogon poetry, this
statement has no meaning in the English language. We know that the color
red cannot be drunk, so that although the sentence is syntactically correct, it
conveys no useful information, and is therefore considered incorrect. Sentences
whose structure conforms to a language’s syntax but whose meaning cannot be
understood are said to be semantically incorrect.
The purpose of a grammar is to give a set of rules which all the sentences of
a certain language must follow. It should be noted that speakers of a natural
language (like English or French) will generally understand sentences that differ
from these rules, but this is not so for formal languages used by computers. All
sentences are required to adhere to the grammar rules without deviation for a
computer to be able to understand them.
In Compilers: Principles, Techniques and Tools ([1]), Aho defines a grammar
more formally:
A grammar is a formal device for specifying a potentially infinite
language (set of strings) in a finite way.
Because language semantics are hard to express in a set of rules (although
we will show a way to deal with sematics in part III), rammars deal with syntax
only: a grammar defines the structure of sentences in a language.
5.4 Production Rules
In a grammar, we are not usually interested in the individual letters that make
up a word, but in the words themselves. We can give these words names so that
we can refer to them in a grammar. For example, there are very many words
that can be the subject or object of an English sentence (’fox’, ’dog’, ’chair’ and
so on) and it would not be feasible to list them all. Therefore we simply refer
to them as ’noun’. In the same way we can give all words that precede nouns to
add extra information (like ’brown’, ’lazy’ and ’small’) a name too: ’adjective’.
We call the set of all articles (’the’, ’a’, ’an’) ’article’. Finally, we call all verbs
’verb’. Each of these names represent a set of many words, with the exception
of ’article’, which has all its members already listed.
Armed with this new terminology, we are now in a position to describe the
form of a very simple English sentence:
sentence: article adjective noun verb adjective noun.
From this lone production rule, we can generate (produce) English sentences.
We can replace every set name to the right of the colon with one of its elements.
For example, we can replace article with ’the’, adjective with ’quick’, noun with
’fox’ and so on. This way we can build sentences such as
81
the quick fox eats a delicious banana
the delicious banana thinks the quick fox
a quick banana outruns a delicious fox
The structure of these sentences matches the preceding rule, which means
that they conform to the syntax we specified. Incidentally, some of these sen-
tences have no real meaning, thus illustrating that semantic rules are not in-
cluded in the grammar rules we discuss here.
We have just defined a grammar, even though it contains only one rule that
allows only one type of sentence. Note that our grammar is a so-called abstract
grammar, since it does not specify the actual words that we may use to replace
the word classes (article, noun, verb) that we introduced.
So far we have given names to classes of individual words. We can also assign
names to common combinations of words. This requires multiple rules, making
the individual rules simpler:
sentence: object verb object.
object: article adjective noun.
This grammar generates the same sentences as the previous one, but is some-
what easier to read. Now we will also limit the choices that we can make when
replacing word classes by introducing some more rules:
noun: fox.
noun: banana.
verb: eats.
verb: thinks.
verb: outruns.
article : a.
article : the.
adjective: quick.
adjective : delicious.
Our grammar is now extended so that it is no longer an abstract grammar.
The rules above dictate how nonterminals (abstract grammar elements like ob-
ject or article ) may be replaced with concrete elements of the language’s alphabet.
The alphabet consists of all a the terminal symbols or terminals in a language
(the actual words). In the productions rules listed above, terminal symbols are
printed in bold.
Nonterminal symbols are sometimes called auxiliary symbols, because they
must be removed from any sentential form in order to create a concrete sentence
in the language.
Production rules are called production rules for a reason: they are used
to produce concrete sentences from the topmost nonterminal, or start symbol.
A concrete sentence may be derived from the start symbol by systematically
selecting nonterminals and replacing them with the right hand side of a suitable
production rule. In the listing below, we present the production rules for the
grammar we have constructed so far during this chapter. Consult the following
example, in which we use this grammar to derive a sentence.
82
sentence: object verb object.
object: article adjective noun.
noun: fox.
noun: banana.
verb: eats.
verb: thinks.
verb: outruns.
article : a.
article : the.
adjective: quick.
adjective : delicious.
Derivation Rule Applied
’sentence’
=⇒ ’object’ verb object 1
=⇒ ’ article ’adjective noun verb object 2
=⇒ the ’adjective ’ noun verb object 9
=⇒ thequick’noun’ verb object 2
=⇒ thequickfox ’verb’ object 3
=⇒ thequickfox eats ’object’ 5
=⇒ thequickfox eats ’ article ’ adjective noun 2
=⇒ thequickfox eats a ’adjective ’ noun 8
=⇒ thequickfox eats a delicious ’noun’ 11
=⇒ thequickfox eats a deliciousbanana 4
The symbol =⇒ indicates the application of a production rule, which the
rule number of the rule applied in the right column. The set of all sentences
which can be derived by repeated application of the production rules (deriving)
is the language defined by these production rules.
The string of terminals and nonterminals in each step is called a sentential
form. The last string, which contains only terminals, is the actual sentence.
This means that the process of derivation ends once all nonterminals have been
replaced with terminals.
You may have noticed that in every step, we consequently replaced the
leftmost nonterminal in the sentential form with one of its productions. This
is why the derivation we have performed is called a leftmost derivation. It is
also correct to perform a rightmost derivation by consequently replacing the
rightmost nonterminal in each sentential form, or any derivation in between.
Our current grammar states that every noun is preceded by precisely one
adjective. We now want to modify our grammar so that it allows us to specify
zero, one or more adjectives before each noun. This can be done by introducing
recursion, where a production rule for a nonterminal may again contain that
nonterminal:
object: article adjectivelist noun.
adjectivelist : adjective adjectivelist .
adjectivelist : ?.
The rule for the nonterminal object has been altered to include adjectivelist
83
instead of simply adjective . An adjective list can either be empty (nothing,
indicated by ?), or an adjective, followed by another adjective list and so on.
The following sentences may now be derived:
sentence
=⇒ object verb object
=⇒ article adjectivelist noun verb object
=⇒ the adjectivelist noun verb object
=⇒ thenounverb object
=⇒ thebananaverbobject
=⇒ thebananaoutrunsobject
=⇒ thebananaoutrunsarticle adjectivelist noun
=⇒ thebananaoutrunstheadjectivelist noun
=⇒ thebananaoutrunstheadjective adjectivelist noun
=⇒ thebananaoutrunsthequickadjectivelist noun
=⇒ thebananaoutrunsthequickadjective adjectivelist noun
=⇒ thebananaoutrunsthequickdelicious adjectivelist noun
=⇒ thebananaoutrunsthequickdeliciousnoun
=⇒ thebananaoutrunsthequickdeliciousfox
5.5 Context-free Grammars
After the introductory examples of sentence derivations, it is time to deal with
some formalisms. All the grammars we will work with in this book are context-
free grammars:
Definition 5.2 (Context-Free Grammar)
A context-free grammar is a quadruple (V,Σ,P,S) where V is a finite set of vari-
ables (nonterminals), Σ is a finite set of terminals, P is a finite set of production
rules and S ∈ V is an element of V designated as the start symbol.
?
The grammar listings you have seen so far were context-free grammars, con-
sisting of a single nonterminal on the left-hand side of each production rule
(the mark of a context-free grammar). In fact, a symbol is a nonterminal only
when it acts as the left-hand side of a production rule. The right side of every
production rule may either be empty (denoted using an epsilon, ?), or contain
any combination of terminals and nonterminals. This notation is called the
Backus-Naur form 2 , after its inventors, John Backus and Peter Naur.
2 John Backus and Peter Naur introduced for the first time a formal notation to describe
the syntax of a given language (this was for the description of the ALGOL 60 programming
language). To be precise, most of BNF was introduced by Backus in a report presented at an
earlier UNESCO conference on ALGOL 58. Few read the report, but when Peter Naur read it
he was surprised at some of the differences he found between his and Backus’s interpretation
of ALGOL 58. He decided that for the successor to ALGOL, all participants of the first
design had come to recognize some weaknesses, should be given in a similar form so that all
participants should be aware of what they were agreeing to. He made a few modificiations
that are almost universally used and drew up on his own the BNF for ALGOL 60 at the
meeting where it was designed. Depending on how you attribute presenting it to the world,
it was either by Backus in 59 or Naur in 60. (For more details on this period of programming
languages history, see the introduction to Backus’s Turing award article in Communications
of the ACM, Vol. 21, No. 8, august 1978. This note was suggested by William B. Clodius
from Los Alamos Natl. Lab).
84
expression: expression + expression.
expression: expression − expression.
expression: expression ∗ expression.
expression: expression / expression.
expression: expression ˆ expression.
expression: number.
expression:
?
expression
?
.
number: 0.
number: 1.
number: 2.
number: 3.
number: 4.
number: 5.
number: 6.
number: 7.
number: 8.
number: 9.
Listing 5.1: Sample Expression Language
The process of deriving a valid sentence from the start symbol (in our pre-
vious examples, this was sentence ), is executed by repeatedly replacing a non-
terminal by the right-hand side of any one of the production rules of which it
acts as the left-hand side, until no nonterminals are left in the sentential form.
Nonterminals are always abstract names, while terminals are often expressed
using their actual (real-world) representations, often between quotes (e.g. ”+” ,
”while” , ”true” ) or printed bold (like we do in this book).
The left-hand side of a production rule is separated from the right-hand side
by a colon, and every production rule is terminated by a period. Whether you do
this does not affect the meaning of the production rules at all, but is considered
good style and part of the specificiation of Backus Naur Form (BNF). Other
notations are used.
As a running example, we will work with a simple language for mathematical
expressions, analogous to the language discussed in the introduction to this
book. The language is capable of expressing the following types of sentences:
1 + 2 * 3 + 4
2 ^ 3 ^ 2
2 * (1 + 3)
In listing 5.1, we give a sample context-free grammar for this language.
Note the periods that terminate each production rule. You can see that
there are only two nonterminals, each of which have a number of alternative
production rules associated with them. We now state that expression will be
the distinguished nonterminal that will act as the start symbol, and we can use
these production rules to derive the sentence 1 + 2 * 3 (see table 5.5).
85
expression
=⇒ expression ∗ expression
=⇒ expression +expression ∗ expression
=⇒ number+expression ∗ expression
=⇒ 1 +expression ∗ expression
=⇒ 1 +number∗ expression
=⇒ 1 +2 ∗ expression
=⇒ 1 +2 ∗ number
=⇒ 1 +2 ∗ 3
Table 5.1: Derivation scheme for 1 + 2 * 3
The grammar in listing 5.1 has all its keywords (the operators and digits) de-
fined in it as terminals. One could ask how this grammar deals with whitespace,
which consists of spaces, tabs and (possibly) newlines. We would naturally like
to allow an abitrary amount of whitespace to occur between two tokens (dig-
its, operators, or parentheses), but the term whitespace occurs nowhere in the
grammar. The answer is that whitespace is not usually included in a grammar,
although it could be. The lexical analyzer uses whitespace to see where a word
ends and where a new word begins, but otherwise discards it (unless the wites-
pace occurs within comments or strings, in which case it is significant). In our
language, whitespace does not have any significance at all so we assume that it
is discarded.
We would now like to extend definition 5.2 a little further, because we have
not clearly stated what a production rule is.
Definition 5.3 (Production Rule)
In the quadruple (V,Σ,S,P) that defines a context-free grammar, P ⊆ N ×
(V ∪ Σ) ∗ is a finite set of production rules.
?
Here, (V ∪Σ) is the union of the set of nonterminals and the set of terminals,
yielding the set of all symbols. (V ∪ Σ) ∗ denotes the set of finite strings of
elements from (V ∪ Σ). In other words, P is a set of 2-tuples with on the left-
hand side a nonterminal, and on the right-hand side a string constructed from
items from V and Σ. It should now be clear that the following are examples of
production rules:
expression: expression ∗ expression.
number: 3.
We have already shown that production rules are used to derive valid sen-
tences from the start symbol (sentences that may occur in the language under
consideration). The formal method used to derive such sentences are (also see
Languages and Machines by Thomas Sudkamp ([8]):
Definition 5.4 (String Derivation)
86
Let G = (V,Σ,S,P) be a context-free grammar and v ∈ (V ∪ Σ) ∗ . The set of
strings derivable from v is recursively defined as follows:
1. Basis: v is derivable from v.
2. Recursive step: If u = xAy is derivable from v and A −→ w ∈ P, then
xwy is derivable from v.
3. Closure: Precisely those strings constructed from v by finitely many ap-
plications of the recursive step are derivable from v.
?
This definition illustrates how we use lowercase latin letters to represent
strings of zero or more terminal symbols, and uppercase latin letters to represent
a nonterminal symbol. Furthermore, we use lowercase greek letters to denote
strings of terminal and nonterminal symbols.
A close formula may be given for all the sentences derivable from a given
grammar, simultaneously introducing a new operator:
{s ∈ Σ ∗ : S =⇒ ∗ s} (5.1)
We have already discussed the operator =⇒, which denotes the derivation
of a sentential form from another sentential form by applying a production rule.
The =⇒ relation is defined as
{(σAτ,σατ) ∈ (V ∪ Σ) ∗ × (V ∪ Σ) ∗ : σ,τ ∈ (V ∪ Σ) ∗ ∧ (A,α) ∈ P}
which means, in words: =⇒ is a collecton of 2-tuples, and is therefore a
relation which binds the left element of each 2-tuple to the right-element, thereby
defining the possible replacements (productions) which may be performed. In
the tuples, the capital latin letter A represents a nonterminal symbol which gets
replaced by a string of nonterminal and terminal symbols, denoted using the
greek lowercase letter α. σ and τ remain unchanged and serve to illustrate that
a replacement (production) is context-insensitive or context-free. Whatever the
actual value of the strings σ and τ, the replacement can take place. We will
encounter other types of grammars which include context-sensitive productions
later.
The =⇒ ∗ relation is the reflexive closure of the relation =⇒. =⇒ ∗ is used
to indicate that multiple production rules have been applied in succession to
achieve the result stated on the right-hand side. The formula α =⇒ ∗ β denotes
an arbitrary derivation starting with α and ending with β. It is perfectly valid to
rewrite the derivation scheme we presented in table 5.5 using the new operator
(see table 5.5). We can use this approach to leave out derivations that are
obvious, analogous to the way one leaves out trivial steps in a mathematical
proof.
The concept of recursion is illustrated by the production rule expression :
”(”expression”)”. When deriving a sentence, the nonterminal expression may be
replaced by itself (but between parentheses). This recursion may continue in-
definitely, termination being reached when another production rule for expression
is applied (in particular, expression : number).
87
expression
=⇒ ∗ expression +expression ∗ expression
=⇒ ∗ 1 +number∗ expression
=⇒ ∗ 1 +2 ∗ 3
Table 5.2: Compact derivation scheme for 1 + 2 * 3
Recursion is considered left recursion when the nonterminal on the left-hand
side of a production also occurs as the first symbol in the right-hand side of the
production. This applies to most of the production rules for expression , for
example:
expression: expression + expression.
While this property does not affect our ability to derive sentences from the
grammar, it does prohibit a machine from automatically parsing an input text
using determinism. This will be discussed shortly. Left recursion can be obvious
(as it is in this example), but it can also be buried deeply in a grammar. It some
cases, it takes a keen eye to spot and remove left recursion. Consider the follow-
ing example of indirect recursion (in this example, we use capital latin letters
to indicate nonterminal symbols and lowercase latin letters to indicate strings
of terminal symbols, as is customary in the compiler construction literature):
Example 5.1 (Indirect Recursion)
A: Bx
B: Cy
C: Az
C: x
A may be replaced by Bx , thus removing an instance of A from a sentential
form, and B may be replaced by Cy . C may be replaced by Az , which reintroduces
A in the sentential form: indirect recursion.
This example was taken from [2].
?
5.6 The Chomsky Hierarchy
An unrestricted rewriting system[1] (grammar), the collection of production
rules is:
P ⊆ (V ∪ Σ) ∗ × (V ∪ Σ) ∗
This means that the most leniant form of grammar allows multiple symbols,
both terminals and nonterminals on the left hand side of a production rule.
Such a production rule is often denoted
88
(α,ω)
since greek lowercase letters stand for a finite string of terminal and non-
terminal symbols, i.e. (V ∪ Σ) ∗ . The unrestricted grammar generates a type 0
language according to the Chomsky hierarchy. Noam Chomsky has defined four
levels of grammars which successively more severe restrictions on the form of
production rules which result in interesting classes of grammars.
A type 1 grammar or context-sensitive grammar is one in which each pro-
duction α −→ β is such that | β | ≥ | α |. Alternatively, a context-sensitive
grammar is sometimes defined as having productions of the form
γAρ −→ γωρ
where ω cannot be the empty string (?). This is, of course, the same defini-
tion. A type 1 grammar generates a type 1 language.
A type 2 grammar or context-free grammar is one in which each production
is of the form
A −→ ω
where ω can be the empty string (?). A context-free grammar generates a
type 2 language.
A type 3 grammar or regular grammar is either right linear, with each pro-
duction of the form
A −→ a or A −→ aB
or left-linear, in which each production is of the form:
A −→ a or A −→ Ba
A regular grammar generates a type 3 language. Regular grammars are very
easy to parse (analyze the structure of a text written using such a grammar)
but are not very powerful at the same time. They are often used to write
lexical analyzers and were discussed in some detail in the previous chapter.
Grammars for most actual programming languages are context free, since this
type of grammar turns out to be easy to parse and yet powerful. The higher
classes (0 and 1) are not often used.
As it happens, the class of context-free languages (type 2) is not very large. It
turns out that there are almost no interesting languages that are context-free.
But this problem is easily solved by first defining a superset of the language
that is being designed, in order to formalize the context-free aspects of this
language. After that, the remaining restrictions are defined using other means
(i.e. semantic analysis).
As an example of a context-sensitive aspect (from Meijer [6]), consider the
fact that in many programming languages, variables must be declared before
they may be used. More formally, in sentences of the form αXβXγ, in which
the number of possible productions for X and the length of the production for β
are not limited, both instances of X must always have the same production. Of
89
course, this cannot be expressed in a context-free manner. 3 This is an immediate
consequence of the fact that the productions are context-free: every nonterminal
may be replaced by one of its right-hand sides regardless of its context. Context-
free grammars can therefore be spotted by the property that the left-hand side
of their production rules consist of precisely one nonterminal.
5.7 Additional Notation
In the previous section, we have shown how a grammar can be written for
a simple language using the Backus-Naur form (BNF). Because BNF can be
unwieldy for languages which contain many alternative production rules for each
nonterminal or involve many recursive rules (rules that refer to themselves), we
also have the option to use the extended Backus-Naur form (EBNF). EBNF
introduces some meta-operators (which are only significant in EBNF and have
no function in the language being defined) which make the life of the grammar
writer a little easier. The operators are:
Operator Function
( and ) Group symbols together so that other meta-
operators may be applied to them as a group.
[ and ] Symbols (or groups of symbols) contained within
square brackets are optional.
{ and } Symbols (or groups of symbols) between braces
may be repeated zero or more times.
| Indicates a choice between two symbols (usually
grouped with parentheses).
Our sample grammar can now easily be rephrased using EBNF (see listing
5.2). Note how we are now able to combine multiple productions rules for the
same nonterminal into one production rule, but be aware that the alternatives
specified between pipes ( | ) still constitute multiple production rules. EBNF is
the syntax description language that is most often used in the compiler con-
struction literature.
Yet another, very intuitive way of describing syntax that we have already
used extensively in the Inger language specification in chapter 3, is the syntax
diagram. The production rules from listing 5.2 have been converted into two
syntax diagrams in figure 5.1.
Syntax diagrams consist of terminals (in boxes with rounded corners) and
nonterminals (in boxes with sharp corners) connected by lines. In order to pro-
duce valid sentences, the user begins with the syntax diagram designated as the
top-level diagram. In our case, this is the syntax diagram for expression , since ex-
pression is the start symbol in our grammar. The user then traces the line leading
into the diagram, evaluating the boxes he encounters on the way. While tracing
lines, the user may follow only rounded corners, never sharp ones, and may not
reverse direction. When a box with a terminal is encountered, that terminal is
placed in the sentence that is written. When a box containing a nonterminal
3 That is, unless there were a (low) limit on the number of possible productions for X
and/or the length of β were fixed and small. In that case, the total number of possibilities is
limited and one could write a separate production rule for each possibility, thereby regaining
the freedom of context.
90
expression: expression + expression
| expression − expression
| expression ∗ expression
| expression / expression
| expression ˆ expression
| number
|
?
expression
?
.
number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
Listing 5.2: Sample Expression Language in EBNF
Figure 5.1: Syntax Diagrams for Mathematical Expressions
91
is encountered, the user switches to the syntax diagram for that nonterminal.
In our case, there is only one nonterminal besides expression ( number ) and thus
there are only two syntax diagrams. In a grammar for a more complete lan-
guage, there may be many more syntax diagrams (consult appendix E for the
syntax diagrams of the Inger language).
Example 5.2 (Tracing a Syntax Diagram)
Let’s trace the syntax diagram in figure 5.1 to generate the sentence
1 + 2 * 3 - 4
We start with the expression diagram, since expression is the start symbol.
Entering the diagram, we face a selection: we can either move to a box con-
taining expression , move to a box containing the terminal ( or kmove to a box
containing number . Since there are no parentheses in the sentence that we want
to generate, the second alternative is eliminated. Also, if we were to move to
number now, the sentence generation would end after we generate only one digit,
because after the number box, the line we are tracing ends. Therefore we are
left with only one alternative: move to the expression box.
The expression box is a nonterminal box, so we must restart tracing the
expression syntax diagram. This time, we move to the number box. This is
also a nonterminal box, so we must pause our current trace and start tracing
the number syntax diagram. The number diagram is simple: it only offers use
one choice (pick a digit). We trace through 1 and leave the number diagram,
picking up where we left off in the expression diagram. After the number box,
the expression diagram also ends so we continue our first trace of the expression
diagram, which was paused after we entered an expression box. We must now
choose an operator. We need a + , so we trace through the corresponding box.
Following the line from + brings us to a second expression box. We must once
again pause our progress and reenter the expression diagram. In the following
interations, we pick 2 , * , 3 , - and 4 . Completing the trace is left as an exercise
to the reader.
?
Fast readers may have observed that converting (E)BNF production rules to
syntax diagrams does not yield very efficient syntax diagrams. For instance, the
syntax diagrams in figure 5.2 for our sample expression grammar are simpler
than the original ones, because we were able to remove most of the recursion in
the expression diagram.
At a later stage, we will have more to say about syntax diagrams. For now,
we will direct our attention back to the sentence generation process.
5.8 Syntax Trees
The previous sections included some example of sentence generation from a
given grammar, in which the generation process was visualized using a derivation
scheme (such as table 5.5 on page 86. Much more insight is gained from drawing
a so-called parse tree or syntax tree for the derivation.
We return to our sample expression grammar (listing 5.3, printed here again
for easy reference) and generate the sentence
92
Figure 5.2: Improved Syntax Diagrams for Mathematical Expressions
expression: expression + expression
| expression − expression
| expression ∗ expression
| expression / expression
| expression ˆ expression
| number
|
?
expression
?
.
number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
Listing 5.3: Sample Expression Language in EBNF
1 + 2 * 3
We will derive this sentence using leftmost derivation as shown in the deriva-
tion scheme in table 5.8.
The resulting parse tree is in figure 5.3. Every nonterminal encountered
in the derivation has become a node in the tree, and the terminals (the digits
and operators themselves) are the leaf nodes. We can now easily imagine how a
machine would calculate the value of the expression 1 + 2 * 3 : every nonterminal
node retrieves the value of its children and performs an operation on them
(addition, subtraction, division, multiplication), and stores the result inside
itself. This process occurs recursively, so that eventually the topmost node of
the tree, known as the root node , contains the final value of the expression.
Not all nonterminal nodes perform an operation on the values of their children;
the number node does not change the value of its child, but merely serves as
a placeholder. When a parent node queries the number node for its value, it
merely passes the value of its child up to its parent. The following recursive
definition states this approach more formally:
93
expression
=⇒ expression ∗ expression
=⇒ expression +expression ∗ expression
=⇒ number+expression ∗ expression
=⇒ 1 +expression ∗ expression
=⇒ 1 +number∗ expression
=⇒ 1 +2 ∗ expression
=⇒ 1 +2 ∗ number
=⇒ 1 +2 ∗ 3
Table 5.3: Leftmost derivation scheme for 1 + 2 * 3
Definition 5.5 (Tree Evaluation)
The following algorithm may be used to evaluate the final value of an expression
stored in a tree.
Let n be the root node of the tree.
• If n is a leaf node (i.e. if n has no children), the final value of n its current
value.
• If n is not a leaf node, the value of nis determined by retrieving the values
of its children, from left to right. If one of the children is an operator, it
is applied to the other children and the result is the final result of n.
?
The tree we have just created is not unique. In fact, their are multiple valid
trees for the expression 1 + 2 * 3 . In figure 5.4, we show the parse tree for the
rightmost derivation of our sample expression. This tree differs slightly (but
significantly) from our original tree. Apparently out grammar is ambiguous: it
can generate multiple trees for the same expression.
Figure 5.3: Parse Tree for Leftmost Derivation of 1 + 2 * 3
The existence of multiple trees is not altogether a blessing, since it turns out
that different trees produce different expression results.
Example 5.3 (Tree Evaluation)
94
expression
=⇒ expression +expression
=⇒ expression +expression ∗ expression
=⇒ expression +expression ∗ number
=⇒ expression +expression ∗ 3
=⇒ expression +number∗ 3
=⇒ expression +2 ∗ 3
=⇒ number+2 ∗ 3
=⇒ 1 +2 ∗ 3
Table 5.4: Rightmost derivation scheme for 1 + 2 * 3
Figure 5.4: Parse Tree for Rightmost Derivation of 1 + 2 * 3
In this example, we will calculate the value of the expression 1 + 2 * 3 using
the parse tree in figure 5.4. We start with the root node, and query the values
of its three expression child nodes. The value of the left child node is 1 , since it
has only one child ( number ) and its value is 1 . The value of the right expression
node is determined recursively by retrieving the values of its two expression child
nodes. These nodes evaluate to 2 and 3 respectively, and we apply the middle
child node, which is the multiplication ( * ) operator. This yields the value 6
which we store in the expression node.
The values of the left and right child nodes of the root expression node are now
known and we can calculate the final expression value. We do so by retrieving
the value of the root node’s middle child node (the + operator) and applying it
to the values of the left and right child nodes ( 1 and 6 respectively). The result,
7 is stored in the root node. Incidentally, it is also the correct answer.
At the end of the evaluation, the expression result is known and resides
inside the root node.
?
In this example, we have seen that the value 7 is found by evaluating the
tree corresponding to the rightmost derivation of the expression 1 + 2 * 3 . This
is illustrated by the annotated parse tree, which is shown in figure 5.5.
We can now apply the same technique to calculate the final value of the
parse tree corresponding to the leftmost derivation of the expression 1 + 2 * 3 ,
shown in figure 5.6. We find that the answer ( 9 ) is incorrect, which is caused
by the order in which the nodes are evaluated.
95
Figure 5.5: Annotated Parse Tree for Rightmost Derivation of 1 + 2 * 3
Figure 5.6: Annotated Parse Tree for Leftmost Derivation of 1 + 2 * 3
The nodes in a parse tree must reflect the precedence of the operators used in
the expression in the parse tree. In case of the tree for the rightmost derivation
of 1 + 2 * 3 , the precedence was correct: the value of 2 * 3 was evaluated before
the 1 was added to the result. In the parse tree for the leftmost derivation, the
value of 1 + 2 was calculated before the result was multiplied by 3 , yielding an
incorrect result. Should we, then, always use rightmost derivations? The answer
is no: it is mere coindence that the rightmost derivation happens to yield the
correct result – it is the grammar that is flawed. With a correct grammar, any
derivation order will yield the same results and only one parse tree correspons
to a given expression.
5.9 Precedence
The problem of ambiguity in the grammar of the previous section is solved for
a big part by introducing new nonterminals, which will serve as placeholders
to introduce operator precedence levels. We know that multiplication ( * ) and
division ( / ) bind more strongly than addition ( + ) and subtraction( - ), but we
need a means to visualize this concept in the parse tree. The solution lies in
adding the nonterminal term (see the new grammar in listing 5.4, which will
deal with multiplications and additions. The original expression nonterminal is
now only used for additions and subtractions. The result is, that whenever
a multiplication or division is encountered, the parse tree will contain a term
node in which all multiplications and divisions are resolved until an addition or
96
expression: term + expression
| term − expression
| term.
term: factor ∗ term
| factor / term
| factor ˆ term
| factor.
factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
|
?
expression
?
.
Listing 5.4: Unambiguous Expression Language in EBNF
expression
=⇒ term +expression
=⇒ factor +expression
=⇒ 1 +expression
=⇒ 1 +term
=⇒ 1 +factor ∗ term
=⇒ 1 +2 ∗ term
=⇒ 1 +2 ∗ factor
=⇒ 1 +2 ∗ 3
Table 5.5: Leftmost derivation scheme for 1 + 2 * 3
subtraction arrives.
We also introduce the nonterminal factor to replace number , and to deal with
parentheses, which have the highest precedence. It should now become obvious
that the lower you get in the grammar, the higher the priority of the operators
dealt with. Tables 5.9 and 5.9 show the leftmost and rightmost derivation of 1
+ 2 * 3 . Careful study shows that they are the same. In fact, the corresponding
parse trees are exactly identical (shown in figure 5.7). The parse tree is already
annotated for convience and yields the correct result for the expression it holds.
It should be noted that in some cases, an instance of, for example, term
actually adds an operator ( * or / ) and sometimes it is merely included as a
placeholder that holds an instance of factor . Such nodes have no function in
a syntax tree and can be safely left out (which we will do when we generate
abstract syntax trees.
There is an amazing (and amusing) trick that was used in the first FOR-
TRAN compilers to solve the problem of operator precedence. An excerpt from
a paper by Donald Knuth (1962):
An ingenious idea used in the first FORTRAN compiler was to sur-
round binary operators with peculiar-looking parentheses:
+ and − were replaced by ))) + ((( and ))) − (((
∗ and / were replaced by )) ∗ (( and ))/((
∗∗ was replaced by ) ∗ ∗(
97
expression
=⇒ term +expression
=⇒ term +term
=⇒ term +factor”∗” term
=⇒ term +factor ∗ factor
=⇒ term +factor ∗ 3
=⇒ term +2∗ 3
=⇒ term +2 ∗ 3
=⇒ factor +2 ∗ 3
=⇒ 1 +2 ∗ 3
Table 5.6: Rightmost derivation scheme for 1 + 2 * 3
Figure 5.7: Annotated Parse Tree for Arbitrary Derivation of 1 + 2 * 3
98
and then an extra “(((” at the left and “)))” at the right were tacked
on. For example, if we consider “(X + Y ) + W/Z,” we obtain
((((X))) + (((Y )))) + (((W))/((Z)))
This is admittedly highly redundant, but extra parentheses need not
affect the resulting machine language code.
Another approach to solve the precedence problem was invented by the Pol-
ish scientist J. Lukasiewicz in the late 20s. Today frequently called prefix no-
tation, the parenthesis-free or polish notation was a perfect notation for the
output of a compiler, and thus a step towards the actual mechanization and
formulation of the compilation process.
Example 5.4 (Prefix notation)
1 + 2 * 3 becomes + 1 * 2 3
1 / 2 - 3 becomes - / 1 2 3
?
5.10 Associativity
When we write down the syntax tree for the expression 2 - 1 - 1 according
to our example grammar, we discover that our grammar is still not correct
(see figure 5.8). The parse tree yields the result 2 while the correct result
is 0 , even though we have taken care of operator predence. It turns out that
apart from precedence, operator associativity is also important. The subtraction
operator - associates to the left, so that in a (sub) expression which consists
only of operators of equal precedence, the order in which the operators must be
evaluated is still fixed. In the case of subtraction, the order is from left to right.
In the case of ˆ (power), the order is from right to left. After all,
2 2
2
= 512 6= 64.
Figure 5.8: Annotated Parse Tree for 2 - 1 - 1
99
expression: expression + term
| expression − term
| term.
term: factor ∗ term
| factor / term
| factor ˆ term
| factor.
factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
|
?
expression
?
.
Listing 5.5: Expression Grammar Modified for Associativity
It turns out that our grammar works only for right-associative operators (or
for non-associative operators like addition or multiplication, since these may be
treated like right-associative operators), because its production rules are right-
recursive. Consider the following excerpt:
expression: term + expression
| term − expression
| term.
The nonterminal expression acts as the left-hand side of these three production
rules, and in two of them also occurs on the far right. This causes right recursion
which can be spotted in the parse tree in figure 5.8: the right child node of every
expression node is again an expression node. Left recursion can be recognized the
same way. The solution, then, to the associativity problem is to introduce
left-recursion in the grammar. The grammar in listing 5.5 can deal with left-
associativity and right-associativity, because expression is left-recursive, causing
+ and - to be treated as left-associative operators, and term is right-recursive,
causing * , / and ˆ to be treated as right-associative operators.
And presto–the expressions 2 - 1 - 1 and 2 ˆ 3 ˆ 2 now have correct parse
trees (figures 5.9 and 5.10). We will see in the next chapter that we are not
quite out of the woods yet, but never fear, the worst is behind us.
5.11 A Logic Language
As a final and bigger example, we present a complete little language that handles
propositional logic notation (proposition, implication, conjunction, disjunction
and negation). This language has operator precedence and associativity, but
very few terminals (and an interpreter can therefore be completely implemented
as an exercise. We will do so in the next chapter). Consult the following sample
program:
Example 5.5 (Proposition Logic Program)
100
Figure 5.9: Correct Annotated Parse Tree for 2 - 1 - 1
Figure 5.10: Correct Annotated Parse Tree for 2 ˆ 3 ˆ 2
101
A = 1
B = 0
C = (˜A) | B
RESULT = C −> A
?
The language allows the free declaration of variables, for which capital letters
are used (giving a range of 26 variables maximum). In the example, the variable
A is declared and set to true (1), and B is set to false (0). The variable C is
declared and set to
?
˜A ? | B, which is false (0). Incidentally, the parentheses
are not required because ˜ has higher priority than |. Finally, the program is
terminated with an instruction that prints the value of C −> B, which is true
(1). Termination of a program with such an instruction is required.
Since our language is a proposition logic language, we must define truth
tables for each operator (see table 5.7). You may already be familiar with all
the operators. Pay special attention to the operator precedence relation:
Operator Priority Operation
~ 1 Negation (not)
& 2 Conjunction (and)
| 2 Disjunction (or)
-> 3 Right Implication
<- 3 Left Implication
<-> 3 Double Implication
A B A & B
F F F
F T F
T F F
T T T
A B A | B
F F F
F T T
T F T
T T T
A ~A
F T
T F
A B A -> B
F F T
F T T
T F F
T T T
A B A <- B
F F T
F T F
T F T
T T T
A B A <-> B
F F T
F T F
T F F
T T T
Table 5.7: Proposition Logic Operations and Their Truth Tables
Now that we are familiar with the language and with the operator precedence
relation, we can write a grammar in BNF. Incidentally, all operators are non-
associative, and we will treat them as if they associated to the right (which is
easiest for parsing by a machine, in the next chapter). The BNF grammar is
in listing 5.6. For good measure, we have also written the grammar in EBNF
(listing 5.7).
You may be wondering why we have built our BNF grammar using complex
constructions with empty production rules (?) while our running example, the
102
program: statementlist RESULT = implication.
statementlist : ?.
statementlist : statement statementlist.
statement: identifier = implication ;.
implication: conjunction restimplication.
restimplication : ?.
restimplication : −> conjunction restimplication.
restimplication : <− conjunction restimplication.
restimplication : <−> conjunction restimplication.
conjunction: negation restconjunction.
restconjunction: ?.
restconjunction: & negation restconjunction.
restconjunction: | negation restconjunction.
negation: ˜ negation.
negation: factor.
factor :
?
implication
?
.
factor : identifier .
factor : 1.
factor : 0.
identifier : A.
...
identifier : Z.
Listing 5.6: BNF for Logic Language
program:
?
statement ;
?
RESULT = implication.
statement: identifier = implication.
implication: conjunction
? ?
−> | <− | <−>
?
implication
?
.
conjunction: negation
? ?
& | |
?
conjunction
?
.
negation:
?
˜
?
factor.
factor :
?
implication
?
| identifier
| 1
| 0.
identifier : A | ... | Z.
Listing 5.7: EBNF for Logic Language
103
mathematical expression grammar, was so much easier. The reason is that in
our expression grammar, multiple individual production rules with the same
nonterminal on the left-hand side (e.g. factor ), also start with that nonterminal.
It turns out that this property of a grammar makes it difficult to implement in
an automatic parser (which we will do in the next chapter). This is why we
must go out of our way to create a more complex grammar.
5.12 Common Pitfalls
We conclude our chapter on grammar with some practical advice. Grammars
are not the solution to everything. They can describe the basic structure of a
language, but fail to capture the details. You can easily spend much time trying
to formulate grammars that contain the intricate details of some shadowy corner
of your language, only to find out that it would have been far easier to handle
those details in the semantic analysis phase. Often, you will find that some
things just cannot be done with a context free grammar.
Also, if you try to capture a high level of detail in a grammar, your grammar
will grow rapidly grow and become unreadable. Extended Backus-Naur form
may cut you some notational slack, but in the end you will be moving towards
attribute grammars or affix grammars (discussed in Meijer, [6]).
Visualizing grammars using syntax diagrams can be a big help, because
Backus-Naur form can lure you into recursion without termination. Try to
formulate your entire grammar in syntax diagrams before moving to BNF (even
though you will have to invest more time). Refer to the syntax diagrams for the
Inger language in appendix E for an extensive example, especially compared to
the BNF notation in appendix D.
104
Bibliography
[1] A.V. Aho, R. Sethi, J.D. Ullman: Compilers: Principles, Techniques and
Tools, Addison-Wesley, 1986.
[2] F.H.J. Feldbrugge: Dictaat Vertalerbouw, Hogeschool van Arnhem en Ni-
jmegen, edition 1.0, 2002.
[3] J.D. Fokker, H. Zantema, S.D. Swierstra: Programmeren en correctheid,
Academic Service, Schoonhoven, 1991.
[4] A. C. Hartmann: A Concurrent Pascal Compiler for Minicomputers, Lec-
ture notes in computer science, Springer-Verlag, Berlin 1977.
[5] J. Levine: Lex and Yacc, O’Reilly & sons, 2000
[6] H. Meijer: Inleiding Vertalerbouw, University of Nijmegen, Subfaculty of
Computer Science, 2002.
[7] M.J. Scott: Programming Language Pragmatics, Morgan Kaufmann Pub-
lishers, 2000.
[8] T. H. Sudkamp: Languages & Machines, Addison-Wesley, 2nd edition,
1998.
[9] N. Wirth: Compilerbouw, Academic Service, 1987.
[10] N. Wirth and K. Jensen: PASCAL User Manual and Report, Lecture notes
in computer science, Springer-Verlag, Berlin 1975.
105
Chapter 6
Parsing
6.1 Introduction
In the previous chapter, we have devised grammars for formal languages. In
order to generate valid sentences in these languages, we have written derivation
schemes, and syntax trees. However, a compiler does not work by generat-
ing sentences in some language, buy by recognizing (parsing) them and then
translating them to another language (usually assembly language or machine
language).
In this chapter, we discuss how one writes a program that does exactly that:
parse sentences according to a grammar. Such a program is called a parser.
Some parsers build up a syntax tree for sentences as they recognize them. These
syntax trees are identical to the ones presented in the previous chapter, but they
are generated inversely: from the concrete sentence instead of from a derivation
scheme. In short, a parser is a program that will read an input text and tell
you if it obeys the rules of a grammar (and if not, why – if the parser is worth
anything). Another way of saying it would be that a parser determines if a
sentence can be generated from a grammar. The latter description states more
precisely what a parser does.
Only the more elaborate compilers build up a syntax tree in memory, but
we will do so explicitly because it is very enlightening. We will also discuss
a technique to simplify the syntax tree, thus creating an abstract syntax tree,
which is more compact than the original tree. The abstract syntax tree is very
important: it is the basis for the remaining compilation phases of semantic
analysis and code generation.
106
Parsing techniques come in many flavors and we do not presume to be able to
discuss them all in detail here. We will only fully cover LL(1) parsing (recursive
descent parsing), and touch on LR(k) parsing. No other methods are discussed.
6.2 Prefix code
The grammar chapter briefly touched on the subject of prefix notation, or polish
notation as it is sometimes known. Prefix notation was invented by the Polish
J. Lukasiewicz in the late 20s. This parenthesis-free notation was a perfect
notation for the output of a compiler:
Example 6.1 (Prefix notation)
1 + 2 * 3 becomes + 1 * 2 3
1 / 2 - 3 becomes - / 1 2 3
?
In the previous chapter, we found that operator precedence and associativity
can and should be handled in a grammar. The later phases (semantic analysis
and code generation) should not be bothered with these operator properties
anymore–the parser should convert the input text to an intermediate format
that implies the operator priority and associativity. An unambiguous syntax
tree is one such structure, and prefix notation is another.
Prefix notation may not seem very powerful, but consider that fact that it
can easily be used to denote complex constructions like if ...then and while .. do
with which you are no doubt familiar (if not, consult chapter 3):
if (var > 0) { a + b } else { b − a } becomes ?>var’0’+a’b’-b’a’
while (n > 0) { n = n − 1 } becomes W>n’0’=n’-’n’1’
Apostrophes ( ’ ) are often used as monadic operators that delimit a variable
name, so that two variables are not actually read as one. As you can deduct from
this example, prefix notation is actually a flattened tree. As long as the number
of operands that each operator takes is known, the tree can easily be traversed
using a recursive function. In fact, a very simple compiler can be constructed
that translates the mathematical expression language that we devised into prefix
code. A second program, the evaluator, could then interpret the prefix code and
calculate the result.
Figure 6.1: Syntax Tree for If-Prefixcode
To illustrate these facts as clearly as possible, we have placed the prefix
expressions for the if and while examples in syntax trees (figures 6.1 and 6.2
respectively).
107
Figure 6.2: Syntax Tree for While-Prefixcode
Notice that the original expressions may be regained by walking the tree in
a pre-order fashion. Conversely, try walking the tree in-order or post-order, and
examine the result as an interesting exercise.
The benefits of prefix notation do not end there: it is also an excellent means
to eliminate unnecessary syntactic sugar like whitespace and comments, without
loss of meaning.
The evaluator program is a recursive affair: it starts reading the prefix string
from left to right, and for every operator it encounters, it calls itself to retrieve
the operands. The recursion terminates when a constant (a variable name or
a literal value) is found. Compare this to the method we discussed in the
introduction to this book. We said that we needed a stack to place (shift) values
and operators on that could not yet be evaluated (reduce). The evaluator works
by this principle, and uses the recursive function as a stack.
The translator-evaluator construction we have discussed so far may seem
rather artificial to you. But real compilers. although more complex, work the
same way. The big difference is that the evaluator is the computer processor
(CPU) - it cannot be changed, and the code that your compiler outputs must
obey the processor’s rules. In fact, the machine code used by a real machine like
the Intel x86 processor is a language unto itself, with a real grammar (consult
the Intel instruction set manual [3] for details).
There is one more property of the prefix code and the associated trees:
operators are no longer leaf nodes in the trees, but have become internal nodes.
We could have used nodes like expression and term as we have done before, but
these nodes would then be void of content. By making the operators nodes
themselves, we save valuable space in the tree.
6.3 Parsing Theory
We have discussed prefix notations and associated syntax trees (or parse trees),
but how is such a tree constructed from the original input (it is called a parse
tree, after all)? In this section we present some of the theory that underlies
parsing. Note that in a book of limited size, we do not presume to be able to
treat all of the theory. In fact, we will limit our discussion to LL(1) grammars
and mention LR parsing in a couple of places.
Parsing can occur in two basic fashions: top-down and bottom-up. With top-
down, you start with a grammar’s start symbol and work toward the concrete
sentence under consideration by repeatedly applying production rules (replacing
108
nonterminals with one of their right-hand sides) until there are no nonterminals
left. This method is by far the easiest method, but also places the most restric-
tions on the grammar.
Bottom-up parsing starts with the sentence to be parsed (the string of termi-
nals), and repeatedly applies production rules inversely, i.e. replaces substrings
of terminals nonterminals with the left-hand side of production rules. This
method is more powerful than top-down parsing, but much harder to write
by hand. Tools that construct bottom-up parsers from a grammar (compiler-
compilers) exist for this purpose.
6.4 Top-down Parsing
Top-down parsing relies on a grammar’s determinism property to work. A
top down or recursive descent parser always takes the leftmost nonterminal in a
sentential form (or the rightmost nonterminal, depending on the flavor of parser
you use) and replaces it with one of its right-hand sides. Which one, depends on
the next terminal character the parser reads from the input stream. Because the
parser must constantly make choices, it must be able to do so without having
to retrace its steps after making a wrong choice. There exist parsers that work
this way, but these are obviously not very efficient and we will not give them
any further thought.
If all goes well, eventually all nonterminals will have been replaced with
terminals and the input sentence should be readable. If it is not, something
went wrong along the way and the input text did not obey the rules of the
grammar; it is said to be syntactically incorrect–there were syntax errors. We
will later see ways to pinpoint the location of syntax errors precisely.
Incidentally, there is no real reason why we always replace the leftmost (or
rightmost) nonterminal. Since our grammar is context-free, it does not mat-
ter which nonterminal gets replaced since there is no dependency between the
nonterminals (context-insensitive). It is simply tradition to pick the leftmost
nonterminal, and hence the name of the collection of recursive descent parsers:
LL, which means “recognizable while reading from left to right, and rewriting
leftmost nonterminals.” We can also define RL right away, which means “recog-
nizable while reading from right to left, and rewriting leftmost nonterminals.”
– this type of recursive descent parsers would be used in countries where text is
read from right to left.
As an example of top-down parsing, consider the BNF grammar in listing
6.1. This is a simplified version of the mathematical expression grammar, made
suitable for LL parsing (the details of that will follow shortly).
Example 6.2 (Top-down Parsing by Hand)
We will now parse the sentence 1 + 2 + 3 by hand, using the top-down approach.
A top-down parser always starts with the start symbol, which in this case is
expression . It then reads the first character from the input stream, which happens
to be 1 , and determines which production rule to apply. Since there is only one
producton rule that can replace expression (it only acts as the left-hand side of
one rule), we replace expression with factor restexpression :
expression =⇒ L factor restexpression
109
expression: factor restexpression.
restexpression: ?.
restexpression: + factor restexpression.
restexpression: − factor restexpression.
factor: 0.
factor: 1.
factor: 2
factor: 3.
factor: 4.
factor: 5.
factor: 6.
factor: 7.
factor: 8.
factor: 9.
Listing 6.1: Expression Grammar for LL Parser
In LL parsing, we always replace the leftmost nonterminal (here, it is factor ).
factor has ten alternative production rules, but know exactly which one to pick,
since we have the character 1 in memory and there is only one production rule
whose right-hand side starts with 1 :
expression =⇒ L factor restexpression
=⇒ L 1 restexpression
We have just eliminated one terminal from the input stream, so we read the
next one, which is ”+” . The leftmost nonterminal which we need to replace is
restexpression , which has only one alternative that starts with + :
expression =⇒ L factor restexpression
=⇒ L 1 restexpression
=⇒ L 1 +factor restexpression
We continue this process until we run out of terminal tokens. The situation
at that point is:
expression =⇒ L factor restexpression
=⇒ L 1 restexpression
=⇒ L 1 +factor restexpression
=⇒ L 1 +2 restexpression
=⇒ L 1 +2 +factor restexpression
=⇒ L 1 +2 +factor restexpression
=⇒ L 1 +2 +3 restexpression
The current terminal symbol under consideration is empty, but we could also
use end of line or end of file. In that case, we see that of the three alternatives
for restexpression , the ones that start with + and - are invalid. So we pick the
production rule with the empty right-hand side, effectively removing restexpres-
sion from the sentential form. We are now left with the input sentence, having
eliminated all the terminal symbols. Parsing was successful.
110
If we were to parse 1 + 2 * 3 using this grammar, parsing will not be suc-
cessful. Parsing will fail as soon as the terminal symbol * is encountered. If the
lexical analyzer cannot handle this token, parsing will end for that reason. If it
can (which we will assume here), the parser is in the following situation:
expression =⇒ ∗
L
1 +2 restexpression
The parser must now find a production rule starting with * . There is none,
so it replaces restexpression with the empty alternative. After that, there are no
more nonterminals to replace, but there are still terminal symbols on the input
stream, thus the sentence cannot be completely recognized.
?
There are a couple of caveats with the LL approach. Consider what happens
if a nonterminal is replaced by a collection of other nonterminals, and so on,
until at some point this collection of nonterminals is replaced by the original
nonterminal, while no new terminals have been processed along the way. This
process will then continue indefinitely, because there is no termination condition.
Some grammars cause this behaviour to occur. Such grammars are called left-
recursive.
Definition 6.1 (Left Recursion)
A context-free grammar (V,Σ,S,P) is left-recursive if
∃X ∈ V,α,β ∈ (V ∪ Σ) ∗ : S =⇒ ∗ Xα =⇒ ∗
L Xβ
in which =⇒ ∗
L
is the reflexive transitive closure of =⇒ L , defined as:
{(uAτ) ∈ (V ∪ Σ) ∗ × (V ∪ Σ) ∗ : u ∈ Σ ∗ ,τ ∈ (V ∪ Σ) ∗ ∧ (A,α) ∈ P}
The difference between =⇒ and =⇒ L is that the former allows arbitrary
strings of terminals and nonterminals to precede the nonterminal that is going
to be replaced (A), while the latter insists that only terminals occur before A
(thus making A the leftmost nonterminal).
Equivalently, we may as well define =⇒ R :
{(τAu) ∈ (V ∪ Σ) ∗ × (V ∪ Σ) ∗ : u ∈ Σ ∗ ,τ ∈ (V ∪ Σ) ∗ ∧ (A,α) ∈ P}
=⇒ R insists that A be the rightmost nonterminal, followed only by terminal
symbols. Using =⇒ R , we are also in a position to define right-recursion (which
is similar to left-recursion):
∃X ∈ V,α,β ∈ (V ∪ Σ) ∗ : S =⇒ ∗ αX =⇒ ∗
R βX
Grammars which contain left-recursion are not guaranteed to terminated, al-
though they may. Because this may introduce hard to find bugs, it is important
to weed out left-recursion from the outset if at all possible.
?
111
Removing left-recursion can be done using left-factorisation. Consider the
following excerpt from a grammar (which may be familiar from the previous
chapter):
expression: expression + term.
expression: expression − term.
expression: term.
Obviously, this grammar is left-recursive: the first two production rules both
start with expression , which also acts as their left-hand side. So expression may
be replaced with expression without processing a nonterminal along the way. Let
it be clear that there is nothing wrong with this grammar (it will generate valid
sentences in the mathematical expression language just fine), it just cannot be
recognized by a top-down parser.
Left-factorisation means recognizing that the nonterminal expression occurs
multiple times as the leftmost symbol in a production rule, and should therefore
be in a production rule on its own. Firstly, we swap the order in which term and
expression occur:
expression: term + expression.
expression: term − expression.
expression: term.
The + and - operators are now treated as if they were right-associative,
which - is definitely not. We will deal with this problem later. For now, assume
that associativity is not an issue. This grammar is no longer left-recursive, and
it obviously produces the same sentences as the original grammar. However, we
are not out of the woods yet.
It turns out that when multiple production rule alternatives start with the
same terminal or nonterminal symbol, it is impossible for the parser to choose an
alternative based on the token it currently has in memory. This is the situation
we have now; three production rules which all start with term . This is where we
apply the left-factorisation: term may be removed from the beginning of each
production rule and placed in a rule by itself. This is called “factoring out” a
nonterminal on the left side, hence left-factorisation. The result:
expression: term restexpression.
restexpression: + term restexpression.
restexpression: − term restexpression.
restexpression: ?.
Careful study will show that this grammar produces exactly the same sen-
tences as the original one. We have had to introduce a new nonterminal
( restexpression ) with an empty alternative to solve the left-recursion, in addi-
tion to wrong associativity for the - operator, so we were not kidding when
we said that top-down parsing imposes some restrictions on grammar. On the
flipside, writing a parser for such a grammar is a snap.
So far, we have assumed that the parser selects the production rule to apply
112
based on one terminal symbol, which is has in memory. There are also parsers
that work with more than one token at a time. A recursive descent parser which
works with 3 tokens is an LL(3) parser. More generally, an LL(k) parser is a
top-down parser with a k tokens lookahead.
Practical advice 6.1 (One Token Lookahead)
Do not be tempted to write a parser that uses a lookahead of more than one
token. The complexity of such a parser is much greater than the one-token
lookahead LL(1) parser, and it will not really be necessary. Most, if not all,
language constructs can be parsed using an LL(1) parser.
?
We have now found that grammars, suitable for recursive descent parsing,
must obey the following two rules:
1. There most not be left-recursion in the grammar.
2. Each alternative production rule with the same left-hand side must start
with a distinct terminal symbol. If it starts with a nonterminal symbol,
examine the production rules for that nonterminal symbol and so on.
We will repeat these definitions more formally shortly, after we have dis-
cussed bottom-up parsing and compared it to recursive descent parsing.
6.5 Bottom-up Parsing
Bottom-up parsers are the inverse of top-down parsers: they start with the
full input sentence (string of terminals) and work by replacing substrings of
terminals and nonterminals in the sentential form by nonterminals, effectively
reversely applying the production rules.
In de remainder of this chapter, we will focus on top-down parsing only, but
we will illustrate the concept of bottom-up parsing (also known as LR) with an
example. Consider the grammar in listing 6.2, which is not LL.
Example 6.3 (Bottom-up Parsing by Hand)
We will now parse the sentence 1 + 2 + 3 by hand, using the bottom-up approach.
A bottom-up parser begins with the entire sentence to parse, and replaces groups
of terminals and nonterminals with the left-hand side of production rules. In
the initial situation, the parser sees the first terminal symbol, 1 , and decides
to replace it with factor (which is the only possibility). Such a replacement is
called a reduction.
1 +2 + 3 =⇒ factor +2 +3
Starting again from the left, the parser sees the nonterminal factor and de-
cides to replace it with expression (which is, once again, the only possibility):
1 +2 + 3 =⇒ factor +2 +3
=⇒ expression +2 +3
113
expression: expression + expression.
expression: expression − expression.
expression: factor.
factor: 0.
factor: 1.
factor: 2
factor: 3.
factor: 4.
factor: 5.
factor: 6.
factor: 7.
factor: 8.
factor: 9.
Listing 6.2: Expression Grammar for LR Parser
There is now no longer a suitable production rule that has a lone expression on
the right-hand side, so the parser reads another symbol from the input stream
( + ). Still, there is no production rule that matches the current input. The
tokens expression and + are stored on a stack (shifted) for later reference. The
parser reads another symbol from the input, which happens to be 2 , which it
can replace with factor , which can in turn be replaced by expression
1 +2 + 3 =⇒ factor +2 +3
=⇒ expression +2 +3
=⇒ expression +factor +3
=⇒ expression +expression +3
All of a sudden, the first three tokens in the sentential form ( expression +
expression ), two of which were stored on the stack, form the right hand side of a
production rule:
expression: expression + expression.
The parser replaces the three tokens with expression and continues the process
until the situation is thus:
1 +2 + 3 =⇒ factor +2 +3
=⇒ expression +2 +3
=⇒ expression +factor +3
=⇒ expression +expression +3
=⇒ expression +3
=⇒ expression +factor
=⇒ expression +expression
=⇒ expression
In the final situation, the parser has reduced the entire original sentence
to the start symbol of the grammar, which is a sign that the input text was
syntactically correct.
114
?
Formally put, the shift-reduce method constructs a right derivation S =⇒ ∗
R
s,
but in reverse order. This example shows that bottom up parsers can deal with
left-recursion (in fact, left recursive grammars make more efficient bottom up
parsers), which helps keep grammars simple. However, we stick with top down
parsers since they are by far the easiest to write by hand.
6.6 Direction Sets
So far, we have only informally defined which restrictions are placed on a gram-
mar for it to be LL(k). We will now present these limitations more precisely.
We must start with several auxiliary definitions.
Definition 6.2 (FIRST-Set of a Production)
The FIRST set of a production for a nonterminal A is the set of all terminal
symbols, with which the strings generated from A can start.
?
Note that for an LL(k) grammar, the first k terminal symbols with which
a production starts are included in the FIRST set, as a string. Also note that
this definition relies on the use of BNF, not EBNF. It is important to realize
that the following grammar excerpt:
factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
actually consists of 10 different production rules (all of which happen to
share the same left-hand side). The FIRST set of a production is often denoted
PFIRST, as a reminder of the fact that it is the FIRST set of a single Production.
Definition 6.3 (FIRST-Set of a Nonterminal)
The FIRST set of a nonterminal A is the set of all terminal symbols, with which
the strings generated from A can start.
If the nonterminal X has n productions in which it acts as the left-hand
side, then
FIRST(X) :=
n
[
i=1
PFIRST(X i )
?
The LL(1) FIRST set of factor in the previous example is {0, 1, 2, 3, 4, 5, 6,
7, 8, 9}. Its individual PFIRST sets (per production) are {0} through {9}. We
will deal only with LL(1) FIRST sets in this book.
We also define the FOLLOW set of a nonterminal. FOLLOW sets are de-
termined only for entire nonterminals, not for productions:
Definition 6.4 (FOLLOW-Set of a Nonterminal)
115
The FOLLOW set of a nonterminal A is the set of all terminal symbols, that
may follow directly after A.
?
To illustrate FOLLOW-sets, we need a bigger grammar:
Example 6.4 (FOLLOW-Sets)
expression: factor restexpression.
restexpression: ?.
| + factor restexpression
| − factor restexpression.
factor: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.
FOLLOW(expression) = {⊥} 1
FOLLOW(restexpresson) = {⊥}
FOLLOW(factor) = {⊥,+,−}
?
We are now in a position to formalize the property of unambiguity for LL(k)
grammars:
Definition 6.5 (Unambiguity of LL(k) Grammars)
A grammar is unambiguous when
1. If a nonterminal acts as the left-hand side of multiple productions, then
the PFIRST sets of these productions must be disjunct.
2. If a nonterminal can produce the empty string (?), then its FIRST set
must be disjunct with its FOLLOW set.
?
How does this work in practice? The first conditions is easy. Whenever an
LL parser reads a terminal, it must decide which production rule to apply. It
does this by looking at the first k terminal symbols that each production rule
can produce (its PFIRST set). In order for the parser to be able to make the
choice, these sets must not have any overlap. If there is no overlap, the grammar
is said to be deterministic.
If a nonterminal can be replaced with the empty string, the parser must check
whether it is valid to do so. Inserting the empty string is an option when no
other rule can be applied, and the nonterminals that come after the nonterminal
that will produce the empty string are able to produce the terminal that the
parser is currently considering. Hence, to make the decision, the FIRST set of
the nonterminal must not have any overlap with its FOLLOW set.
1 We use ⊥ to denote end of file.
116
Production PFIRST
program: statementlist RESULT=implication. {A...Z}
statementlist : ?. ∅
statementlist : statement statementlist. {A...Z}
statement: identifier =implication ;. {A...Z}
implication: conjunction restimplication . {∼,(,0,1,A...Z}
restimplication : ?. ∅
restimplication : −>conjunction restimplication . { -> }
restimplication : <−conjunction restimplication . { <- }
restimplication : <−>conjunction restimplication. { <-> }
conjunction: negation restconjunction. {∼,(,0,1,A...Z}
restconjunction: ?. ∅
restconjunction: &negation restconjunction. {&}
restconjunction: | negation restconjunction. {|}
negation: ˜ negation. {∼}
negation: factor. {(,0,1,A...Z}
factor :
?
implication
?
. {(}
factor : identifier . {A...Z}
factor : 1. {1}
factor : 0. {0}
identifier : A. {A}
identifier : Z. {Z}
Table 6.1: PFIRST Sets for Logic Language
6.7 Parser Code
A wondrous amd most useful property of LL(k) grammars (henceforth referred
to as LL since we will only be working with LL(1) anyway) is that a parser can
be written for them in a very straightforward fashion (as long as the grammar
is truly LL).
A top-down parser needs a stack to place its nonterminals on. It it easiest to
use the stack offered by the C compiler (or whatever language you work with)
for this purpose. Now, for every nonterminal, we produce a function. This
function checks that the terminal symbol currently under consideration is an
element of the FIRST set of the nonterminal that the function represents, or
else it reports a syntax error.
After a syntax error, the parser may recover from the error using a synchro-
nization approach (see chapter 8 on error recovery for details) and continue if
possible, in order to find more errors.
The body of the function reads any terminals that are specified in the pro-
duction rules for the nonterminal, and calls other functions (that represent other
nonterminals) in turn, thus putting frames on the stack. In the next section,
we will show that this approach is ideal for construction a syntax tree.
Writing parser code is best illustrated with an (elaborate) example. Please
refer to the grammar for the logic language (section 5.11), for which we will write
a parser. In table 6.7, we show the PFIRST set for every individual production,
while in table 6.7, we show the FIRST and FOLLOW sets for every nonterminal.
117
Nonterminal FIRST FOLLOW
program {A...Z} {⊥}
statementlist {A...Z} {RESULT}
statement {A...Z} {∼,(,0,1,A...Z,RESULT}
implication {∼,(,0,1,A...Z} {;,⊥,)}
restimplication { ->, <-, <-> } {;,⊥,)}
conjunction {∼,(,0,1,A...Z} {->,<-,<->,;,⊥,)}
restconjunction {&,|} {->,<-,<->,;,⊥,)}
negation {∼,(,0,1,A...Z} {&,|,->,<-,<->,;,⊥,)}
factor: {(,0,1,A...Z} {&,|,->,<-,<->,;,⊥,)}
identifier: {A...Z} {=,&,|,->,<-,<->,;,⊥,)}
Table 6.2: FIRST and FOLLOW Sets for Logic Language
With this information, we can now build the parser. Refer to appendix G for
the complete source code (including a lexical analyzer built with flex ). We will
discuss the C-function for the nonterminal conjunction here (shown in listing 6.3.
The conjunction function first checks that the current terminal input symbol
(stored in the global variable token ) is an element of FIRST(conjunction) (lines
3–6). If not, conjunction returns an error.
If token is an element of the FIRST set, conjunction calls negation,which is
the first token in the production rule for conjunction (lines 11-14):
conjunction: negation restconjunction.
If negation returns without errors, conjunction must now decide whether to
call restconjunction (which may produce the empty string). It does so by looking
at the current terminal symbol under consideration. If it is a & or a | (both
part of FIRST(restconjunction), it calls restconjunction (lines 16-19). If not, it
skips restconjunction , assuming it produces the empty string.
The other functions in the parser are constructed using a similar approach.
Note that the parser presented here only performs a syntax check; the parser in
appendix G also interprets its input (it is an interpreter), which makes for more
interesting reading.
6.8 Conclusion
Our discussion of parser construction is now complete. The results of parsing
are placing in a syntax tree and passed on to the next phase, semantic analysis.
118
1 int conjunction()
2 {
3 if ( token != ’˜’ && token != ’(’
4 && token != IDENTIFIER
5 && token != TRUE
6 && token != FALSE )
7 {
8 return( ERROR );
9 }
10
11 if ( negation() == ERROR )
12 {
13 return( ERROR );
14 }
15
16 if ( token == ’&’ || token == ’|’ )
17 {
18 return( restconjunction () );
19 }
20 else
21 {
22 return( OK );
23 }
24 }
Listing 6.3: Conjunction Nonterminal Function
119
Bibliography
[1] A.V. Aho, R. Sethi, J.D. Ullman: Compilers: Principles, Techniques and
Tools, Addison-Wesley, 1986.
[2] F.H.J. Feldbrugge: Dictaat Vertalerbouw, Hogeschool van Arnhem en Ni-
jmegen, edition 1.0, 2002.
[3] Intel: IA-32 Intel Architecture - Software Developer’s Manual - Volume 2:
Instruction Set, Intel Corporation, Mt. Prospect, 2001.
[4] J. Levine: Lex and Yacc, O’Reilly & sons, 2000
[5] H. Meijer: Inleiding Vertalerbouw, University of Nijmegen, Subfaculty of
Computer Science, 2002.
[6] M.J. Scott: Programming Language Pragmatics, Morgan Kaufmann Pub-
lishers, 2000.
[7] T. H. Sudkamp: Languages & Machines, Addison-Wesley, 2nd edition,
1998.
[8] N. Wirth: Compilerbouw, Academic Service, 1987.
120
Chapter 7
Preprocessor
7.1 What is a preprocessor?
A preprocessor is a tool used by a compiler to transform a program before actual
compilation. The facilities a preprocessor provides may vary, but the four most
common functions a preprocessor could provide are:
• header file inclusion
• conditional compilation
• macro expansion
• line control
Header file inclusion is the substitution of files for include declarations (in
the C preprocessor this is the #include directive). Conditional compilation
provides a mechanism to include and exclude parts of a program based on
various conditions (in the C preprocessor this can be done with #define direc-
tives). Macro expansion is probably the most powerful feature of a preprocessor.
Macros are short abbreviations of longer program constructions. The prepro-
cessor replaces these macros with their definition throughout the program (in
the C preprocessor a macro is specified with #define). Line control is used
to inform the compiler where a source line originally came from when different
source files are combined into an intermediate file. Some preprocessors also re-
move comments from the source file, though it is also perfectly acceptable to do
this in the lexical analyzer.
7.2 Features of the Inger preprocessor
The preprocessor in Inger only supports header file inclusion for now. In the near
future other preprocessor facilities may be added, but due to time constraints
header file inclusion is the only feature. The preprocessor directives in Inger
always start at the beginning of a line with a #, just like the C preprocessor.
The directive for header inclusion is #import followed by the name of the file
to include between quotes.
121
7.2.1 Multiple file inclusion
Multiple inclusion of the same header might give some problems. In C we
prevent this through conditional compiling with a #define or with a #pragma
once directive. The Inger preprocessor automatically prevents multiple inclusion
by keeping a list of files that are already included for this source file.
Example 7.1 (Multiple inclusion)
Multiple inclusion – this should be perfectly acceptable for a programmer so
no warning is shown, though hdrfile3 is included only once.
?
Forcing the user not to include files more than once is not an option since
sometimes multiple header files just need the same other header file. This could
be solved by introducing conditional compiling into the preprocessor and have
the programmers solve it themselves, but it would be nice if it happened au-
tomatically so the Inger preprocessor keeps track of included files to prevent
it.
7.2.2 Circular References
Another problem that arises from header files including other header files is
the problem of circular references. Again unlike the C preprocessor, the Inger
preprocessor detects circular references and shows a warning while ignoring the
circular include.
Example 7.2 (Circular References)
Circular inclusion – this always means that there is an error in the source so
the preprocessor gives a warning and the second inclusion of hdrfile2 is ignored.
?
This is realized by building a tree structure of includes. Everytime a new
file is to be included, the tree is checked upwards to the root node, to see if
this file has already been included. When a file already has been included the
preprocessor shows a warning and the import directive is ignored. Because every
122
include creates a new child node in the tree, the preprocessor is able to distinct
between a multiple inclusion and a circular inclusion by only going up in the
tree.
Example 7.3 (Include tree)
Include tree structure – for every inclusion, a new child node is added. This
example shows how the circular inclusion for header 2 is detected by going
upwards in the tree, while the multiple inclusion of header 3 is not seen as a
circular inclusion because it is in a different branch.
?
123
Chapter 8
Error Recovery
As soon as we started programming, we found to our surprise that it wasn’t
as easy to get programs right as we had thought. Debugging had to be discovered.
I can remember the exact instant when I realized that a large part of my life from
then on was going to be spent in finding mistakes in my own programmes.” -
Maurice Wilkes discovers debugging, 1949.
8.1 Introduction
Almost no programs are ever written from scratch that contain no errors at all.
Since programming languages, as opposed to natural languages, have very rigid
syntax rules, it is very hard to write error-free code for a complex algorithm on
the first attempt. This is why compilers must have excellent features for error
handling. Most programs will require several runs of the compiler before they
are free of errors.
Error detection is a very important aspect of the compiler; it is the outward
face of the compiler and the most important bit that the user will become rapidly
familiar with. It is therefore imperative that error messages be clear, correct
and, above all, useful. The user should not have to look up additional error
information is a dusty manual, but the nature of the error should be clear from
the information that the compiler gives.
This chapter discusses the different natures of errors, and shows ways de-
tecting, reporting and recovering from errors.
124
8.2 Error handling
Parsing is all about detecting syntax errors and displaying them in the most
useful manner possible. For every compiler run, we want the parser to detect
and display as many syntax errors as it can find, to alleviate the need for the
user to run the compiler over and over, correcting the syntax errors one by one.
There are three stages in error handling:
• Detection
• Reporting
• Recovery
The detection of errors will happen during compilation or during execution.
Compile-time errors are detected by the compiler during translation. Runtime
errors are detected by the operating system in conjunction with the hardware
(such as a division by zero error). Compile-time errors are the only errors that
the compiler should have to worry about, although it should make an effort to
detect and warn about all the potential runtime errors that it can. Once an
error is detected, it must be reported both to the user and to a function which
will process the error. The user must be informed about the nature of the error
and its location (its line number, and possibly its character number).
The last and also most difficult task that the compiler faces is recovering
from the error. Recovery means returning the compiler to a position in which
it is able to resume parsing normally, so that subsequent errors do not result
from the original error.
8.3 Error detection
The first stage of error handling is detection which is divided into compile-time
detection and runtime detection. Runtime detection is not part of the compiler
so therefore it will not be discussed here. Compile-time however is divided into
four stages which will be discussed here. These stages are:
• Detecting lexical errors
• Detecting syntactic errors
• Detecting semantic errors
• Detecting compiler errors
- Detecting lexical errors Normally the scanner reports a lexical error when
it encounters an input character that cannot be the first character of any lexical
token. In other words, an error is signalled when the scanner is unfamiliar
with a token found in the input stream. Sometimes, however, it is appropriate
to recognize a specific sequence of input characters as an invalid token. An
example of a error detected by the scanner is an unterminated comment. The
scanner must remove all comments from the source code. It is not correct, of
course, to begin a comment but never terminate it. The scanner will reach
the end of the source file before it encounters the end of the comment. Another
(similar) example is when the scanner is unable to determine the end of a string.
125
Lexical errors also include the error class known as overflow errors. Most
languages include the integer type, which accepts integer numbers of a certain
bit length (32 bits on Intel x86 machines). Integer numbers that exceed the
maximum bit length generate a lexical error. These errors cannot be detected
using regular expressions, since regular expressions cannot interpret the value of
a token, but only calculate its length. The lexer can rule that no integer number
can be longer than 10 digits, but that would mean that 000000000000001 is not
a valid integer number (although it is!). Rather, the lexer must verify that
the literal value of the integer number does not exceed the maximum bit length
using a so-called lexical action. When the lexer matches the complete token and
is about to return it to the parser, it verifies that the token does not overflow.
If it does, the lexer reports a lexical error and returns zero (as a placeholder
value). Parsing may continue as normal.
- Detecting syntactic errors The parser uses the grammar’s production rules
to determine which tokens it expects the lexer to pass to it. Every nonterminal
has a FIRST set, which is the set of all the terminal tokens that the nonterminal
be replaced with, and a FOLLOW set, which is the set of all the terminal tokens
that may appear after the nonterminal. After receiving a terminal token from
the lexical analyzer, the parser must check that it matches the FIRST set of the
nonterminal it is currently evaluating. If so, then it continues with its normal
processing, otherwise the normal routine of the parser is interrupted and an
error processing function is called.
- Detecting semantic errors Semantic errors are detected by the action rou-
tines called within the parser. For example, when a variable is encountered it
must have an entry in the symbol table. Or when the value of variable "a" is
assigned to variable "b" they must be of the same type.
- Detecting compiler errors The last category of compile-time errors deals
with malfunctions within the compiler itself. A correct program could be incor-
rected compiled because of a bug in the compiler. The only thing the user can
do is report the error to the system staff. To make the compiler as error-free as
possible, it contains extensive self-tests.
8.4 Error reporting
Once an error is detected, it must be reported to the user and to the error
handling function. Typically, the user recieves one or more messages that report
the error. Errors messages displayed to the user must obey a few style rules, so
that they may be clear
and easy to understand.
1. The message should be specific, pinpointing the place in the program
where the error was detected as closely as possible. Some compilers include
only the line number in the source file on which the error occurred, while
others are able to highlight the character position in the line containing
the error, making the error easier to find.
2. The messages should be written in clear and complete English sentences,
never in cryptic terms. Never list just a message code number such as
"error number 33" forcing the user to refer to a manual.
126
3. The message should not be redundant. For example when a variable is not
declared, it is not be nessesary to print that fact each time the variable is
referenced.
4. The messages should indicate the nature of the error discovered. For
example, if a colon were expected but not found, then the message should
just say that and not just "syntax error" or
"missing symbol".
5. It must be clear that the given error is actually an error (so that the
compiler did not generate an executable), or that the message is a warning
(and an executable may still be generated).
Example 8.1 (Error Reporting)
Error
The source code contains an overflowing integer value
(e.g. 1234578901234567890).
Response
This error may be treated as a warning, since compilation can still
take place. The offending overflowing value will be replaced with
some neutral value (say, zero) and this fact should be reported:
test.i (43): warning: integer value overflow
(12345678901234567890). Replaced with 0.
Error
The source code is missing the keyword THEN where it is expected
according to the grammar.
Response
This error cannot be treated as a warning, since an essential piece
of code can not be compiled. The location of the error must be
pinpointed so that the user can easily correct it:
test.i (54): error: THEN expected after IF condition.
Note that the compiler must now recover from the error; obviously
an important part of the IF statement is missing and it must be
skipped somehow. More information on error recovery will follow
below.
?
127
8.5 Error recovery
There are three ways to perform error recovery:
1. When an error is found, the parser stops and does not attempt to find
other errors.
2. When an error is found, the parser reports the error and continues parsing.
No attempt is made at error correction (recovery), so the next errors may
be irrelevant because they are caused by the first error.
3. When an error is found, the parser reports it and recovers from the error,
so that subsequent errors do not result from the original error. This is the
method discussed below.
Any of these three approaches may be used (and have been), but it should be
obvious that approach 3 is most useful to the programmer using the compiler.
Compiling a large source program may take a long time, so it is advantageous
to have the compiler report multiple errors at once. The user may then correct
all errors at his leisure.
8.6 Synchronization
Error recovery uses so-called synchronization points that the parser looks for
after an error has been detected. A synchronization point is a location in the
source code from which the parser can safely continue parsing without printing
further errors resulting from the original error.
Error recovery uses two sets of terminal tokens, the so-called direction sets:
1. The FIRST set - is the set of all terminal symbols with which the strings,
generated by all the productions for this nonterminal begin.
2. The FOLLOW set - a set of all terminal symbols that can be generated
by the grammar directly after the current nonterminal.
As an example for direction sets, we will consider the following very simple
grammar and show how the FIRST and FOLLOW sets may be constructed for
it.
number: digit morenumber.
morenumber: digit morenumber.
morenumber: ?.
digit : 0.
digit : 1.
Any nonterminal has at least one, but frequently more than one production
rule. Every production rule has its own FIRST set, which we will call PFIRST.
The PFIRST set for a production rule contains all the leftmost terminal tokens
that the production rule may eventually produce. The FIRST set of any non-
terminal is the union of all its PFIRST sets. We will now construct the FIRST
and PFIRST sets for our sample grammar.
PFIRST sets for every production:
128
number: digit morenumber. PFIRST =
?
0, 1
?
morenumber: digit morenumber. PFIRST =
?
0, 1
?
morenumber: ?. PFIRST =
? ?
digit : 0. PFIRST =
?
0
?
digit : 1. PFIRST =
?
1
?
FIRST sets per terminal:
FIRST ? number ? =
?
0, 1
?
FIRST ? morenumber ? =
?
0, 1
?
V
? ?
=
?
0, 1
?
FIRST ? digit ? =
?
0
?
V
?
1
?
=
?
0, 1
?
Practical advice 8.1 (Construction of PFIRST sets)
PFIRST sets may be most easily constructed by working from bottom to top:
find the PFIRST sets for ’digit’ first (these are easy since the production rules
for digit contain only terminal tokens). When finding the PFIRST set for a
production rule higher up (such as number), combine the FIRST sets of the
nonterminals it uses (in the case of number, that is digit). These make up the
PFIRST set.
?
Every nonterminal must also have a FOLLOW set. A FOLLOW set con-
tains all the terminal tokens that the grammar accepts after the nonterminal to
which the FOLLOW set belongs. To illustrate this, we will now determine the
FOLLOW sets for our sample grammar.
number: digit morenumber.
morenumber: digit morenumber.
morenumber: ?.
digit : 0.
digit : 1.
FOLLOW sets for every nonterminal:
FOLLOW ? number ? =
?
EOF
?
FOLLOW ? morenumber ? =
?
EOF
?
FOLLOW ? digit ? =
?
EOF, 0, 1
?
The terminal tokens in these two sets are the synchronization points. After
the parser detects and displays an error, it must synchronize (recover from the
error). The parser does this by ignoring all further tokens until it reads a token
that occurs in a synchronization point set, after which parsing is resumed. This
point is best illustrated by a example, describing a Sync routine. Please refer to
listing 8.1.
129
/* Forward declarations. */
/* If current token is not in FIRST set, display
* specified error.
* Skip tokens until current token is in FIRST
5 * or in FOLLOW set.
* Return TRUE if token is in FIRST set, FALSE
* if it is in FOLLOW set.
*/
BOOL Sync( int first [], int follow [], char ∗error )
10 {
if ( !Element( token, first ) )
{
AddPosError( error , lineCount, charPos );
}
15
while( !Element( token, first ) && !Element( token, follow ) )
{
GetToken();
/* If EOF reached, stop requesting tokens and just
20 * exit, claiming that the current token is not
* in the FIRST set. */
if ( token == 0 )
{
return( FALSE );
25 }
}
/* Return TRUE if token in FIRST set, FALSE
* if token in FOLLOW set.
30 */
return( Element( token, first ) );
}
Listing 8.1: Sync routine
130
/* Call this when an unexpected token occurs halfway a
* nonterminal function. It prints an error, then
* skips tokens until it reaches an element of the
* current nonterminal’s FOLLOW set. */
5 void SyncOut( int follow [] )
{
/* Skip tokens until current token is in FOLLOW set. */
while( !Element( token, follow ) )
{
10 GetToken();
/* If EOF is reached, stop requesting tokens and
* exit. */
if ( token == 0 ) return;
}
15 }
Listing 8.2: SyncOut routine
Tokens are requested from the lexer and discarded until a token occurs in
one of the synchronization point lists.
At the beginning of each production function in the parser the FIRST and
FOLLOW sets are filled. Then the function Sync should be called to check if the
token given by the lexer is available in the FIRST or FOLLOW set. If not then
the compiler must display the error and search for a token that is part of the
FIRST or FOLLOW set of the current production. This is the synchronization
point. From here on we can start checking for other errors.
It is possible that an unexpected token is encountered halfway a nonterminal
function. When this happens, it is nessesary to synchronize until a token of the
FOLLOW set is found. The function SyncOut provides this functionality (see
listing 8.2).
Morgan (1970) claims that up to 80% of the spelling errors occurring in
student programs may be corrected in this fashion.
131
Part III
Semantics
132
We are now in a position to continue to the next level and take a look at
the shady side of compiler construction; semantics. This part of the book will
provide answers to questions like: What are semantics good for? What is the
difference between syntax and semantics? Which checks are performed? What
is typechecking? And what is a symbol table? In other words, this chapter will
unleash the riddles of semantic analysis. Firstly it is important to know the dif-
ference between syntax and semantics. Syntax is the grammatical arrangement
of words or tokens in a language which establishes their necessary relations.
Hence, syntax analysis checks the correctness of the relation between elements
of a sentence. Let’s explain this with an example using a natural language. The
sentence
Loud purple flowers talk
is incorrect according to the English grammar, hence the syntax of the sen-
tence is flawed. This means that the relation between the words is incorrect due
to its bad syntactical construction which results in a meaningless sentence.
Whereas syntax is about the relation between elements of a sentence, se-
mantics is concerned with the meaning of the production. The relation of the
elements in a sentence can be right, while the construction as a whole has no
meaning at all. The sentence
Purple flowers talk loud
is correct according to the English grammar, but the meaning of the sentence
is not flawless at all since purple flowers cannot talk! At least, not yet. The
semantic analysis checks for meaningless constructions, erroneous productions
which could have multiple meanings and generates error en warning messages.
When we apply this theory on programming languages we see that syntax
analysis finds syntax errors such as typos and invalid constructions such as illegal
variable names. However, it is possible to write programs that are syntactically
correct, but still violate the rules of the language. For example, the following
sample code conforms to the Inger syntax, but is invalid nonetheless: we cannot
assign a value to a function.
myFunc() = 6 ;
The semantic analysis is of great importance since code with assignments like
this may act strange when executing the program after successful compilation.
If the program above does not crash with a segmentation fault on execution and
apparently executes the way it should, there is a chance that something fishy is
going on: it is possible that a new address is assigned to the function myFunc() ,
or not? We do not assume
1
that everything will work the way we think it will
work.
Some things are too complex for syntax analysis, this is where semantic
analysis comes in. Type checking is necessary because we cannot force correct
use of types in the syntax because too much additional information is needed.
This additional information, like (return) types, will be available to the type
checker stored in the AST and symbol table.
Let us begin, with the symbol table.
1 Tip 27 from the Pragrammatic Programmer [1]: Don’t Assume It - Prove It Prove your
assumptions in the actual environment - with real data and boundary conditions
133
Chapter 9
Symbol table
9.1 Introduction to symbol identification
At compile time we need to keep track of the different symbols (functions and
variables) declared in the program source code. This is a requirement because
every symbol must be identifiable with a unique name so we can find the symbols
back later.
Example 9.1 (An Incorrect Program)
module flawed;
start main: void → void
{
2 ∗ 4; // result is lost
5 myfunction ( 3 ); // function does not exist
}
?
Without symbol identification it is impossible to make assignments, or define
functions. After all, where do we assign the value to? And how can we call a
function if it has no name? We do not think we would make many program-
mers happy if they could only reference values and functions through memory
addresses. In the example the mathematical production yields 8 , but the result
is not assigned to a uniquely identified symbol. The call to myfunction yields a
134
compiler error as myfunction is not declared anywhere in the program source so
there is no way for the compiler to know what code the programmer actually
wishes to call. This explains only why we need symbol identification but does
not yet tell anything practical about the subject.
9.2 Scoping
We would first like to introduce scoping. What actually is scoping? In Webster’s
Revised Unabridged Dictionary (1913) a scope is defined as:
Room or opportunity for free outlook or aim; space for action; am-
plitude of opportunity; free course or vent; liberty; range of view,
intent, or action.
When discussing scoping in the context of a programming language the de-
scription comes closest to Webster’s range of view. A scope limits the view a
statement or expression has when it comes to other symbols. Let us illustrate
this with an example in which every block, delimited by { and }, results in a
new scope.
Example 9.2 (A simple scope example)
module example; // begin scope (global)
int a = 4;
5 int b = 3;
start main: void → void
{ // begin scope (main)
float a = 0;
10
{ // begin scope (free block}
char a = ’a’;
int x;
print ( a );
15 } // end of scope {free block}
x = 1; // x is not in range of view!
print ( b );
} // end of scope (main)
20 // end of scope 0 (global)
?
Example 9.2 contains an Inger program with 3 nested scopes containing
variable declarations. Note that there are 3 declarations for variables named a ,
which is perfectly legal as each declaration is made in a scope that did not yet
contain a symbol called ‘ a ’ of its own. Inger only allows referencing of symbols
135
in the local scope and (grand)parent scopes. The expression x = 1; is illegal
since x was declared in a scope that could be best described as a nephew scope.
Using identification in scopes enables us to declare a symbol name multiple
times in the same program but in different scopes, this implicates that a symbol
is unique in its own scope and not necessarily in the program, this way we do
not run out of useful variable names within a program.
Now that we know what scoping is, it is probably best to continue with some
theory on how to store information about these scopes and their symbols during
compile time.
9.3 The Symbol Table
Symbols collected during parsing must be stored for later reference during se-
mantic analysis. In order to have access to the symbols during in a later stage
it is important to define a clear data structure in which we store the necessary
information about the symbols and the scopes they are in. This data structure
is called a Symbol Table and can be implemented in a variety of ways (e.g. ar-
rays, linked lists, hash tables, binary search trees & n-ary search trees). Later
on in this chapter we will discuss what data structure we considered best for
our symbol table implementation.
9.3.1 Dynamic vs. Static
There are two possible types of symbol tables, dynamic or static symbol tables.
What exactly are the differences between the two types, and why is one bet-
ter than the other? A dynamic symbol table can only be used when both the
gathering of symbol information and usage of this information happen in one
pass. It works in a stack like manner: a symbol is pushed on the stack when it
is encountered. When a scope is left, we pop all symbols which belong to that
scope from the stack. A static table is built once and can be walked as many
times as required. It is only deconstructed when the compiler is finished.
Example 9.3 (Example of a dynamic vs. static symbol table)
module example;
int v1, v2;
5 f : int v1, v2 → int
{
return (v1 + v2);
}
10 start g: int v3 → int
136
{
return (v1 + v3);
}
The following example illustrates now a dynamic table grows and
shrinks over time.
After line 3 the symbol table is a set of symbols
T = {v1,v2}
After line 5 the symbol table also contains the function f and the local variables
v1 and v2
T = {v1,v2,f,v1,v2}
At line 9 the symbol table is back to its global form
T = {v1,v2}
After line 10 the symbol table is expanded with the function g and the local
variable v3
T = {v1,v2,g,v3}
Next we illustrate now a static table’s set of symbols only grows.
After line 3 the symbol table is a set of symbols
T = {v1,v2}
After line 5 the symbol table also contains the function f and the local variables
v1 and v2
T = {v1,v2,f,v1,v2}
After line 10 the symbol table is expanded with the function g and the local
variable v3
T = {v1,v2,f,v1,v2,g,v3}
?
In earlier stages of our research we assumed local symbols should only be
present in the symbol table when the scope it is declared in is being processed
(this includes, off course, children of that scope). After all, storing a symbol
that is no longer accessable seems pointless. This assumption originated when
we were working on the idea of a single-pass compiler (tokenizing, parsing, se-
mantics and code generation all in one single run) so for a while we headed
in the direction of a dynamic symbol table. Later on we decided to make the
compiler multi-pass which resulted in the need to store symbols over a longer
timespan. Using multiple passes, the local symbols should remain available for
as long as the compiler lives as they might be needed every pass around, thus
we decided to switch to a static symbol table.
When using a static symbol table, the table will not shrink but only grow.
Instead of building the symbol table during pass 1 which happens when using a
dynamic table, we will construct the symbol table from the AST. The AST will
be available after parsing the source code.
137
9.4 Data structure selection
9.4.1 Criteria
In order to choose the right data structure for implementing the symbol table
we look at its primary goal and what the criteria are. Its primary goal is storing
symbol information an provide a fast lookup facility, for easy access to all the
stored symbol information.
Since we have only a short period of time to develop our language Inger and
the compiler, the only criteria we had in choosing a suitable data structure was
that it was easy to use, and implement.
9.4.2 Data structures compared
One of the first possible data structures which comes to mind when thinking of
how to store a list of symbols is perhaps an array. Although an array is very
convenient to store symbols, it has quite a few practical limitations. Arrays are
defined with a static size, so chances are you define a symbol table with array
size 256, and end up with 257 symbols causing a buffer overflow (an internal
compiler error). Writing beyond an array limit will produce unpredictable re-
sults. Needless to say, this situation is undesirable. Searching an array is not
efficient at all due to its linear searching algorithm. A binary search algorithm
could be applied, but this would require the array to be sorted at all times.
For sorted arrays, searching is a fairly straightforward operation and easy to
implement. It is also notable that an array would probably only be usable when
using either a dynamic table or if no scoping would be allowed.
Figure 9.1: Array
If the table would be implemented as a stack, finding a symbol is a simple
matter of searching for the desired symbol from the top of the stack. The first
variable found is automatically the last variable added and thus the variable in
the nearest scope. This implementation makes it easy to use multi-level scop-
ing, but is a heavy burden on performance as with every search the stack has
to be deconstructed and stored on a second stack (until the first occurrence
is found) and reconstructed, a very expensive operation. This implementation
would probably be the best for a dynamic table if it were not for its expensive
search operations.
Another dynamic implementation would be a linked list of symbols. If im-
plemented as an unsorted double linked list it would be possible to use it just
138
Figure 9.2: Stack
like the stack (append to the back and search from the back) without the dis-
advantage of a lot of push and pop operations. A search still takes place in
a linear time frame but the operations themselves are much cheaper than the
stack implementation.
Figure 9.3: Linked List
Binary search trees improve search time massively, but only in sorted form
(an unsorted tree after all, is not a tree at all). This results in the loss of an
advantage the stack and double linked list offered: easy scoping. Now the first
symbol found is not per definition the latest definition of that symbol name.
It in fact is probably the first occurrence. This means that the search is not
complete until it is impossible to find another occurrence. This also means that
we have to include some sort of scope field with every symbol to separate the
symbols: (a,1) and (a,2) are symbols of the same name, but a is in a higher
scope and therefore the correct symbol. Another big disadvantage is that when
a function is processed we need to rid the tree of all symbols in that function’s
scope. This requires a complete search and rebalancing of the tree. Since the
tree is sorted by string value every operation (insert, search, etc...) is quite ex-
pensive. These operations could be made more time efficient by using an hash
algorithm as explained in the next paragraph.
String comparisons are relatively heavy compared to comparison of simple
types such as integers or bytes so using an hash algorithm to convert symbol
names to simple types would speed all operations on the tree considerably.
The last option we discuss is the n-ary tree. Every node has n children each
of which implicate a new scope as a child of its parent scope. Every node is a
scope and all symbols in that scope are stored inside that node. When the AST
is walked, all the code has to do is make sure that the symbol table walks along.
Then when information about a symbol is requested, we only have to search the
current scope and its (grand) parents. This seems in our opinion to be the only
valid static symbol table implementation.
139
Figure 9.4: Binary Tree
Figure 9.5: N-ary Tree
9.4.3 Data structure selection
We think a combination of an n-ary tree combined with linked lists is a suitable
solution for us. Initially we thought using just one linked list was a good idea.
Each list node, representing a scope, would contain the root node of a binary
tree. The major advantage of this approach is that adding and removing a scope
was easy and fast. This advantage was based on the idea that the symbol table
would grow and shrink during the first pass of compilation, which means that
the symbol table is not available anymore after the first pass. This is not what
we want since the symbol table must be available at all times after the first
pass and therefore favour a new data structure that is less efficient in removing
scopes ( we do not remove scopes anyway ) but faster in looking up symbols in
the symbol table.
9.5 Types
The symbol table data structure is not enough, it is just a tree and should be
decorated with symbol information like (return) types and modifiers. To store
this information correctly we designed several logical structures for symbols and
types. It basicly comes down to a set of functions which wrap a Type structure.
These functions are for example: CreateType() , AddSimpleType() , AddDimension() ,
AddModifier() , etc.... There is a similar set of accessor functions.
9.6 An Example
To illustrate how the symbol table is filled from the Abstract Syntax Tree we
show which steps have to be taken to fill the symbol table.
1. Start walking at the root node of the AST in a pre-order fashion.
140
2. For each block we encounter we add a new child to the current scope and
make this child the new current scope
3. For each variable declaration found, we extract:
- Variable name
- Variable type 1
4. For each function found, we extract:
- Function name
- Function types, starting with the return type 1
5. After the end of a block is encountered we move back to the parent scope.
To conclude we will show a simple example.
Example 9.4 (Simple program to test expressions)
module test module;
int z = 0;
5 inc : int a → int
{
return( a + 1 );
}
10 start main: void → void
{
int i;
i = (z ∗ 5) / 10 + 20 − 5 ∗ (132 + inc ( 3 ) );
}
?
We can distinguish the following steps in parsing the example source code.
1. found z , add symbol to current scope ( global )
2. found inc , add symbol to current scope ( global )
3. enter a new scope level as we now parse the function inc
4. found the parameter a , add this symbol to the current scope ( inc )
5. as no new symbols are encountered, leave this scope
6. found main , add symbol to current scope ( global )
1 for every type we also store optional information such as modifiers (start, extern, etc...)
and dimensions (for pointers and arrays)
141
7. enter a new scope level as we now parse the function main
8. found i , add symbol to current scope ( main )
9. as no new symbols are encountered, leave this scope
After these steps our symbol table will look like this.
Figure 9.6: Conclusion
142
Chapter 10
Type Checking
10.1 Introduction
Type checking is part of the symantic analysis. The purpose of type checking
is to evaluate each operator, application and return statement in the AST (Ab-
stract Syntax Tree) and search for its operands or arguments. The operands
or arguments must both be of compatible types and form a valid combination
with the operator. For instance: when the operator is + , the left operand is a
integer and the right operand is a char pointer, it is not a valid addition. You
can not add a char pointer to an integer without explicit coercion.
The type checker evaluates all nodes where types are used in an expression
and produces an error when it cannot find a decent solution (through coercion)
to a type conflict.
Type checking is one of the the last steps to detect semantic errors in the
source code. After there are a few symantic checks left before code generation
can commence.
This chapter discusses the process of type checking, how to modify the AST
by including type info and produce proper error messages when necessary.
10.2 Implementation
The process of type checking consists of two parts:
• Decorate the AST with types for literals.
• Propagate these types up the tree taking into account:
143
– Type correctness for operators
– Type correctness for function arguments
– Type correctness for return statements
– If types do not match in their simple form ( int , float etc...) try to
coerce these types.
• Perform a last type check to make sure indirection levels are correct (e.g.
assigning an int to a pointer variable.
10.2.1 Decorate the AST with types
To decorate the AST with types it is advisable to walk post-order through the
AST and search for all literal identifiers or values. When a literal is found in
the AST the type must be located in the symbol table. Therefore it is necessary
when walking through the AST to keep track of the current scope level. The
symbol table provides the information of the literal (all we are interested in is
the type) and this will be stored in the AST.
The second step is to move up in the AST and evaluate types for unary,
binary and application nodes.
The 10.1 illustrates the process of expanding the tree. It shows the AST
decoration process for the expression a = b + 1; . The variable a and b are both
declared as an integer.
Example 10.1 (Decorating the AST with types.)
Figure 10.1: AST type expanding
The nodes a, b and 1 are the literals. These are the first nodes we encounter
when walking post-order through the tree. The second part is to determine the
types of node + and node = . After we passed the the literals b and 1 we arrive
at node + . Because we have already determined the type of its left and right
child we can evaluate its type. In this case the outcome (futher referred to in
the text as the result type) is easy. Because node b and 1 are both of the type
int , node + will also become an int .
Because we are still walking post-order through the AST we finally arrive
at node = . The right and left child are also both integers so this node will also
become an integer.
?
144
The advantage by walking post-order through the AST is that all the type-
checking can be done in one pass. If you were to walk pre-order through the
AST it would be advisible to decorate the AST with types in two passes. The
first pass should walk pre-order through the AST and decorate only the literal
nodes, and the second pass which walks pre-order through the AST evaluates
the parent nodes from the literals. This cannot be done in one pass because
the first time walking pre-order through the AST you will first encounter the =
node. When you try to evaluate its type you will find that the children do not
have a type.
The above example was easy; all the literals were integers so the result type
will also be an integer. But what would happen if one of the literals was a float
and the others are all integers.
One way of dealing with this problem is to create a table with conversion
priorities. When for example a float and an int are located, the highest priority
operator wins. These priorities can be found in the table for each operator. For
an example of this table, see table 10.1. In this table the binary operators assign
= and add + are implemented. The final version of this table has all binary
implemented. The same goes for all unary operators like the not ( ! ) operator.
Node Type
NODE ASSIGN FLOAT
NODE ASSIGN INT
NODE ASSIGN CHAR
NODE BINARY ADD FLOAT
NODE BINARY ADD INT
Table 10.1: Conversion priorities
The highest priority is on top for each operator.
The second part is to make a list of types which can be converted for each
operator. In the Inger language it is possible to convert an integer to a float
but the conversion from integer to string is not possible. This table is called the
coercion table. For an example see table 10.2.
From type To type New node
INT FLOAT NODE INT TO FLOAT
CHAR INT NODE CHAR TO INT
CHAR FLOAT NODE CHAR TO FLOAT
Table 10.2: Coercion table
A concrete example is explained in section 10.2. It shows the AST for the
expression a = b + 1.0; . The variable a and is declared as a float and b is
declared by the type of integer. The literal 1.0 is also a float.
Example 10.2 (float versus int.)
The literals a , b and 1.0 are all looked up in the symbol table. The variable
a and the literal 1.0 are both floats. The variable b is an integer. Because we
145
Figure 10.2: Float versus int
are walking post-order through the AST the first operator we encounter is the
+ . Operator + has as its left child an integer and the right child is a float. Now
it is time to use the lookup table to find out of what type the + operator must
be. It appears that the first entry for the operator + in the lookup table is of
the type float. This type has the highest priority. Because one of the two types
is also a float, the result type for the operator + will be a float.
It is still nessesary to check if the other child can be converted to the float.
If not, an error message should appear on the screen.
The second operator is the operator = . This will be exactly the same process
as for the + operator. The left child ( a ) is of type float and the right child + of
type float so operator = will also become a float.
However, what would happen if the left child of the assignment operator =
was an integer? Normally the result type should be looked up in the table 10.1,
but in case of a assignment there is an exception. For the assignment operator
= its right child determines the result. So if the left child is an integer, the
assignment operator will also become an integer. When you declare a variable
as an integer and an assignment takes place of which the right child differs from
the original declared type, an error must occur. It is not possible to change the
original declaration type of any variable. This is the only operator exception
you should take care of.
We just illustrated an example of what would happen if two different types
are encountered, belonging to the same operator. After the complete pass the
AST is decorated with types and finally looks like 10.3.
Figure 10.3: Float versus int result
?
146
10.2.2 Coercion
After the decoration of the AST is complete, and all the checks are executed the
main goal of the typechecker mudule is achieved. At this point it is nessesary to
make a choice. There are two ways to continue, the first way is to start with the
code generation. The type checking module is in this case completely finished.
The second way is to prepare the AST tree for the code generation module.
In the first approach, the type checker’s responsibility is now finished, and
it is up to the code generation module to perform the necessary conversions. In
the sample source line
int b;
float a = b + 1.0;
Listing 10.1: Coercion
the code generation module finds that since a is a float, the result of b + 1.0
must also be a float. This implies that the value of b must be converted to float
in order to add 1.0 to it and return the sum. To determine that variable b must
be converted to a float it is nessecary to evalute the expression just like the way
it is done in the typechecker module.
In the second approach, the typechecking module takes the responsibility
to convert the variable b to a float. Because the typechecker module already
decorates the AST with all types and therefore concludes any conversion to be
made it can easily apply the conversion so the code generation module does not
have to repeat the evaluation process.
To prepare the AST for the above problem we have to apply the coercion
technique. Coercion means the conversion form one type to another. However
it is not possible to convert any given type to any other given type. Since all
natural numbers (integers) are elements in the set N and all real numbers (float)
are in the set R the following formula applies:
N ⊂ R
A practical application of this theory is it nessesary to modify the AST by
adding new nodes. These new nodes are the so called coercion nodes. The best
way to explain this is by a practical example. For this example, refer to the
source of listing 10.1.
Example 10.3 (coercion)
In the first approach were we let the code generation module take care of the
coercion technique, the AST would end up looking like figure 10.3. In the second
approach, were the typechecker module takes responsibility for the coercion
technique, the AST will have the structure shown in figure 10.4.
Notice that the position of node b is repleaced by node IntToFloat and node b
has become a child of node IntToFloat . The node IntToFloat is called the coercion
node. When we arrive during the typechecker pass at node + , the left and right
child are both evaluated. Because the right child is a float and the right child
an integer the outcome must be a float. This is determined by the type lookup
table 10.1. Since we now know the result type for node + we can apply the
147
coercion technique for its childs. This is only required for the child of which the
type differs from its parent (node + ).
Figure 10.4: AST coercion
When we find a child which type differs from its parent we use the coercion
table 10.2 to check if it is possible to convert the type of the child node (node b )
to its parent type. If this is not possible an error message must be produced and
the compilation progress will stop. When it is possible to apply the conversion
it is required to insert a new node in the AST. This node will replace node b
and the type becomes a float. Node b will be its child.
?
10.3 Overview.
Now all the steps for the typechecker module are completed. The AST is dec-
orated with types and prepared for the code generation module. Example 10.4
gives a complete display of the AST befor and after the type checking pass.
Example 10.4 (AST decoration)
Consult the sample Inger program in listing 10.2. The AST before decoration
is shown in figure 10.5, notice that all types are unknown (no type) . The AST
after the decoration is shown in figure 10.6.
?
10.3.1 Conclusion
Typechecking is the most important part of the semantic analysis. When the
typechecking is completed there could still be some errors in the source. For
example
• unreachable code, statements are located after a return keyword. These
statements will never be executed;
148
module example;
start f : void → float
{
float a = 0;
int b = 0;
a = b + 1.0;
return (a);
}
Listing 10.2: Sample program listing
• when the function header is declared with a return type other than void ,
the return keyword must exist in the function body. It will not check if the
return type is valid, this already took place in the typechecker pass;
• check for double case labels in a switch ;
• lvalue check, when a assignment = is located in the AST its left child can
not be a function. This is a rule we applied for the Inger language, other
languages may allow this.
• when a goto statement is encountered the label which the goto points at
must exists.
• function parameter count, when a function is declared with two parame-
ters (return type excluded), the call to the function must also have two
parameters.
All these small checks are also part of the semantic analysis and will be discussed
in the next chapter. After these checks are preformed the code generation can
finally take place.
149
Figure 10.5: AST before decoration
150
Figure 10.6: AST after decoration
151
Bibliography
[1] A.B. Pyster: Compiler Design and Construction, Van Nostrand Reinhold
Company, 1980
[2] G. Goos, J. Hartmanis: Compiler Construction - An Advanced Course,
Springer-Verlag, Berlin, 1974
152
Chapter 11
Miscellaneous Semantic
Checks
11.1 Left Hand Values
An lvalue, short for left hand value, is that expression or identifier reference that
can be placed on the left hand side of an assignment. The lvalue check is one of
the necessary checks in the semantic stage. An lvalue check makes sure that no
invalid assignments are done in the source code. Examples 11.1 and 11.2 show
us what lvalues are valid in the Inger compiler and which are not.
Example 11.1 (Invalid Lvalues)
function() = 6;
2 = 2;
”somestring” = ”somevalue”;
?
What makes a valid lvalue? An lvalue must be a modifiable entity. One can
define the invalid lvalues and check for them, in our case it is better to check
for the lvalues that are valid, because this list is much shorter.
Example 11.2 (Valid Lvalues)
153
int a = 6;
name = ”janwillem”;
?
11.1.1 Check Algorithm
To check the validity of the lvalues we need a filled AST (Abstract Syntax Tree)
in order to have access to the all elements of the source code. To get a better
grasp of the checking algorithm have a look at the pseudo code in example 11.3.
This algorithm results in a list of error messages if any.
Example 11.3 (Check Algorithm)
Start at the root of the AST
for each node found in the AST do
if the node is an ’=’ operator then
5 check its most left child in the AST
which is the lvalue and see if this is
a valid one.
if invalid report an error
10
else go to next node
else go to the next node
?
Not all the lvalues are as straightforward as they seem. A valid but bizarre
example of a semantically correct assignment is:
Example 11.4 (Bizarre Assignment)
int a[20];
int b = 4;
a = a ∗ b;
?
We choose to make this a valid assignment in Inger and to provide some
address arithmetic possibilities. The code in example 11.4 multiplies the base
address by the absolute value of identifier b. Lets say that the base address of
array a was initially 0x2FFFFFA0 , then the base address of a will be 0xBFFFFE80 .
154
11.2 Function Parameters
This section covers argument count checking. Amongst other things, function
parameters must be checked before we actually start generating code. Apart
from checking the use of a correct number of function arguments in function
calls and the occurence of multiple definitions of the main function,we also
check whether the passed arguments are of the correct type. Argument type
checking is explained in 10
The idea of checking the number of arguments passed to a function is pretty
straightforward. The check consists of two steps: firstly, we collect all the
function header nodes from the AST and store them in a list. Secondly, we
compare the number of arguments used in each function call to the number of
arguments required by each function and check that the numbers match.
To build a list of all nodes that are function headers we make a pass through
the AST and collect all nodes that are of type NODE FUNCTIONHEADER , and
put them in a list structure provided by the generic list module. It is faster to
go through the AST once and build a list of the nodes we need, than to make a
pass through the AST to look for a node each time we need it. After building
the list for the example program 11.5 it will contain the header nodes for the
functions main and AddOne .
Example 11.5 (Program With Function Headers)
module example;
start main : void → void
{
5 AddOne( 2 );
}
AddOne : int a → int
{
10 return( a + 1 );
}
?
The next step is to do a second pass through the AST and look for nodes
of type NODE APPLICATION which represent a function call in the source code.
When such a node is found we first retrieve the actual number of arguments
passed in the function application with the helper function GetArgumentCount-
FromApplication . Secondly we get the number of arguments as defined in the
function declaration, to do this we use the function GetArgumentCount . Then
it is just a matter of comparing the number of arguments we expect and the
number of arguments we found. We only print an error message when a function
was called with too many or few arguments.
155
11.3 Return Keywords
The typechecking mechanism of the Inger compiler checks if a function returns
the right type when assigning a function return value to a variable.
Example 11.6 (Correct Variable Assignment)
int a;
a = myfunction();
?
The source code in example 11.6 is correct and implies that the function
myfunction returns a value. As in most programming languages we introduced
a return keyword in our language and we define the following semantic rules:
unreachable code and non-void function returns (definition 11.1 and 11.2).
11.3.1 Unreachable Code
Definition 11.1 (Unreachable code)
Code after a return keyword in Inger source will not be executed. A
warning for this unreachable code will be generated.
?
For this check we run over the AST pre-order and check each code block for
the return keyword. If the child node containing the return keyword is not the
last child node in the code block, the remaining statements will be unreachable;
unreachable code. An example of unreachable code can be found in example
11.7 in which function print takes an integer as parameter and prints this to the
screen. The statement ‘ print( 2 ) ’ will never be executed since the function main
returns before the print function is reached.
Example 11.7 (Unreachable Code)
start main : void → int
{
int a = 8;
5 if ( a == 8 )
{
print ( 1 );
return( a );
print ( 2 );
10 }
}
?
Unreachable code is, besides useless, not a problem and the compilation
process can continue, therefor a warning messages is printed.
156
11.3.2 Non-void Function Returns
Definition 11.2 (Non-void function returns)
The last statement in a non-void function should be the keyword
’return’ in order to return a value. If the last statement in a non-
void function is not ’return’ we generate a warning ‘control reaches
end of non-void function’.
?
It is nice that unreachable code is detected, but it is not essential to the next
phase the process of compilation. Non-void function returns, on the contrary,
have a greater impact. Functions that should return a value but never do, can
result in an errorneous program. In example 11.8 variable a is assigned the
result value of function myfunction , but the function myfunction never returns a
value.
Example 11.8 (Non-void Function Returns)
module functionreturns;
start main : void → void
{
5 int a;
a = myfunction();
}
myfunction : void → int
10 {
int b = 2;
}
?
To make sure all non-void function return, we check for the return keyword
which should be in the function code block. Like with most semantic checks
we go through the AST pre-order and search all function code block for the
return keyword. When a function has a return statement in an if-then-else
statement both then and else block should contain the return keyword because
the code blocks are executed conditionally. The same is for a switch block, all
case block should contain a return statement. All non-void function without
return keyword will generate a warning.
11.4 Duplicate Cases
Generally a switch statement has one or more case blocks. It is syntactically
correct to define multiple code blocks with the same case value, so-called dupli-
cate case values. If duplicate case values occur it might not be clear which code
157
block is executed, this is a choice which you should make as a compiler builder.
The semantic check in Inger generates a warning when a duplicate case value is
found and generates code for the first case code block. We choose to generate a
warning instead of an error message because the multi-value construction still
allows us to go to the next phase in compilation; code generation. Example
program 11.9, will have the output
This is the first code block
because we choose to generate code for the first code block definition for
duplicate case value 0 .
Example 11.9 (Duplicate Case Values)
/* Duplicate cases
* A program with duplicate case values
*/
module duplicate cases;
5
start main : void → void
{
int a = 0;
10 switch( a )
{
case 0
{
printf ( ”This is the first case block” );
15
}
case 0
{
printf ( ”This is the second case block” );
20
}
default
{
printf ( ”This is the default case” );
25
}
}
}
?
The algorithm that checks for duplicate case values is pretty simple and
works recursively down the AST. It starts at the root node of the AST and
searches for NODE SWITCH nodes. For each switch node found we search for
duplicate children in the cases block. If any duplicates were found, generate a
proper warning, else continue until the complete AST was searched. In the end
this check will detect all duplicate values and report them.
158
11.5 Goto Labels
In the Inger language we implemented the goto statement although use of this
statement is often considered harmful. Why exactly is goto considered harmful?
As the late Edsger Dijkstra ([3]) stated:
The go to statement as it stands is just too primitive; it is too much
an invitation to make a mess of one’s program
Despite its possible harmfulness we decided to implement it. Why? Because
it is a very cool feature. For the unaware Inger programmer we added a subtle
reminder to the keyword goto and implemented it as goto considered harmful .
As with using variables, goto labels should be declared before using them.
Since this pre-condition cannot be forced using grammar rules (syntax) it should
be checked in the semantic stage. Due to a lack of time we did not implement
this semantic check and therefore programmers and users of the Inger compiler
should be aware that jumping to undeclared goto labels may result in inexplica-
ble and possibly undesired program behaviour. Example code 11.10 shows the
correct way to use the goto keyword.
Example 11.10 (Goto Usage)
int n = 10;
label here;
printstr ( n );
n = n − 1;
5 if ( n > 0 )
{
goto considered harmful here;
}
?
A good implementation for this check would be, to store the label decla-
rations in the symbol table and walk through the AST and search for goto
statements. The identifier in a goto statement like
goto considered harmful labelJumpHere
will be looked up in the symbol table. If the goto label is not found, an error
message will be generated. Although goto is a very cool feature, be careful using
it.
159
Bibliography
[1] A. Hunt, D. Thomas: The Pragmatic Programmer, Addison Wesley, 2002
[2] Thomas A. Sudkamp: Languages And Machines Second Edition, Addison
Wesley, 1997
[3] Edsger W. Dijkstra: Go To Statement Considered Harmful
http://www.acm.org/classics/oct95/
160
Part IV
Code Generation
161
Code generation is the final step in building a compiler. After the semantic
analysis there should be no more errors in the source code. If there are still
errors then the code generation will almost certainly fail.
This part of the book contains descriptions of how the assembly output will
be generated from the Inger source code. The subjects covered in this part
include implementation (assembly code) of every operator supported by Inger,
storage of data types, calculation of array offsets and function calls, with regard
to stack frames and return values.
In the next chapter, code generation is explained at an abstract level. In the
final chapter of this book, code templates, we present assembly code templates
for each operation in Inger. Using templates, we can guarantee that operations
can be chained together in any order desired by the programmer, including
orders we did not expect.
162
Chapter 12
Code Generation
12.1 Introduction
Code generation is the least discussed and therefore the most mystical aspect
of compiler construction in the literature. It is also not extremely difficult, but
requires great attention to detail. The approach using in the Inger compiler is to
write a template for each operation. For instance, there is a template for addition
(the code+ operation), a template for multiplication, dereferencing, function
calls and array indexing. All of these templates may be chained together in any
order. We can assume that the order is valid, since if the compiler gets to code
generation, the input code has passed the syntax analysis and semantic analysis
phases. Let’s take a look at a small example of using templates.
int a = ∗(b + 0x20);
Generating code for this line of Inger code involves the use of four templates.
The order of the templates required is determined by the order in which the
expression is evaluated, i.e. the order in which the tree nodes in the abstract
syntax tree are linked together. By traversing the tree post-order, the first
template applied is the template for addition, since the result of b + 0x20 must
be known before anything else can be evaluated. This leads to the following
ordering of templates:
1. Addition: calculate the result of b + 0x20
2. Dereferencing: find the memory location that the number between braces
163
points to. This number was, of course, calculated by the previous (inner)
template.
3. Declaration: the variable a is declared as an integer, either on the stack
(if it is a local variable) or on the heap (if it is a global variable).
4. Assignment: the value delivered by the dereferencing template is stored
in the location returned by the declaration template.
If the templates are written carefully enough (and tested well enough), we
can create a compiler that suppports and ordering of templates. The question,
then, is how templates can be linked together. The answer lies in assigning one
register (in casu, eax ), as the result register. Every template stores its result in
eax , whether it is a value or a pointer. The meaning of the value stored in eax
is determined by the template that stored the value.
12.2 Boilerplate Code
Since the Inger compiler generates assembly code, it is necessary to wrap up this
code in a format that the assembler expects. We use the GNU AT&T assembler,
which uses the AT&T assembly language syntax (a syntax very similar to Intel
assembly, but with some peculiar quirks. Take a look at the following assembly
instruction, first in Intel assembly syntax:
MOV EAX, EBX
This instruction copies the value stored in the EBX register into the EAX
register. In GNU AT&T syntax:
movl %ebx, %eax
We note several differences:
1. Register names are written lowercase, and prefixed with a percent ( % ) sign
to indicate that they are registers, not global variable names;
2. The order of the operands is reversed. This is a most irritating property
of the AT&T assembly language syntax which is a major source of errors.
You have been warned.
3. The instruction mnemonic mov is prefixed with the size of its operands (4
bytes, long). This is similar to Intel’s BYTE PTR , WORD PTR and DWORD
PTR keywords.
There are other differences, some more subtle than others, regarding deref-
erencing and indexing. For complete details, please refer to the GNU As Man-
ual[10].
The GNU Assembler specifies a defailt syntax for the assembly files, at file
level. Every file has at least one data segment (designated with .data ), and
one code segment (designated with .text ). The data segment contains global
164
.data
.globl a
.align 4
.type a,@object
5 .size a,4
a:
.long 0
Listing 12.1: Global Variable Declaration
variables and string constants, while the code segment holds the actual code.
The code segment may never be written to, while the data segment is modifiable.
Global variables are declared by specifying their size, and optionally a type and
alignment. Global variables are always of type @object (as opposed to type
@function for functions). The code in listing 12.1 declares the variable a .
It is also required to declare at least one function (the main function) as
a global label. This function is used as the program entry point. Its type is
always @function .
12.3 Globals
The assembly code for an Inger program is generated by traversing the tree
multiple times. The first pass is necessary to find all global declarations. As
the tree is traversed, the code generation module checks for declaration nodes.
When it finds a declaration node, the symbol that belongs to the declaration is
retrieved from the symbol table. If this symbol is a global, the type information
is retrieved and the assembly code to declare this global variable is generated
(see listing 12.1). Local variables and function parameters are skipped during
this pass.
12.4 Resource Calculation
During the second pass when the real code is generated, the implementations
for functions are also created. Before the code of a function can be generated,
the code generation module must know the location of all function parameters
and local variables on the stack. This is done by quickly scanning the body of
the function for local declarations. Whenever a declaration is found its position
on the stack is determined and this is stored in the symbol itself, in the sym-
bol table. This way references to local variables and parameters can easily be
converted to stack locations when generating code for the function implemen-
tation. The size and location of each symbol play an important role in creating
the layout for function stack frames, later on.
165
12.5 Intermediate Results of Expressions
The code generation module in Inger is implemented in a very simple and
straightforward way. There is no real register allocation involved, all inter-
mediate values and results of expressions are stored in the EAX register. Even
though this will lead to extremely unoptimized code – both in speed and size –
it is also very easy to write. Consider the following simple program:
/*
* simple.i
* Simple example program to demonstrate code generation.
*/
5 module simple;
extern printInt : int i → void;
int a, b;
10
start main : void → void
{
a = 16;
b = 32;
15 printInt ( a ∗ b );
}
This little program translates to the following x86 assembly code wich shows
how the intermediate values and results of expressions are kept in the EAX
register:
.data
.globl a
.align 4
.type a,@object
5 .size a,4
a:
.long 0
.globl b
.align 4
10 .type b,@object
.size b,4
b:
.long 0
.text
15 .align 4
.globl main
.type main,@function
main:
pushl %ebp
20 movl %esp, %ebp
subl $0, %esp
movl $16, %eax
movl %eax, a
166
movl $32, %eax
25 movl %eax, b
movl a, %eax
movl %eax, %ebx
movl b, %eax
imul %ebx
30 pushl %eax
call printInt
addl $4, %esp
leave
ret
The eax register may contain either values or references, depending on the
code template that placed a value in eax . If the code template was, for instance,
addition , eax will contain a numeric value (either floating point or integer). If
the code template was the template for the address operator ( & ), then eax will
contain an address (a pointer).
Since the code generation module can assume that the input code is both
syntactically and semantically correct, the meaning of the value in eax does not
really matter. All the code generation module needs to to is make sure that the
value in eax is passed between templates correctly, and if possible, efficiently.
12.6 Function calls
Function calls are executed using the Intel assembly call statement. Since the
GNU assembler is reasonable high-level assembler, it is sufficient to supply the
name of the function being called; the linker ( ld ) will take care of the job of
filling in the correct address, assuming the function exists, but once again – we
can assume that the input Inger code is semantically and syntactically correct.
If a function really does not exist, and the linker complains about it, it is because
there is an error in a header file. The syntax of a basic function call is:
call printInt
Of course, most interesting functions take parameters. Parameters to func-
tions are always passed to the function using the stack. For this, Inger using
the same paradigm that the C language uses: the caller is responsible for both
placing parameters on the stack and for removing them after the function call
has completed. The reason that Inger is so compatible with C is a practical one:
this way Inger can call C functions in operating system libraries, and we do not
need to supply wrapper libraries that call these functions. This makes the life
of Inger programmers and compiler writers a little bit easier.
Apart from parameters, functions also have local variables and these live on
the stack too. All in all the stack is rapidly becoming complex and we call the
order in which parameters and local variables are placed on the stack the stack
frame. As stated earlier, Inger adheres to the calling convention popularized
by the C programming language, and therefore the stack frames of the two
languages are identifical.
167
The function being called uses the ESP register to point to the top of the
stack. The EBP register is the base pointer to the stack frame. As in C,
parameters are pushed on the stack from right to left (the last argument is
pushed first). Return values of 4 bytes or less are stored in the EAX register. For
return values with more than 4 bytes, the caller passes an extra first argument
to the callee (the function being called). This extra argument is the address
of the location where the return value must be stored (this extra argument is
the first argument, so it is the last argument to be pushed on the stack). To
illustrate this point, we give an example in C:
Example 12.1 (Stack Frame)
/* vec3 is a structure of
* 3 floats (12 bytes). */
struct vec3
{
5 int x, y, z;
};
/* f is a function that returns
* a vec3 struct: */
10 vec3 f( int a, int b, int c );
Since the return value of the function f is more than 4 bytes, an extra
first argument must be placed on the stack, containing the address of the vec3
structure that the function returns. This means the call:
v = f ( 1, 0, 3 );
is transformed into:
f( &v , 1, 0, 3 );
?
It should be noted that Inger does not support structures at this time, and
all data types can be handled using either return values of 4 bytes or less, which
fit in eax , or using pointers (which are also 4 bytes and therefore fit in eax ). For
future extensions of Inger, we have decided to support the extra return value
function argument.
Since functions have a stack frame of their own, the contents of the stack
frame occupied by the caller are quite safe. However, the registers used by the
caller will be overwritten by the callee, so the caller must take care to push any
values it needs later onto the stack. If the caller wants to save the eax , ecx and
edx registers, it has to push them on the stack first. After that, it pushes the
arguments (from right to left), and when the call instruction is called, the eip
register is pushed onto the stack too (implicitly, by the call instruction), which
means the return address is on top of the stack.
168
Although the caller does most of the work creating the stack frame (pushing
parameters on the stack), the callee still has to do several things. The stack
frame is not yet finished, because the callee must create space for local variables
(and set them to their initial values, if any). Furthermore, the callee must set
save the contents of ebx , esi and edi as needed and set esp and ebp to point to the
top and bottom of the stack, respectively. Initially, the EBP register points to
a location in the caller’s stack frame. This value must be preserved, so it must
be pushed onto the stack. The contents of esp (the bottom of the current stack
frame) are then copied into esp , so that esp is free to do other things and to
allow arguments to be referenced as an offset from ebp . This gives us the stack
frame depicted in figure 12.1.
Figure 12.1: Stack Frame Without Local Variables
To allocate space for local variables and temporary storage, the callee just
subtracts the number of bytes required for the allocation from esp . Finally, it
pushes ebx , esi and edi on the stack, if the function overwrites them. Of course,
this depends on the templates used in the function, so for every template, its
effects on ebx , esi and edi must be known.
The stack frame now has the form shown in figure 12.2.
During the execution of the function, the stack pointer esp might go up
and down, but the ebp register is fixed, so the function can always refer to the
first argument as [ebp+8] . The second argument is located at [ebp+12] (decimal
offset), the third argument is at [ebp+16] and so on, assuming all argument are
4 bytes in size.
The callee is not done yet, because when execution of the function body
is complete, it must perform some cleanup operations. Of course, the caller is
responsible for cleaning up function parameters it pushed onto the stack (just
like in C), but the remainer of the cleanup is the callee’s job. The callee must:
• Store the return value in eax , or in the extra parameter;
• Restore the ebx , esi and edi registers as needed.
Restoration of the values of the ebx , esi and edi registers is performed by
popping them from the stack, where they had been stored for safekeeping earlier.
169
Figure 12.2: Stack Frame With Local Variables
Of course, it is important to only pop the registers that were pushed onto the
stack in the first place: some functions save ebx , esi and edi , while others do not.
The last thing to do is taking down the stack frame. This is done by moving
the contents from ebp to esp (thus effectively discarding the stack frame) and
popping the original ebp from the stack. 1 The return ( ret ) instruction can now
be executed, wich pops the return address of the stack and places it in the eip
register.
Since the stack is now exactly the same as it was before making the function
call, the arguments (and return value when larger than 4 bytes) are still on the
stack. The esp can be restored by adding the number of bytes the arguments
use to esp .
Finally, if there were any saved registers ( eax , ecx and edx ) they must be
popped from the stack as well.
12.7 Control Flow Structures
The code generation module handles if/then/else structures by generating com-
parison code and conditional jumps. The jumps go to the labels that are gen-
erated before the then and else blocks.
Loops are also implemented in a very straight forward manner. First it
generates a label to jump back to every iteration. After that the comparison
code is generated. This is done in exactly the same way as it is done with
if expressions. After this the code block of the loop is generated followed by
a jump to the label right before the comparison code. The loop is concluded
1 The i386 instruction set has an instruction leave
which does this exact thing.
170
with a final label where the comparison code can jump to if the result of the
expression is false.
12.8 Conclusion
This concludes the description of the inner workings of the code generation
module for the Inger language.
171
Bibliography
[1] O. Andrew and S. Talbott: Managing projects with Make, OReilly & asso-
ciates, inc., December 1991.
[2] B. Brey: 8086/8088, 80286, 80386, and 80486 Assembly Language Pro-
gramming, Macmillan Publishing Company, 1994.
[3] G. Chapell: DOS Internals, Addison-Wesley, 1994.
[4] J. Duntemann: Assembly Language Step-by-Step, John Wiley & Sons, Inc.,
1992.
[5] T. Hogan: The Programmers PC Sourcebook: Charts and Tables for the
IBM PC Compatibles, and the MS-DOS Operating System, including the
new IBM Personal System/2 computers, Microsoft Press, Redmond, Wash-
ington, 1988.
[6] K. Irvine: Assembly Language for Intel-based Computers, Prentice-Hall,
Upper Saddle River, NJ, 1999.
[7] M. L. Scott: Porgramming Language Pragmatics, Morgan Kaufmann Pub-
lishers, 2000.
[8] I. Sommerville: Software Engineering (sixth edition), Addison-Wesley,
2001.
[9] W. Stallings: Operating Systems: achtergronden, werking en ontwerp, Aca-
demic Service, Schoonhoven, 1999.
[10] R. Stallman: GNU As Manual,
http://www.cs.utah.edu/dept/old/texinfo/as/as.html
172
Chapter 13
Code Templates
This final chapter of the book serves as a repository of code templates.
These templates are used by the compiler to generate code for common (sub)
expressions. Every template has a name and will be treated on the page ahead,
each template on a page of its own.
173
Addition
Inger
expr + expr
Example
3 + 5
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. addl %ebx, %eax
Description
The result of the left expression is added to the result of the right
expression and the result of the addition is stored in eax .
174
Subtraction
Inger
expr − expr
Example
8 − 3
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax.
4. subl %ebx, %eax
Description
The result of the right expression is subtracted from the result of
the left expression and the result of the subtraction is stored in eax .
175
Multiplication
Inger
expr ∗ expr
Example
12 − 4
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax .
4. imul %ebx
Description
The result of the left expression is multiplied with the result of the
right expression and the result of the multiplication is stored in eax .
176
Division
Inger
expr / expr
Example
32 / 8
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax .
4. xchgl %eax, %ebx
5. xorl %edx, %edx
6. idiv %ebx
Description
The result of the left expression is divided by the result of the right
expression and the result of the division is stored in eax .
177
Modulus
Inger
expr % expr
Example
14 % 3
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax .
4. xchgl %eax, %ebx
5. xorl %edx, %edx
6. idiv %ebx
7. movl %edx, %eax
Description
The result of the left expression is divided by the result of the right
expression and the remainder of the division is stored in eax .
178
Negation
Inger
−expr
Example
−10
Assembler
1. Expression is evaluated and stored in eax .
2. neg %eax
Description
The result of the expression is negated and stored in the eax register.
179
Left Bitshift
Inger
expr << expr
Example
256 << 2
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ecx
3. Right side of expression is evaluated and stored in eax .
4. xchgl %eax, %ecx
5. sall %cl, %eax
Description
The result of the left expression is shifted n bits to the left, where n
is the result of the right expression. The result is stored in the eax
register.
180
Right Bitshift
Inger
expr >> expr
Example
16 >> 2
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ecx
3. Right side of expression is evaluated and stored in eax .
4. xchgl %eax, %ecx
5. sarl %cl, %eax
Description
The result of the left expression is shifted n bits to the right, where
n is the result of the right expression. The result is stored in the eax
register.
181
Bitwise And
Inger
expr & expr
Example
255 & 15
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax .
4. andl %ebx, %eax
Description
The result of an expression is subject to a bitwise and operation with
the result of another expression and this is stored in the eax register.
182
Bitwise Or
Inger
expr | expr
Example
13 | 22
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax .
4. orl %ebx, %eax
Description
The result of an expression is subject to a bitwise or operation with
the result of another expression and this is stored in the eax register.
183
Bitwise Xor
Inger
expr ˆ expr
Example
63 ˆ 31
Assembler
1. Left side of expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. Right side of expression is evaluated and stored in eax .
4. andl %ebx, %eax
Description
The result of an expression is subject to a bitwise xor operation with
the result of another expression and this is stored in the eax register.
184
If-Then-Else
Inger
if ( expr )
{
// Code block
}
5 // The following part is optional
else
{
// Code block
}
Example
int a = 2;
if ( a == 1 )
{
a = 5;
5 }
else
{
a = a − 1;
}
Assembler
When there is only a then block:
1. Expression is evaluated and stored in eax .
2. cmpl $0, %eax
3. je .LABEL0
4. Then code block is generated.
5. .LABEL0:
When there is an else block:
1. Expression is evaluated and stored in eax .
2. cmpl $0, %eax
3. je .LABEL0
4. Then code block is generated.
5. jmp .LABEL1
6. .LABEL0:
7. Else code block is generated.
8. .LABEL1:
185
Description
This template describes an if-then-else construction. The condi-
tional code execution is realized with conditional jumps to labels.
Different templates are used for if-then and if-then-else construc-
tions.
186
While Loop
Inger
while( expr ) do
{
// Code block
}
Example
int i = 5;
while( i > 0 ) do
{
i = i − 1;
5 }
Assembler
1. Expression is evaluated and stored in eax .
2. .LABEL0:
3. cmpl $0, %eax
4. je .LABEL1
5. Code block is generated
6. jmp .LABEL0
Description
This template describes a while loop. The expression is evaluated
and while the result of the expression is true the code block is exe-
cuted.
187
Function Application
Inger
func( arg1, arg2, argN );
Example
printInt ( 4 );
Assembler
1. The expression of each argument is evaluated, stored in eax ,
and pushed on the stack.
2. movl %ebp, %ecx
3. The location on the stack is determined.
4. call printInt (in this example the function name is printInt)
5. The number of bytes used for the arguments is calculated.
6. addl $4, %esp (in this example the number of bytes is 4)
Description
This template describes the application of a function. The argu-
ments are pushed on the stack according to the C style function call
convention.
188
Function Implementation
Inger
func: type ident1 , type ident2 , type identN → returntype
{
// Implementation
}
Example
square: int i → int
{
return( i ∗ i );
}
Assembler
1. .globl square (in this example the function name is square)
2. .type square, @function
3. square:
4. pushl %ebp
5. movl %esp, %ebp
6. The number of bytes needed for the parameters are counted
here.
7. subl $4, %esp (in this example the number of bytes needed
is 4)
8. The implementation code is generated here.
9. leave
10. ret
Description
This template describes implementation of a function. The number
of bytes needed for the parameters is calculated and subtracted from
the esp register to allocate space on the stack.
189
Identifier
Inger
identifier
Example
i
Assembler
For a global variable:
1. Expression is evaluated and stored in eax
2. movl i, %eax (in this example the name of the identifier is i)
For a local variable:
1. movl %ebp, %ecx
2. The location on the stack is determined
3. addl $4, %ecx (in this example the stack offset is 4)
4. movl (%ecx), %eax
Description
This template describes the use of a variable. When a global variable
is used it is easy to generate the assembly because we can just use
the name of the identifier. For locals its position on the stack has to
be determined.
190
Assignment
Inger
identifier = expr;
Example
i = 12;
Assembler
For a global variable:
1. The expression is evaluated and stored in eax .
2. movl %eax, i (in this example the name of the identifier is i)
For a local variable:
1. The expression is evaluated and stored in eax .
2. The location on the stack is determined
3. movl %eax, 4(%ebp) (in this example the offset on the stack is
4)
Description
This template describes an assignment of a variable. Global and
local variables must be handled differently.
191
Global Variable Declaration
Inger
type identifier = initializer ;
Example
int i = 5;
Assembler
For a global variable:
1. .data
2. .globl i (in this example the name of the identifier is i)
3. .type i,@object)
4. .size i,4 (in this example the type is 4 bytes in size)
5. a:
6. .long 5 (in this example the initializer is 5)
Description
This template describes the declaration of a global variable. When
no initializer is specified, the variable is initialized to zero.
192
Equal
Inger
expr == expr
Example
i == 3
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. cmpl %eax, %ebx
5. movl $0, %ebx
6. movl $1, %ecx
7. cmovne %ebx, %eax
8. cmove %ecx, %eax
Description
This template describes the == operator. The two expressions are
evaluated and the results are compared. When the results are the
same, 1 is loaded in eax . When the results are not the same, 0 is
loaded in eax .
193
Not Equal
Inger
expr != expr
Example
i != 5
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. cmpl %eax, %ebx
5. movl $0, %ebx
6. movl $1, %ecx
7. cmove %ebx, %eax
8. cmovne %ecx, %eax
Description
This template describes the 6= operator. The two expressions are
evaluated and the results are compared. When the results are not
the same, 1 is loaded in eax . When the results are the same, 0 is
loaded in eax .
194
Less
Inger
expr < expr
Example
i < 18
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. cmpl %eax, %ebx
5. movl $0, %ebx
6. movl $1, %ecx
7. cmovnl %ebx, %eax
8. cmovl %ecx, %eax
Description
This template describes the < operator. The two expressions are
evaluated and the results are compared. When the left result is less
than the right result, 1 is loaded in eax . When the left result is not
smaller than the right result, 0 is loaded in eax .
195
Less Or Equal
Inger
expr <= expr
Example
i <= 44
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. cmpl %eax, %ebx
5. movl $0, %ebx
6. movl $1, %ecx
7. cmovnle %ebx, %eax
8. cmovle %ecx, %eax
Description
This template describes the ≤ operator. The two expressions are
evaluated and the results are compared. When the left result is less
than or equals the right result, 1 is loaded in eax . When the left
result is not smaller than and does not equal the right result, 0 is
loaded in eax .
196
Greater
Inger
expr > expr
Example
i > 57
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. cmpl %eax, %ebx
5. movl $0, %ebx
6. movl $1, %ecx
7. cmovng %ebx, %eax
8. cmovg %ecx, %eax
Description
This template describes the > operator. The two expressions are
evaluated and the results are compared. When the left result is
greater than the right result, 1 is loaded in eax . When the left result
is not greater than the right result, 0 is loaded in eax .
197
Greater Or Equal
Inger
expr >= expr
Example
i >= 26
Assembler
1. The left expression is evaluated and stored in eax .
2. movl %eax, %ebx
3. The right expression is evaluated and stored in eax .
4. cmpl %eax, %ebx
5. movl $0, %ebx
6. movl $1, %ecx
7. cmovnge %ebx, %eax
8. cmovge %ecx, %eax
Description
This template describes the ≥ operator. The two expressions are
evaluated and the results are compared. When the left result is
greater than or equals the right result, 1 is loaded in eax . When the
left result is not greater than and does not equal the right result, 0
is loaded in eax .
198
Chapter 14
Bootstrapping
A subject which has not been discussed so far is bootstrapping. Bootstrapping
means building a compiler in its own language. So for the Inger language it
would mean that we build the compiler in the Inger language as well. To discuss
the practical application of this theory is beyond the scope of this book. Below
we explain it in theory.
Developing a new language is mostly done to improve some aspects compared
to other, existing languages. What we would prefer, is to compile the compiler
in its own language, but how can that be done when there is no compiler to
compile the compiler? To visualize this problem we use so-called T-diagrams,
to illustrate the process of bootstrapping. To get familiar with T-diagrams we
present a few examples.
Example 14.1 (T-Diagrams)
Figure 14.1: Program P can work on machine with language M . I = input, O =
output.
Figure 14.2: Interpreter for language T , is able to work on a machine with
language M .
?
199
Figure 14.3: Compiler for language T1 to language T2 , is able to work on a
machine with language M .
Figure 14.4: Machine for language M .
The bootstrapping problem can be resolved in the following way:
1. Build two versions of the complier. One version is the optimal-compiler
and the other version is the bootstrap-compiler. The optimal compiler
written for the new language T , complete with all optimisation is written
in language T itself. The bootstrap-compiler is written in an existing
language m . Because this compiler is not optimized and therefore slower
in use, the m is written in lowercase instead of uppercase.
2. Translate the optimal-compiler with the bootstrap-compiler. The result is
the optimal-compiler, which can run on the target machine M . However,
this version of the compiler is not optimized yet (slow, and a lot of memory
usage). We call this the temporary-compiler.
3. Now we must the compile the optimal-compiler again, only this time we
use the temporary-compiler to compile it with. The result will be the
final, optimal compiler able to run on machine M. This compiler will be
fast and produce optimized output.
4. The result:
It is a long way before you have a bootstrap compiler, but remember, this
is the ultimate compiler!
Figure 14.5: Program P runs on machine M .
200
Figure 14.6: The two compilers.
Figure 14.7: temporary-compiler.
Figure 14.8: compile process.
Figure 14.9: final-compiler.
201
Chapter 15
Conclusion
All parts how to build a compiler, from the setup of a language to the code
generation, have now been discussed. Using this book as a reference, it should
be possible to build your own compiler.
The compiler we build in this book is not innovating. Lots of this type of
compiler (for imperative languages) compilers already exist, only the language
differs: examples include compilers for C or Pascal .
Because the Inger compiler is a low-level compiler, it is extremely suitable for
system programming (building operating systems). The same applies to game
programming and programming command line applications (such as UNIX filter
tools).
We hope you will be able to put the theory and practical examples we de-
scribed in this book to use, in order to build your own compliler. It is up to you
now!
202
Appendix A
Requirements
A.1 Introduction
This chapter specifies the software necessary to either use (run) Inger or to
develop for Inger. The version numbers supplied in this text are the version
numbers of software packages we used. You may well be able to use older
versions of this software, but it is not guaranteed that this will work. You can
always (except in rare cases) use newer versions of the software discussed below.
A.2 Running Inger
Inger was designed for the Linux operating system. It is perfectly possible to run
Inger on other platforms like Windows, and even do development work for Inger
on non-Linux platforms. However, this section discusses the software required
to run Inger on Linux.
The Linux distribution we used is RedHat, 1 but other advanced Linux dis-
tributions like SuSE 2 or Mandrake 3 will do fine.
The most elementary package you need to run Inger is naturally Inger itself.
It can be downloaded from its repository at Source Forge. 4 There are two pack-
ages available: the user package, which contains only the compiler binary and
the user manual, and the development package, which contains the compiler
source as well. Be sure to download the user version, not the developer version.
As the Inger compiler compiles to GNU AT&T assembly code, the GNU
assembler as is required to convert the assembly code to object code. The GNU
assembler is part of the binutils package. 5 You may use any other assembler,
provided it supports the AT&T assembly syntax; as is the only assembler that
supports it that we are currently aware of. A linker is also required — we use
the GNU linker (which is also part of the binutils package).
Some of the code for Inger is generated using scripts written in the Perl
scripting language. You will need the perl 6 interpreter to execute these scripts.
1 RedHat Linux 7.2, http://www.redhat.com/apps/download/
2 SuSE Linux 8.0, http://www.suse.com/us/private/download/index.html
3 Mandrake Linux 8.0, http://www.mandrake.com
4 Inger 0.x, http://www.sourceforge.net/projects/inger
5 Binutils 2.11.90.0.8, http://www.gnu.org/directory/GNU/binutils.html
6 Perl 6, http://www.perl.com
203
If you use a Windows port of Inger, you can also use the GNU ports of as
and ld that come with DJGPP. 7 DJGPP is a free port of (most of) the GNU
tools.
It can be advantageous to be able to view this documentation in digital form
(as a Portable Document File), which is possible with the Acrobat Reader. 8 The
Inger website may also offer this documentation in other forms, such as HTML.
Editing Inger source code in Linux can be done with the free editor vi , which
is included with virtually all Linux distributions. You can use any editor you
want, though. An Inger syntax highlighting template for Ultra Edit is available
from the Inger archive at Source Forge.
If a Windows binary of the Inger compiler is not available or not usable, and
you need to run Inger on a Windows platform, you may be able to use the Linux
emulator for the Windows platform, Cygwin , 9 to execute the Linux binary.
A.3 Inger Development
For development of the Inger language, rather than development with the Inger
language, some additional software packages are required. For development
purposes we strongly recommend that you work on a Linux platform, since we
cannot guarantee that all development tools are available on Windows platforms
or work the same way as they do on Linux.
The Inger binary is built from source using automake 10 and autoconf 11 , both
of which are free GNU software. These packages allow the developer to generate
makefiles that target the user platform, i.e. use available C compiler and lexical
scanner generator versions, and warn if no suitable software is available. To
execute the generated makefiles, GNU make , which is part of the binutils package,
is also required. Most Linux installations should have this software already
installed.
C sources are compiled using the GNU Compiler Collection ( gcc ).
12
We used
the lexical analyzer generator GNU flex
13
to generate a lexical scanner.
All Inger code is stored in a Concurrent Versioning System repository on
a server at Source Forge, which may be accessed using the cvs package.
14
Note that you must be registered as an Inger developer to be able to change
the contents of the CVS repository. Registration can be done through Source
Forge.
All documentation was written using the L
A T E X2 ε
typesetting package,
15
which is also available for Windows as the MikT E X 16 system. Editors that come
in handy when working with T E X sources are Ultra Edit,
17
which supports T E X
syntax highlighting, and TexnicCenter
18
which is a full-fledged T E X editor with
7 DJGPP 2.03, http://www.delorie.com or http://www.simtel.net/pub/djgpp
8 Adobe Acrobat 5.0, http://www.adobe.com/products/acrobat/readstep.html
9 Cygwin 1.11.1p1, http://www.cygwin.com
10 Automake 1.4-p5, http://www.gnu.org/software/automake
11 Autoconf 2.13, http://www.gnu.org/software/autoconf/autoconf.html
12 GCC 2.96, http://www.gnu.org/software/gcc/gcc.html
13 Flex 2.5.4, http://www.gnu.org/software/flex
14 CVS 1.11.1p1 , http://www.cvshome.org
15 L A T E X2 ε , http://www.latex-project.org
16 MikT E X 2.2, http://www.miktex.org
17 Ultra Edit 9.20, http://www.ultraedit.com
18 TexnicCenter, http://www.toolscenter.org/products/texniccenter
204
many options (although no direct visual feedback — it is a what you see is what
you mean (WYSIWYM) tool).
The Inger development package comes with a project definition file for
KDevelop, an open source clone of Microsoft Visual Studio. If you have a
Linux distribution that has the X window system with KDE (K Desktop Envi-
ronment) installed, then you can do development work for Inger in a graphical
environment.
A.4 Required Development Skills
Development on Inger requires the following skills:
- A working knowledge of the C programming language;
- A basic knowlege of the Perl scripting language;
- Experience with GNU assembler (specifically, the AT&T assembly syn-
tax).
The rest of the skills needed, including working with the lexical analyzer
generator flex and writing tree data structures can be acquired from this book.
Use the bibliography at the end of this chapter to find additional literature that
will help you master all the tools discussed in the preceding sections.
205
Bibliography
[1] M. Bar: Open Source Development with CVS, Coriolis Group, 2 nd edition,
2001
[2] D. Elsner: Using As: The Gnu Assembler, iUniverse.com, 2001
[3] M. Goossens: The Latex Companion, Addison-Wesley Publishing, 1993
[4] A. Griffith: GCC: the Complete Reference, McGraw-Hill Osborne Media,
1 st edition, 2002
[5] E. Harlow: Developing Linux Applications, New Riders Publishing, 1999.
[6] J. Levine: Lex & Yacc, O’Reilly & Associates, 1992
[7] M. Kosta Loukides: Programming with GNU Software, O’Reilly & Asso-
ciates, 1996
[8] C. Negus: Red Hat Linux 8 Bible, John Wiley & Sons, 2002
[9] Oetiker, T.: The Not So Short Introduction to L
A T E X2 ε , version 3.16, 2000
[10] A. Oram: Managing Projects with Make, O’Reilly & Associates, 2 nd edition,
1991
[11] G. Purdy: CVS Pocket Reference, O’Reilly & Associates, 2000
[12] R. Stallman: Debugging with GDB: The GNU Source-Level Debugger, Free
Software Foundation, 2002
[13] G.V. Vaughan: GNU Autoconf, Automake, and Libtool, New Riders Pub-
lishing, 1 st edition, 2000
[14] L. Wall: Programming Perl, O’Reilly & Associates, 3 r d edition, 2000
[15] M. Welsch: Running Linux, O’Reilly & Associates, 3 r d edition, 1999
206
Appendix B
Software Packages
This appendix lists the locations software packages that are required or rec-
ommended in order to use Inger or do development work for Inger. Note that
the locations (URLs) of these packages are subject to change and may not be
correct.
Package Description and location
RedHat Linux 7.2 Operating system
http://www.redhat.com
SuSE Linux 8.0 Operating system
http://www.suse.com
Mandrake 8.0 Operating system
http://www.mandrake.com
GNU Assembler 2.11.90.0.8 AT&T syntax assembler
http://www.gnu.org/directory/GNU/binutils.html
GNU Linker 2.11.90.0.8 COFF file linker
http://www.gnu.org/directory/GNU/binutils.html
DJGPP 2.03 GNU tools port
http://www.delorie.com
Cygwin 1.2 GNU Linux emulator for Windows
http://www.cygwin.com
207
Package Description and location
CVS 1.11.1p1 Concurrent Versioning System
http://www.cvshome.org
Automake 1.4-p5 Makefile generator
http://www.gnu.org/software/automake
Autoconf 2.13 Makefile generator support
http://www.gnu.org/software/autoconf/autoconf.html
Make 2.11.90.0.8 Makefile processor
http://www.gnu.org/software/make/make.html
Flex 2.5.4 Lexical analyzer generator
http://www.gnu.org/software/flex
L
A T E X2 ε
Typesetting system
http://www.latex-project.org
MikT E X 2.2 Typesetting system
http://www.miktex.org
TexnicCenter T E X editor
http://www.toolscenter.org/products/texniccenter
Ultra Edit 9.20 T E X editor
http://www.ultraedit.com
Perl 6 Scripting language
http://www.perl.com
208
Appendix C
Summary of Operations
C.1 Operator Precedence Table
Operator Priority Associatity Description
() 1 L function application
[] 1 L array indexing
! 2 R logical negation
- 2 R unary minus
+ 2 R unary plus
~ 3 R bitwise complement
* 3 R indirection
& 3 R referencing
* 4 L multiplication
/ 4 L division
% 4 L modulus
+ 5 L addition
- 5 L subtraction
>> 6 L bitwise shift right
<< 6 L bitwise shift left
< 7 L less than
<= 7 L less than or equal
> 7 L greater than
>= 7 L greater than or equal
== 8 L equality
!= 8 L inequality
& 9 L bitwise and
^ 10 L bitwise xor
| 11 L bitwise or
&& 12 L logical and
|| 12 L logical or
?: 13 R ternary if
= 14 R assignment
209
C.2 Operand and Result Types
Operator Operation Operands Result
() function application any any
[] array indexing int none
! logical negation bool bool
- unary minus int int
+ unary plus int int
~ bitwise complement int , char int , char
* indirection any any pointer
& referencing any pointer any
* multiplication int , float int , float
/ division int , float int , float
% modulus int , char int , char
+ addition int , float , char int , float , char
- subtraction int , float , char int , float , char
>> bitwise shift right int , char int , char
<< bitwise shift left int , char int , char
< less than int , float , char int , float , char
<= less than or equal int , float , char int , float , char
> greater than int , float , char int , float , char
>= greater than or equal int , float , char int , float , char
== equality int , float , char int , float , char
!= inequality int , float , char int , float , char
& bitwise and int , char int , char
^ bitwise xor int , char int , char
| bitwise or int , char int , char
&& logical and bool bool
|| logical or bool bool
?: ternary if bool (2x) any
= assignment any any
210
Appendix D
Backus-Naur Form
module: module <identifier> ; globals.
globals : ?.
globals : global globals.
globals : extern global globals.
global: function.
global: declaration.
function: functionheader functionrest.
functionheader: modifiers < identifier > : paramlist
−> returntype.
functionrest : ;.
functionrest : block.
modifiers: ?.
modifiers: start.
paramlist: void.
paramlist: paramblock moreparamblocks.
moreparamblocks: ?.
moreparamblocks: ; paramblock moreparamblocks.
paramblock: type param moreparams.
moreparams: ?.
moreparams: , param moreparams.
param: reference < identifier > dimensionblock.
211
returntype: type reference dimensionblock.
reference : ?.
reference : ∗ reference.
dimensionblock: ?.
dimensionblock:
? ?
dimensionblock.
block:
?
code
?
.
code: ?.
code: block code
code: statement cod?.
statement: label <identifier> ;
statement: ;
statement: break ;
statement: continue ;
statement: expression ;
statement: declarationblock ;
statement: if
?
expression
?
block elseblock
statement: goto <identifier> ;
statement: while
?
expression
?
do block
statement: do block while
?
expression
?
;
statement: switch
?
expression
? ?
switchcases default block
?
statement: return
?
expression
?
;.
elseblock : ?.
elseblock : else block.
switchcases: ?.
switchcases: case <intliteral> block swithcases.
declarationblock: type declaration restdeclarations .
restlocals : ?.
restlocals : , declaration restdeclarations .
local : reference < identifier > indexblock
initializer .
indexblock: ?.
indexblock:
?
< intliteral >
?
indexblock.
initializer : ?.
initializer : = expression.
expression: logicalor restexpression.
212
restexpression : ?.
restexpression : = logicalor restexpression.
logicalor : logicaland restlogicalor .
restlogicalor : ?.
restlogicalor : || logicaland restlogicalor .
logicaland: bitwiseor restlogicaland .
restlogicaland : ?.
restlogicaland : && bitwiseor restlogicaland.
bitwiseor : bitwisexor restbitwiseor .
restbitwiseor : ?.
restbitwiseor : | bitwisexor restbitwiseor .
bitwisexor: bitwiseand restbitwisexor.
restbitwisexor : ?.
restbitwisexor : ˆ bitwiseand restbitwisexor.
bitwiseand: equality restbitwiseand.
restbitwiseand: ?.
restbitwiseand: & equality restbitwiseand.
equality : relation restequality .
restequality : ?.
restequality : equalityoperator relation
restequality .
equalityoperator: ==.
equalityoperator: !=.
relation : shift restrelation .
restrelation : ?.
restrelation : relationoperator shift restrelation .
relationoperator : <.
relationoperator : <=.
relationoperator : >.
relationoperator : >=.
shift : addition restshift .
213
restshift : ?.
restshift : shiftoperator addition restshift .
shiftoperator : <<.
shiftoperator : >>.
addition: multiplication restaddition.
restaddition : ?.
restaddition : additionoperator multiplication
restaddition.
additionoperator: +.
additionoperator: −.
multiplication : unary3 restmultiplication.
restmultiplication : ?.
restmultiplication : multiplicationoperator unary3
restmultiplication .
multiplicationoperator : ∗.
multiplicationoperator : /.
multiplicationoperator : %.
unary3: unary2
unary3: unary3operator unary3.
unary3operator: &.
unary3operator: ∗.
unary3operator: ˜.
unary2: factor.
unary2: unary2operator unary2.
unary2operator: +.
unary2operator: −.
unary2operator: !.
factor : <identifier> application.
factor : immediat?.
factor :
?
expression
?
.
application : ?.
application :
?
expression
?
application.
application :
?
expression moreexpressions
?
.
moreexpressions: ?.
moreexpressions: , expression morexpressions.
214
type: bool.
type: char.
type: float.
type: int.
type: untyped.
immediate: <booleanliteral>.
immediate: <charliteral>.
immediate: < floatliteral >.
immediate: < intliteral >.
immediate: < stringliteral >.
215
Appendix E
Syntax Diagrams
Figure E.1: Module
Figure E.2: Function
216
Figure E.3: Formal function parameters
Figure E.4: Data declaration
Figure E.5: Code block
217
Figure E.6: Statement
Figure E.7: Switch cases
218
Figure E.8: Assignment, Logical OR Operators
Figure E.9: Ternary IF
Figure E.10: Logical AND and Bitwise OR Operators
219
Figure E.11: Bitwise XOR and Bitwise AND Operators
Figure E.12: Equality Operators
Figure E.13: Relational Operators
Figure E.14: Bitwise Shift, Addition and Subtraction Operators
Figure E.15: Multiplication and Division Operators
220
Figure E.16: Unary Operators
Figure E.17: Factor (Variable, Immediate or Expression)
Figure E.18: Immediate (Literal) Value
Figure E.19: Literal identifier
221
Figure E.20: Literal integer number
Figure E.21: Literal float number
222
Appendix F
Inger Lexical Analyzer
Source
F.1 tokens.h
#ifndefTOKENS H
#define TOKENS H
#include ”defs.h”
5 /* #include "type.h" */
#include ”tokenvalue.h”
#include ”ast.h”
/*
10 *
* MACROS
*
*/
15 /* Define where a line starts (at position 1)
*/
#define LINECOUNTBASE 1
/* Define the position of a first character of a line.
*/
20 #define CHARPOSBASE 1
/* Define the block size with which strings are allocated.
*/
#define STRING BLOCK 100
25 /*
*
* TYPES
*
*/
30
/* This enum contains all the keywords and operators
223
* used in the language.
*/
enum
35 {
/* Keywords */
KW BREAK = 1000, /* "break" keyword */
KW CASE, /* "case" keyword */
KW CONTINUE, /* "continue" keyword */
40 KW DEFAULT, /* "default" keyword */
KW DO, /* "do" keyword */
KW ELSE, /* "else" keyword */
KW EXTERN, /* "extern" keyword */
KW GOTO, /* "goto" keyword */
45 KW IF, /* "if" keyword */
KW LABEL, /* "label" keyword */
KW MODULE, /* "module" keyword */
KW RETURN, /* "return"keyword */
KW START, /* "start" keyword */
50 KW SWITCH, /* "switch" keyword */
KW WHILE, /* "while" keyword */
/* Type identifiers */
KW BOOL, /* "bool" identifier */
55 KW CHAR, /* "char" identifier */
KW FLOAT, /* "float" identifier */
KW INT, /* "int" identifier */
KW UNTYPED, /* "untyped" identifier */
KW VOID, /* "void" identifier */
60
/* Variable lexer tokens */
LIT BOOL, /* bool constant */
LIT CHAR, /* character constant */
LIT FLOAT, /* floating point constant */
65 LIT INT, /* integer constant */
LIT STRING, /* string constant */
IDENTIFIER, /* identifier */
/* Operators */
70 OP ADD, /* "+" */
OP ASSIGN, /* "=" */
OP BITWISE AND, /* "&" */
OP BITWISE COMPLEMENT, /* "~" */
OP BITWISE LSHIFT, /* "<<" */
75 OP BITWISE OR, /* "|" */
OP BITWISE RSHIFT, /* ">>" */
OP BITWISE XOR, /* "^" */
OP DIVIDE, /* "/" */
OP EQUAL, /* "==" */
80 OP GREATER, /* ">" */
OP GREATEREQUAL, /* ">=" */
OP LESS, /* "<" */
OP LESSEQUAL, /* "<=" */
OP LOGICAL AND, /* "&&" */
85 OP LOGICAL OR, /* "||" */
224
OP MODULUS, /* "%" */
OP MULTIPLY, /* "*" */
OP NOT, /* "!" */
OP NOTEQUAL, /* "!=" */
90 OP SUBTRACT, /* "-" */
OP TERNARY IF, /* "?" */
/* Delimiters */
ARROW, /* "->" */
95 LBRACE, /* "{" */
RBRACE, /* "}" */
LBRACKET, /* "[" */
RBRACKET, /* "]" */
COLON, /* ":" */
100 COMMA, /* "," */
LPAREN, /* "(" */
RPAREN, /* ")" */
SEMICOLON /* ";" */
}
105 tokens;
/*
*
* FUNCTION DECLARATIONS
110 *
*/
TreeNode ∗Parse();
115 /*
*
* GLOBALS
*
*/
120
extern Tokenvalue tokenvalue;
#endif
F.2 lexer.l
%{
/* Include stdlib for string to number conversion routines. */
#include <stdlib.h>
5 /* Include errno for errno system variable. */
#include <errno.h>
/* Include string.h to use strtoul(). */
#include <string.h>
/* include assert.h for assert macro. */
10 #include <assert.h>
/* Include global definitions. */
225
#include ”defs.h”
/* The token #defines are defined in tokens.h. */
#include ”tokens.h”
15 /* Include error/warning reporting module. */
#include ”errors.h”
/* Include option.h to access command line option. */
#include ”options.h”
20 /*
*
* MACROS
*
*/
25 #define INCPOS charPos += yyleng;
/*
*
30 * FORWARD DECLARATIONS
*
*/
char SlashToChar( char str [] );
void AddToString( char c );
35
/*
*
* GLOBALS
*
40 */
/*
* Tokenvalue (declared in tokens.h) is used to pass
* literal token values to the parser.
45 */
Tokenvalue tokenvalue;
/*
* lineCount keeps track of the current line number
50 * in the source input file.
*/
int lineCount;
/*
55 * charPos keeps track of the current character
* position on the current source input line.
*/
int charPos;
60 /*
* Counters used for string reading
*/
static int stringSize , stringPos;
65 /*
226
* commentsLevel keeps track of the current
* comment nesting level, in order to ignore nested
* comments properly.
*/
70 static int commentsLevel = 0;
%}
75 /*
*
* LEXER STATES
*
*/
80
/* Exclusive state in which the lexer ignores all input
until a nested comment ends. */
%x STATE COMMENTS
/* Exclusive state in which the lexer returns all input
85 until a string terminates with a double quote. */
%x STATE STRING
%pointer
90
/*
*
* REGULAR EXPRESSIONS
*
95 */
%%
/*
*
100 * KEYWORDS
*
*/
start { INCPOS; return KW START; }
105
bool { INCPOS; return KW BOOL; }
char { INCPOS; return KW CHAR; }
float { INCPOS; return KW FLOAT; }
int { INCPOS; return KW INT; }
110 untyped { INCPOS; return KW UNTYPED; }
void { INCPOS; return KW VOID; }
break { INCPOS; return KW BREAK; }
case { INCPOS; return KW CASE; }
115 default { INCPOS; return KW DEFAULT; }
do { INCPOS; return KW DO; }
else { INCPOS; return KW ELSE; }
extern { INCPOS; return KW EXTERN; }
goto considered harmful { INCPOS; return KW GOTO; }
227
120 if { INCPOS; return KW IF; }
label { INCPOS; return KW LABEL; }
module { INCPOS; return KW MODULE; }
return { INCPOS; return KW RETURN; }
switch { INCPOS; return KW SWITCH; }
125 while { INCPOS; return KW WHILE; }
/*
*
130 * OPERATORS
*
*/
”−>” { INCPOS; return ARROW; }
135 ”==” { INCPOS; return OP EQUAL; }
”!=” { INCPOS; return OP NOTEQUAL; }
”&&” { INCPOS; return OP LOGICAL AND; }
”||” { INCPOS; return OP LOGICAL OR; }
”>=” { INCPOS; return OP GREATEREQUAL; }
140 ”<=” { INCPOS; return OP LESSEQUAL; }
”<<” { INCPOS; return OP BITWISE LSHIFT; }
”>>” { INCPOS; return OP BITWISE RSHIFT; }
”+” { INCPOS; return OP ADD; }
”−” { INCPOS; return OP SUBTRACT; }
145 ”∗” { INCPOS; return OP MULTIPLY; }
”/” { INCPOS; return OP DIVIDE; }
”!” { INCPOS; return OP NOT; }
”˜” { INCPOS; return OP BITWISE COMPLEMENT; }
150 ”%” { INCPOS; return OP MODULUS; }
”=” { INCPOS; return OP ASSIGN; }
”>” { INCPOS; return OP GREATER; }
”<” { INCPOS; return OP LESS; }
155 ”&” { INCPOS; return OP BITWISE AND; }
”|” { INCPOS; return OP BITWISE OR; }
”ˆ” { INCPOS; return OP BITWISE XOR; }
”?” { INCPOS; return OP TERNARY IF; }
160
/*
*
* DELIMITERS
165 *
*/
”(” { INCPOS; return LPAREN; }
”)” { INCPOS; return RPAREN; }
170 ”[” { INCPOS; return LBRACKET; }
”]” { INCPOS; return RBRACKET; }
”:” { INCPOS; return COLON; }
”;” { INCPOS; return SEMICOLON; }
228
”{” { INCPOS; return LBRACE; }
175 ”}” { INCPOS; return RBRACE; }
”,” { INCPOS; return COMMA; }
/*
180 *
* VALUE TOKENS
*
*/
185 true { /* boolean constant */
INCPOS;
tokenvalue.boolvalue = TRUE;
return( LIT BOOL );
}
190
false { /* boolean constant */
INCPOS;
tokenvalue.boolvalue = FALSE;
return( LIT BOOL );
195 }
[0−9]+ { /* decimal integer constant */
INCPOS;
tokenvalue. uintvalue = strtoul ( yytext , NULL, 10 );
200 if ( tokenvalue. uintvalue == −1 )
{
tokenvalue. uintvalue = 0;
AddPosWarning( ”integer literal value ”
”too large . Zero used”,
205 lineCount, charPos );
}
return( LIT INT );
}
210 ”0x”[0−9A−Fa−f]+ {
/* hexidecimal integer constant */
INCPOS;
tokenvalue. uintvalue = strtoul ( yytext , NULL, 16 );
if ( tokenvalue. uintvalue == −1 )
215 {
tokenvalue. uintvalue = 0;
AddPosWarning( ”hexadecimal integer literal value ”
”too large . Zero used”,
lineCount, charPos );
220 }
return( LIT INT );
}
[0−1]+[Bb] { /* binary integer constant */
225 INCPOS;
tokenvalue. uintvalue = strtoul ( yytext , NULL, 2 );
if ( tokenvalue. uintvalue == −1 )
229
{
tokenvalue. uintvalue = 0;
230 AddPosWarning( ”binary integer literal value too ”
”large . Zero used” ,
lineCount, charPos );
}
return( LIT INT );
235 }
[ A−Za−z]+[ A−Za−z0−9]∗ {
/* identifier */
INCPOS;
240 tokenvalue. identifier = strdup( yytext );
return( IDENTIFIER );
}
[0−9]∗\.[0−9]+([Ee][+−]?[0−9]+)? {
245 /* floating point number */
INCPOS;
if ( sscanf( yytext , ”%f”,
&tokenvalue. floatvalue ) == 0 )
{
250 tokenvalue. floatvalue = 0;
AddPosWarning( ”floating point literal value too ”
”large . Zero used”,
lineCount, charPos );
}
255 return( LIT FLOAT );
}
260 /*
*
* CHARACTERS
*
*/
265
\’\\[\’\”abfnrtv ]\’ {
INCPOS;
yytext[ strlen (yytext)−1] = ’\0’;
tokenvalue.charvalue =
270 SlashToChar( yytext+1 );
return( LIT CHAR );
}
\’\\B[0−1][0−1][0−1][0−1][0−1][0−1][0−1][0−1]\’ {
275 /∗ \B escape sequence. ∗/
INCPOS;
yytext[ strlen (yytext)−1] = ’\0’;
tokenvalue.charvalue =
SlashToChar( yytext+1 );
280 return( LIT CHAR );
}
230
\’\\o[0−7][0−7][0−7]\’ {
/∗ \o escape sequence. ∗/
285 INCPOS;
yytext[ strlen (yytext)−1] = ’\0’;
tokenvalue.charvalue =
SlashToChar( yytext+1 );
return( LIT CHAR );
290 }
\’\\x[0−9A−Fa−f][0−9A−Fa−f]\’ {
/∗ \x escape sequence. ∗/
INCPOS;
295 yytext[ strlen (yytext)−1] = ’\0’;
tokenvalue.charvalue =
SlashToChar( yytext+1 );
return( LIT CHAR );
}
300
\’.\’ {
/∗ Single character . ∗/
INCPOS;
tokenvalue.charvalue = yytext [1];
305 return( LIT CHAR );
}
310 /∗
∗
∗ STRINGS
∗
∗/
315
\” { INCPOS;
tokenvalue. stringvalue =
(char∗) malloc( STRING BLOCK );
memset( tokenvalue.stringvalue ,
320 0, STRING BLOCK );
stringSize = STRING BLOCK;
stringPos = 0;
BEGIN STATE STRING; /∗ begin of string ∗/
}
325
<STATE STRING>\” {
INCPOS;
BEGIN 0;
/∗ Do not include terminating ” in string ∗/
330 return( LIT STRING ); /∗ end of string ∗/
}
<STATE STRING>\n {
INCPOS;
335 AddPosWarning( ”strings cannot span multiple ”
231
”lines ”, lineCount, charPos );
AddToString( ’\n’ );
}
340 <STATE STRING>\\[\’\”abfnrtv] {
/∗ Escape sequences in string . ∗/
INCPOS;
AddToString( SlashToChar( yytext ) );
}
345
<STATE STRING>\\B[0−1][0−1][0−1][0−1][0−1][0−1][0−1][0−1] {
/∗ \B escape sequence. ∗/
INCPOS;
AddToString( SlashToChar( yytext ) );
350 }
<STATE STRING>\\o[0−7][0−7][0−7] {
/∗ \o escape sequence. ∗/
INCPOS;
355 AddToString( SlashToChar( yytext ) );
}
<STATE STRING>\\x[0−9A−Fa−f][0−9A−Fa−f] {
/∗ \x escape sequence. ∗/
360 INCPOS;
AddToString( SlashToChar( yytext ) );
}
<STATE STRING>. {
365 /∗ Any other character ∗/
INCPOS;
AddToString( yytext [0] );
}
370
/∗
∗
∗ LINE COMMENTS
∗
375 ∗/
”//”[ˆ\n]∗ { ++lineCount; /∗ ignore comment lines ∗/ }
380 /∗
∗
∗ BLOCK COMMENTS
∗
∗/
385
”/∗” { INCPOS;
++commentsLevel;
BEGIN STATE COMMENTS;
/∗ start of comments ∗/
232
390 }
<STATE COMMENTS>”/∗” {
INCPOS;
++commentsLevel;
395 /∗ begin of deeper nested
comments ∗/
}
<STATE COMMENTS>. { INCPOS; /∗ ignore all characters ∗/ }
400
<STATE COMMENTS>\n {
charPos = 0;
++lineCount; /∗ ignore newlines ∗/
}
405
<STATE COMMENTS>”∗/” {
INCPOS;
if ( −−commentsLevel == 0 )
BEGIN 0; /∗ end of comments∗/
410 }
/∗
∗
415 ∗ WHITESPACE
∗
∗/
[\t ] { ++charPos; /∗ ignore whitespaces ∗/ }
420
\n { ++lineCount;
charPos = 0; /∗ ignored newlines ∗/
}
425 /∗ unmatched character ∗/
. { INCPOS; return yytext[0]; }
%%
430
/∗
∗
∗ ADDITIONAL VERBATIM C CODE
∗
435 ∗/
/∗
∗ Convert slashed character (e.g. \n, \r etc .) to a
∗ char value.
440 ∗ The argument is a string that start with a backslash,
∗ e.g. \x2e, \o056, \n, \b11011101
∗
∗ Pre: (for \x, \B and \o): strlen (yytext) is large
233
∗ enough. The regexps in the lexer take care
445 ∗ of this .
∗/
char SlashToChar( char str [] )
{
static char strPart [20];
450
memset( strPart , 0, 20 );
switch( str [1] )
{
455 case ’\\’:
return( ’\\’ );
case ’\”’:
return ( ’\”’ );
case ’\’’:
460 return ( ’\’’ );
case ’a’:
return ( ’\a ’ );
case ’b’:
return ( ’\b ’ );
465 case ’B’:
strncpy( strPart , str +2, 8 );
return( strtoul ( yytext+2, NULL, 2 ) );
case ’ f ’:
470 return ( ’\ f ’ );
case ’n’:
return ( ’\n ’ );
case ’o’:
strncpy( strPart , str +2, 3 );
475 return( strtoul ( strPart , NULL, 8 ) );
case ’t ’:
return ( ’\t ’ );
case ’ r ’:
return ( ’\ r ’ );
480 case ’v’:
return ( ’\v ’ );
case ’x’:
strncpy( strPart , str +2, 2 );
return( strtoul ( strPart , NULL, 16 ) );
485 default :
/∗ Control should never get here! ∗/
assert ( 0 );
}
}
490
/∗
∗ For string reading (which happens on a
∗ character−by−character basis), add a character to
495 ∗ the global lexer string ’tokenvalue. stringvalue ’.
∗/
void AddToString( char c )
234
{
if ( tokenvalue. stringvalue == NULL )
500 {
/∗ Some previous realloc () already went wrong.
∗ Silently abort.
∗/
return;
505 }
if ( stringPos >= stringSize − 1 )
{
510 stringSize += STRING BLOCK;
DEBUG( ”resizing string memory +%d, now %d bytes”,
STRING BLOCK, stringSize );
tokenvalue. stringvalue =
515 (char∗) realloc ( tokenvalue. stringvalue ,
stringSize );
if ( tokenvalue. stringvalue == NULL )
{
AddPosWarning( ”Unable to claim enough memory ”
520 ”for string storage”,
lineCount, charPos );
return;
}
memset( tokenvalue.stringvalue + stringSize
525 − STRING BLOCK, 0, STRING BLOCK );
}
tokenvalue. stringvalue [stringPos] = c;
stringPos++;
530 }
235
Appendix G
Logic Language Parser
Source
G.1 Lexical Analyzer
%{
#include ”lexer.h”
5 unsigned int nr = 0;
%}
%%
10
\n {
nr = 0;
}
15 [ \t]+ {
nr += yyleng;
}
”−>” {
20 nr += yyleng;
return (RIMPL);
}
”<−” {
25 nr += yyleng;
return (LIMPL);
}
”<−>” {
30 nr += yyleng;
return (EQUIV);
236
}
[A−Z]{1} {
35 nr += yyleng;
return (IDENT);
}
”RESULT” {
40 nr += yyleng;
return (RESULT);
}
”PRINT” {
45 nr += yyleng;
return (PRINT);
}
. {
50 nr += yyleng;
return (yytext [0]);
}
55 %%
G.2 Parser Header
#ifndef LEXER H
#define LEXER H 1
enum
5 {
LIMPL = 300,
RIMPL,
EQUIV,
RESULT,
10 PRINT,
IDENT
};
#endif
G.3 Parser Source
#include <stdio.h>
#include <stdlib.h>
#include ”lexer.h”
5
237
#ifdef DEBUG
# define debug(args ...) printf (args)
#else
# define debug (...)
10 #endif
extern unsigned char∗ yytext;
extern unsigned int nr;
extern int yylex (void);
15
unsigned int token;
// who needs complex datastructures anyway?
unsigned int acVariables [26];
20
void gettoken (void)
{
token = yylex();
debug(”new token: %s\n”, yytext);
25 }
void error (char ∗ e)
{
fprintf ( stderr , ”ERROR(%d:%c): %s\n”,
30 nr, yytext [0], e);
exit (1);
}
void statement (void);
35 int negation (void);
int restnegation (int );
int conjunction (void);
int restconjunction (int );
int implication (void);
40 int restimplication (int );
int factor (void);
void statement (void)
{
45 int res = 0, i , var = 0;
debug(”statement()\n”);
if (token == IDENT)
{
50 var = yytext[0] − 65;
gettoken ();
if (token == ’=’)
{
gettoken();
55 res = implication ();
acVariables [var] = res;
if (token != ’;’)
error (”; expected”);
gettoken();
238
60 } else {
error (”= expected”);
}
} else {
error (”This shouldn’t have happened.”);
65 }
for ( i = 0 ; i < 26 ; i++)
debug (”%d”, acVariables[i ]);
debug (”\n”);
}
70
int implication (void)
{
int res = 0;
debug(”implication()\n”);
75 res = conjunction();
res = restimplication (res );
return (res );
}
80 int restimplication (int val)
{
int res = val;
int operator;
debug(” restimplication ()\n”);
85
if (token == EQUIV || token == RIMPL || token == LIMPL)
{
operator = token;
gettoken();
90 res = conjunction();
switch (operator)
{
case RIMPL:
res = (val == 0) || ( res == 1) ? 1 : 0;
95 break;
case LIMPL:
res = (val == 1) || ( res == 0) ? 1 : 0;
break;
case EQUIV:
100 res = (res == val ) ? 1 : 0;
break;
}
res = restimplication (res );
}
105 return (res );
}
int conjunction (void)
{
110 int res = 0;
debug(”conjunction()\n”);
res = negation();
res = restconjunction(res );
239
return (res );
115 }
int restconjunction (int val)
{
int res = val, operator;
120 debug(”restconjunction()\n”);
if (token == ’&’ || token == ’|’)
{
operator = token;
gettoken();
125 res = negation();
if (operator == ’&’)
{
res = ((res == 1) && (val == 1)) ? 1 : 0;
} else { /* ’|’ */
130 res = ((res == 1) || ( val == 1)) ? 1 : 0;
}
res = restconjunction(res );
}
return (res );
135 }
int negation (void)
{
int res = 0;
140 debug(”negation()\n”);
if (token == ’˜’)
{
gettoken();
res = negation() == 0 ? 1 : 0;
145 } else {
res = factor ();
}
return (res );
}
150
int factor (void)
{
int res = 0;
155 debug(”factor()\n”);
switch (token)
{
case ’(’:
gettoken();
160 res = implication ();
if (token != ’)’)
error (”missing ’)’ ”);
break;
case ’1’:
165 res = 1;
break;
case ’0’:
240
res = 0;
break;
170 case IDENT:
debug(”’%s’ processed\n”, yytext);
res = acVariables[yytext [0] − 65];
break;
default:
175 error (” (, 1, 0 or identifier expected”);
}
debug (”factor is returning %d\n”, res);
gettoken();
return (res );
180 }
void program (void)
{
while (token == IDENT || token == PRINT)
185 {
if (token == IDENT)
{
statement();
}
190 else if (token == PRINT)
{
gettoken();
printf (”%d\n”, implication());
if (token != ’;’)
195 error (”; expected”);
gettoken();
}
}
}
200
int main (void)
{
int i = 0;
for ( i = 0 ; i < 26 ; i++)
205 acVariables [i ] = 0;
/* start off */
gettoken();
program ();
210 return (0);
}
241
Listings
3.1 Inger Factorial Program . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Backus-Naur Form for module . . . . . . . . . . . . . . . . . . . 27
3.3 Legal Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Global Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 BNF for Declaration . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 The While Statement . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 The Break Statement . . . . . . . . . . . . . . . . . . . . . . . . 44
3.9 The Break Statement (output) . . . . . . . . . . . . . . . . . . . 44
3.10 The Continue Statement . . . . . . . . . . . . . . . . . . . . . . . 45
3.11 The Continue Statement (output) . . . . . . . . . . . . . . . . . . 45
3.12 Roman Numerals . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.13 Roman Numerals Output . . . . . . . . . . . . . . . . . . . . . . 48
3.14 Multiple If Alternatives . . . . . . . . . . . . . . . . . . . . . . . 48
3.15 The Switch Statement . . . . . . . . . . . . . . . . . . . . . . . . 49
3.16 The Goto Statement . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.17 An Array Example . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.18 C-implementation of printint Function . . . . . . . . . . . . . . . . 56
3.19 Inger Header File for printint Function . . . . . . . . . . . . . . . . 56
3.20 Inger Program Using printint . . . . . . . . . . . . . . . . . . . . . 57
5.1 Sample Expression Language . . . . . . . . . . . . . . . . . . . . 85
5.2 Sample Expression Language in EBNF . . . . . . . . . . . . . . . 91
5.3 Sample Expression Language in EBNF . . . . . . . . . . . . . . . 93
5.4 Unambiguous Expression Language in EBNF . . . . . . . . . . . 97
5.5 Expression Grammar Modified for Associativity . . . . . . . . . . 100
5.6 BNF for Logic Language . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 EBNF for Logic Language . . . . . . . . . . . . . . . . . . . . . . 103
6.1 Expression Grammar for LL Parser . . . . . . . . . . . . . . . . . 110
6.2 Expression Grammar for LR Parser . . . . . . . . . . . . . . . . . 114
6.3 Conjunction Nonterminal Function . . . . . . . . . . . . . . . . . 119
8.1 Sync routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2 SyncOut routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.1 Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.2 Sample program listing . . . . . . . . . . . . . . . . . . . . . . . . 149
12.1 Global Variable Declaration . . . . . . . . . . . . . . . . . . . . . 165
242
Index
abstract grammar, 82
Abstract Syntax Tree, 140
abstract syntax tree, 97, 106
Ada, 17
address, 52
adventure game, 79
Algol 60, 16
Algol 68, 16
algorithm, 24
alphabet, 63, 79, 82
ambiguous, 94
annotated parse tree, 95
annotated syntax tree, 95
array, 50
arrays, 136
assignment chaining, 40
assignment statement, 40
associativity, 13, 99
AST, 133, 157–159
auxiliary symbol, 82
Backus-Naur Form, 26
Backus-Naur form, 84
basis, 80
binary number, 31
binary search trees, 136
block, 42
bool, 33
boolean value, 33
bootstrap-compiler, 200
bootstrapping, 199
bottom-up, 108
break, 43
by value, 55
C, 17
C++, 18
callee, 168
calling convention, 167
case, 46, 158
case block, 157
case blocks, 157
case value, 157
char, 36
character, 32, 36
child node, 156
children, 158
Chomsky hierarchy, 89
closure, 80
CLU, 17
COBOL, 16
code, 11
code block, 156–158
code blocks, 157
code generation, 158
coercion, 143
comment, 29
common language specification, 21
compiler, 10
compiler-compiler, 109
compound statement, 40
computer language, 79
conditional statement, 43
context-free, 87
context-free grammar, 63, 84, 89
context-insensitive, 87
context-sensitive grammar, 89
continue, 43
dangling else problem, 46
data, 24
decimal separator, 31
declaration, 24
default, 46
definition, 24
delimiter, 29
derivation, 28
derivation scheme, 92
determinism, 88, 109
deterministic grammar, 116
dimension, 50
double, 36
243
duplicate case value, 157, 158
duplicate case values, 157
duplicate values, 158
duplicates, 158
dynamic variable, 52
Eiffel, 18
encapsulation, 17
end of file, 110
error, 133, 158
error message, 159
error recovery, 117
escape, 32
evaluator, 107
exclusive state, 69
expression evaluation, 40
Extended Backus-Naur Form, 28
extended Backus-Naur form, 90
extern, 56
FIRST, 128
float, 35, 36
floating point number, 31
flow control statement, 46
FOLLOW, 128
formal language, 79
FORTRAN, 16
fractional part, 31
function, 24, 52
function body, 42, 54
function header, 54
functional programming language, 16
global variable, 24, 37
goto, 159
goto label, 159
goto labels, 159
goto statement, 159
grammar, 60, 78
hash tables, 136
header file, 56
heap, 52
hexadecimal number, 31
identifier, 29, 52
if, 43
imperative programming, 16
imperative programming language,
16
indirect recursion, 88
indirection, 37
induction, 80
information hiding, 17
inheritance, 17
int, 34
integer number, 31
Intel assemby language, 10
interpreter, 9
Java, 18
Kevo, 18
Kleene star, 64
label, 49, 159
language, 78
left hand side, 40
left recursion, 88
left-factorisation, 112
left-linear, 89
left-recursive, 111
leftmost derivation, 83, 93
lexer, 61
lexical analyzer, 61, 86
library, 56
linked list, 136
linker, 56
LISP, 17
LL, 109
local variable, 37
lookahead, 113
loop, 43
lvalue, 40
lvalue check, 153
metasymbol, 29
Modula2, 17
module, 55
n-ary search trees, 136
natural language, 79, 133
Non-void function returns, 157
non-void function returns, 156
non-void functions, 157
nonterminal, 26, 82
object-oriented programming language,
16
operator, 29
optimal-compiler, 200
244
parse tree, 92
parser, 106
Pascal, 16
PL/I, 16
pointer, 36
polish notation, 99, 107
polymorphic type, 36
pop, 13
prefix notation, 99, 107
priority, 13
priority list, 13
procedural, 16
procedural programming, 16
procedural programming language,
16
production rule, 27, 81
push, 13
random access structure, 50
read pointer, 11
recursion, 83, 87
recursive descent, 109
recursive step, 80
reduce, 12, 13, 108
reduction, 12, 113
regex, 66
regular expression, 65
regular grammar, 89
regular language, 63
reserved word, 29, 73
return, 156, 157
return statement, 157
right hand side, 40
right linear, 89
rightmost derivation, 94
root node, 158
rvalue, 40
SASL, 17
scanner, 61
scanning, 62
Scheme, 17
scientific notation, 31
scope, 30, 37, 136
scoping, 135
screening, 62
selector, 46
semantic, 10
semantic analysis, 133, 136
semantic check, 158
semantics, 81, 133
sentential form, 83, 85, 109
shift, 12, 13, 108, 114
shift-reduce method, 115
side effect, 40, 53
signed integers, 34
simple statement, 40
Simula, 17
single-line comment, 29
SmallTalk, 17
SML, 17
stack, 11, 52, 108
stack frame, 167
start, 52, 55
start function, 52
start symbol, 27, 82, 84, 90, 108
statement, 24, 40, 156
static variable, 52
string, 32
strings, 79
switch block, 157
switch node, 158
switch statement, 157
symbol, 134, 136
symbol identification, 134, 135
Symbol Table, 136
symbol table, 133, 159
syntactic sugar, 108
syntax, 60, 80, 133
syntax analysis, 133
syntax diagram, 24, 90
syntax error, 109
syntax tree, 92
T-diagram, 199
template, 163
terminal, 26, 82
terminal symbol, 82
token, 61, 114
token value, 63
tokenizing, 62
top-down, 108
translator, 9
Turing Machine, 17
type 0 language, 89
type 1 grammar, 89
type 1 language, 89
type 2 grammar, 89
type 2 language, 89
type 3 grammar, 89
245
type 3 language, 89
type checker, 133
Type checking, 133
typed pointer, 50
types, 133
union, 63, 64
unique, 136
Unreachable code, 156
unreachable code, 156, 157
unsigned byte, 36
untyped, 36, 73
warning, 133, 156, 158
whitespace, 86
zero-terminated string, 50
246