PMzone logo

Book cover

Project Guides


Compiler Development

UNDER CONSTRUCTION

certificate

Compiler Design and Development

Project 4: Learn How to Write a 64-Bit Windows Compiler

This applied approach to designing and developing a 64-bit compiler focuses on practical skills rather than accademic theory. Designed for the practicing programmer who wants to learn to write a compiler for a developmental language, this project teaches you what you need to know to get started. Excerpts in the accompanying project guides are taken from the book, Compiler Design and Development, Volume 1, First Edition.

A compiler is a program that converts the source code for a programming language into machine code for a target microprocessor. Compilers generally consist of a scanner that performs lexical analysis, parser that performs syntax and semantic analysis, intermediate code generator, code optimizer, low-level code generator, and symbol table.

Learn How to Build a 64-Bit Compiler

If you have the desire to create a compiler, this publication is aimed at you. It does not matter whether you are a hobby programmer, student, or a professional developer. The approach is the same for everybody.

Overview

A compiler is a program that takes code written in a high-level source language and translates it into a low-level language, usually assembly or machine code. Generally, low-level code is targeted to a particular family of microprocessors. The compiler covered in this project is specifically designed to work with Intel x86-64 and AMD64 processors.

The front-end of a compiler consists of a scanner, parser, symbol table, and intermediate code generator. The scanner reads the source code character by character and constructs tokens out of the basic elements (lexemes) of the source input stream. This process is called lexical analysis and it begins the process of developing one or more symbol tables. It also begins one of the most important functions of a compiler, error detection and reporting.

The parser conducts syntax and semantic analyses on the tokens and produces a series of parse trees based on the set of grammar rules characterizing the correct form of sentences in the language. It also detects errors and reports them as they occur.

The symbol table contains information about identifiers such as type, location in memory, limitations, and other data needed by the parser. The symbol table and parse trees are used to produce intermediate code.

The back-end of a compiler consists of a code optimizer and code generator. The optimizer uses various strategies to improve the speed and space requirements of generated code. The code generator then converts the optimized intermediate code into assembly instructions that are passed to an assembler, or directly into machine-level instructions that can be read directly by the processor.

An integral part of specifying an operational compiler is coming to grips with its actual internal workings. This requires familiarity with the following concepts:

Conceptual Design

In 1984 Gehani wrote, "Most programmers adhere to the old-fashioned view that the purpose of programs is to instruct our machines, whereas modern programmers know that the purpose of our machines is to execute our programs." (Gehani, N., 1984, Comparing and Assessing Programming Languages, Englewood Cliffs, NJ, Prentice-Hall.

Compiler. In my mind, a language is only as good as its compiler. Therefore, a compiler must be totally reliable, which means it always has to do what the programmer intended. This requires every program to be checked against every single rule of the language so that no incorrectly formed code is allowed to be compiled without warning the programmer. It also means all instructions are correctly translated into machine-level form so that no incorrect code can be allowed to cause a program to crash unintentionally. There are other important qualities that a well-designed compiler should possess.

A compiler must:

The compiler is composed of the following component parts:

Programming Language. The programming language described in this document is called Adagé. Adagé is a powerful and flexible language incorporating many of the best features found in modern renditions of languages like Ada, BASIC, C, Modula, Pascal, and Spark. It is a bi-level programming language designed primarily with readability and reliability in mind. Bi-level means that it contains both high-level programming constructs and low-level access to the microprocessor using standard assembly language mnemonics. Adagé is an imperative programming language focused on placing named data objects into memory locations and then performing operations on them using various procedures, functions, and methods.

The purpose for designing this new language is to help users learn how to design and develop a compiler based on a simple to understand, statically verifiable, standardized programming language. Secondary purposes include learning how to develop language and compiler specifications, create a robust user interface, and build support tools like component libraries and source code verification programs.

In my opinion, Ada is way too large a language and its compiler is very complex. BASIC is a good language, but lacks structure and abstraction. C, although very powerful, is terse and unsafe for many applications. Modula and Pascal contain good abstraction and safety features, but are not expressive enough in terms of hiding implementation details. Spark is a subset of Ada that introduces the concept of programming-by-contract, but remains overly complex and proprietary. I decided a fresh language was in order because embarking on this new frontier would prove both exciting and rewarding.

Some of the main features envisioned for Adagé include the following:

Note that Adagé is not explicitly designed for high integrity applications, embedded device drivers, or any specific single developmental domain. Unlike high integrity applications written in Ada and Spark, Adagé will halt program execution upon encountering an error, offer corrective actions, and then wait for user input before continuing. This is by design because Adagé applications are meant for domains that expect correct data results and enforce those expectations.

Scanner

Scanning (lexical analysis) is usually the first phase of a compiler program. Based on finite-state machines, the scanner reads one character at a time from a source file and converts them into recognizable sequences of characters that form the tokens included in the language. Scanners are often called lexers or tokenizers. In programming languages, scanners are generally combined with parsers to analyze the syntax of the language.

The role of the scanner (lexical analyzer) is to read the input source file character by character and group recognized patterns of characters into lexemes and produce as output a sequence of tokens for each lexeme. Note that the difference between the term lexeme and token is that a lexeme is a recognized string pattern such as the reserved word PUT. The scanner turns the lexeme into a token by outputting the name of the identifier as well as associated attribute information such as type, value, location, and scope. Thus, a lexeme represents an object and when paired with its associated attributes, becomes a token.

Note also that the term object means a named entity that is assigned a memory location and a value at that memory location. In essence, an object is anything that is assigned memory space and then takes up that memory space with a value.

Regular Expressions

Regular expressions are a convenient method of specifying certain simple sets of strings. They are used to specify the structure of the tokens used in a programming language. Typically, regular expressions are used in scanner generators.

Regular expressions, also called regular grammars, are a means of specifying certain string patterns called lexemes. We are interested in regular expressions because we will use them to specify the structure of tokens declared in our programming language. In fact, the set of regular expressions used in our scanner are expressly aimed at developing specific recognizers for the Adagé programming language.

The regular expressions used in developing the Adagé scanner have several desirable properties:

For example, Adagé string patterns are recognized by the scanner as identifiers when they start with a letter or underline character followed by any combination of letters, digits, or underlines. Figure 1 depicts the format of a regular expression denoting an identifier.

Figure 1. Regular Expression for an Identifier
IDENTIFIER ::= letter [letter | digit | underline {letter | digit}]
  letter ::= upper_case_letter | lower_case_letter
  upper_case_letter ::= A .. Z
  lower_case_letter ::= a .. z
  digit ::= 0 .. 9
  underline ::= _

What the symbology in Example 1 shows is that an Adagé IDENTIFIER must start with a letter (a..z | A..Z). The first letter then can be followed by any combination of zero or more optional letters, digits, or underlines. In the case of an underline character, the underline must be followed by at least one letter or digit {letter | digit}. Brackets [ ] used in defining a regular expression signify that zero or more of the symbols contained within the brackets can be used. Alternatively, braces { } indicate that one or more characters contained within must be used while parentheses ( ) indicate that one of each character contained within must be used in the order listed. The ::= symbol means "is defined as". The vertical bar | represents an "or" operator.

Therefore, the following are legal and illegal identifiers in Adagé:

Figure 2. Legal and Illegal Identifiers
  LEGAL IDENTIFIERS       ILLEGAL IDENTIFIERS
  Add_Two_Numbers         Add__Two__Numbers    ~ Cannot use two adjacent underlines
  P25_Units               25_Units             ~ Cannot use a digit as the first character
  ReturnVal               RETURN               ~ Cannot use stand-alone reserved words
  My_3_Variables          My#3Variables:       ~ Cannot use symbols other than underlines
  x_1                     _                    ~ Cannot use single stand-alone underlines
  Div_by_Two              _Div_by_Two          ~ Cannot use an underline as the first character
  Big_Definition          Big_Definition_      ~ Cannot use an underline as the last character
  A315_Vectors            A315 Vectors         ~ Cannot use spaces within a name

Deterministic Finite Automata

A recognizer for a programming language takes as input a string of one or more characters and determines whether the string is a token of the language. Converting regular expressions to recognizer diagrams is to construct generalized transition diagrams from the expressions. Deterministic Finite Automata (DFA) represent transition diagrams that contain at most one path leaving one state and into the next state. A DFA is a theoretical machine accepting regular languages.

Deterministic Finite Automata (DFA) are used to create finite-state machine-oriented subroutines that deal with the tokens in our programming language. Finite-state machines (FSM) have one purpose: to accept or reject input strings as tokens. Finite automata are often represented as either transition tables or transition diagrams, which use circles and arrows to portray finite-states and transitions, thus its name. Deterministic finite automata always define a unique next state as individual characters making up a token are recognized and processed by the lexical analyzer.

Adagé's DFA transition diagrams are designed for use in developing the scanner program. The conversion of the regular expression in Figure 1 into a DFA diagram is shown in Figure 3.

Figure 3. Finite-State Transition Diagram

Note that the start state begins at the circle containing a zero. The start point represents the process of reading the first character of a token. If the initial character is a letter, then the scanner knows it is dealing with an identifier and the subroutine jumps to State 1. If the next character read is an underline, then the subroutine jumps to State 2.

In order for the subroutine to leave State 2, the next character read from the source file must be either a letter or digit. If it is, then the subroutine transitions back to State 1. The subroutine will remain in State 1 as long as the next group of characters read from the source file are recognized as letters or digits. If an underline is read, then the subroutine jumps to State 2. When any other character is read, for instance a blank space, then the subroutine is done and transitions to the final state, State 3.

Note that in Figure 3 the end state is represented by the number 3 inside double circles. When the subroutine reaches an end state, it knows it has a fully formed lexeme. The arrows pointing to each circle represent transition conditions. The finite-state transition diagram depicted in Example 3 can be converted to program code as shown in Listing 1.

~---------------------------------------------------------------------------------------------
~ Listing 1. Program Stub for a DFA Transition Diagram
~ Purpose is to tokenize the input stream from a program's source code
~---------------------------------------------------------------------------------------------
var token: string [1..64]           ~ Define a 64-byte buffer named token
var chr: char                       ~ Define chr as type char
subtype cnt is integer range 1..64  ~ User defined subtype
var count: cnt = 1                  ~ Define count as type cnt with initial value
constant whitespace: char = " "     ~ Define whitespace to be a blank character
  
State_0:
chr = get_ch(count)                 ~ Get character from source
  while chr = whitespace            ~ Loop as long as character is whitespace
    count = count + 1
    chr = get_ch(count)             ~ Get next character from source
  end while                         ~ Must have encountered a non-whitespace character
  count = 1                         ~ Reset count
  
case chr is
  when letter or underline          ~ Start of an identifier
    Get_Identifier()                ~ If token is an identifier, get the rest of it
  ~ Statements to check for other kinds of tokens
  when others
    Raise_Exception (error_number)
end case

procedure Get_Identifier()
  if chr = letter then              ~ Determine next transition state
    goto State_1
  else                              ~ Char must be an underline
    goto State_2
  end if
    
State_1:
  token(count) = chr                ~ Store char in token
  count = count + 1
  chr = get_ch(count)               ~ Get next character
  if (chr = letter) or (chr = digit)
    goto State_1
  else if chr = underline
    goto State_2                    ~ Deal with hyphen
   else
    goto State_3
  end if
    
State_2:
  token(count) = chr                ~ Store underline character
  count = count + 1
  chr = get_ch(count)               ~ Get next character
  if (chr = letter) or (chr = digit) then
    goto State_1
  else                              ~ An underline must be followed by either a letter or
    Raise_Exception (Error_Number)  ~ digit. Also, two underlines cannot be placed together
  end if                            ~ and an identifier must end in either a letter or digit

State_3:
  ~ Routine to add identifier to symbol table
  ~ Routine to add identifier to syntax tree
  ~ Return to State_0 to retrieve next token
end procedure
~---------------------------------------------------------------------------------------------

In Listing 1, we assumed the first character retrieved from the source line happened to be a letter or underline. This caused the scanner to know it is working with an identifier. Knowing that it is building an identifier, the scanner subroutine passes program control to either State 1 or State 2 to finish retrieving the rest of the token. Notice how State 2 ensures that at least one letter or digit follows an underline and that two underlines cannot be grouped together.

Notice how easy it is to translate a DFA transition diagram into actual program code. Similar transition diagrams exist for other Adagé token types including numbers, symbols, end-of-file, end-of-line, comments, strings, and pragmas. Each has its own set of numbered states and transitions to those states.

Parsers

There are two general approaches to parsing. The first is termed top-down. The second is termed bottom-up. The productions produced by a top-down parsing technique represent a leftmost derivation. Conversely, the productions produced by a bottom-up parsing technique represent a rightmost derivation. When specifying a parsing technique then, one must state whether a leftmost or rightmost parse will be produced.

The purpose of a parser is to collect string information into groupings as specified by production rules and output parse trees for the code generator. Parse trees graphically represent the grammar resulting from the choice for replacement order of the production rules.

Context-Free Grammars

There are many different ways to describe the semantics of a programming language. Most modern compilers generally employ context free grammars (CFG). Grammars are considered context free when their production rules can be applied regardless of the context of a nonterminal. Context free grammar production rules are formulated using the Backus-Naur Form (BNF).

Every programming language has its own set of syntactical rules that provide guidance on how to form correct constructs. Adagé employs a Context-Free Grammar (CFG) to specify its syntax. A CFG is a formal methodology represented by a series of production rules containing terminal and nonterminal symbols. Context-free grammars, in the form of extended Backus-Naur Form (BNF) notation, are used by Adagé's parsing program. The purpose of CFGs is to accept the output of the scanner and verify that the source code satisfies the syntax rules of the language. CFGs provide a simple mathematical representation of the rules by which the output of the scanner are described.

Terminals. Terminals represent all the basic symbols of a language from which strings are formed. Terminals never change and cannot be broken down any further. Sometimes referred to as tokens, terminals are represented using their actual symbology. Examples of terminal symbols include:

Nonterminals. Nonterminals represent syntactic variables in the grammar that denote sets of strings. Nonterminal symbols can always be expanded to one or more terminal symbols. Typically, nonterminal symbols are depicted using the following representations:

Production Rules. The productions of a grammar specify the manner in which terminals and nonterminals may be combined to form strings. Generally, production rules define how a nonterminal symbol on the left hand side of a production expands into a number of terminal and nonterminal symbols on the right hand side of a production. Expansion of all nonterminal symbols on the right side of a production must eventually lead to a sequence of terminal only symbols. Figure 4 depicts the general format of production rules.

Figure 4. Context-Free Grammar Production Format
statement goal ::= finite sequence of terminals and nonterminals

Figure 4 shows that Adagé's context-free grammar format contains four elements:

  1. A set of terminal symbols, referred to as tokens. They represent elementary symbols of the language
  2. A set of nonterminals, referred to as syntactic variables. Each nonterminal represents a set of strings of terminals
  3. Set of productions, where each production consists of a nonterminal, referred to as the head or left side of the production, a ::= symbol, and a sequence of terminals and nonterminals, referred to as the right side of the production
  4. Designation of one of the nonterminals as the <start symbol> or <statement goal>

Figure 5 shows a sample of Adagé's production rules using a set of classic expression grammars for arithmetic operations.

Figure 5. Adagé Arithmetic Production Rules
 1. expression ::= term
 2.              | term + term
 3.              | term - term
 4. term       ::= factor
 5.              | factor * factor
 6.              | factor / factor
 7.              | factor % factor
 8.              | factor rem factor
 8. factor     ::= constant
 9.              | variable
10.              | (expression)
11. constant   ::= digit
12. variable   ::= identifier
13. digit      ::= (0..9)

These production rules utilize the stack to perform arithmetic parsing operations. In some cases, Reverse Polish Notation operations are implemented where speed and operator precedence are critical. Additionally, the productions in Example 6 are not yet optimized. This is because Adagé's grammar is still evolving, which leaves plenty of opportunity to work on efficiency.

Note that the productions shown in Figure 5 can be portrayed as syntax diagrams, which makes them much easier to use. Adagé has syntax diagrams for all its language constructs. These are used to build the tables and subprograms that drive the parser. Figure 6 depicts the set of representative syntax diagrams for Adagé arithmetic productions.

Figure 6. Adagé Syntax Diagrams

Syntax Analysis. Parsers perform two functions: (1) syntax analysis and (2) semantic analysis. The syntax of a programming language concerns the structure of its constructs. The syntax analyzer identifies sequences of symbols and validates their legality. In top-down recursive descent parsers, the syntax analyzer creates a tree-like intermediate representation that depicts the grammatical structure of the token stream. In top-down push-down parsers, the syntax analyzer creates stack entries. Since context-free grammars have inherent shortcomings, most notably their inability to describe certain interdependencies between different units in a program, Adagé employs both methods.

One important Adagé requirement is that its grammar must form unambiguous constructs. That is, every string generated by the grammar can only have one parse tree. This is because ambiguous constructs have the potential of providing different solutions to a parsing description. Figure 7 points out the difference between ambiguous and unambiguous syntactic constructs.

Figure 7. Syntax Constructs
index = value + rate * 10    ~ Ambiguous statement raises an exception
index = value + (rate * 10)  ~ Unambiguous statement that uses parentheses to disambiguate

Figure 7 shows that in the case where value = 100 and rate = 1000, then the ambiguous statement could give two different answers (either 11,000 or 10,100) depending on the order of the parse.

Because Adagé does not invoke operator precedence like other programming languages, the compiler avoids this ambiguity by requiring parentheses to be used to indicate operator precedence. In the absence of parentheses, Adagé defaults to performing arithmetic operations from left to right. Therefore, without the use of parentheses, Adagé would add rate to value and then multiply the result by 100 giving an answer of 11,000. If this is not the desired solution, then the operands must be reordered or parentheses must be used.

Derivations. The sequence of intermediary strings generated to expand the <start symbol> of the grammar to a desired string of terminals is called a derivation. The parser is responsible for representing derivations in the form of parse trees. Figure 8 shows the parse tree for the unambiguous statement index = value + (rate * 10).

Figure 8. Parse Tree for index = value + (rate * 10)

Another problem arises that can't be solved by Adagé's syntax analyzer. Figure 9 depicts a construct with perfectly formed syntax.

Figure 9. Acceptable Syntactic Definition
var AnyVal : integer = "This is a string"  ~ Perfectly formed syntax

Semantic Analysis. The semantics of a language specifies the meaning of its syntactic constructs. Figure 9 demonstrates a perfectly legal syntax, but it is not semantically correct. Although the assignment is correctly formed, a string of characters cannot be assigned to an integer variable. Identifying these kinds of errors is the job of the semantic analyzer.

Semantic analysis checks that the syntax portrays the correct meaning. In the previous example, the syntax is not semantically correct because the declaration is attempting to assign a string on the right side of the assignment symbol (i.e. =) to the integer variable "AnyVal" on the left side. A semantic error should be raised at this point. Semantic errors include type mismatch, anonymous types, undeclared variables, reserved word misuse, more than one declaration of a variable in the same scope, accessing out of scope variables, and actual argument and formal parameter mismatches.

There are three generally recognized types of parsers for grammars: (1) universal, (2) top-down, and (3) bottom-up. Although universal parsers are known for their ability to parse almost any kind of grammar, they are typically very inefficient. This often makes them less useful in compiler design; therefore, I will not discuss universal parsers.

Top-Down Parsing

A parser is considered top-down if it builds a parse tree corresponding to a token sequence by starting at the top of the tree and then expanding it downward to the leaves. Top-down parsing techniques are predictive in nature because they always predict the production that is to be matched before matching actually begins. A recursive descent parsing technique is a popular top-down approach.

Top-down parsers form the derivation of the input source file from the start symbol of the grammar and construct parse trees from that root at the top and proceed downwards through branches to subordinate constructs called leaves. Two approaches for top-down parsing are recognized: (1) recursive descent parsing and (2) predictive parsing.

Recursive Descent Parsing. Recursive descent parsing is a kind of top-down parsing consisting of a set of recursive routines that may require backtracking to create a proper parse tree. When a context-free grammar contains several alternative branches representing the next retrieved token, it has to take a branch in order to check the validity of its choice. If the incorrect branch is taken, the subroutine must backtrack back up the tree and test another branch. This recursive process is time consuming for a compiler.

Predictive Parsing. A predictive parser is a top-down recursive parser that avoids backtracking by obtaining a lookahead character to check before making a branch decision. This is called top-down parsing without backtracking and it reduces the execution time to a linear function related to the length of the token being parsed.

Also, most top-down parsers scan a line to be parsed from left to right and make decisions on which rule (i.e. branch to take) based on the leftmost token on the line. This is referred to as an LL parser (left to right scan; leftmost derivation). When one lookahead character is used to avoid backtracking, the number one is placed inside parentheses like this: LL(1). LL(1) then means a top-down parser with left to right scanning using the leftmost derivation and 1 lookahead character.

Note that the drawbacks associated with top-down parsers include: (1) potential to enter into infinite loops, (2) potential to backtrack when no lookahead is employed, (3) near-impossibility of determining where errors occur, and (4) not all grammars lend themselves to top-down parsing. However, top-down predictive parsers have the advantage of being able to lend themselves to transition diagrams in a straightforward manner.

Bottom-Up Parsing

Bottom-up parsers build the structure of a parse tree by beginning at the bottom (leaves of the tree, which are terminal symbols) and determining the productions used to generate the leaves. The parser continues until it reaches the production used to expand the start symbol at the top of the tree.

Bottom-up parsing, also called shift-reduce parsing, is so called because the derivation tree is constructed from the bottom up. That is, they are constructed from the leaves of a tree and progress up to the root node. Almost all bottom-up parsers employ right derivation strategies, thus they are referred to as LR parsers. And in similar style to LL parsers, they can employ zero or more lookahead characters.

So, a LR(1) parser is a bottom-up parser with left to right scanning using the rightmost derivation and 1 lookahead character. If you recall that a leftmost derivation is one in which the leftmost nonterminal is expanded at each step; in the same manner, a rightmost derivation is one in which the rightmost nonterminal is expanded at each step.

There are three types of bottom-up parsers in general use today. They include: (1) Simple LR (SLR), Canonical LR (CLR), and Lookahead LR (LALR).

Most LR parsers are extremely difficult to implement and need specially-formed grammars to work correctly. It is generally agreed that LALR(1) parsers are comparatively harder to implement, but more powerful than CLR(1) parsers. And, CLR(1) parsers are harder to implement than are SLR(1) parsers. Although LR parsers can detect syntactic errors a lot sooner in the parsing cycle and are generally faster than LL parsers, one of their principal drawbacks is that they are extremely difficult to handcraft. For this reason, specialized programs like Yacc and Bison have been developed.

Error Handler

When a compiler detects a syntax error, it usually attempts to continue with the parsing process so as to detect and record as many errors as possible.

Error handling is one of the most critical processes performed by a compiler. Generally, when the compiler detects a semantic error, it can still continue compiling a program. When a syntactic error is detected, a serious condition exists that may force the compiler to stop processing the source code. However, error recovery is not easily accomplished. Since syntax errors potentially put the compiler into a state where continued compiling is not possible, it would make sense if the compiler could perform either error recovery or error repair. These two options play a vital role in the overall operation of a compiler.

Error Recovery. With error recovery, the compiler attempts to record the error and reset the parser so that the remaining program can be parsed. The problem with syntax errors is that they tend to cascade throughout the remaining code. This makes recovery extremely difficult at best and impossible in the worst case scenario. Adagé has a built-in panic mode whereby the parser bails out when attempts at error recovery fail.

Error Repair. With error repair, sometimes called error correction, the compiler attempts to actually make repairs where it makes sense to do so. For instance, a simple typo or accidentally defining an improper type might be repairable whereas an incomplete sentence might not be repairable. One problem with self-repair is that a number of false corrections may be made and never detected by the programmer. When Adagé makes a repair, it is reported so that the programmer becomes aware of the correction.

The biggest drawback to error repair is the amount of exacting extra code that must be added to a compiler with no guarantee of success. For this reason, Adagé's grammar is designed to make errors easily recognizable at the earliest possible opportunity and to provide minimal algorithms to affect repair. Adagé also has a built-in optimizer that determines when a point of diminishing return has occurred whereby any further effort at error recovery would not be worthwhile. At predetermined thresholds, the parser is designed to bail out with an appropriate error message. Figure 10 depicts a typical error report, which Adagé calls an exception.

Figure 10. Adagé Error Format
Syntax error 104 on Line 21: var index integer
                                      ^
                                Missing colon.

Adagé error reports provide line number, position, and error reference number. The error reference number allows programmers to cross reference particular errors in the Adagé Reference Manual. Note that exceptions incorporate more than just errors. Exceptions include warnings, cautions, and tips.

Intermediate Code

Most industrial-grade production compilers offered today produce intermediate code as part of the front end output. Intermediate code is often stored in a tree-like format based on the needs of optimizers and code generators for various targeted processors. Other stored formats include the three-address statements. Intermediate code is also closer to the target machine than the source language, which makes it easier to generate actual code for the target processor and to optimize that code.

Intermediate Code (IC) represents an abstract form of machine code that models a generic processor. Most sources discuss two important properties of intermediate code. It is generally agreed that intermediate code (1) should be easy to produce and (2) easy to translate into assembly instructions for the target processor.

The purpose of intermediate code is to make a compiler more portable. That is, the intermediate code is easier to re-target for UNIX, Linux, and Windows operating systems, or even different or newer processors for that matter. Using intermediate code allows the details of the target processor and associated machine language to be confined to the backend of the compiler. In my opinion, intermediate code representations are also easier to optimize than actual assembly code.

 1.  ;-------------------------------------------------------
 2.  ; Listing 2. Intermediate Code Snippet
 3.  ; Pseudo Code: var index: integer = 50
 4.  ;              procedure Add_Int (x, y: int)
 5.  ;              begin
 6.  ;                x = x + y
 7.  ;              end procedure
 8.  ;
 9.  ;              Add_Int (index, 100)
10.  ;-------------------------------------------------------
11.  MEM index 8 BYTES  ; Set aside 8 bytes in memory
12.  ST @ADDR index 50  ; Store 50 at index location
13.  L1:                ; Begin procedure
14.  LD R1, STK1        ; Load address from stack location 1
15.  LD R2, [R1]        ; Load value located at address in R1
16.  LD R3, STK2        ; Load value from stack location 2
17.  ADD R3, R2         ; Add both registers
18.  ST [R1], R3        ; Store results at address in R1
19.  RET                ; Return to caller
20.  ;-------------------------------------------------------

Notice that the intermediate code in Listing 2 is quite different from code implemented by compiler programs like YACC. The intermediate code is based on the abstract syntax trees produced by the parser's syntax analyzer and validated by the semantic analyzer.

Code Optimization

Almost all modern compilers perform some optimization. It should be noted that code optimization, as effective as it may be, rarely results in perfectly optimized code. There is always a point of diminishing return when optimization is involved.

The purpose of the code optimizer is to make the resulting code: (1) faster, (2) work on different processors, and (3) work on different operating systems. One drawback to the optimizing process is that a compiler can spend considerable time performing optimization. Code optimization almost always concerns trade-offs in time spent optimizing the code and any resulting speed-ups that may be realized. I highly recommend you download a copy of Agner Fog's Optimizing Subroutines in Assembly Language. It is free and comes in PDF format.

The initial versions of the Adagé compiler employ a post-code-peephole optimization technique. This optimization approach is performed on the actual assembly language code output by the code generator before it is submitted to the NASM assembler. Select assembly instructions are reviewed and optimized as necessary. This type of analysis is very limited in order to deal with the complexities of the x86-64 instruction set.

 1.  ;-------------------------------------------------------
 2.  ; Listing 3. Optimized Intermediate Code Stub
 3.  ; Pseudo Code: var index: integer = 50
 4.  ;              index = index + 100
 5.  ;-------------------------------------------------------
 6.  MEM index 8 BYTES  ; Set aside 8 bytes in memory
 7.  ST @ADDR index 50  ; Store 50 at index location
 8.  L1:                ; Begin procedure
 9.  LD R1, STK1        ; Load address from stack location 1
10.  LD R2, STK2        ; Load value from stack location 2
11.  ADD [R1], R2       ; Add value in R2 to address in R1
12.  RET                ; Return to caller
13.  ;-------------------------------------------------------

Notice that the optimizer eliminated two instructions and streamlined two others. The value on the stack at location STK2 is loaded directly into R2. The next instruction then adds the value in R2 directly to the memory location represented by "index". The address for "index" is contained in R1.

Code Generator

Code generation is the final phase of a compilation process. The major task of the code generator is to ensure that correct code is produced for the target microprocessor. The code generator uses the symbol table and intermediate code source in deciding which x86-64 instructions to use.

Machine-level code generation is the final phase of compiler operations. It usually consists of developing assembly language instructions to be passed to an assembler like NASM. Code generators use the intermediate code as input and translate those instructions into actual assembly code. Just note that perfect, or even good code generation is an extremely difficult endeavor. Building a code generator depends on intimate knowledge of assembly language instructions used by the processor and the syntax requirements of the assembler. For this reason, we narrowed our compiler to using the NASM assembler and targeted the Intel x86-64 architectural processor operating in 64-bit real mode on the Windows 64-bit Operating System.

;-----------------------------------------------
; Listing 4. Generated Code Snippet
; Pseudo Code: var index = 50
;              index = index + 100
;-----------------------------------------------
segment .data       ; Define data section
  index dq 50

segment .text       ; Define code section
  L1:               ; Begin procedure
  mov rax, 100      
  add [index], rax
  xor rax, rax      ; Report no error condition
  ret               ; Return to caller
;-----------------------------------------------

Listing 4 shows the generated assembly language code that is fed directly to the Netwide Assembler program for conversion to machine code. The previous three examples are very simplistic views on how the intermediate code is optimized and then converted to assembly language code by the compiler. However, they do show the process used by the Adagé compiler. Intermediate code is used because it is easier to optimize and it allows different back ends to be applied for different processors and operating systems.

Symbol Table

The symbol table is a data structure used by compilers to store key information about identifiers recognized by the lexical analyzer during the scanning process. The information stored in the symbol table is used by the parser and code generator. The types of information stored include variables, procedures, functions, methods, tasks, constants, labels, structures, file identifiers, and other compiler generated data such as identifier attributes.

A symbol table is a construct for holding each identifier and its associated attributes. The purpose of the symbol table is to preserve data on named identifiers that may include storage allocation, data type, scope, value, and in the case of subroutines, their names and arguments.

The symbol table is also used to check for reserved words and to ensure in-scope names are used only once. Identifier names are entered into the data structure by the lexical analyzer. Later, during syntax analysis associated attributes are entered. During the semantic analysis and code generation phases, the symbol table is traversed. The semantic analyzer uses the symbol table to check that names are used consistently with their declared types. The code generator uses the symbol table to check on how much memory space is needed and what kind of run-time storage must be allocated to each name.

Symbol tables are typically implemented using a list, binary search tree, or hash table. The kind of structure used often depends on the number of anticipated names to be stored and the desired performance needed from the compiler. Our compiler uses IWBasic's dictionary data structure. This data structure is a hash table construct and represents one of the fastest methods for building and searching symbol tables.

Figure 11. Symbol Table Entries
NAME       TYPE       MODE    VALUE            VISIBLE  BYTES ARGUMENTS         RANGE
---------- ---------- ------- ---------------- -------- ----- ----------------- -----------------
$index     integer    in      50               yes      8     n/a               int.min, int.max
$Add_Int   procedure  -       -                yes      -     integer, integer
$value     float      in out  1.2345e+308      yes      8     n/a               flt.min, flt.max
$big_str   string     in      "Assembly is..." yes      128   n/a               128 bytes
$int_array array      in out  0                yes      800   float             0, 99

Note that the contents of symbol tables can vary application to application and more than one symbol table may be used. Adagé actually builds three symbol tables. Appropriate data are extracted from these three tables and placed into a matrix that is saved and loaded into the runtime program each time it is executed.

Compiler Tools

No compiler is complete without a robust set of support tools. Requisite tools include: (1) preprocessor, (2) symbolic debugger, (3) disassembler, (4) validation suite, (5) text editor, (6) scanning program, (7) parsing program, (8) assembler program, and (9) linker.

The Adagé compiler comes with a number of essential tools including a symbolic debugger, validation suite, standard library files, and text editor. We will build the symbolic debugger, validation suite, text editor, scanning program, and parsing program as separate, time-consuming projects.

The assembler and linker are commercial products and must be carefully chosen before initiating compiler development. This is because assembler programs and linkers have distinct nuances, which must be built into the compiler. Additionally, a good disassembler should be downloaded and used to improve the overall effectiveness of Adagé executable programs.

Integrated Development Environment

IDEs are text editors that help programmers to get the correct programming syntax and then run the development tools to create executable program output files.

Adagé has an accompanying Integrated Development Environment (IDE), which is developed as a project discussed elsewhere on this website. The IDE is a software application that provides comprehensive facilities to programmers for writing and debugging executable programs. It consists of a source code editor, automated build tools, and a debugger. The Adagé IDE provides intelligent code completion and referential help. The Adagé IDE is designed to maximize programmer productivity by providing tightly-bound components with a similar user interface.

Features of Adagé's IDE include the following capabilities:

Figure 12. Adagé Integrated Development Environment

Preprocessor

A preprocessor is a program that processes source files and provides a file that is used as input to another program such as a scanner program. The amount and type of preprocessing done depends on the nature of the preprocessor. Some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of full-fledged programming languages like the C preprocessor.

The initial version of the Adagé compiler does not provide a preprocessor. Preprocessing was left out on purpose because the use of preprocessor macros are inherently unsafe and often lead to unstable programs. However, the underlying NASM assembler contains many single- and multi-line macros that can be used when inline assembly code is incorporated in the main code. NASM's preprocessor capabilities provide for string substitution, text inclusion, and conditional compilation. Unfortunately, inline assembly code breaches Adagé's strict typing requirements and validation checking. A warning is raised stating that strict type checking and validation are not being enforced. However, the use of contract constructs, discussed later, using aspect specifiers are implemented to mitigate typical issues concerning assembly language security breaches.

~------------------------------------------------------
~ Listing 5. NASM Preprocessor Macro Code
~ Package: standard.inc
~ Macro definitions for this library
~------------------------------------------------------
package standard
  asm
    %ifndef standard_inc
    %define standard_inc
    extern _printf, __imp__printf
  
    segment .data
      TEXT.type EQU 0
      ...
      SCHAR.type EQU 31
    
    segment .text
      %imacro SCHAR 2  ; Define 8-bit signed integer
        %1 db %2
      %endmacro

      %imacro BCD 2    ; Define ten-byte BCD value
        %1 dt %2
      %endmacro
      ...
  end asm
end package
~------------------------------------------------------

Symbolic Debugger

A debugger is a software tool that lets programmers execute and validate their code statement by statement. Debuggers are generally tightly integrated into their associated compiler.

Adagé's debugger is a software tool built right into the compiler. It allows programs to be executed and validated instruction by instruction. The Adagé debugger is a symbolic debugger. This makes it an interactive, real-time tool that requires information from both the compiler and linker. For this reason, specific compiler options can be selected during program development that places debugging information in the compiled source program to aid in debugging.

Debugging features include the ability to examine and modify processor registers, data, and program memory; pause or stop program execution at defined program locations using breakpoints; single-stepping one instruction at a time through the code; and tracing the history of executed code.

The Adagé debugger includes the following features:

Figure 13. Adagé Symbolic Debugger

Disassemblers

The purpose of disassembly tools is to facilitate the understanding and validation of program source code, especially when the source code is unavailable.

There are a number of useful disassemblers available for Adagé programs. I recommend the freely-available disassemblers offered by Ida Pro, OllyDbg, and Agner Fog. Hex-Rays charges for their latest software version of Ida Pro, but provides free 32-bit versions of previous releases. OllyDbg can be downloaded free of charge from Oleh Yuschuk's website. Also, Agner Fog offers a free converter called objconv. This program is less capable than Ida Pro, but is easy to use and provides adequate disassembled listing information. I use it to dump and disassemble object and executable files.

I use disassemblers to perform the following analyses:

Figure 14. Ida Pro Disassembler Program

Note that Ida Pro is a recursive descent disassembler. Although there are numerous shortcomings related to the recursive descent approach, Ida Pro employs a large number of heuristic techniques to provide additional code that may have gone missing from the recursive-descent process. Ida Pro is intelligent enough to distinguish data code from instructions. One of the more valuable features of Ida Pro is its ability to generate annotated comments about variables and subprograms. Ida Pro also provides a listing of machine-level code adjacent to each assembly instruction. This makes Ida Pro listings both powerful and easy to understand.

Validation and Verification Suites

Validation and Verification tools are specifically designed to support the development of software used in high integrity applications. For instance, the Inspector program, part of the Adagé compiler, allows programmers to formally verify properties of their code such as information flow, dynamic errors, syntax correctness, code security, and functional properties.

One of the main tools in the Adagé arsenal is Inspector. It is a stand-alone program envisioned to help programmers validate and verify (V&V or ValVer) that their source code conforms to standard Adagé rules. Inspector is a valuable component program that has three fundamental functions:

Currently, the Adagé Inspector program is run from the command line in a console window. Inspector can be invoked simply by typing Inspector followed by the name of the file using full directory notation. However, I intend on integrating the Inspector program into the Adagé IDE as a selectable build option. Eventually, when all the functions of Inspector have been tested and are working properly, they will be designed right into the compiler so that source code validation and verification takes place during compilation.

Inspector is meant to reduce or eliminate the needed for extensive debugging. Debugging tends to be extremely problematic and may not even help in the discovery of the errors for which a programmer may be searching. Inspector is designed to find specific problems before they occur in a program so that debugging becomes unnecessary in terms of tracking down program errors.

The Adagé programming language with its core annotations ensures that a program does not contain certain errors related to the flow and type of static information. To check for dynamic error conditions, Inspector checks for proof annotations, which consist of pre- and post-conditions embedded in user comments. The advantage of Inspector is that it can conduct validation and verification on source code package specifications, even when the code is not ready for compilation.

Formal verification of Adagé code includes checks for possible divide by zero conditions, out of range variable assignments, incorrect type assignments, incorrect array indexing, arithmetic overflow situations, un-allowed aliasing, anonymous type usage, subprogram overloading, inappropriate pointer assignments, illegal data coercion and conversion, invalid return types, and infinite loop states.

Beyond its utility as a code checker, the aspect clauses inserted in the interface sections of packages provide programmers with additional information about how its subprograms are supposed to behave. For instance, consider the information given in Listing 6.

~----------------------------------------------------
~ Listing 6. Typical Adagé Interface
~ No annotations appear in this interface version
~----------------------------------------------------
interface Total_Cost
  procedure Compute_Cost (x: in float)
end interface
~----------------------------------------------------

The interface shown in Listing 6 tells us very little about the procedure. All we know is that it takes a float value whose formal name is "x". It says nothing about what the procedure actually does or what "x" represents. Now, consider what happens when more descriptive parameters are used and annotations are added as shown in Listing 7.

~----------------------------------------------------------------------------------
~ Listing 7. Annotated Interface for Package Total_Cost
~ Aspect clauses are used by Inspector to check source code
~----------------------------------------------------------------------------------
interface Total_Cost
  ~# global total: float                    ~ Make global variables visible
     
  procedure Compute_Cost (price: in float)  ~ Use more descriptive parameter name
    ~# global out total                     ~ Set total as write only
    ~# pre in price > 0.0                   ~ Ensure actual input is > 0.0
    ~# post out total > 0.0                 ~ Ensure actual output is > 0.0
end interface
~----------------------------------------------------------------------------------

As shown in Listing 7, we now have a whole lot of useful information. For instance, we now know the following about the package:

Not only do we know more about what's going on with the procedure, the aspect clauses guide Inspector in making the following checks in the package body:

Note that formal verification of subprogram properties is based on contracts expressed as pre-conditions, post-conditions, type invariants and so on. Adagé annotations take the form of Adagé comments using the "~#" symbol and are only found in package specifications (interfaces). There are no annotations in package bodies unless refinement is required. The form for annotations lets the compiler treat them as human-readable comments, but remain recognizable as annotations to the Inspector program. Since annotations are treated as comments by the compiler, they do not add to the size of programs. There are a number of pre-defined aspect clauses in addition to global, pre, post,initializes, and derives that can be used to perform Adagé program code validation and verification.

Besides providing information to Inspector and increasing user insight, why bother with annotated comments anyway? The answer is that annotations provide aspect clauses that tell Inspector what checks are to be performed. Typical checks include data-flow analysis, initialization of variables, data dependencies of subprograms, and range and boundary conditions. Annotations do not add to the size of the final program because comments are not included. All Inspector checks are performed statically. Listing 8 displays both the interface and package body for the Total_Cost package.

~-----------------------------------------------------------------------------
~ Listing 8. Package Specification without using Annotations
~ This version adds substantial code to perform dynamic checks
~-----------------------------------------------------------------------------
interface Total_Cost                        ~ Interface to package body
  ~# global total: float                    ~ Make global variables visible
     
  procedure Compute_Cost (price: in float)
    ~# global out total                     ~ Set total as write only
    ~# pre in price > 0.0                   ~ Ensure actual input is > 0.0
    ~# post total > 0.0                     ~ Ensure actual output is > 0.0
end interface
  
package Total_Cost                          ~ Package body
  var total: float

  procedure Compute_Cost (price: in float)
    var tax: float
  begin
    tax = price * 0.05
    total = price + tax
  end procedure
end package
~-----------------------------------------------------------------------------

In the previous listing, we included the package body. Package bodies normally are private, meaning other packages and subprograms that import their subprograms and variable only have visibility over the interface. From a user standpoint, annotations provide much needed information about how those subprograms work and what data are being passed to them.

Although Adagé provides a contract-based programming environment, it does not offer aspect-related specifications similar to the SPARK 2014 programming language. The reason for this decision is that aspect-type contracts add a whole new level of executable program code that greatly increases the complexity and size of both the compiler and user programs. For this reason, Adagé will use only annotated comments for now.

Scanning Programs: Lex and Flex

Lex and Flex are general-purpose scanner generators designed to be re-targetable and machine-independent. Scanners, sometimes called tokenizers, are programs that recognize lexical patterns in text. Both produce tables that are used by driver routines to create complete scanner programs. Example 20 depicts a package specification containing Inspector annotations.

Lex and Flex are scanner generators, which perform automatic lexical analysis on input source files. Scanning is accomplished based on information about the language in the form of regular expressions input by the programmer. Lex and Flex, originally designed for UNIX, output C/C++ objective files. These files are then fed to a parsing program (typically Yacc or Bison). Since our scanner is handcrafted, we do not use Lex and Flex in any of our projects.

The reasons we are not using either Lex or Flex is because: (1) in my opinion, both of these programs have huge learning curves and (2) we want to understand how scanners work by building our own from the ground up. Just recognize that when you hear the terms Lex and Flex, we are talking about automated scanner generators.

Parser Programs: Yacc, Bison, and PCYacc

[NEED SOME TEXT HERE].

Yacc, developed by AT&T, is a general purpose program for generating LALR(1) parsers based on grammars written in Backus-Naur Form (BNF). Originally developed for Unix operating systems, current versions can be found for the Windows OS. GNU Bison and PCYacc are reimplementations of Yacc and purportedly offer improvements and additional features over Yacc.

Yacc, Bison, and PCYacc only build parsers based on context-free grammars input by programmers. Compilers still require lexical analyzers such as Lex or Flex to perform scanning functions. All these parser generators have unique learning curves. They all require a C/C++ compiler, or in some cases a Java compiler, to create the parsing program. Naturally, each program produces C/C++/Java-compatible output files.

Another reasonable offering is the GOLD Parser. According to its developers, GOLD generates parsers that use Deterministic Finite Automatons (DFA) for the tokenizer and a LALR(1) for the state machine. Unlike other parser generators, GOLD does not require you to embed your grammar into your source code. It saves the parse tables into a separate file which is loaded by the parser engine when run.

Again, since we are hand crafting our compiler using BASIC and inline assembly language, we do not use automated parser generators in any of our projects. Our purpose for building a compiler in the first place is three-fold: (1) specify a context-free grammar for our language, (2) write the algorithms necessary to implement a working parser, and (3) create symbol table structures to hold names and attributes. We always have the option of employing third-party parser generators later on if we find we need more powerful tools.

Assembler and Linker: NASM and GoLink

When a source code file is created by the code generator of a compiler, it is assembled into an object file with an assembler program. Once the source file has been successfully assembled into an object file, it is linked using a linker program. When all the object files have been linked, an executable program file is created and ready to use.

Adagé employs the Netwide Assembler (NASM) program to translate the human-readable source code generated by the compiler's code generator into machine-level hexadecimal code instructions. These hex code instructions are output into an object file that can be linked together with other object files.

The job of a linker is to combine object modules saved in different files into a single executable program. Linkers also take care of memory allocation by assigning data and instructions to memory addresses in such a way that different modules do not overlap. Adagé uses the GoLink linker program to link object files into a single executable program file.

Assembler Program

An assembler is a type of computer program that translates the output of the code generator from human-readable assembly language to machine-level hexadecimal format that can be executed by a particular microprocessor.

An assembler is a software program that interprets assembly language source files into machine-level language that can be executed by a computer. Assembler software allows program developers to access, operate, and manage computers at the hardware level.

Adagé uses the Netwide Assembler (NASM) as its underlying assembler program. NASM is an 80x64 assembler designed for portability and modularity. It supports a range of Windows compatible file formats including Win32 and Win64 object file formats. Its syntax is designed to be simple and easy to understand, similar to Intel's, but less complex. Most importantly, NASM supports all currently known x86 architectural extensions and provides powerful macro constructs.

Here is a list of NASM's key features:

Linker Program

GoLink is a free linker that takes COFF object and resource files and creates executable (EXE) or dynamic linked libraries (DLL). GoLink runs under 32-bit and 64-bit Windows operating systems. This is a full featured linker that keeps file size to a minimum. One notable feature of GoLink is that Lib files are not needed to identify what functions reside in linked DLLs. Instead GoLink looks inside the DLLs themselves to find the appropriate functions. GoLink also reports redundant program data and code.

A linker is a software program that combines object modules to form an executable program. Since Adagé allows you to write and compile separate modules in order to break up large programs into smaller pieces of manageable code, a program is needed to combine all these modules into one executable program. This is the job of the linker.

Another job of the linker is to replace symbolic addresses in program code with actual addresses. This is necessary even if only one module is being linked. Although there are many exceptional linkers to be found on the internet, I chose Jeremy Gordon's GoLink linker because of its features and that it works with Windows 64-bit Common Object File Format (COFF) source files, which are produced by the Adagé compiler.

Here is a short list of GoLink's salient features:

NOTICE More discussion to come as the Ada compiler design and development project progresses.

Page Top

© 1997-2017 Transtar Management Services, Inc. All rights reserved. Terms of Use