Compiler Structure

This post is a follow up from my previous post. Here I will talk about parts of a compiler. A Compiler usually contains following parts:

  • Lexical Analyser (Lexer)
  • Syntax Analyser (Parser)
  • Semantic Analyser
  • Symbol Table
  • Intermediate Code Generator
  • Code Generator
  • Code Optimizer

The Source Code goes through the these phases and is finally converted to Target Language (which can be Machine Language or Byte Code).

flowchart2

Let us have a short talk about each one of them.

Lexical Analysis: This is the phase when the source code is converted into a stream of Tokens. These tokens are meaningful character strings. They would have some predefined meaning in the syntax of a language. For example, in C “int” could be converted to a Token called “Integer Type”. This token will then be used in the subsequent phases of the compiler which tells them that “int” describes “Integer Type”.

Syntax Analysis: This is the phase when the Source Code is analysed and converted into its Parse Tree or Syntax Tree. If any error is found in this phase then it is reported to the user. This phase takes the stream of Tokens from Lexical Analyser. Syntax Analyser is actually an Automata (which is a type of Turing Machine). It is created on the basis of the Syntax of Language. Syntax Analyser for C and of C++ are different. Most of the Syntax Analyser creates an Abstract Syntax Tree which actually depicts the structure of the source code.

Semantic Analysis: In this phase Compiler checks for Semantic errors of a language. Semantic errors mainly consists of type errors and whether variable has been declared or not. It makes a heavy use of Symbol Table. For every variable it encounters it gets variable’s type in Symbol Table and check with the language rules. It also performs Type Conversions based on the language rules. For example, the expression “1.5 + 2” will give a Float but “2” is an Integer. In this phase compiler will convert “2” to Float and apply float addition operator.

Symbol Table: This part of compiler is responsible for storing all the Type Information about all variables or functions. For example, if you have declared your variable named “foo” as an integer then Symbol Table will store it. This part is used by all the stages of compilers. It is mainly created at the Syntax Analysis stage.

Intermediate Code: Intermediate Code may or may not be present in a Compiler structure. It depends on the Compiler Designers whether they want the Intermediate Code to be generated or not. Intermediate Code can be of many forms like Abstract Syntax Tree, Three Address Code, Static Single Assignment etc.

Code Optimizer: Code Optimizers mainly work on Intermediate Code. They apply many algorithms for optimizing the code. Similar to Intermediate Code, Code Optimization is not a necessary part of compiler. Even many compilers like GCC or LLVM implements a command line switch to turn off/on and set level of Code Optimization.

Code Generation: Code Generation is the final step of Compiler. At this step the target language code is generated. It can be Byte Code or x86 Assembly Code.

In the next post I will start with Lexical Analysis phase of compiler.

Cheers ๐Ÿ˜€

Advertisements

Introduction to Compilers

Hey all,

I have been trying to create my own compiler for a simple language like C (by language I mean it don’t contains many construct, sure programming correctly in C is not easy :P) but I couldn’t find any good tutorial which go through every thing about creating compilers. And I don’t wanted to go through a 1000 pagesย Dragon Book (well who would :o). I found some other books too like Writing Compilers and Interpreters: A Software Engineering Approach, 3rd Edition but it concentrates on Top Down parsers instead of Bottom Up Parsers (don’t worry I will tell you what they mean). So, finally (Believe me it required so much of motivation) I decided to go through Dragon Book. And I was able to create a working compiler for a subset of C language for x86. The source code could be found here: CCompiler – GitHub. But I don’t think you would want to go through the source code because 1) The project is difficult to understand and 2) The way I wrote the code it is even more difficult to read it (After all it was just a test project).

Now, I am writing a new language named Leopard (I love Leopards ๐Ÿ™‚ ). Its source code will be good for sure. You can find more about the code here Leopard – GitHub. It will run on a Virtual Machine and is Garbage Collected and JITed. In this and the following blogs I will write about how to write Compilers, all the problems I faced while writing the Compiler, Assembler, Virtual Machine etc. I will tell you a solution for them. I didn’t use any Compiler Writers like Lex or Yacc. Instead everything has been written by hand (that is the beauty of Compilers, it is hard isn’t it :D). Compiler and Assembler has been written in C#, Mono and Virtual Machine in Java (yes, I want to write it in C# but I want to try my hands over Java too, lets see if Java is any better than C# :P).

Pre-requisites for Compilers are so many. You must know Theory of Computation, many Data Structures, any Assembly language (I would prefer x86) and many others. But no one would want to learn all these right? (It is human nature isn’t it? Who wants to study). So, I will tell you about all these things. These are things I am going to write about:

  1. Introduction to Compilers, Interpreters, Virtual Machines
  2. Automata, Languages, Grammars.
  3. Lexer
  4. Parser
  5. Semantic Analyser
  6. Intermediate Code Generation
  7. Code Generation
  8. Assembler
  9. Virtual Machine
  10. Garbage Collection
  11. JIT

I know its too much, but if you want to write your own compiler you have to do that. So, at the end of this blog series you would have the knowledge of how to develop a Compiler, Assembler, Virtual Machine with JIT and Garbage Collection. We will not use LLVM or any other compiler infrastructure project instead we will start with scratch.

Lets start with Introduction to Compilers.

It is said that there are two types of languages in this world.

  • Compiled
  • Interpreted

Compiled Languages compiles directly to the machine code. C, C++, Objective-C are examples of compiled languages. The code written in C is compiled directly to machine code. This is the typical work flow of a program written in compiled languages.compile

Source files contains the code, Compiler compiles it into Object File. These object files are then linked to different libraries (if required) and an executable file is produced. Now this executable file contains only binary code. This binary code can be executed directly by the processor. Compiled code gives us the best speed as it is directly executed by processor. But we do not have any control over such compiled code. It can be malicious and if executed by the processor it could lead to havoc.

Interpreted Languages do not compiles to machine code in one go. Rather a program known as Interpreter will read a line of source code, checks if it is correct and then executes it. If it founds an error then the program will be stopped by the Interpreter. Examples of such languages are BASIC, Shell Scripting, Perl, Python, Ruby. Unlike Compiler, processor doesn’t executes the code directly. A line is first converted to code and then executed by the processor. Interpreted Languages gives worst speed.ย  But, we have a control over such code. Interpreter knows which code is going to be executed and it could check whether that code could do some malicious operations or not. If it does then it will stop it right there.

As you can see, we could have good security but on the cost of speed and good speed on the cost of security. In 1970s, LISP was introduced. It was the first language to bring something in between of Compiled and Interpreted Languages. Later with Java this concept became very popular. Here is the basic idea about it.

The source code is compiled into Byte Code. This Byte Code is then executed by a Virtual Machine.

java-compiler

This is the work flow of a Java Program. The execution of Byte Code by Virtual Machine can be done is two ways:

  • Interpreting the Byte Code: Virtual Machine will behave similarly as an Interpreter to Byte Code. It will read each instruction an execute it. The speed of a program executed now is much better than Interpreted Languages but less than Compiled Languages.
  • Just In Time Compilation: Instead of read and converting each instruction of Byte Code one at a time, Virtual Machine can compile some amount of Byte Code (may be one function or one compound statement block) instruction into machine code and save the compiled code for future use. Whenever it founds that the Byte Code instructions to be compiled have already been compiled it will reload the saved compiled code. This increases the speed significantly as compared to “Interpreting the Byte Code”. It is even seen that JIT has surpassed the performance of C/C++ Compiled code. This is because at runtime JIT has information about the current processor being used and it can then apply some optimizations specific to that processor. But Optimized C/C++ Code is faster than JIT. When we will be developing JIT we will talk more about it.

Not only Java but today many languages follow this approach. It includes C#, VB.NET, C++/CLI, Python, Ruby.

Next time we will see a block diagram of Compiler and how it works.

Cheers ๐Ÿ˜€