I have been trying to create my own compiler for a simple language like C (by language I mean it don’t contains many construct, sure programming correctly in C is not easy :P) but I couldn’t find any good tutorial which go through every thing about creating compilers. And I don’t wanted to go through a 1000 pages Dragon Book (well who would :o). I found some other books too like Writing Compilers and Interpreters: A Software Engineering Approach, 3rd Edition but it concentrates on Top Down parsers instead of Bottom Up Parsers (don’t worry I will tell you what they mean). So, finally (Believe me it required so much of motivation) I decided to go through Dragon Book. And I was able to create a working compiler for a subset of C language for x86. The source code could be found here: CCompiler – GitHub. But I don’t think you would want to go through the source code because 1) The project is difficult to understand and 2) The way I wrote the code it is even more difficult to read it (After all it was just a test project).
Now, I am writing a new language named Leopard (I love Leopards 🙂 ). Its source code will be good for sure. You can find more about the code here Leopard – GitHub. It will run on a Virtual Machine and is Garbage Collected and JITed. In this and the following blogs I will write about how to write Compilers, all the problems I faced while writing the Compiler, Assembler, Virtual Machine etc. I will tell you a solution for them. I didn’t use any Compiler Writers like Lex or Yacc. Instead everything has been written by hand (that is the beauty of Compilers, it is hard isn’t it :D). Compiler and Assembler has been written in C#, Mono and Virtual Machine in Java (yes, I want to write it in C# but I want to try my hands over Java too, lets see if Java is any better than C# :P).
Pre-requisites for Compilers are so many. You must know Theory of Computation, many Data Structures, any Assembly language (I would prefer x86) and many others. But no one would want to learn all these right? (It is human nature isn’t it? Who wants to study). So, I will tell you about all these things. These are things I am going to write about:
- Introduction to Compilers, Interpreters, Virtual Machines
- Automata, Languages, Grammars.
- Semantic Analyser
- Intermediate Code Generation
- Code Generation
- Virtual Machine
- Garbage Collection
I know its too much, but if you want to write your own compiler you have to do that. So, at the end of this blog series you would have the knowledge of how to develop a Compiler, Assembler, Virtual Machine with JIT and Garbage Collection. We will not use LLVM or any other compiler infrastructure project instead we will start with scratch.
Lets start with Introduction to Compilers.
It is said that there are two types of languages in this world.
Compiled Languages compiles directly to the machine code. C, C++, Objective-C are examples of compiled languages. The code written in C is compiled directly to machine code. This is the typical work flow of a program written in compiled languages.
Source files contains the code, Compiler compiles it into Object File. These object files are then linked to different libraries (if required) and an executable file is produced. Now this executable file contains only binary code. This binary code can be executed directly by the processor. Compiled code gives us the best speed as it is directly executed by processor. But we do not have any control over such compiled code. It can be malicious and if executed by the processor it could lead to havoc.
Interpreted Languages do not compiles to machine code in one go. Rather a program known as Interpreter will read a line of source code, checks if it is correct and then executes it. If it founds an error then the program will be stopped by the Interpreter. Examples of such languages are BASIC, Shell Scripting, Perl, Python, Ruby. Unlike Compiler, processor doesn’t executes the code directly. A line is first converted to code and then executed by the processor. Interpreted Languages gives worst speed. But, we have a control over such code. Interpreter knows which code is going to be executed and it could check whether that code could do some malicious operations or not. If it does then it will stop it right there.
As you can see, we could have good security but on the cost of speed and good speed on the cost of security. In 1970s, LISP was introduced. It was the first language to bring something in between of Compiled and Interpreted Languages. Later with Java this concept became very popular. Here is the basic idea about it.
The source code is compiled into Byte Code. This Byte Code is then executed by a Virtual Machine.
This is the work flow of a Java Program. The execution of Byte Code by Virtual Machine can be done is two ways:
- Interpreting the Byte Code: Virtual Machine will behave similarly as an Interpreter to Byte Code. It will read each instruction an execute it. The speed of a program executed now is much better than Interpreted Languages but less than Compiled Languages.
- Just In Time Compilation: Instead of read and converting each instruction of Byte Code one at a time, Virtual Machine can compile some amount of Byte Code (may be one function or one compound statement block) instruction into machine code and save the compiled code for future use. Whenever it founds that the Byte Code instructions to be compiled have already been compiled it will reload the saved compiled code. This increases the speed significantly as compared to “Interpreting the Byte Code”. It is even seen that JIT has surpassed the performance of C/C++ Compiled code. This is because at runtime JIT has information about the current processor being used and it can then apply some optimizations specific to that processor. But Optimized C/C++ Code is faster than JIT. When we will be developing JIT we will talk more about it.
Not only Java but today many languages follow this approach. It includes C#, VB.NET, C++/CLI, Python, Ruby.
Next time we will see a block diagram of Compiler and how it works.