Tame the dragon 1
learn about compiler development
Learn about the dark arts of bending the fabric of code you write
Table of contents
The dawn of computers revolutionized the way we do things. It was the start of something big_: past the industrial era we hath entered the era of information. We realized soon that coding in 0
and 1
is not an efficient way to describe a solution.
Imagine typing
01101000 01100101 01101100 01101100 01101111
instead of just saying hello 😱
Compilers have been used ever since to allow users to describe the computations in a high-level syntax. The way we write compilers has not changed that much from the 1959 COBOL compiler to JavaScript.
What is a Compiler
Compilers take your high level representation and converts it to machine code, thats it !
//c
printf("Hi Mom!");
this gets converted to machine code by gcc
which is a C
compiler and is part of the GNU utilities.
Why learn Compiler Development
Making a compiler is an excellent test of your skills. From Scanning to Code-Generation there are countless avenues to decide your path forward. Each decision and trade-off you make will mould the compiler into something different. Skills learned here will give you a deeper understanding of how computers do things. Piping the various parts together will help you learn about code organization and different paradigms for it. I mainly to do it test new languages out 🤓
What is an Interpreter
For now just remember that an interpreter unlike a compiler does not generate machine code ahead of time i.e. no executable file is outputted.
Weird case of python
Is Python interpreted or compiled? It's not easy to answer. Python uses both compilation and interpretation. Normally the python source code file .py
is compiled to another representation and then that representation is interpreted.This mix of compilation and interpretation is also used by JAVA, C# and possibly others given there exists a tool. C or C++ can be interpreted** even though it is generally compiled. The language merely defines the syntax to be implemented in both interpreters and compilers
Phases of Compilation
- Lexing / Scanning
- Parsing
- Code Generation
Lexing
All code is just words
-
What's the difference between a
cpp
file and apy
file ? it's just the file extension! , there is nothing fundamentally different aboutcpp
files compared topy
files. They are just text files with specific extensions like.c
or.matlab
-
Lexing or Scanning is always the first phase in a compilation pipeline. In this phase we assign meaning to words inside the file so that appropriate actions can be done upon them.
- In a statement like
user@> 1 + 1 2
We have 3 tokens
1
+
and1
Do we consider1
as a string ? or as a number ?. Addition only works on number so the word1
represents a number.What about this code:
user@> print "hello" hello
Here we see that we are calling the
print
method with the valuehello
.
print
here is a function call that takes one argument and prints that,print
is not number or a string.To Summarize:
1
-> Number
+
-> Binary function
print
-> Unary function
"hello"
-> StringTo find what word means what in a code is essence of Lexing
Code time
Our Language of choice would be Elixir
which is a functional programming language. I have found functional languages to be a great choice for making compilers because of how different components can be piped almost effortlessly and fluently in them.
Being functional also means easier to debug as the state is not mutated which is a big W when building complex software like Compilers
Note: Refer to this beautifully crafted website for any elixir needs: hexdocs.pm
We will start my defining our entry point
def lexer(data) do
a =
data
|> String.split("", trim: true)
|> emp_expresssion([])
IO.inspect(a)
end
Explanation
lexer
function will recieve the text from a file after reading it.- We use the
String.split
function to split the words by whitespaces
print "hello" -> [print,"hello"]
// Something like
print"hello"
// might work in other languages but will not work here because there are no whitespaces to separate the function with the arguements.
// More complex have a better Lexer but for simplicity we will be splitting by spaces
- The
trim:true
is to make sure there arent any whitespaces at the end or at the start |>
is used to pipe result from one expression to another- We
|>
the result fromString.split
toemp_expression
- 'emp_expression' takes 2 arguements
[]
an empty list- The result that is being piped
in. Elixir we would say that
emp_expression/2
emp_expression/2
takes in a list of words after being splitted and lexes them- The empty list is used for recursion which will make sense in a bit
Lets start with defining emp_expression
def emp_expresssion([token | rest], acc) do
case token do
"" ->
emp_expresssion(rest, acc)
" " ->
emp_expresssion(rest, acc)
"+" ->
emp_expresssion(rest, acc ++ [{:op, "+"}])
"-" ->
emp_expresssion(rest, acc ++ [{:op, "-"}])
"/" ->
emp_expresssion(rest, acc ++ [{:op, "/"}])
"*" ->
emp_expresssion(rest, acc ++ [{:op, "+"}])
"(" ->
emp_expresssion(rest, acc ++ [{:identifier, "("}])
")" ->
emp_expresssion(rest, acc ++ [{:identifier, ")"}])
"\"" ->
string = lex_string(rest, "")
size = String.length(string)
rest = Enum.drop(rest, size + 1)
emp_expresssion(rest, acc ++ [{:string, string}])
_ ->
cond do
Empless.Lexer.is_number(token) ->
num = lex_number(rest, token)
size = String.length(num)
rest = Enum.drop(rest, size - 1)
emp_expresssion(rest, acc ++ [{:num, num}])
is_char(token) ->
identifier = lex_identifier(rest, token)
size = String.length(identifier)
rest = Enum.drop(rest, size - 1)
if(is_identifier(identifier)) do
emp_expresssion(rest, acc ++ [{:identifier, identifier}])
else
raise "Unkown Identifier found"
end
true ->
{:error, "Unknow Lexeme Found"}
end
end
end
Explanation
Lets divide the function into smaller biteable bits
- Lets start with the function declaration
def emp_expresssion([token | rest], acc) do
here we have defined a function with name emp_expression
. You might notice something weird inside the braces, that is called pattern matching. Like mentioned above emp_expression/2
takes 2 arguements and using pattern matching with the |
operator we divide the passed array into the first element:token
and the remaining list: rest
iex> [head|rest] = [1,2,3]
iex> head
1
iex> rest
[2,3]
- The rest of the code contains case branches kinda like if-else in C/C++ Lets dive into one of the branches specifically the one to identify if the word represents a number
Note: We will be thinking of numbers as whole integers and not decimal/floats. This makes the implementation a lot easier and afterwards you can just update it to work floats.
cond do
Empless.Lexer.is_number(token) ->
num = lex_number(rest, token)
size = String.length(num)
rest = Enum.drop(rest, size - 1)
emp_expresssion(rest, acc ++ [{:num, num}])
forget about the cond
part and lets look at the code inside the block.
It first makes a call to another function Empless.Lexer.is_numer(token)
which takes the token/head that we splitted (remember?).
if the function returns True
that means we have encountered a number and have to extract it all from the text.
1234 + 4566
^
|
1 is a number but it is not present exclusively. Its part of a single bigger number i.e. 1234.
If the first character in a word is a number we keep parsing until we have reached the end of the number or an error (123fb + 45
should give an error because the first word cannot be categorized as a number or string).
we make a call to another function lex_number
which will extract the number from the stream of text and then we move the current cursor forward and recursively call the function again. let me give you an example
1234 + 456
^
|
# First character is a number so `lex_number` is called which returns 1234 , the full number.
# we have parsed the word and remove to continue scanning phase
+ 456
^
|
# `+` Character is encountered which represents a binary operation, Now we drop + from the string aswell
456
^
|
# First character is a number so `lex_number` is called which returns 456,
# Now we drop this from string aswell and are left with an empty string
Successfully parsed the whole string 🤟🏼
The emp_expression
function contains a few more conditions for
- strings that start with
"
- all the binary operators
-
,*
- parenthesis which are just iterations of the code we just worked through.
Tokens
In the code above you will see some statements like
emp_expresssion(rest, acc ++ [{:op, "+"}])
Question is what is that acc
array and what are :op
and +
in it ? Remember we said that we are giving meaning to words so we attach another identifier with the word like :op
which means an operation along with +
. For number we have done :num
with the number itself.
After running this program we get a single list of tokens like
iex@> lexer("1+1")
[[:num,1] [:op,+] [:num,1]]
One more thing, see how we have given meaning to all the words in the string. Thats how we do Lexing the first phase of compilation.
Code discussed here is available at: Empless
Glossary
- Word: a sequence of characters/numbers representing something
- 1234 represents a number
- "hello" represents a string