Tame the dragon 1

learn about compiler development

TCombinator published on
8 min, 1529 words

Learn about the dark arts of bending the fabric of code you write

Table of contents

  1. Introduction
  2. Why Compiler development
  3. What is Lexing

The dawn of computers revolutionized the way we do things. It was the start of something big_: past the industrial era we hath entered the era of information. We realized soon that coding in 0 and 1 is not an efficient way to describe a solution.

Imagine typing 01101000 01100101 01101100 01101100 01101111 instead of just saying hello 😱

Compilers have been used ever since to allow users to describe the computations in a high-level syntax. The way we write compilers has not changed that much from the 1959 COBOL compiler to JavaScript.

What is a Compiler

Compilers take your high level representation and converts it to machine code, thats it !

//c
printf("Hi Mom!");

this gets converted to machine code by gcc which is a C compiler and is part of the GNU utilities.

Why learn Compiler Development

Making a compiler is an excellent test of your skills. From Scanning to Code-Generation there are countless avenues to decide your path forward. Each decision and trade-off you make will mould the compiler into something different. Skills learned here will give you a deeper understanding of how computers do things. Piping the various parts together will help you learn about code organization and different paradigms for it. I mainly to do it test new languages out 🤓

What is an Interpreter

For now just remember that an interpreter unlike a compiler does not generate machine code ahead of time i.e. no executable file is outputted.

Weird case of python

Is Python interpreted or compiled? It's not easy to answer. Python uses both compilation and interpretation. Normally the python source code file .py is compiled to another representation and then that representation is interpreted.This mix of compilation and interpretation is also used by JAVA, C# and possibly others given there exists a tool. C or C++ can be interpreted** even though it is generally compiled. The language merely defines the syntax to be implemented in both interpreters and compilers

Phases of Compilation

  • Lexing / Scanning
  • Parsing
  • Code Generation

Lexing

All code is just words

  • What's the difference between a cpp file and a py file ? it's just the file extension! , there is nothing fundamentally different about cpp files compared to py files. They are just text files with specific extensions like .c or .matlab

  • Lexing or Scanning is always the first phase in a compilation pipeline. In this phase we assign meaning to words inside the file so that appropriate actions can be done upon them.

    • In a statement like
    user@> 1 + 1
    2
    

    We have 3 tokens 1 + and 1
    Do we consider 1 as a string ? or as a number ?. Addition only works on number so the word 1 represents a number.

    What about this code:

    user@> print "hello"
    hello
    

    Here we see that we are calling the print method with the value hello.
    print here is a function call that takes one argument and prints that, print is not number or a string.

    To Summarize:
    1 -> Number
    + -> Binary function
    print -> Unary function
    "hello" -> String

    To find what word means what in a code is essence of Lexing

Code time

Our Language of choice would be Elixir which is a functional programming language. I have found functional languages to be a great choice for making compilers because of how different components can be piped almost effortlessly and fluently in them.
Being functional also means easier to debug as the state is not mutated which is a big W when building complex software like Compilers

Note: Refer to this beautifully crafted website for any elixir needs: hexdocs.pm

We will start my defining our entry point

  def lexer(data) do
    a =
      data
      |> String.split("", trim: true)
      |> emp_expresssion([])

    IO.inspect(a)
  end

Explanation

  • lexer function will recieve the text from a file after reading it.
  • We use the String.split function to split the words by whitespaces
print "hello" -> [print,"hello"] 
// Something like 
print"hello" 
// might work in other languages but will not work here because there are no whitespaces to separate the function with the arguements. 
// More complex have a better Lexer but for simplicity we will be splitting by spaces   
  • The trim:true is to make sure there arent any whitespaces at the end or at the start
  • |> is used to pipe result from one expression to another
  • We |> the result from String.split to emp_expression
  • 'emp_expression' takes 2 arguements
    1. [] an empty list
    2. The result that is being piped in. Elixir we would say that emp_expression/2
  • emp_expression/2 takes in a list of words after being splitted and lexes them
  • The empty list is used for recursion which will make sense in a bit

Lets start with defining emp_expression

  def emp_expresssion([token | rest], acc) do
    case token do
      "" ->
        emp_expresssion(rest, acc)

      " " ->
        emp_expresssion(rest, acc)

      "+" ->
        emp_expresssion(rest, acc ++ [{:op, "+"}])

      "-" ->
        emp_expresssion(rest, acc ++ [{:op, "-"}])

      "/" ->
        emp_expresssion(rest, acc ++ [{:op, "/"}])

      "*" ->
        emp_expresssion(rest, acc ++ [{:op, "+"}])

      "(" ->
        emp_expresssion(rest, acc ++ [{:identifier, "("}])

      ")" ->
        emp_expresssion(rest, acc ++ [{:identifier, ")"}])

      "\"" ->
        string = lex_string(rest, "")
        size = String.length(string)
        rest = Enum.drop(rest, size + 1)
        emp_expresssion(rest, acc ++ [{:string, string}])

      _ ->
        cond do
          Empless.Lexer.is_number(token) ->
            num = lex_number(rest, token)
            size = String.length(num)
            rest = Enum.drop(rest, size - 1)
            emp_expresssion(rest, acc ++ [{:num, num}])

          is_char(token) ->
            identifier = lex_identifier(rest, token)
            size = String.length(identifier)
            rest = Enum.drop(rest, size - 1)

            if(is_identifier(identifier)) do
              emp_expresssion(rest, acc ++ [{:identifier, identifier}])
            else
              raise "Unkown Identifier found"
            end

          true ->
            {:error, "Unknow Lexeme Found"}
        end
    end
  end

Explanation

Lets divide the function into smaller biteable bits

  • Lets start with the function declaration
  def emp_expresssion([token | rest], acc) do

here we have defined a function with name emp_expression. You might notice something weird inside the braces, that is called pattern matching. Like mentioned above emp_expression/2 takes 2 arguements and using pattern matching with the | operator we divide the passed array into the first element:token and the remaining list: rest

iex> [head|rest] = [1,2,3]
iex> head 
1
iex> rest
[2,3]
  • The rest of the code contains case branches kinda like if-else in C/C++ Lets dive into one of the branches specifically the one to identify if the word represents a number

Note: We will be thinking of numbers as whole integers and not decimal/floats. This makes the implementation a lot easier and afterwards you can just update it to work floats.

cond do
  Empless.Lexer.is_number(token) ->
    num = lex_number(rest, token)
    size = String.length(num)
    rest = Enum.drop(rest, size - 1)
    emp_expresssion(rest, acc ++ [{:num, num}])

forget about the cond part and lets look at the code inside the block. It first makes a call to another function Empless.Lexer.is_numer(token) which takes the token/head that we splitted (remember?). if the function returns True that means we have encountered a number and have to extract it all from the text.

1234 + 4566
^
|
1 is a number but it is not present exclusively. Its part of a single bigger number i.e. 1234.

If the first character in a word is a number we keep parsing until we have reached the end of the number or an error (123fb + 45 should give an error because the first word cannot be categorized as a number or string).

we make a call to another function lex_number which will extract the number from the stream of text and then we move the current cursor forward and recursively call the function again. let me give you an example

1234 + 456
^
| 
# First character is a number so `lex_number` is called which returns 1234 , the full number.
# we have parsed the word and remove to continue scanning phase

+ 456 
^
|
# `+` Character is encountered which represents a binary operation, Now we drop + from the string aswell

456
^
| 
# First character is a number so `lex_number` is called which returns 456, 
# Now we drop this from string aswell and are left with an empty string

Successfully parsed the whole string 🤟🏼 The emp_expression function contains a few more conditions for

  • strings that start with "
  • all the binary operators -, *
  • parenthesis which are just iterations of the code we just worked through.

Tokens

In the code above you will see some statements like

emp_expresssion(rest, acc ++ [{:op, "+"}])

Question is what is that acc array and what are :op and + in it ? Remember we said that we are giving meaning to words so we attach another identifier with the word like :op which means an operation along with +. For number we have done :num with the number itself.

After running this program we get a single list of tokens like

iex@> lexer("1+1")
[[:num,1] [:op,+] [:num,1]]

One more thing, see how we have given meaning to all the words in the string. Thats how we do Lexing the first phase of compilation.


Code discussed here is available at: Empless

Glossary

  • Word: a sequence of characters/numbers representing something
    • 1234 represents a number
    • "hello" represents a string