TCombinator

Tame the dragon 1

Table of contents

Learn about the dark arts of bending the fabric of code you write

Table of contents

  1. Introduction
  2. Why Compiler development
  3. What is Lexing

The dawn of computers revolutionized the way we do things. It was the start of something big_: past the industrial era we hath entered the era of information. We realized soon that coding in 0 and 1 is not an efficient way to describe a solution.

Imagine typing 01101000 01100101 01101100 01101100 01101111 instead of just saying hello 😱

Compilers have been used ever since to allow users to describe the computations in a high-level syntax. The way we write compilers has not changed that much from the 1959 COBOL compiler to JavaScript.

What is a Compiler

Compilers take your high level representation and converts it to machine code, thats it !

//c
printf("Hi Mom!");

this gets converted to machine code by gcc which is a C compiler and is part of the GNU utilities.

Why learn Compiler Development

Making a compiler is an excellent test of your skills. From Scanning to Code-Generation there are countless avenues to decide your path forward. Each decision and trade-off you make will mould the compiler into something different. Skills learned here will give you a deeper understanding of how computers do things. Piping the various parts together will help you learn about code organization and different paradigms for it. I mainly to do it test new languages out 🤓

What is an Interpreter

For now just remember that an interpreter unlike a compiler does not generate machine code ahead of time i.e. no executable file is outputted.

Weird case of python

Is Python interpreted or compiled? It's not easy to answer. Python uses both compilation and interpretation. Normally the python source code file .py is compiled to another representation and then that representation is interpreted.This mix of compilation and interpretation is also used by JAVA, C# and possibly others given there exists a tool. C or C++ can be interpreted** even though it is generally compiled. The language merely defines the syntax to be implemented in both interpreters and compilers

Phases of Compilation

Lexing

All code is just words

Code time

Our Language of choice would be Elixir which is a functional programming language. I have found functional languages to be a great choice for making compilers because of how different components can be piped almost effortlessly and fluently in them.
Being functional also means easier to debug as the state is not mutated which is a big W when building complex software like Compilers

Note: Refer to this beautifully crafted website for any elixir needs: hexdocs.pm

We will start my defining our entry point

  def lexer(data) do
    a =
      data
      |> String.split("", trim: true)
      |> emp_expresssion([])

    IO.inspect(a)
  end

Explanation

print "hello" -> [print,"hello"] 
// Something like 
print"hello" 
// might work in other languages but will not work here because there are no whitespaces to separate the function with the arguements. 
// More complex have a better Lexer but for simplicity we will be splitting by spaces   

Lets start with defining emp_expression

  def emp_expresssion([token | rest], acc) do
    case token do
      "" ->
        emp_expresssion(rest, acc)

      " " ->
        emp_expresssion(rest, acc)

      "+" ->
        emp_expresssion(rest, acc ++ [{:op, "+"}])

      "-" ->
        emp_expresssion(rest, acc ++ [{:op, "-"}])

      "/" ->
        emp_expresssion(rest, acc ++ [{:op, "/"}])

      "*" ->
        emp_expresssion(rest, acc ++ [{:op, "+"}])

      "(" ->
        emp_expresssion(rest, acc ++ [{:identifier, "("}])

      ")" ->
        emp_expresssion(rest, acc ++ [{:identifier, ")"}])

      "\"" ->
        string = lex_string(rest, "")
        size = String.length(string)
        rest = Enum.drop(rest, size + 1)
        emp_expresssion(rest, acc ++ [{:string, string}])

      _ ->
        cond do
          Empless.Lexer.is_number(token) ->
            num = lex_number(rest, token)
            size = String.length(num)
            rest = Enum.drop(rest, size - 1)
            emp_expresssion(rest, acc ++ [{:num, num}])

          is_char(token) ->
            identifier = lex_identifier(rest, token)
            size = String.length(identifier)
            rest = Enum.drop(rest, size - 1)

            if(is_identifier(identifier)) do
              emp_expresssion(rest, acc ++ [{:identifier, identifier}])
            else
              raise "Unkown Identifier found"
            end

          true ->
            {:error, "Unknow Lexeme Found"}
        end
    end
  end

Explanation

Lets divide the function into smaller biteable bits

  def emp_expresssion([token | rest], acc) do

here we have defined a function with name emp_expression. You might notice something weird inside the braces, that is called pattern matching. Like mentioned above emp_expression/2 takes 2 arguements and using pattern matching with the | operator we divide the passed array into the first element:token and the remaining list: rest

iex> [head|rest] = [1,2,3]
iex> head 
1
iex> rest
[2,3]

Note: We will be thinking of numbers as whole integers and not decimal/floats. This makes the implementation a lot easier and afterwards you can just update it to work floats.

cond do
  Empless.Lexer.is_number(token) ->
    num = lex_number(rest, token)
    size = String.length(num)
    rest = Enum.drop(rest, size - 1)
    emp_expresssion(rest, acc ++ [{:num, num}])

forget about the cond part and lets look at the code inside the block. It first makes a call to another function Empless.Lexer.is_numer(token) which takes the token/head that we splitted (remember?). if the function returns True that means we have encountered a number and have to extract it all from the text.

1234 + 4566
^
|
1 is a number but it is not present exclusively. Its part of a single bigger number i.e. 1234.

If the first character in a word is a number we keep parsing until we have reached the end of the number or an error (123fb + 45 should give an error because the first word cannot be categorized as a number or string).

we make a call to another function lex_number which will extract the number from the stream of text and then we move the current cursor forward and recursively call the function again. let me give you an example

1234 + 456
^
| 
# First character is a number so `lex_number` is called which returns 1234 , the full number.
# we have parsed the word and remove to continue scanning phase

+ 456 
^
|
# `+` Character is encountered which represents a binary operation, Now we drop + from the string aswell

456
^
| 
# First character is a number so `lex_number` is called which returns 456, 
# Now we drop this from string aswell and are left with an empty string

Successfully parsed the whole string 🤟🏼 The emp_expression function contains a few more conditions for

Tokens

In the code above you will see some statements like

emp_expresssion(rest, acc ++ [{:op, "+"}])

Question is what is that acc array and what are :op and + in it ? Remember we said that we are giving meaning to words so we attach another identifier with the word like :op which means an operation along with +. For number we have done :num with the number itself.

After running this program we get a single list of tokens like

iex@> lexer("1+1")
[[:num,1] [:op,+] [:num,1]]

One more thing, see how we have given meaning to all the words in the string. Thats how we do Lexing the first phase of compilation.


Code discussed here is available at: Empless

Glossary

Tags: #Elixir #Compilers #LLVM