Skip to main content
KindaSloth Blog

Building Blocks of a Compiler

A Journey Through My Toy Language

Basic idea

In this post, I'll explain basic concepts about compilers and interpreters. The idea is to dive into a simple implementation of a toy language I did to see and understand how compilers and interpreters work under the hood.

The parts of a language

Below are all the most common parts that almost every compiler/interpreter has; I'll try to explain each one.

  • Lexing
  • Parsing
  • Static analysis
  • Intermediate representation
  • Optimization
  • Code generation and/or runtime

Lexing

Lexing, or lexical analysis, is the first phase of a compiler. It's a process where the compiler breaks down the source code, a string of characters, into meaningful elements called tokens.

Tokens are minor units that have standalone meanings. They can represent variable names, data types, function names, operators, keywords, punctuation like semicolons or brackets, and more.

The lexer scans the source code, character by character. When it identifies a sequence of characters that forms a valid token, it generates that token. For example, if the lexer encounters the characters "i", "n", and "t" in sequence and not part of a more extensive sequence of characters like the word "print", it generates a token for the keyword "int".

However, not all sequences of characters in the source code help understand the program's structure or behavior. The lexer typically discards sequences like whitespace and comments or categorizes them as a unique token.

Here you can see the basic lexer implementation I did for the toy language I wrote for this post: Lexer.

Just a note: This lexer could be way more efficient and faster; I did it to make it work as simply as possible.

Parsing

After the lexing phase of a compiler, which produces a series of tokens, we move on to the parsing phase. Parsing, or syntax analysis, takes these tokens and organizes them into a more complex structure, typically known as an abstract syntax tree (aka AST).

In parsing, the compiler checks that the program follows the language's syntax rules. It uses these rules to build the parse tree, a hierarchical structure that represents the organization of the code. Each tree node represents a construct in the language, such as an "if" statement or a loop.

This phase usually identifies and reports syntax errors, like a missing semicolon or an unmatched bracket. If the parser finds a syntax error, it typically stops further processing and reports the error to the user.

By the end of the parsing phase, if no errors are detected, the compiler has a structured representation of the source code.

Below you can see the basic parser implementation that I did.

First, we define our AST.

And with that, we can write our parser: Parser.

Just a note: I'm using an array of expressions for this toy language, but that's not recommended; I'm just simplifying stuff here.

Static analysis

Static analysis, often called semantic analysis in the context of compilers, is the next phase after parsing. While the lexer and parser ensure that the source code adheres to the language's syntax, the semantic analysis phase further verifies that the program makes sense regarding the language's semantics or meaning.

In other words, the semantic analyzer checks that the parsed code is well-formed, meaningful, and logically coherent. This involves checking various aspects of the code that must be more concerned with syntax.

Here are some of the tasks that can be part of semantic analysis:

  • Type checking
  • Scope resolution
  • Control flow checks
  • Other language-specific checks

Errors caught in this stage can include type mismatch, undefined variables, improper use of operations, and unreachable code. If any semantic errors are found, the compiler reports them to the user.

Here you can see the basic type checker implementation that I did: Type checker.

Just a note: Like the lexer, this could be way more efficient and faster; I did it to make it work as simply as possible.

Intermediate representation and Optimization

The intermediate representation (IR) is essential to the compilation process. After the source code has been lexed, parsed, and semantically analyzed, the next step is to translate it into an intermediate form.

The intermediate representation is a generic form of code that abstracts the specifics of the source language and the target machine language. It is designed to be easy for the compiler to manipulate and analyze, making it a powerful tool for optimization. Additionally, using an intermediate representation, compilers can be modularized more effectively, separating the front end (language-specific) and the back end (machine-specific).

There are two main types of intermediate representation: high-level IRs and low-level IRs.

  • High-Level IR: This form of IR is closer to the structure of the source code. It maintains much of the high-level abstraction of the source language, which can be helpful for certain kinds of analysis and optimization.

  • Low-Level IR: This form of IR is closer to the machine code and is more abstract than high-level IR. It's typically used for optimizations that rely more on the target machine's specifics and for generating machine code in the final step of the compilation process.

The compiler manipulates the intermediate representation to optimize the code. Optimizations can be as simple as eliminating redundant computations and dead code or as complex as loop transformations and inline expansion. These optimizations aim to improve the resulting program's performance, reducing runtime, memory usage, power consumption, or some combination of these.

Code generation and/or runtime

The final stages of a compiler's operation focus on Code Generation and, potentially, Runtime Management. Let's take a closer look at each.

Code generation

  • At this point, we've been through all the stages of analysis and abstraction, optimized the code using the intermediate representation, and now it's time to generate the target code. This could be machine code for a specific hardware architecture, but it could also be another high-level language.

  • The goal of the code generator is to take the intermediate code and transform it into efficient code in the target language. This can involve mapping constructs from the source language to equivalent constructs in the target language, ensuring that the semantics of the original code are preserved in the translation.

  • For instance, if you're transpiling TypeScript to JavaScript, the compiler needs to decide how to translate TypeScript's static typing system into JavaScript, which is dynamically typed. Similarly, if you're compiling a functional language to an imperative language, the compiler must figure out how to represent functional programming constructs using the imperative language's features.

Runtime management

  • Depending on the source and target languages, the compiler may also need to generate code to manage the runtime environment. This can include setting up stacks and heaps for dynamic memory allocation, managing the invocation and return of functions (including parameter passing), handling exceptions, and potentially implementing some form of garbage collection for languages that support it.

Here you can see the basic printer implementation that I did to emit javascript from my "AST": Printer.

Conclusion

The design and implementation of compilers (or transpilers) is a complex and fascinating subject, exploring the intersection of theory and practice, language design, and software engineering. It offers a unique insight into the workings of programming languages and the translation between them.

Here you can see the complete source code of the toy language described in this post: source code