Building the Tundra Language and Interpreter, Part 1
Building the Tundra Language and Interpreter, Part 1
It’s time, y’all. I’m building a language, a compiler, and an interpreter.
“If you don’t know how compilers work, then you don’t know how computers work. If you’re not 100% sure whether you know how compilers work, then you don’t know how they work.”
— Steve Yegge
It feels long overdue.
I’ve developing software some time- mainly UI UX Front-End, and I still often feel like I don’t understand it deeply enough. So it’s time to step up my game. In this series, I’ll share my thinking process, research, notes, and code — and hopefully you’ll enjoy following along.
I still remember watching Dylan Beattie’s talk on YouTube about creating his own language, Rockstar, with all of its rock-and-roll flair. That talk left a strong impression on me years ago, and now I finally feel ready to build a language of my own.
Here is the idea to start with the basics and try to really understand the core concepts and then we design the language itself, build a tree-walking interpreter, and finish with a bytecode virtual machine.
I’ll be using Crafting Interpreters by Robert Nystrom as my main guide. thats to keep some sort of a structure. I’ll also be taking notes and ideas from Static Program Analysis by Anders Møller and Michael I. Schwartzbach, along with notes from Asa and a bunch of articles here and there.
I’m starting from limited knowledge, but the goal is to build something real enough to teach me how languages actually work.
And don’t worry if some of the jargon or concepts feel confusing at first. I’m learning this in public, and the goal of the series is to make these ideas clearer as I go. My hope is that, as I progress, both I and anyone reading this series will have a much better understanding of how programming languages — and computers in general — work.
Why study interpreters and compilers?
There are at least three good reasons.
Firstly, writing an interpreter or compiler forces you to bring together a wide range of technical skills. It’s the kind of project that help your understanding of parsing, data structures, control flow, execution models, and software design all at once, not bad huh?. The skills you develop carry over into all kinds of software work.
Secondly, studying interpreters and compilers helps remove the magic of how computers work. These systems can seem almost magical from the outside, but they shouldn’t stay that way. Learning how source code is turned into something a machine can execute gives you a much clearer mental model of what programming actually is.
Thirdly, maybe you want to build your own programming language — or even just a small domain-specific language. If that’s the case, you’ll eventually need to build either an interpreter or a compiler for it. Moreover this feels like a great time to learn. Interest in programming languages has surged over the past several years, and new languages keep appearing, each exploring different tradeoffs and ideas.
So buckle up, grab a cup of coffee or tea, and let’s dive in.
How a Programming Language Is Built
I’d like to start by looking at how a programming language is generally built.
Here are my notes as of now.
At a high level, I’m going to divide the process into three main phases:
- The Front End (FE)
- Intermediate Representation (IR)
- The Back End (BE)
I’ll also briefly talk about compilers and interpreters, since those terms get used a lot and are easy to blur together.
┌──────────────┐
│ Source Code │
└──────────────┘
│
▼
┌─────────────────────────────┐
│ 1. Scanning │
│ (Lexical Analysis) │
│ - creates tokens │
│ - removes whitespace/comments│
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 2. Parsing │
│ (Syntax Trees / AST) │
│ - builds tree structure │
│ - follows grammar │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 3. Static Analysis & │
│ Binding │
│ - resolves names/scope │
│ - links definitions/uses │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Intermediate Representation │
│ (IR) │
│ - supports multiple targets │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 4. Code Generation │
│ - produces assembly or │
│ bytecode │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 5. Runtime / VM │
│ - executes code │
│ - manages memory, etc. │
└─────────────────────────────┘
1. The Front End (FE) Phase
1.1. Scanning
Scanning, also known as lexing or lexical analysis, is the very first step in processing source code.
Initally, a code is just a flat stream of characters — basically one large string of text. The scanner’s job is to group those characters into meaningful chunks called tokens.
You can think of tokens as the “words” of a programming language. The term lexical comes from the Greek root lex, meaning “word.”
A scanner usually identifies several different kinds of tokens:
- Single-character tokens, such as
(or, - Multi-character tokens, such as numbers (
123), string literals ("hi!"), and identifiers (min) - Keywords, which are reserved words that have a special meaning in the language
The scanner also filters out noise that the rest of the implementation does not need to care about. This usually includes:
- Whitespace, such as spaces, tabs, and newlines used for readability
- Comments, which are written for humans and ignored by the language implementation
By the time the scanning phase ends, the source code has already been converted into a string of tokens. The resulting tokens are passed on to the parser stage.
Scanning (Lexical Analysis)
Grouping a flat stream of characters into meaningful “words” (tokens), while discarding the noise.
┌───────────────┐
│ a = 1 + 2; │
└───────────────┘
│
▼
┌────────────────────┐
│ Scanner │
└────────────────────┘
│
├──► [IDENTIFIER: a]
├──► [EQUALS]
├──► [NUMBER: 1]
├──► [PLUS]
└──► [NUMBER: 2]
discarded:
- spaces
- tabs
- comments
1.2. Parsing
Parsing is the stage where those tokens are organized into meaningful structure according to the grammar of the language.
If scanning turns characters into words, parsing turns those words into sentences.
The most important thing here is that parsing makes it clear how these sentences are structured hierarchically in the source code. The source code itself is not simply a sequence of tokens but rather an arrangement where expressions and statements are nested within each other.
If tokens are words, parsing makes the sentence.
Input tokens:
[IDENTIFIER: a] [EQUALS] [NUMBER: 1] [PLUS] [NUMBER: 2]
Parse tree:
[EQUALS]
/ \
[IDENTIFIER: a] [PLUS]
/ \
[NUMBER: 1] [NUMBER: 2]
This phase enforces the language grammar and reports syntax errors if the structure is invalid.
Building the syntax tree
The parser’s main job is to take the flat sequence of tokens and transform it into a tree structure. This is often represented as a parse tree or, more commonly in practice, an Abstract Syntax Tree (AST) - sounds fancier and smarter.
That tree reflects the nested structure of the grammar.
For instance, within a mathematical statement, the tree represents the operations that are to be performed within other operations and in what sequence. In this regard, it is somewhat similar to the diagramming of a sentence during construction of a sentance in school.
Enforcing grammar and reporting errors
Parsing is also where the rules of the language are enforced. The grammar of the language is used to build the syntax for forming larger structures from the smaller elements. A syntax error will arise whenever the source program violates this grammar.
Error reporting for this type of error is thus very important because programmers do make mistakes.
Different parsing strategies
Not all language implementations handle parsing in exactly the same way:
- Tree-walk interpreters build an AST, and then the interpreter walks that tree node by node to execute the program.
- Single-pass compilers may skip building a full tree and instead generate output as they parse, using a technique often called syntax-directed translation.
- Transpilers still scan and parse source code, but if the source and target languages are similar enough, they may skip deeper analysis and generate the destination code more directly.
1.3 Static analysis
At this stage, the parser will have generated a syntax tree for the implementation to understand the structure of the expressions. The one thing that the implementation still needs to find out is whether the identifiers are making any sense and meaningful.
Binding and resolution
One of the important tasks in static analysis is called binding or name resolution.
The binding process is responsible for identifying in which part of the code the identifier has been declared and binding occurrences of the identifier to declarations of the same identifier.
It relies on scope, which refers to an area in the source code where a name has meaning and/or reachable.
For instance, when analyzing a + b, static analysis is used to find out whether the identifiers a and b represent local variables, global variables, or even undefined names.
Type checking
If the programming language is statically typed,, that is when type checking takes place.
In a dynamically typed programming language, most of this is left for run-time.
Storing the results
The information discovered during static analysis needs to be stored somewhere so later stages can use it. There are a few common ways to do that:
- Attributes: extra fields attached directly to AST nodes
- Symbol tables: separate data structures mapping names to information about those names
- New data structures: more specialized representations that capture the semantics of the program more directly
Main jobs:
- Binding / Resolution:
connect names to their declarations (scope)
- Type Checking:
ensure operations are valid for the given data types
Parsed tree:
[EQUALS]
/ \
[IDENTIFIER: a] [PLUS]
/ \
[NUMBER: 1] [NUMBER: 2]
Annotations:
- [IDENTIFIER: a] -> Scope: Global
- [NUMBER: 1] -> Type: Integer
- [NUMBER: 2] -> Type: Integer
Symbol table:
+------+------------------+
| Name | Data |
+------+------------------+
| a | Global, Integer |
+———+------------------+
Static analysis is generally considered the final part of the front end.
2. Intermediate Representation
An Intermediate Representation, shortly IR, lies between the front end specific to the input language and the back end specific to the output platform.
The Intermediate Representation is an intermediary representation of the program code which is more suitable for analysis, transformation, optimization, and translation.
Why use an IR?
2.1. It separates source languages from target machines
-
The front end is tied to the source language. It knows how to understand something like C, Pascal.
-
The back end is tied to the target platform. It knows how to generate code for something like x86 or ARM.
Input programs
- source A
- source B
- source C
│
▼
┌──────────────────────────────┐
│ Intermediate Representation │
│ (IR) │
└──────────────────────────────┘
│
▼
Output programs
- target A
- target B
- target C
Common IR forms:
- Control-flow graphs
- SSA
- Three-address code
- Bytecode
2.2. It solves the “M × N” problem
Supporting multiple languages on multiple architectures becomes expensive very quickly without Without an IR.
If you had 3 source languages and 3 target architectures, you would need 9 separate compilers.
With an IR, you only need:
- one front end per language
- one back end per target
- That reduces the problem from 9 implementations to 6. This is one of the big reasons modern compiler frameworks are built this way.
Separating the source language from the target architecture
makes scaling much cheaper.
Without IR:
Languages: C, Pascal, Tundra
Targets: x86, ARM, MIPS
Each language needs its own compiler for each target:
C ──► x86
C ──► ARM
C ──► MIPS
Pascal ──► x86
Pascal ──► ARM
Pascal ──► MIPS
Tundra ──► x86
Tundra ──► ARM
Tundra ──► MIPS
Total: 9 compilers
With IR:
Languages:
C ──►
Pascal ──► IR ──► x86
Tundra ──► ──► ARM
──► MIPS
Total: 6 implementations
- 3 front ends (one per language)
- 3 back ends (one per target)
2.3. It gives the compiler a place to optimize
IR is also where many optimizations happen.
Once the program reaches that stage the compiler is able to substitute existing program for another one that performs the same function but works much faster.
One of the well-known techniques is called constant folding. For example in the case of 2 + 3, the compiler will perform the operation at compile-time and will replace it with the constant 5. That saves a lot of expenses.
Common forms of IR
There are several well-known styles of intermediate representation, including:
- Control-flow graphs
- Static Single Assignment (SSA)
- Three-address code
- Bytecode, sometimes called p-code, which is a compact, low-level instruction format often executed by a virtual machine
3. The Back End
But even after the analysis and transformation process, there is the need for a mechanism to execute the program, either by compiling it into machine language, translating it into bytecode, or interpreting it directly.
3.1. Optimization
Before generating final code, the implementation may perform optimizations.
This will involve replacing the original code used by the programmer with one that performs at a higher level.
3.2. Code generation
The core job of the back end is code generation.
This is where the implementation turns the program into:
- machine code, which runs directly on the CPU, or
- bytecode, which runs on a virtual machine
3.3. Runtime support
After a program runs, it usually relies on the runtime system for memory management, garbage collection, stack handling, or exception management.
3.4. Common shortcuts
Not every language implementation follows the full front-end → IR → back-end pipeline. Some common variations include:
- Tree-walk interpreters, which skip the back end entirely and execute the AST directly
- Transpilers, which translate one high-level language into another and let another toolchain do the rest
- JIT compilers (just-in-time compilers), which compile code to machine code at runtime
Okay, but what are compilers and interpreters?
These are two distinct concepts, although they are very similar.
- A compiler is used to convert source code into some other form; however, it will not execute the code itself.
- The interpreter executes the program by running either source code or code converted into some other form.
+------------+--------------------------------+----------------------------------+
| Category | Compiler | Interpreter |
+------------+--------------------------------+----------------------------------+
| Action | Translates source code | Executes the program directly |
| | into another form | |
+------------+--------------------------------+----------------------------------+
| Execution | Does not execute the code | Executes step-by-step |
| | itself | from source or IR |
+------------+--------------------------------+----------------------------------+
Hybrid:
Many modern systems do both.
Example: Python compiles source to bytecode,
then a virtual machine interprets that bytecode.
It turns out that in reality, the difference is not so clear-cut. In many modern programming languages, both methods are used at the same time. Thus, Python code is compiled into bytecode, which is interpreted by a virtual machine.
That is, “compiler” and “interpreter” are not necessarily opposite concepts. More often, they are simply separate stages of the same process.
Thats it for now, see ya in a bit. Stay tuned!