Writing a Compiler (Part 2): Tokenizing


The first stage of any compiler is the tokenizer. A tokenizer will split up a string of code into "tokens", where each token represents the smallest, yet still meaningful chunk of code. A tokenizer might take in some code like this:


x + 1

And convert it into something like this:


[
  { "contents": "x", "type": "identifier", "line": 1, "column": 1 },
  { "contents": "+", "type": "operator", "line": 1, "column": 3 },
  { "contents": "1", "type": "number", "line": 1, "column": 5 },
]

Each element in this array is a token, and has some basic information about it, such its contents, type, and the line/column it is on.


The code for this blog is viewable here.


What is a Token?


Lets start by defining what a token is:


In wac/parse/token.py:


from enum import Enum, auto
from typing import Optional

class TokenType(Enum):
    IDENTIFIER = auto()
    WHITESPACE = auto()
    NEWLINE = auto()

@dataclass
class Token:
    content: str
    line: int
    column: int
    type: Optional[TokenType] = None

A basic token will have the contents of the token, the line and column it is on, and the type of said token. The type is allowed to be optional, since we won't know the type of the token until later.


We are using a dataclass for Token. This allows for us to write just the outline of the class, and all of our boilerplate __init__, __eq__, etc. methods will be created automatically.


To verify that what we have currently is working, lets add the following test in test/test_tokenizer.py:


from wac.parse.token import Token, TokenType

def test_create_token():
    token = Token("hello", 1, 2)

    assert token.content == "hello"
    assert token.line == 1
    assert token.column == 2
    assert not token.type


def test_create_token_with_type():
    token = Token("hello", 1, 2, TokenType.IDENTIFIER)

    assert token.content == "hello"
    assert token.line == 1
    assert token.column == 2
    assert token.type == TokenType.IDENTIFIER


def test_token_compare():
    assert Token("hello", 1, 1) == Token("hello", 1, 1)

And run it with pytest. We shouldn't get any errors.


Creating Location Info


Next up is taking a string of code, and assigning location info to each of the characters. The logic is pretty simple:



We might change this later, but this should cover all of the important use cases.


We will create a NamedTuple to store our new LocationInfo data structure:


from typing import NamedTuple

class LocationInfo(NamedTuple):
    char: str
    line: int
    column: int

Now we will create a function to take in some code, and using the logic we defined above, return a generator which will emit our location info:


def generate_location_info(
    code: str,
) -> Generator[LocationInfo, None, None]:
    line = 1
    column = 1

    for c in code:
        yield LocationInfo(c, line, column)

        if c == "\n":
            line += 1
            column = 1

        else:
            column += 1

The -> Generator[...] part is return type annotation, which tells mypy (and your intellisense if you're using VSCode) what the generate_location_info function is actually returning.


We will test the functionality of our new function in the existing test_tokenizer.py file:


from wac.parse.token import generate_location_info

def test_generate_location_info():
    locations = list(generate_location_info("a\nbc\ndef"))

    assert locations == [
        ("a", 1, 1),
        ("\n", 1, 2),
        ("b", 2, 1),
        ("c", 2, 2),
        ("\n", 2, 3),
        ("d", 3, 1),
        ("e", 3, 2),
        ("f", 3, 3),
    ]

Generating the Tokens


Now that we have the ability to get location info for each of the characters in our string, we can start to group similar characters and make tokens!


from itertools import groupby
from typing import Generator, List, Optional

def char_to_token_type(c: str) -> Optional[TokenType]:
    if c == "\n":
        return TokenType.NEWLINE

    if c.isspace():
        return TokenType.WHITESPACE

    if c.isalpha():
        return TokenType.IDENTIFIER

    return None


def tokenize(code: str) -> List[Token]:
    def create_token(token_info):
        contents = "".join([info.char for info in token_info])

        return Token(contents, token_info[0].line, token_info[0].column)

    location_info = generate_location_info(code)

    grouped = groupby(
        location_info,
        lambda info: char_to_token_type(info.char),
    )

    return [create_token(list(group[1])) for group in grouped]

char_to_token_type will take a single char, and try and figure out what kind of token type it is. We will add to this later, but this should be good for now.


tokenize is a little more complex, but not too much. Basically, we start out by generating the location info, figuring out the token type (using the lambda function), and group all the tokens which have the same token type. An example of what groupby does:


>>> from itertools import groupby

>>> nums = [1, 2, 3, -1, -2, 10, -10]

>>> groups = groupby(nums, lambda x: x > 0)

>>> [(group[0], list(group[1])) for group in groups]
[(True, [1, 2, 3]), (False, [-1, -2]), (True, [10]), (False, [-10])]

What we get back is an array of tuples, each tuple having the return value of our lambda on the left, and the grouped values which match that lambda's return value on the right.


Once we have our location info all grouped up, we loop through each group, adding up all the char's, taking the line/column info of the first location info tuple, and creating a token from that.


Here are the associated tests for above:


from wac.parse.token import Token, tokenize

def test_tokenize_single_token():
    tokens = tokenize("hello")

    assert tokens == [Token("hello", 1, 1)]


def test_tokenize_2_tokens():
    tokens = tokenize("hello\n")

    assert tokens == [Token("hello", 1, 1), Token("\n", 1, 6)]


def test_tokenize_many_tokens():
    tokens = tokenize("hello world")

    assert tokens == [
        Token("hello", 1, 1),
        Token(" ", 1, 6),
        Token("world", 1, 7),
    ]


def test_tokenize_unknown_token():
    tokens = tokenize("+")

    assert tokens == [Token("+", 1, 1)]

Next Steps


In the next blog we will further improve upon our tokenizer, and start actually classifying the tokens based on their contents.


[prev] [next]