Writing a Compiler (Part 3): Classifying + Better Tokenizing

After we have some tokens that have been strung together, the next step is to classify them, and give them more meaning. Classification in and of itself is not super interesting or super hard, so in addition to classifying the tokens, we will add basic parsing of comments, which will require us to make some changes to our tokenizer.

Let's begin!

The code used for this blog can be viewed here.

Basic Token Type Classification

Let's start by adding some more token types to our TokenType enum:

 class TokenType(Enum):
     IDENTIFIER = auto()
     WHITESPACE = auto()
     NEWLINE = auto()
+    PLUS = auto()
+    DASH = auto()
+    ASTERISK = auto()
+    SLASH = auto()
+    POWER = auto()
+    OPEN_PAREN = auto()
+    CLOSE_PAREN = auto()
+    EQUAL = auto()
+    LESS_THEN = auto()
+    GREATER_THEN = auto()
+    DOT = auto()
+    COMMENT = auto()

All of these new token types (except COMMENT) are single-character tokens, meaning we won't need to add a lot of code to properly classify these tokens. Also, note that we are using PLUS instead of ADD, and DASH instead of MINUS. Although it might be tempting to name these tokens after their uses, things can start to get confusing when you lock a token to a specific use case. For example: A "dash" can be used for both binary subtraction, and unary negation. An example of that would be 1 - 2 (binary) and - 1 (unary). By using the ambiguous term "dash", we can refer to both operations, since we use the token type to refer to the token, not the way the token is used.

Next we need to update our char_to_token_type function to handle the new token types we just added:

 def char_to_token_type(c: str) -> Optional[TokenType]:
-    if c == "\n":
-        return TokenType.NEWLINE
+    simple_token_types = {
+        "\n": TokenType.NEWLINE,
+        "+": TokenType.PLUS,
+        "-": TokenType.DASH,
+        "*": TokenType.ASTERISK,
+        "/": TokenType.SLASH,
+        "^": TokenType.POWER,
+        "(": TokenType.OPEN_PAREN,
+        ")": TokenType.CLOSE_PAREN,
+        "=": TokenType.EQUAL,
+        "<": TokenType.LESS_THEN,
+        ">": TokenType.GREATER_THEN,
+        ".": TokenType.DOT,
+        "#": TokenType.COMMENT,
+    }
+
+    if token_type := simple_token_types.get(c):
+        return token_type

This will basically just match up a character like + to it's respective token type. We use the walrus operator which allows us to assign and use our temporary token_type variable in one line. We use the .get(c) method instead of using [c] because .get will return None if the key is not found, whereas using a subscript ([]) will cause a KeyError exception. Not very pleasant!

When we run pytest, we will see that our test_tokenize_unknown_token test fails, since + is now a recognized token type. To fix it, we change it to something which doesn't have a token type yet:

 def test_tokenize_unknown_token():
-    tokens = tokenize("+")
+    tokens = tokenize("!")

-    assert tokens == [Token("+", 1, 1, None)]
+    assert tokens == [Token("!", 1, 1, None)]

Cleanup

As we start to make our tokenizer more complex, it is important to recognize things which won't work well in the future. Currently, the LocationInfo and Token concepts are very similar, and so we might as well just use Token's for everything. That way we will just be grouping/merging/manipulating tokens, instead of both token and location info objects.

Let's start by re-writing our tokenize function:

def tokenize(code: str) -> List[Token]:
    def collapse_token(tokens):
        contents = "".join([token.content for token in tokens])

        first = tokens[0]
        return Token(contents, first.line, first.column, first.type)

    def location_info_to_token(info: LocationInfo) -> Token:
        return Token(
            info.char, info.line, info.column, char_to_token_type(info.char)
        )

    tokens = [
        location_info_to_token(info) for info in generate_location_info(code)
    ]

    grouped = groupby(tokens, lambda token: token.type)

    return [collapse_token(list(group[1])) for group in grouped]

Basically, once we get our location info, we immediately turn it into a token, and then group on the token's type. This will make things easier down below.

Comment Parsing

Now we need are going to add the ability to parse comments. These are the basic requirements for a comment token:

Starts with a # character
Ends when EOL (end of line, newline) is reached
Ends when EOF (end of file) is reached

To achieve this, we will write a function that will take the many tokens inside of our comment and group them into a single "comment" token. Let's write some tests to see what we should expect:

def test_collapse_comment():
    tokens = tokenize("# hello world")

    assert tokens == [Token("# hello world", 1, 1, TokenType.COMMENT)]


def test_collapse_comment_respect_newlines():
    tokens = tokenize("# hello\n# world")

    assert tokens == [
        Token("# hello", 1, 1, TokenType.COMMENT),
        Token("\n", 1, 8, TokenType.NEWLINE),
        Token("# world", 2, 1, TokenType.COMMENT),
    ]

If we run our tests, we will see that the new tests that we added are not passing. Let's write some code to fix that!

def collapse_comment(tokens: List[Token]) -> List[Token]:
    out: List[Token] = []
    in_comment = False

    for token in tokens:
        if token.type == TokenType.COMMENT:
            in_comment = True

        elif in_comment:
            if token.type == TokenType.NEWLINE:
                in_comment = False

            else:
                out[-1].content += token.content
                continue

        out.append(token)

    return out

This (in short) loop through each token until we find a token with a COMMENT type. Once we find it, we will append the contents of each token to the last token in our out list (the comment token), until we reach a NEWLINE, or the loop ends.

And now we need to update our tokenize function:

     ]

+    tokens = collapse_comment(tokens)
+
     grouped = groupby(tokens, lambda token: token.type)

When we run our tests again, they should be passing.

More Cleanup/Refactoring

As we mentioned before, the LocationInfo's look very similar to the Token's. It would probably be best to just merge the two:

-class LocationInfo(NamedTuple):
-    char: str
-    line: int
-    column: int
-
-
-def generate_location_info(
+def generate_token_locations(
     code: str,
-) -> Generator[LocationInfo, None, None]:
+) -> Generator[Token, None, None]:
     line = 1
     column = 1

     for c in code:
-        yield LocationInfo(c, line, column)
+        yield Token(c, line, column)

 ...

-    def location_info_to_token(info: LocationInfo) -> Token:
-        return Token(
-            info.char, info.line, info.column, char_to_token_type(info.char)
-        )
+    def add_token_type(token: Token) -> Token:
+        token.type = char_to_token_type(token.content)
+
+        return token

     tokens = [
-        location_info_to_token(info) for info in generate_location_info(code)
+        add_token_type(token) for token in generate_token_locations(code)
     ]

And in our tests:

-from wac.parse.token import Token, TokenType, generate_location_info, tokenize
+from wac.parse.token import (
+    Token,
+    TokenType,
+    generate_token_locations,
+    tokenize,
+)

 ...

 def test_generate_location_info():
-    locations = list(generate_location_info("a\nbc\ndef"))
+    locations = list(generate_token_locations("a\nbc\ndef"))

     assert locations == [
-        ("a", 1, 1),
-        ("\n", 1, 2),
-        ("b", 2, 1),
-        ("c", 2, 2),
-        ("\n", 2, 3),
-        ("d", 3, 1),
-        ("e", 3, 2),
-        ("f", 3, 3),
+        Token("a", 1, 1),
+        Token("\n", 1, 2),
+        Token("b", 2, 1),
+        Token("c", 2, 2),
+        Token("\n", 2, 3),
+        Token("d", 3, 1),
+        Token("e", 3, 2),
+        Token("f", 3, 3),
     ]

Refactoring done!

Fixing isort

Remember that isort is what we use for sorting our imports. We didn't set it up in our first blog post, so now we need to configure it (the imports in the test file is too long, and not being formatted correctly).

In .isort.cfg:

[settings]
multi_line_output=3
include_trailing_comma=true
color_output=true

Since we have color output, we will also need to update our dev-requirements.txt file:

 click==8.0.4
+colorama==0.4.4
 coverage==6.3.2

Don't forget to re-run pip install -r dev-requirements.txt afterwards!

Now when we run make isort, it will tell us if isort was successful or not, and give us an indication of what failed. Note that it will not return a non-zero exit code upon failure, which means our CI workflow would silently fail! We can fix this by adding the following to our Makefile:

 isort:
-       isort . --diff
+       isort . --diff --check

That's it!

What's Next

In the next blog post, we will discuss the general structure of our new programming language, and how we will structure our AST nodes. After that, we will write an AST parser to create said nodes.

[prev] [next]