From 4ec6074b5505118b6132e4a2b735896895a9f0ac Mon Sep 17 00:00:00 2001 From: System Date: Tue, 25 Apr 2017 17:16:30 +0000 Subject: [PATCH] Original project from Glitch --- 0-introduction.md | 285 ++++++++++ 1-tokenizer.js | 180 +++++++ 2-parser.js | 161 ++++++ 3-traverser.js | 97 ++++ 4-transformer.js | 142 +++++ 5-code-generator.js | 66 +++ 6-compiler.js | 50 ++ LICENSE | 2 +- README.md | 43 +- package.json | 14 +- server.js | 70 +++ template.html.ejs | 320 +++++++++++ test.js | 50 +- the-super-tiny-compiler.js | 1038 ------------------------------------ 14 files changed, 1430 insertions(+), 1088 deletions(-) create mode 100644 0-introduction.md create mode 100644 1-tokenizer.js create mode 100644 2-parser.js create mode 100644 3-traverser.js create mode 100644 4-transformer.js create mode 100644 5-code-generator.js create mode 100644 6-compiler.js mode change 100755 => 100644 LICENSE mode change 100755 => 100644 README.md create mode 100644 server.js create mode 100644 template.html.ejs mode change 100755 => 100644 test.js delete mode 100755 the-super-tiny-compiler.js diff --git a/0-introduction.md b/0-introduction.md new file mode 100644 index 0000000..67e428b --- /dev/null +++ b/0-introduction.md @@ -0,0 +1,285 @@ +# Introduction + +Today we're going to write a compiler together. But not just any compiler... A +super duper teeny tiny compiler! A compiler that is so small that if you remove +all the comments this file would only be ~200 lines of actual code. + +We're going to compile some lisp-like function calls into some C-like function +calls. + +If you are not familiar with one or the other. I'll just give you a quick intro. + +If we had two functions `add` and `subtract` they would be written like this: + +| | LISP-style | C-style | +| ------------- | ------------------------ | ------------------------ | +| `2 + 2` | `(add 2 2)` | `add(2, 2)` | +| `4 - 2` | `(subtract 4 2)` | `subtract(4, 2)` | +| `2 + (4 - 2)` | `(add 2 (subtract 4 2))` | `add(2, subtract(4, 2))` | + +Easy peezy right? + +Well good, because this is exactly what we are going to compile. While this is +neither a complete LISP or C syntax, it will be enough of the syntax to +demonstrate many of the major pieces of a modern compiler. + +# Stages of a Compiler + +Most compilers break down into three primary stages: Parsing, Transformation, +and Code Generation + +1. *Parsing* is taking raw code and turning it into a more abstract + representation of the code. +2. *Transformation* takes this abstract representation and manipulates to do + whatever the compiler wants it to. +3. *Code Generation* takes the transformed representation of the code and turns + it into new code. + +## Parsing + +Parsing typically gets broken down into two phases: Lexical Analysis and +Syntactic Analysis. + +*Lexical Analysis* takes the raw code and splits it apart into these things +called tokens by a thing called a tokenizer (or lexer). + +Tokens are an array of tiny little objects that describe an isolated piece of +the syntax. They could be numbers, labels, punctuation, operators, whatever. + +*Syntactic Analysis* takes the tokens and reformats them into a representation +that describes each part of the syntax and their relation to one another. This +is known as an **Intermediate Representation** or **Abstract Syntax Tree**. + +An Abstract Syntax Tree, or AST for short, is a deeply nested object that +represents code in a way that is both easy to work with and tells us a lot of +information. + +For the following syntax: + +```lisp +(add 2 (subtract 4 2)) +``` + +Tokens might look something like this: + +```js +[ + { type: 'paren', value: '(' }, + { type: 'name', value: 'add' }, + { type: 'number', value: '2' }, + { type: 'paren', value: '(' }, + { type: 'name', value: 'subtract' }, + { type: 'number', value: '4' }, + { type: 'number', value: '2' }, + { type: 'paren', value: ')' }, + { type: 'paren', value: ')' }, +] +``` + +And an Abstract Syntax Tree (AST) might look like this: + +```js +{ + type: 'Program', + body: [{ + type: 'CallExpression', + name: 'add', + params: [{ + type: 'NumberLiteral', + value: '2', + }, { + type: 'CallExpression', + name: 'subtract', + params: [{ + type: 'NumberLiteral', + value: '4', + }, { + type: 'NumberLiteral', + value: '2', + }] + }] + }] +} +``` + +## Transformation + +The next type of stage for a compiler is transformation. Again, this just takes +the AST from the last step and makes changes to it. It can manipulate the AST +in the same language or it can translate it into an entirely new language. + +Let's look at how we would transform an AST. + +You might notice that our AST has elements within it that look very similar. +There are these objects with a type property. Each of these are known as an AST +Node. These nodes have defined properties on them that describe one isolated +part of the tree. + +We can have a node for a "NumberLiteral": + +```js +{ + type: 'NumberLiteral', + value: '2', +} +``` + +Or maybe a node for a "CallExpression": + +```js +{ + type: 'CallExpression', + name: 'subtract', + params: [ + // nested nodes go here... + ], +} +``` + +When transforming the AST we can manipulate nodes by adding/removing/replacing +properties, we can add new nodes, remove nodes, or we could leave the existing +AST alone and create an entirely new one based on it. + +Since we're targeting a new language, we're going to focus on creating an +entirely new AST that is specific to the target language. + +## Traversal + +In order to navigate through all of these nodes, we need to be able to traverse +through them. This traversal process goes to each node in the AST depth-first. + +```js +{ + type: 'Program', + body: [{ + type: 'CallExpression', + name: 'add', + params: [{ + type: 'NumberLiteral', + value: '2' + }, { + type: 'CallExpression', + name: 'subtract', + params: [{ + type: 'NumberLiteral', + value: '4' + }, { + type: 'NumberLiteral', + value: '2' + }] + }] + }] +} +``` + +So for the above AST we would go: + +1. Program - Starting at the top level of the AST +2. CallExpression (add) - Moving to the first element of the Program's body +3. NumberLiteral (2) - Moving to the first element of CallExpression's params +4. CallExpression (subtract) - Moving to the second element of CallExpression's params +5. NumberLiteral (4) - Moving to the first element of CallExpression's params +6. NumberLiteral (2) - Moving to the second element of CallExpression's params + +If we were manipulating this AST directly, instead of creating a separate AST, +we would likely introduce all sorts of abstractions here. But just visiting +each node in the tree is enough. + +The reason I use the word "visiting" is because there is this pattern of how +to represent operations on elements of an object structure. + +### Visitors + +The basic idea here is that we are going to create a "visitor" object that has +methods that will accept different node types. + +```js +var visitor = { + NumberLiteral() {}, + CallExpression() {}, +}; +``` + +When we traverse our AST, we will call the methods on this visitor whenever we +"enter" a node of a matching type. + +In order to make this useful we will also pass the node and a reference to the +parent node. + +```js +var visitor = { + NumberLiteral(node, parent) {}, + CallExpression(node, parent) {}, +}; +``` + +However, there also exists the possibilty of calling things on "exit". Imagine +our tree structure from before in list form: + +- Program + - CallExpression + - NumberLiteral + - CallExpression + - NumberLiteral + - NumberLiteral + +As we traverse down, we're going to reach branches with dead ends. As we finish +each branch of the tree we "exit" it. So going down the tree we "enter" each +node, and going back up we "exit". + +- → Program (enter) + - → CallExpression (enter) + - → NumberLiteral (enter) + - ← NumberLiteral (exit) + - → CallExpression (enter) + - → NumberLiteral (enter) + - ← NumberLiteral (exit) + - → NumberLiteral (enter) + - ← NumberLiteral (exit) + - ← CallExpression (exit) + - ← CallExpression (exit) +- ← Program (exit) + +In order to support that, the final form of our visitor will look like this: + +```js +var visitor = { + NumberLiteral: { + enter(node, parent) {}, + exit(node, parent) {}, + } +}; +``` + +## Code Generation + +The final phase of a compiler is code generation. Sometimes compilers will do +things that overlap with transformation, but for the most part code generation +just means take our AST and string-ify code back out. + +Code generators work several different ways, some compilers will reuse the +tokens from earlier, others will have created a separate representation of the +code so that they can print node linearly, but from what I can tell most will +use the same AST we just created, which is what we're going to focus on. + +Effectively our code generator will know how to "print" all of the different +node types of the AST, and it will recursively call itself to print nested +nodes until everything is printed into one long string of code. + +--- + +And that's it! That's all the different pieces of a compiler. + +Now that isn't to say every compiler looks exactly like I described here. +Compilers serve many different purposes, and they might need more steps than I +have detailed. + +But now you should have a general high-level idea of what most compilers look +like. + +Now that I've explained all of this, you're all good to go write your own +compilers right? + +Just kidding, that's what I'm here to help with :P + +So let's begin... diff --git a/1-tokenizer.js b/1-tokenizer.js new file mode 100644 index 0000000..770aefd --- /dev/null +++ b/1-tokenizer.js @@ -0,0 +1,180 @@ +/** + * ============================================================================ + * (/^▽^)/ + * THE TOKENIZER! + * ============================================================================ + */ + +/** + * We're gonna start off with our first phase of parsing, lexical analysis, with + * the tokenizer. + * + * We're just going to take our string of code and break it down into an array + * of tokens. + * + * (add 2 (subtract 4 2)) => [{ type: 'paren', value: '(' }, ...] + */ + +// We start by accepting an input string of code, and we're gonna set up two +// things... +function tokenizer(input) { + + // A `current` variable for tracking our position in the code like a cursor. + let current = 0; + + // And a `tokens` array for pushing our tokens to. + let tokens = []; + + // We start by creating a `while` loop where we are setting up our `current` + // variable to be incremented as much as we want `inside` the loop. + // + // We do this because we may want to increment `current` many times within a + // single loop because our tokens can be any length. + while (current < input.length) { + + // We're also going to store the `current` character in the `input`. + let char = input[current]; + + // The first thing we want to check for is an open parenthesis. This will + // later be used for `CallExpression` but for now we only care about the + // character. + // + // We check to see if we have an open parenthesis: + if (char === '(') { + + // If we do, we push a new token with the type `paren` and set the value + // to an open parenthesis. + tokens.push({ + type: 'paren', + value: '(', + }); + + // Then we increment `current` + current++; + + // And we `continue` onto the next cycle of the loop. + continue; + } + + // Next we're going to check for a closing parenthesis. We do the same exact + // thing as before: Check for a closing parenthesis, add a new token, + // increment `current`, and `continue`. + if (char === ')') { + tokens.push({ + type: 'paren', + value: ')', + }); + current++; + continue; + } + + // Moving on, we're now going to check for whitespace. This is interesting + // because we care that whitespace exists to separate characters, but it + // isn't actually important for us to store as a token. We would only throw + // it out later. + // + // So here we're just going to test for existence and if it does exist we're + // going to just `continue` on. + let WHITESPACE = /\s/; + if (WHITESPACE.test(char)) { + current++; + continue; + } + + // The next type of token is a number. This is different than what we have + // seen before because a number could be any number of characters and we + // want to capture the entire sequence of characters as one token. + // + // (add 123 456) + // ^^^ ^^^ + // Only two separate tokens + // + // So we start this off when we encounter the first number in a sequence. + let NUMBERS = /[0-9]/; + if (NUMBERS.test(char)) { + + // We're going to create a `value` string that we are going to push + // characters to. + let value = ''; + + // Then we're going to loop through each character in the sequence until + // we encounter a character that is not a number, pushing each character + // that is a number to our `value` and incrementing `current` as we go. + while (NUMBERS.test(char)) { + value += char; + char = input[++current]; + } + + // After that we push our `number` token to the `tokens` array. + tokens.push({ type: 'number', value }); + + // And we continue on. + continue; + } + + // We'll also add support for strings in our language which will be any + // text surrounded by double quotes ("). + // + // (concat "foo" "bar") + // ^^^ ^^^ string tokens + // + // We'll start by checking for the opening quote: + if (char === '"') { + // Keep a `value` variable for building up our string token. + let value = ''; + + // We'll skip the opening double quote in our token. + char = input[++current]; + + // Then we'll iterate through each character until we reach another + // double quote. + while (char !== '"') { + value += char; + char = input[++current]; + } + + // Skip the closing double quote. + char = input[++current]; + + // And add our `string` token to the `tokens` array. + tokens.push({ type: 'string', value }); + + continue; + } + + // The last type of token will be a `name` token. This is a sequence of + // letters instead of numbers, that are the names of functions in our lisp + // syntax. + // + // (add 2 4) + // ^^^ + // Name token + // + let LETTERS = /[a-z]/i; + if (LETTERS.test(char)) { + let value = ''; + + // Again we're just going to loop through all the letters pushing them to + // a value. + while (LETTERS.test(char)) { + value += char; + char = input[++current]; + } + + // And pushing that value as a token with the type `name` and continuing. + tokens.push({ type: 'name', value }); + + continue; + } + + // Finally if we have not matched a character by now, we're going to throw + // an error and completely exit. + throw new TypeError('I dont know what this character is: ' + char); + } + + // Then at the end of our `tokenizer` we simply return the tokens array. + return tokens; +} + +// Just exporting our tokenizer to be used in the final compiler... +module.exports = tokenizer; \ No newline at end of file diff --git a/2-parser.js b/2-parser.js new file mode 100644 index 0000000..fe66a46 --- /dev/null +++ b/2-parser.js @@ -0,0 +1,161 @@ +/** + * ============================================================================ + * ヽ/❀o ل͜ o\ノ + * THE PARSER!!! + * ============================================================================ + */ + +/** + * For our parser we're going to take our array of tokens and turn it into an + * AST. + * + * [{ type: 'paren', value: '(' }, ...] => { type: 'Program', body: [...] } + */ + +// Okay, so we define a `parser` function that accepts our array of `tokens`. +function parser(tokens) { + + // Again we keep a `current` variable that we will use as a cursor. + let current = 0; + + // But this time we're going to use recursion instead of a `while` loop. So we + // define a `walk` function. + function walk() { + + // Inside the walk function we start by grabbing the `current` token. + let token = tokens[current]; + + // We're going to split each type of token off into a different code path, + // starting off with `number` tokens. + // + // We test to see if we have a `number` token. + if (token.type === 'number') { + + // If we have one, we'll increment `current`. + current++; + + // And we'll return a new AST node called `NumberLiteral` and setting its + // value to the value of our token. + return { + type: 'NumberLiteral', + value: token.value, + }; + } + + // If we have a string we will do the same as number and create a + // `StringLiteral` node. + if (token.type === 'string') { + current++; + + return { + type: 'StringLiteral', + value: token.value, + }; + } + + // Next we're going to look for CallExpressions. We start this off when we + // encounter an open parenthesis. + if ( + token.type === 'paren' && + token.value === '(' + ) { + + // We'll increment `current` to skip the parenthesis since we don't care + // about it in our AST. + token = tokens[++current]; + + // We create a base node with the type `CallExpression`, and we're going + // to set the name as the current token's value since the next token after + // the open parenthesis is the name of the function. + let node = { + type: 'CallExpression', + name: token.value, + params: [], + }; + + // We increment `current` *again* to skip the name token. + token = tokens[++current]; + + // And now we want to loop through each token that will be the `params` of + // our `CallExpression` until we encounter a closing parenthesis. + // + // Now this is where recursion comes in. Instead of trying to parse a + // potentially infinitely nested set of nodes we're going to rely on + // recursion to resolve things. + // + // To explain this, let's take our Lisp code. You can see that the + // parameters of the `add` are a number and a nested `CallExpression` that + // includes its own numbers. + // + // (add 2 (subtract 4 2)) + // + // You'll also notice that in our tokens array we have multiple closing + // parenthesis. + // + // [ + // { type: 'paren', value: '(' }, + // { type: 'name', value: 'add' }, + // { type: 'number', value: '2' }, + // { type: 'paren', value: '(' }, + // { type: 'name', value: 'subtract' }, + // { type: 'number', value: '4' }, + // { type: 'number', value: '2' }, + // { type: 'paren', value: ')' }, <<< Closing parenthesis + // { type: 'paren', value: ')' }, <<< Closing parenthesis + // ] + // + // We're going to rely on the nested `walk` function to increment our + // `current` variable past any nested `CallExpression`. + + // So we create a `while` loop that will continue until it encounters a + // token with a `type` of `'paren'` and a `value` of a closing + // parenthesis. + while ( + (token.type !== 'paren') || + (token.type === 'paren' && token.value !== ')') + ) { + // we'll call the `walk` function which will return a `node` and we'll + // push it into our `node.params`. + node.params.push(walk()); + token = tokens[current]; + } + + // Finally we will increment `current` one last time to skip the closing + // parenthesis. + current++; + + // And return the node. + return node; + } + + // Again, if we haven't recognized the token type by now we're going to + // throw an error. + throw new TypeError(token.type); + } + + // Now, we're going to create our AST which will have a root which is a + // `Program` node. + let ast = { + type: 'Program', + body: [], + }; + + // And we're going to kickstart our `walk` function, pushing nodes to our + // `ast.body` array. + // + // The reason we are doing this inside a loop is because our program can have + // `CallExpression` after one another instead of being nested. + // + // (add 2 2) + // (subtract 4 2) + // + while (current < tokens.length) { + ast.body.push(walk()); + } + + // At the end of our parser we'll return the AST. + return ast; +} + +// Just exporting our parser to be used in the final compiler... +module.exports = parser; \ No newline at end of file diff --git a/3-traverser.js b/3-traverser.js new file mode 100644 index 0000000..d199874 --- /dev/null +++ b/3-traverser.js @@ -0,0 +1,97 @@ +/** + * ============================================================================ + * ⌒(❀>◞౪◟<❀)⌒ + * THE TRAVERSER!!! + * ============================================================================ + */ + +/** + * So now we have our AST, and we want to be able to visit different nodes with + * a visitor. We need to be able to call the methods on the visitor whenever we + * encounter a node with a matching type. + * + * traverse(ast, { + * Program(node, parent) { + * // ... + * }, + * + * CallExpression(node, parent) { + * // ... + * }, + * + * NumberLiteral(node, parent) { + * // ... + * }, + * }); + */ + +// So we define a traverser function which accepts an AST and a +// visitor. Inside we're going to define two functions... +function traverser(ast, visitor) { + + // A `traverseArray` function that will allow us to iterate over an array and + // call the next function that we will define: `traverseNode`. + function traverseArray(array, parent) { + array.forEach(child => { + traverseNode(child, parent); + }); + } + + // `traverseNode` will accept a `node` and its `parent` node. So that it can + // pass both to our visitor methods. + function traverseNode(node, parent) { + + // We start by testing for the existence of a method on the visitor with a + // matching `type`. + let methods = visitor[node.type]; + + // If there is an `enter` method for this node type we'll call it with the + // `node` and its `parent`. + if (methods && methods.enter) { + methods.enter(node, parent); + } + + // Next we are going to split things up by the current node type. + switch (node.type) { + + // We'll start with our top level `Program`. Since Program nodes have a + // property named body that has an array of nodes, we will call + // `traverseArray` to traverse down into them. + // + // (Remember that `traverseArray` will in turn call `traverseNode` so we + // are causing the tree to be traversed recursively) + case 'Program': + traverseArray(node.body, node); + break; + + // Next we do the same with `CallExpression` and traverse their `params`. + case 'CallExpression': + traverseArray(node.params, node); + break; + + // In the cases of `NumberLiteral` and `StringLiteral` we don't have any + // child nodes to visit, so we'll just break. + case 'NumberLiteral': + case 'StringLiteral': + break; + + // And again, if we haven't recognized the node type then we'll throw an + // error. + default: + throw new TypeError(node.type); + } + + // If there is an `exit` method for this node type we'll call it with the + // `node` and its `parent`. + if (methods && methods.exit) { + methods.exit(node, parent); + } + } + + // Finally we kickstart the traverser by calling `traverseNode` with our ast + // with no `parent` because the top level of the AST doesn't have a parent. + traverseNode(ast, null); +} + +// Just exporting our traverser to be used in the final compiler... +module.exports = traverser; \ No newline at end of file diff --git a/4-transformer.js b/4-transformer.js new file mode 100644 index 0000000..52d5fa8 --- /dev/null +++ b/4-transformer.js @@ -0,0 +1,142 @@ +var traverser = require('./3-traverser'); + +/** + * ============================================================================ + * ⁽(◍˃̵͈̑ᴗ˂̵͈̑)⁽ + * THE TRANSFORMER!!! + * ============================================================================ + */ + +/** + * Next up, the transformer. Our transformer is going to take the AST that we + * have built and pass it to our traverser function with a visitor and will + * create a new ast. + * + * ---------------------------------------------------------------------------- + * Original AST | Transformed AST + * ---------------------------------------------------------------------------- + * { | { + * type: 'Program', | type: 'Program', + * body: [{ | body: [{ + * type: 'CallExpression', | type: 'ExpressionStatement', + * name: 'add', | expression: { + * params: [{ | type: 'CallExpression', + * type: 'NumberLiteral', | callee: { + * value: '2' | type: 'Identifier', + * }, { | name: 'add' + * type: 'CallExpression', | }, + * name: 'subtract', | arguments: [{ + * params: [{ | type: 'NumberLiteral', + * type: 'NumberLiteral', | value: '2' + * value: '4' | }, { + * }, { | type: 'CallExpression', + * type: 'NumberLiteral', | callee: { + * value: '2' | type: 'Identifier', + * }] | name: 'subtract' + * }] | }, + * }] | arguments: [{ + * } | type: 'NumberLiteral', + * | value: '4' + * ---------------------------------- | }, { + * | type: 'NumberLiteral', + * | value: '2' + * | }] + * (sorry the other one is longer.) | } + * | } + * | }] + * | } + * ---------------------------------------------------------------------------- + */ + +// So we have our transformer function which will accept the lisp ast. +function transformer(ast) { + + // We'll create a `newAst` which like our previous AST will have a program + // node. + let newAst = { + type: 'Program', + body: [], + }; + + // Next I'm going to cheat a little and create a bit of a hack. We're going to + // use a property named `context` on our parent nodes that we're going to push + // nodes to their parent's `context`. Normally you would have a better + // abstraction than this, but for our purposes this keeps things simple. + // + // Just take note that the context is a reference *from* the old ast *to* the + // new ast. + ast._context = newAst.body; + + // We'll start by calling the traverser function with our ast and a visitor. + traverser(ast, { + + // The first visitor method accepts any `NumberLiteral` + NumberLiteral: { + // We'll visit them on enter. + enter(node, parent) { + // We'll create a new node also named `NumberLiteral` that we will push to + // the parent context. + parent._context.push({ + type: 'NumberLiteral', + value: node.value, + }); + }, + }, + + // Next we have `StringLiteral` + StringLiteral: { + enter(node, parent) { + parent._context.push({ + type: 'StringLiteral', + value: node.value, + }); + }, + }, + + // Next up, `CallExpression`. + CallExpression: { + enter(node, parent) { + + // We start creating a new node `CallExpression` with a nested + // `Identifier`. + let expression = { + type: 'CallExpression', + callee: { + type: 'Identifier', + name: node.name, + }, + arguments: [], + }; + + // Next we're going to define a new context on the original + // `CallExpression` node that will reference the `expression`'s arguments + // so that we can push arguments. + node._context = expression.arguments; + + // Then we're going to check if the parent node is a `CallExpression`. + // If it is not... + if (parent.type !== 'CallExpression') { + + // We're going to wrap our `CallExpression` node with an + // `ExpressionStatement`. We do this because the top level + // `CallExpression` in JavaScript are actually statements. + expression = { + type: 'ExpressionStatement', + expression: expression, + }; + } + + // Last, we push our (possibly wrapped) `CallExpression` to the `parent`'s + // `context`. + parent._context.push(expression); + }, + } + }); + + // At the end of our transformer function we'll return the new ast that we + // just created. + return newAst; +} + +// Just exporting our transformer to be used in the final compiler... +module.exports = transformer; \ No newline at end of file diff --git a/5-code-generator.js b/5-code-generator.js new file mode 100644 index 0000000..af86c4e --- /dev/null +++ b/5-code-generator.js @@ -0,0 +1,66 @@ +/** + * ============================================================================ + * ヾ(〃^∇^)ノ♪ + * THE CODE GENERATOR!!!! + * ============================================================================ + */ + +/** + * Now let's move onto our last phase: The Code Generator. + * + * Our code generator is going to recursively call itself to print each node in + * the tree into one giant string. + */ + +function codeGenerator(node) { + + // We'll break things down by the `type` of the `node`. + switch (node.type) { + + // If we have a `Program` node. We will map through each node in the `body` + // and run them through the code generator and join them with a newline. + case 'Program': + return node.body.map(codeGenerator) + .join('\n'); + + // For `ExpressionStatement` we'll call the code generator on the nested + // expression and we'll add a semicolon... + case 'ExpressionStatement': + return ( + codeGenerator(node.expression) + + ';' // << (...because we like to code the *correct* way) + ); + + // For `CallExpression` we will print the `callee`, add an open + // parenthesis, we'll map through each node in the `arguments` array and run + // them through the code generator, joining them with a comma, and then + // we'll add a closing parenthesis. + case 'CallExpression': + return ( + codeGenerator(node.callee) + + '(' + + node.arguments.map(codeGenerator) + .join(', ') + + ')' + ); + + // For `Identifier` we'll just return the `node`'s name. + case 'Identifier': + return node.name; + + // For `NumberLiteral` we'll just return the `node`'s value. + case 'NumberLiteral': + return node.value; + + // For `StringLiteral` we'll add quotations around the `node`'s value. + case 'StringLiteral': + return '"' + node.value + '"'; + + // And if we haven't recognized the node, we'll throw an error. + default: + throw new TypeError(node.type); + } +} + +// Just exporting our code generator to be used in the final compiler... +module.exports = codeGenerator; \ No newline at end of file diff --git a/6-compiler.js b/6-compiler.js new file mode 100644 index 0000000..7b4ae99 --- /dev/null +++ b/6-compiler.js @@ -0,0 +1,50 @@ +var tokenizer = require('./1-tokenizer'); +var parser = require('./2-parser'); +// Note: The traverser is only used inside of the transformer... +var transformer = require('./4-transformer'); +var codeGenerator = require('./5-code-generator'); + +/** + * ============================================================================ + * (۶* ‘ヮ’)۶” + * !!!!!!!!THE COMPILER!!!!!!!! + * ============================================================================ + */ + +/** + * FINALLY! We'll create our `compiler` function. Here we will link together + * every part of the pipeline. + * + * 1. input => tokenizer => tokens + * 2. tokens => parser => ast + * 3. ast => transformer => newAst + * 4. newAst => generator => output + */ + +function compiler(input) { + let tokens = tokenizer(input); + let ast = parser(tokens); + let newAst = transformer(ast); + let output = codeGenerator(newAst); + + // and simply return the output! + return output; +} + +/** + * ============================================================================ + * (๑˃̵ᴗ˂̵)و + * !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!YOU MADE IT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + * ============================================================================ + */ + +/** + * Now, if you enjoyed this, please give it a star on GitHub and follow me on + * Twitter (the links are up on the top right). + * + * You can also play around with this code/website on glitch.com (link is also + * on top right). + */ + +// Just exporting our compiler to be used in the tests +module.exports = compiler; diff --git a/LICENSE b/LICENSE old mode 100755 new mode 100644 index 40f19b9..bc384f0 --- a/LICENSE +++ b/LICENSE @@ -390,4 +390,4 @@ understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses. -Creative Commons may be contacted at creativecommons.org. +Creative Commons may be contacted at creativecommons.org. \ No newline at end of file diff --git a/README.md b/README.md old mode 100755 new mode 100644 index 79a0d63..abc7c56 --- a/README.md +++ b/README.md @@ -1,24 +1,21 @@ -[![The Super Tiny Compiler](https://cloud.githubusercontent.com/assets/952783/21579290/5755288a-cf75-11e6-90e0-029529a44a38.png)](the-super-tiny-compiler.js) +# Welcome to The Super Tiny Compiler! -***Welcome to The Super Tiny Compiler!*** +***An [@thejameskyle](http://thejameskyle.com/) production*** -This is an ultra-simplified example of all the major pieces of a modern compiler -written in easy to read JavaScript. +--- + +This is an ultra-simplified example of all the major pieces of a modern +compiler written in easy to read JavaScript. Reading through the guided code will help you learn about how *most* compilers work from end to end. -### [Want to jump into the code? Click here](the-super-tiny-compiler.js) - -### [You can also check it out on Glitch](https://the-super-tiny-compiler.glitch.me/) - ---- - ### Why should I care? -That's fair, most people don't really have to think about compilers in their day -jobs. However, compilers are all around you, tons of the tools you use are based -on concepts borrowed from compilers. +That's fair, most people don't really have to think about compilers in their +day jobs. However, compilers are all around you, tons of the tools you use are +based on concepts borrowed from compilers. These are really useful concepts to +have at your disposal. ### But compilers are scary! @@ -27,22 +24,14 @@ taken something that is reasonably straightforward and made it so scary that most think of it as this totally unapproachable thing that only the nerdiest of the nerds are able to understand. -### Okay so where do I begin? - -Awesome! Head on over to the [the-super-tiny-compiler.js](the-super-tiny-compiler.js) -file. - -### I'm back, that didn't make sense +I've done my best to try and keep this from being a scary experience. I hope +that reading this will be a positive learning experience for you. -Ouch, I'm really sorry. I'm planning on doing a lot more work on this to add -inline annotations. If you want to come back when that's done, you can either -watch/star this repo or follow me on -[twitter](https://twitter.com/thejameskyle) for updates. - -### Tests +### Okay so where do I begin? -Run with `node test.js` +Awesome! Head on over to [**0-introduction.md**](./intro) and then +work your way down the list of files. --- -[![cc-by-4.0](https://licensebuttons.net/l/by/4.0/80x15.png)](http://creativecommons.org/licenses/by/4.0/) +[![cc-by-4.0](https://licensebuttons.net/l/by/4.0/80x15.png)](http://creativecommons.org/licenses/by/4.0/) \ No newline at end of file diff --git a/package.json b/package.json index b3df43d..7b25ab7 100644 --- a/package.json +++ b/package.json @@ -1,7 +1,15 @@ { "name": "the-super-tiny-compiler", - "version": "1.0.0", "author": "James Kyle (thejameskyle.com)", "license": "CC-BY-4.0", - "main": "./the-super-tiny-compiler.js" -} + "repository": "thejameskyle/the-super-tiny-compiler", + "dependencies": { + "express": "^4.15.2", + "markdown-it": "^8.3.1", + "ejs": "^2.5.6", + "prismjs": "^9000.0.1" + }, + "scripts": { + "start": "node server.js" + } +} \ No newline at end of file diff --git a/server.js b/server.js new file mode 100644 index 0000000..75b03a6 --- /dev/null +++ b/server.js @@ -0,0 +1,70 @@ +var markdown = require('markdown-it')(); +var Prism = require('prismjs'); +var express = require('express'); +var path = require('path'); +var ejs = require('ejs'); +var fs = require('fs'); + +var app = express(); + +var ROUTES_MAP = { + '/' : 'README.md', + '/intro' : '0-introduction.md', + '/tokenizer' : '1-tokenizer.js', + '/parser' : '2-parser.js', + '/traverser' : '3-traverser.js', + '/transformer' : '4-transformer.js', + '/code-generator' : '5-code-generator.js', + '/compiler' : '6-compiler.js' +}; + +var routes = Object.keys(ROUTES_MAP).map(function(routePath) { + return { + routePath: routePath, + routeName: ROUTES_MAP[routePath] + }; +}); + +function readFile(fileName) { + return fs.readFileSync(path.join(__dirname, fileName)).toString(); +} + +function renderMarkdown(fileContents) { + return markdown.render(fileContents); +} + +function renderJavaScript(fileName, fileContents) { + return Prism.highlight(fileContents, Prism.languages.javascript); +} + +var template = ejs.compile(readFile('./template.html.ejs')); + +function render(routeName) { + var fileName = routeName; + var fileContents = readFile(fileName); + + var extName = path.extname(fileName); + if (extName === '.md') fileContents = renderMarkdown(fileContents); + if (extName === '.js') fileContents = renderJavaScript(fileName, fileContents); + + let isCode = extName !== '.md'; + + return template({ + routes: routes, + fileName: fileName, + fileContents: fileContents, + isCode: isCode, + }); +} + +routes.forEach(function(route) { + var html = render(route.routeName); + + app.get(route.routePath, function(req, res) { + res.send(html); + }); +}); + +var listener = app.listen(process.env.PORT, function () { + console.log('Your app is listening on port ' + listener.address().port); +}); diff --git a/template.html.ejs b/template.html.ejs new file mode 100644 index 0000000..9f2c3ed --- /dev/null +++ b/template.html.ejs @@ -0,0 +1,320 @@ + +class="is-code"<% } %>> + + The Super Tiny Compiler - <%= fileName %> + + + + + + + + +
+
+ + /Users/thejameskyle/code/the-super-tiny-compiler/<%= fileName %> + + + + Star this in GitHub + + + + Follow me on Twitter + + + + Remix this in Glitch + +
+ + + +
+ <% if (isCode) { %> +
<%- fileContents %>
+ <% } else { %> +
+ <%- fileContents %> +
+ <% } %> + + <% if (fileName === '6-compiler.js') { %> + Carlton Dance + <% } %> +
+
+ + diff --git a/test.js b/test.js old mode 100755 new mode 100644 index 20d7c5f..618995a --- a/test.js +++ b/test.js @@ -1,16 +1,22 @@ -const { - tokenizer, - parser, - transformer, - codeGenerator, - compiler, -} = require('./the-super-tiny-compiler'); -const assert = require('assert'); +var tokenizer = require('./1-tokenizer'); +var parser = require('./2-parser'); +// Note: The traverser is only used inside of the transformer... +var transformer = require('./4-transformer'); +var codeGenerator = require('./5-code-generator'); +var compiler = require('./6-compiler'); -const input = '(add 2 (subtract 4 2))'; -const output = 'add(2, subtract(4, 2));'; -const tokens = [ +// assert is a Node.js utility for asserting values and throwing and error if +// they aren't what you expect +var assert = require('assert'); + +/** + * Setting up all of the expected values through out our compiler phases: + */ +var input = '(add 2 (subtract 4 2))'; +var output = 'add(2, subtract(4, 2));'; + +var tokens = [ { type: 'paren', value: '(' }, { type: 'name', value: 'add' }, { type: 'number', value: '2' }, @@ -22,7 +28,7 @@ const tokens = [ { type: 'paren', value: ')' } ]; -const ast = { +var ast = { type: 'Program', body: [{ type: 'CallExpression', @@ -44,7 +50,7 @@ const ast = { }] }; -const newAst = { +var newAst = { type: 'Program', body: [{ type: 'ExpressionStatement', @@ -75,10 +81,16 @@ const newAst = { }] }; -assert.deepStrictEqual(tokenizer(input), tokens, 'Tokenizer should turn `input` string into `tokens` array'); -assert.deepStrictEqual(parser(tokens), ast, 'Parser should turn `tokens` array into `ast`'); -assert.deepStrictEqual(transformer(ast), newAst, 'Transformer should turn `ast` into a `newAst`'); -assert.deepStrictEqual(codeGenerator(newAst), output, 'Code Generator should turn `newAst` into `output` string'); -assert.deepStrictEqual(compiler(input), output, 'Compiler should turn `input` into `output`'); +/** + * Now let's write some assertions to make sure our compiler does everything we + * want it to... + */ + +assert.deepStrictEqual( tokenizer(input), tokens, 'Tokenizer should turn `input` string into `tokens` array'); +assert.deepStrictEqual( parser(tokens), ast, 'Parser should turn `tokens` array into `ast`'); +assert.deepStrictEqual( transformer(ast), newAst, 'Transformer should turn `ast` into a `newAst`'); +assert.deepStrictEqual( codeGenerator(newAst), output, 'Code Generator should turn `newAst` into `output` string'); +assert.deepStrictEqual( compiler(input), output, 'Compiler should turn `input` into `output`'); -console.log('All Passed!'); +// If none of the above asserts threw an error... +console.log('All Passed!'); \ No newline at end of file diff --git a/the-super-tiny-compiler.js b/the-super-tiny-compiler.js deleted file mode 100755 index 2bafccc..0000000 --- a/the-super-tiny-compiler.js +++ /dev/null @@ -1,1038 +0,0 @@ -'use strict'; - -/** - * TTTTTTTTTTTTTTTTTTTTTTTHHHHHHHHH HHHHHHHHHEEEEEEEEEEEEEEEEEEEEEE - * T:::::::::::::::::::::TH:::::::H H:::::::HE::::::::::::::::::::E - * T:::::::::::::::::::::TH:::::::H H:::::::HE::::::::::::::::::::E - * T:::::TT:::::::TT:::::THH::::::H H::::::HHEE::::::EEEEEEEEE::::E - * TTTTTT T:::::T TTTTTT H:::::H H:::::H E:::::E EEEEEE - * T:::::T H:::::H H:::::H E:::::E - * T:::::T H::::::HHHHH::::::H E::::::EEEEEEEEEE - * T:::::T H:::::::::::::::::H E:::::::::::::::E - * T:::::T H:::::::::::::::::H E:::::::::::::::E - * T:::::T H::::::HHHHH::::::H E::::::EEEEEEEEEE - * T:::::T H:::::H H:::::H E:::::E - * T:::::T H:::::H H:::::H E:::::E EEEEEE - * TT:::::::TT HH::::::H H::::::HHEE::::::EEEEEEEE:::::E - * T:::::::::T H:::::::H H:::::::HE::::::::::::::::::::E - * T:::::::::T H:::::::H H:::::::HE::::::::::::::::::::E - * TTTTTTTTTTT HHHHHHHHH HHHHHHHHHEEEEEEEEEEEEEEEEEEEEEE - * - * SSSSSSSSSSSSSSS UUUUUUUU UUUUUUUUPPPPPPPPPPPPPPPPP EEEEEEEEEEEEEEEEEEEEEERRRRRRRRRRRRRRRRR - * SS:::::::::::::::SU::::::U U::::::UP::::::::::::::::P E::::::::::::::::::::ER::::::::::::::::R - * S:::::SSSSSS::::::SU::::::U U::::::UP::::::PPPPPP:::::P E::::::::::::::::::::ER::::::RRRRRR:::::R - * S:::::S SSSSSSSUU:::::U U:::::UUPP:::::P P:::::PEE::::::EEEEEEEEE::::ERR:::::R R:::::R - * S:::::S U:::::U U:::::U P::::P P:::::P E:::::E EEEEEE R::::R R:::::R - * S:::::S U:::::U U:::::U P::::P P:::::P E:::::E R::::R R:::::R - * S::::SSSS U:::::U U:::::U P::::PPPPPP:::::P E::::::EEEEEEEEEE R::::RRRRRR:::::R - * SS::::::SSSSS U:::::U U:::::U P:::::::::::::PP E:::::::::::::::E R:::::::::::::RR - * SSS::::::::SS U:::::U U:::::U P::::PPPPPPPPP E:::::::::::::::E R::::RRRRRR:::::R - * SSSSSS::::S U:::::U U:::::U P::::P E::::::EEEEEEEEEE R::::R R:::::R - * S:::::S U:::::U U:::::U P::::P E:::::E R::::R R:::::R - * S:::::S U::::::U U::::::U P::::P E:::::E EEEEEE R::::R R:::::R - * SSSSSSS S:::::S U:::::::UUU:::::::U PP::::::PP EE::::::EEEEEEEE:::::ERR:::::R R:::::R - * S::::::SSSSSS:::::S UU:::::::::::::UU P::::::::P E::::::::::::::::::::ER::::::R R:::::R - * S:::::::::::::::SS UU:::::::::UU P::::::::P E::::::::::::::::::::ER::::::R R:::::R - * SSSSSSSSSSSSSSS UUUUUUUUU PPPPPPPPPP EEEEEEEEEEEEEEEEEEEEEERRRRRRRR RRRRRRR - * - * TTTTTTTTTTTTTTTTTTTTTTTIIIIIIIIIINNNNNNNN NNNNNNNNYYYYYYY YYYYYYY - * T:::::::::::::::::::::TI::::::::IN:::::::N N::::::NY:::::Y Y:::::Y - * T:::::::::::::::::::::TI::::::::IN::::::::N N::::::NY:::::Y Y:::::Y - * T:::::TT:::::::TT:::::TII::::::IIN:::::::::N N::::::NY::::::Y Y::::::Y - * TTTTTT T:::::T TTTTTT I::::I N::::::::::N N::::::NYYY:::::Y Y:::::YYY - * T:::::T I::::I N:::::::::::N N::::::N Y:::::Y Y:::::Y - * T:::::T I::::I N:::::::N::::N N::::::N Y:::::Y:::::Y - * T:::::T I::::I N::::::N N::::N N::::::N Y:::::::::Y - * T:::::T I::::I N::::::N N::::N:::::::N Y:::::::Y - * T:::::T I::::I N::::::N N:::::::::::N Y:::::Y - * T:::::T I::::I N::::::N N::::::::::N Y:::::Y - * T:::::T I::::I N::::::N N:::::::::N Y:::::Y - * TT:::::::TT II::::::IIN::::::N N::::::::N Y:::::Y - * T:::::::::T I::::::::IN::::::N N:::::::N YYYY:::::YYYY - * T:::::::::T I::::::::IN::::::N N::::::N Y:::::::::::Y - * TTTTTTTTTTT IIIIIIIIIINNNNNNNN NNNNNNN YYYYYYYYYYYYY - * - * CCCCCCCCCCCCC OOOOOOOOO MMMMMMMM MMMMMMMMPPPPPPPPPPPPPPPPP IIIIIIIIIILLLLLLLLLLL EEEEEEEEEEEEEEEEEEEEEERRRRRRRRRRRRRRRRR - * CCC::::::::::::C OO:::::::::OO M:::::::M M:::::::MP::::::::::::::::P I::::::::IL:::::::::L E::::::::::::::::::::ER::::::::::::::::R - * CC:::::::::::::::C OO:::::::::::::OO M::::::::M M::::::::MP::::::PPPPPP:::::P I::::::::IL:::::::::L E::::::::::::::::::::ER::::::RRRRRR:::::R - * C:::::CCCCCCCC::::CO:::::::OOO:::::::OM:::::::::M M:::::::::MPP:::::P P:::::PII::::::IILL:::::::LL EE::::::EEEEEEEEE::::ERR:::::R R:::::R - * C:::::C CCCCCCO::::::O O::::::OM::::::::::M M::::::::::M P::::P P:::::P I::::I L:::::L E:::::E EEEEEE R::::R R:::::R - * C:::::C O:::::O O:::::OM:::::::::::M M:::::::::::M P::::P P:::::P I::::I L:::::L E:::::E R::::R R:::::R - * C:::::C O:::::O O:::::OM:::::::M::::M M::::M:::::::M P::::PPPPPP:::::P I::::I L:::::L E::::::EEEEEEEEEE R::::RRRRRR:::::R - * C:::::C O:::::O O:::::OM::::::M M::::M M::::M M::::::M P:::::::::::::PP I::::I L:::::L E:::::::::::::::E R:::::::::::::RR - * C:::::C O:::::O O:::::OM::::::M M::::M::::M M::::::M P::::PPPPPPPPP I::::I L:::::L E:::::::::::::::E R::::RRRRRR:::::R - * C:::::C O:::::O O:::::OM::::::M M:::::::M M::::::M P::::P I::::I L:::::L E::::::EEEEEEEEEE R::::R R:::::R - * C:::::C O:::::O O:::::OM::::::M M:::::M M::::::M P::::P I::::I L:::::L E:::::E R::::R R:::::R - * C:::::C CCCCCCO::::::O O::::::OM::::::M MMMMM M::::::M P::::P I::::I L:::::L LLLLLL E:::::E EEEEEE R::::R R:::::R - * C:::::CCCCCCCC::::CO:::::::OOO:::::::OM::::::M M::::::MPP::::::PP II::::::IILL:::::::LLLLLLLLL:::::LEE::::::EEEEEEEE:::::ERR:::::R R:::::R - * CC:::::::::::::::C OO:::::::::::::OO M::::::M M::::::MP::::::::P I::::::::IL::::::::::::::::::::::LE::::::::::::::::::::ER::::::R R:::::R - * CCC::::::::::::C OO:::::::::OO M::::::M M::::::MP::::::::P I::::::::IL::::::::::::::::::::::LE::::::::::::::::::::ER::::::R R:::::R - * CCCCCCCCCCCCC OOOOOOOOO MMMMMMMM MMMMMMMMPPPPPPPPPP IIIIIIIIIILLLLLLLLLLLLLLLLLLLLLLLLEEEEEEEEEEEEEEEEEEEEEERRRRRRRR RRRRRRR - * - * ======================================================================================================================================================================= - * ======================================================================================================================================================================= - * ======================================================================================================================================================================= - * ======================================================================================================================================================================= - */ - -/** - * Today we're going to write a compiler together. But not just any compiler... A - * super duper teeny tiny compiler! A compiler that is so small that if you - * remove all the comments this file would only be ~200 lines of actual code. - * - * We're going to compile some lisp-like function calls into some C-like - * function calls. - * - * If you are not familiar with one or the other. I'll just give you a quick intro. - * - * If we had two functions `add` and `subtract` they would be written like this: - * - * LISP C - * - * 2 + 2 (add 2 2) add(2, 2) - * 4 - 2 (subtract 4 2) subtract(4, 2) - * 2 + (4 - 2) (add 2 (subtract 4 2)) add(2, subtract(4, 2)) - * - * Easy peezy right? - * - * Well good, because this is exactly what we are going to compile. While this - * is neither a complete LISP or C syntax, it will be enough of the syntax to - * demonstrate many of the major pieces of a modern compiler. - */ - -/** - * Most compilers break down into three primary stages: Parsing, Transformation, - * and Code Generation - * - * 1. *Parsing* is taking raw code and turning it into a more abstract - * representation of the code. - * - * 2. *Transformation* takes this abstract representation and manipulates to do - * whatever the compiler wants it to. - * - * 3. *Code Generation* takes the transformed representation of the code and - * turns it into new code. - */ - -/** - * Parsing - * ------- - * - * Parsing typically gets broken down into two phases: Lexical Analysis and - * Syntactic Analysis. - * - * 1. *Lexical Analysis* takes the raw code and splits it apart into these things - * called tokens by a thing called a tokenizer (or lexer). - * - * Tokens are an array of tiny little objects that describe an isolated piece - * of the syntax. They could be numbers, labels, punctuation, operators, - * whatever. - * - * 2. *Syntactic Analysis* takes the tokens and reformats them into a - * representation that describes each part of the syntax and their relation - * to one another. This is known as an intermediate representation or - * Abstract Syntax Tree. - * - * An Abstract Syntax Tree, or AST for short, is a deeply nested object that - * represents code in a way that is both easy to work with and tells us a lot - * of information. - * - * For the following syntax: - * - * (add 2 (subtract 4 2)) - * - * Tokens might look something like this: - * - * [ - * { type: 'paren', value: '(' }, - * { type: 'name', value: 'add' }, - * { type: 'number', value: '2' }, - * { type: 'paren', value: '(' }, - * { type: 'name', value: 'subtract' }, - * { type: 'number', value: '4' }, - * { type: 'number', value: '2' }, - * { type: 'paren', value: ')' }, - * { type: 'paren', value: ')' }, - * ] - * - * And an Abstract Syntax Tree (AST) might look like this: - * - * { - * type: 'Program', - * body: [{ - * type: 'CallExpression', - * name: 'add', - * params: [{ - * type: 'NumberLiteral', - * value: '2', - * }, { - * type: 'CallExpression', - * name: 'subtract', - * params: [{ - * type: 'NumberLiteral', - * value: '4', - * }, { - * type: 'NumberLiteral', - * value: '2', - * }] - * }] - * }] - * } - */ - -/** - * Transformation - * -------------- - * - * The next type of stage for a compiler is transformation. Again, this just - * takes the AST from the last step and makes changes to it. It can manipulate - * the AST in the same language or it can translate it into an entirely new - * language. - * - * Let’s look at how we would transform an AST. - * - * You might notice that our AST has elements within it that look very similar. - * There are these objects with a type property. Each of these are known as an - * AST Node. These nodes have defined properties on them that describe one - * isolated part of the tree. - * - * We can have a node for a "NumberLiteral": - * - * { - * type: 'NumberLiteral', - * value: '2', - * } - * - * Or maybe a node for a "CallExpression": - * - * { - * type: 'CallExpression', - * name: 'subtract', - * params: [...nested nodes go here...], - * } - * - * When transforming the AST we can manipulate nodes by - * adding/removing/replacing properties, we can add new nodes, remove nodes, or - * we could leave the existing AST alone and create an entirely new one based - * on it. - * - * Since we’re targeting a new language, we’re going to focus on creating an - * entirely new AST that is specific to the target language. - * - * Traversal - * --------- - * - * In order to navigate through all of these nodes, we need to be able to - * traverse through them. This traversal process goes to each node in the AST - * depth-first. - * - * { - * type: 'Program', - * body: [{ - * type: 'CallExpression', - * name: 'add', - * params: [{ - * type: 'NumberLiteral', - * value: '2' - * }, { - * type: 'CallExpression', - * name: 'subtract', - * params: [{ - * type: 'NumberLiteral', - * value: '4' - * }, { - * type: 'NumberLiteral', - * value: '2' - * }] - * }] - * }] - * } - * - * So for the above AST we would go: - * - * 1. Program - Starting at the top level of the AST - * 2. CallExpression (add) - Moving to the first element of the Program's body - * 3. NumberLiteral (2) - Moving to the first element of CallExpression's params - * 4. CallExpression (subtract) - Moving to the second element of CallExpression's params - * 5. NumberLiteral (4) - Moving to the first element of CallExpression's params - * 6. NumberLiteral (2) - Moving to the second element of CallExpression's params - * - * If we were manipulating this AST directly, instead of creating a separate AST, - * we would likely introduce all sorts of abstractions here. But just visiting - * each node in the tree is enough. - * - * The reason I use the word "visiting" is because there is this pattern of how - * to represent operations on elements of an object structure. - * - * Visitors - * -------- - * - * The basic idea here is that we are going to create a “visitor” object that - * has methods that will accept different node types. - * - * var visitor = { - * NumberLiteral() {}, - * CallExpression() {}, - * }; - * - * When we traverse our AST, we will call the methods on this visitor whenever we - * "enter" a node of a matching type. - * - * In order to make this useful we will also pass the node and a reference to - * the parent node. - * - * var visitor = { - * NumberLiteral(node, parent) {}, - * CallExpression(node, parent) {}, - * }; - * - * However, there also exists the possibilty of calling things on "exit". Imagine - * our tree structure from before in list form: - * - * - Program - * - CallExpression - * - NumberLiteral - * - CallExpression - * - NumberLiteral - * - NumberLiteral - * - * As we traverse down, we're going to reach branches with dead ends. As we - * finish each branch of the tree we "exit" it. So going down the tree we - * "enter" each node, and going back up we "exit". - * - * -> Program (enter) - * -> CallExpression (enter) - * -> Number Literal (enter) - * <- Number Literal (exit) - * -> Call Expression (enter) - * -> Number Literal (enter) - * <- Number Literal (exit) - * -> Number Literal (enter) - * <- Number Literal (exit) - * <- CallExpression (exit) - * <- CallExpression (exit) - * <- Program (exit) - * - * In order to support that, the final form of our visitor will look like this: - * - * var visitor = { - * NumberLiteral: { - * enter(node, parent) {}, - * exit(node, parent) {}, - * } - * }; - */ - -/** - * Code Generation - * --------------- - * - * The final phase of a compiler is code generation. Sometimes compilers will do - * things that overlap with transformation, but for the most part code - * generation just means take our AST and string-ify code back out. - * - * Code generators work several different ways, some compilers will reuse the - * tokens from earlier, others will have created a separate representation of - * the code so that they can print node linearly, but from what I can tell most - * will use the same AST we just created, which is what we’re going to focus on. - * - * Effectively our code generator will know how to “print” all of the different - * node types of the AST, and it will recursively call itself to print nested - * nodes until everything is printed into one long string of code. - */ - -/** - * And that's it! That's all the different pieces of a compiler. - * - * Now that isn’t to say every compiler looks exactly like I described here. - * Compilers serve many different purposes, and they might need more steps than - * I have detailed. - * - * But now you should have a general high-level idea of what most compilers look - * like. - * - * Now that I’ve explained all of this, you’re all good to go write your own - * compilers right? - * - * Just kidding, that's what I'm here to help with :P - * - * So let's begin... - */ - -/** - * ============================================================================ - * (/^▽^)/ - * THE TOKENIZER! - * ============================================================================ - */ - -/** - * We're gonna start off with our first phase of parsing, lexical analysis, with - * the tokenizer. - * - * We're just going to take our string of code and break it down into an array - * of tokens. - * - * (add 2 (subtract 4 2)) => [{ type: 'paren', value: '(' }, ...] - */ - -// We start by accepting an input string of code, and we're gonna set up two -// things... -function tokenizer(input) { - - // A `current` variable for tracking our position in the code like a cursor. - let current = 0; - - // And a `tokens` array for pushing our tokens to. - let tokens = []; - - // We start by creating a `while` loop where we are setting up our `current` - // variable to be incremented as much as we want `inside` the loop. - // - // We do this because we may want to increment `current` many times within a - // single loop because our tokens can be any length. - while (current < input.length) { - - // We're also going to store the `current` character in the `input`. - let char = input[current]; - - // The first thing we want to check for is an open parenthesis. This will - // later be used for `CallExpression` but for now we only care about the - // character. - // - // We check to see if we have an open parenthesis: - if (char === '(') { - - // If we do, we push a new token with the type `paren` and set the value - // to an open parenthesis. - tokens.push({ - type: 'paren', - value: '(', - }); - - // Then we increment `current` - current++; - - // And we `continue` onto the next cycle of the loop. - continue; - } - - // Next we're going to check for a closing parenthesis. We do the same exact - // thing as before: Check for a closing parenthesis, add a new token, - // increment `current`, and `continue`. - if (char === ')') { - tokens.push({ - type: 'paren', - value: ')', - }); - current++; - continue; - } - - // Moving on, we're now going to check for whitespace. This is interesting - // because we care that whitespace exists to separate characters, but it - // isn't actually important for us to store as a token. We would only throw - // it out later. - // - // So here we're just going to test for existence and if it does exist we're - // going to just `continue` on. - let WHITESPACE = /\s/; - if (WHITESPACE.test(char)) { - current++; - continue; - } - - // The next type of token is a number. This is different than what we have - // seen before because a number could be any number of characters and we - // want to capture the entire sequence of characters as one token. - // - // (add 123 456) - // ^^^ ^^^ - // Only two separate tokens - // - // So we start this off when we encounter the first number in a sequence. - let NUMBERS = /[0-9]/; - if (NUMBERS.test(char)) { - - // We're going to create a `value` string that we are going to push - // characters to. - let value = ''; - - // Then we're going to loop through each character in the sequence until - // we encounter a character that is not a number, pushing each character - // that is a number to our `value` and incrementing `current` as we go. - while (NUMBERS.test(char)) { - value += char; - char = input[++current]; - } - - // After that we push our `number` token to the `tokens` array. - tokens.push({ type: 'number', value }); - - // And we continue on. - continue; - } - - // We'll also add support for strings in our language which will be any - // text surrounded by double quotes ("). - // - // (concat "foo" "bar") - // ^^^ ^^^ string tokens - // - // We'll start by checking for the opening quote: - if (char === '"') { - // Keep a `value` variable for building up our string token. - let value = ''; - - // We'll skip the opening double quote in our token. - char = input[++current]; - - // Then we'll iterate through each character until we reach another - // double quote. - while (char !== '"') { - value += char; - char = input[++current]; - } - - // Skip the closing double quote. - char = input[++current]; - - // And add our `string` token to the `tokens` array. - tokens.push({ type: 'string', value }); - - continue; - } - - // The last type of token will be a `name` token. This is a sequence of - // letters instead of numbers, that are the names of functions in our lisp - // syntax. - // - // (add 2 4) - // ^^^ - // Name token - // - let LETTERS = /[a-z]/i; - if (LETTERS.test(char)) { - let value = ''; - - // Again we're just going to loop through all the letters pushing them to - // a value. - while (LETTERS.test(char)) { - value += char; - char = input[++current]; - } - - // And pushing that value as a token with the type `name` and continuing. - tokens.push({ type: 'name', value }); - - continue; - } - - // Finally if we have not matched a character by now, we're going to throw - // an error and completely exit. - throw new TypeError('I dont know what this character is: ' + char); - } - - // Then at the end of our `tokenizer` we simply return the tokens array. - return tokens; -} - -/** - * ============================================================================ - * ヽ/❀o ل͜ o\ノ - * THE PARSER!!! - * ============================================================================ - */ - -/** - * For our parser we're going to take our array of tokens and turn it into an - * AST. - * - * [{ type: 'paren', value: '(' }, ...] => { type: 'Program', body: [...] } - */ - -// Okay, so we define a `parser` function that accepts our array of `tokens`. -function parser(tokens) { - - // Again we keep a `current` variable that we will use as a cursor. - let current = 0; - - // But this time we're going to use recursion instead of a `while` loop. So we - // define a `walk` function. - function walk() { - - // Inside the walk function we start by grabbing the `current` token. - let token = tokens[current]; - - // We're going to split each type of token off into a different code path, - // starting off with `number` tokens. - // - // We test to see if we have a `number` token. - if (token.type === 'number') { - - // If we have one, we'll increment `current`. - current++; - - // And we'll return a new AST node called `NumberLiteral` and setting its - // value to the value of our token. - return { - type: 'NumberLiteral', - value: token.value, - }; - } - - // If we have a string we will do the same as number and create a - // `StringLiteral` node. - if (token.type === 'string') { - current++; - - return { - type: 'StringLiteral', - value: token.value, - }; - } - - // Next we're going to look for CallExpressions. We start this off when we - // encounter an open parenthesis. - if ( - token.type === 'paren' && - token.value === '(' - ) { - - // We'll increment `current` to skip the parenthesis since we don't care - // about it in our AST. - token = tokens[++current]; - - // We create a base node with the type `CallExpression`, and we're going - // to set the name as the current token's value since the next token after - // the open parenthesis is the name of the function. - let node = { - type: 'CallExpression', - name: token.value, - params: [], - }; - - // We increment `current` *again* to skip the name token. - token = tokens[++current]; - - // And now we want to loop through each token that will be the `params` of - // our `CallExpression` until we encounter a closing parenthesis. - // - // Now this is where recursion comes in. Instead of trying to parse a - // potentially infinitely nested set of nodes we're going to rely on - // recursion to resolve things. - // - // To explain this, let's take our Lisp code. You can see that the - // parameters of the `add` are a number and a nested `CallExpression` that - // includes its own numbers. - // - // (add 2 (subtract 4 2)) - // - // You'll also notice that in our tokens array we have multiple closing - // parenthesis. - // - // [ - // { type: 'paren', value: '(' }, - // { type: 'name', value: 'add' }, - // { type: 'number', value: '2' }, - // { type: 'paren', value: '(' }, - // { type: 'name', value: 'subtract' }, - // { type: 'number', value: '4' }, - // { type: 'number', value: '2' }, - // { type: 'paren', value: ')' }, <<< Closing parenthesis - // { type: 'paren', value: ')' }, <<< Closing parenthesis - // ] - // - // We're going to rely on the nested `walk` function to increment our - // `current` variable past any nested `CallExpression`. - - // So we create a `while` loop that will continue until it encounters a - // token with a `type` of `'paren'` and a `value` of a closing - // parenthesis. - while ( - (token.type !== 'paren') || - (token.type === 'paren' && token.value !== ')') - ) { - // we'll call the `walk` function which will return a `node` and we'll - // push it into our `node.params`. - node.params.push(walk()); - token = tokens[current]; - } - - // Finally we will increment `current` one last time to skip the closing - // parenthesis. - current++; - - // And return the node. - return node; - } - - // Again, if we haven't recognized the token type by now we're going to - // throw an error. - throw new TypeError(token.type); - } - - // Now, we're going to create our AST which will have a root which is a - // `Program` node. - let ast = { - type: 'Program', - body: [], - }; - - // And we're going to kickstart our `walk` function, pushing nodes to our - // `ast.body` array. - // - // The reason we are doing this inside a loop is because our program can have - // `CallExpression` after one another instead of being nested. - // - // (add 2 2) - // (subtract 4 2) - // - while (current < tokens.length) { - ast.body.push(walk()); - } - - // At the end of our parser we'll return the AST. - return ast; -} - -/** - * ============================================================================ - * ⌒(❀>◞౪◟<❀)⌒ - * THE TRAVERSER!!! - * ============================================================================ - */ - -/** - * So now we have our AST, and we want to be able to visit different nodes with - * a visitor. We need to be able to call the methods on the visitor whenever we - * encounter a node with a matching type. - * - * traverse(ast, { - * Program(node, parent) { - * // ... - * }, - * - * CallExpression(node, parent) { - * // ... - * }, - * - * NumberLiteral(node, parent) { - * // ... - * }, - * }); - */ - -// So we define a traverser function which accepts an AST and a -// visitor. Inside we're going to define two functions... -function traverser(ast, visitor) { - - // A `traverseArray` function that will allow us to iterate over an array and - // call the next function that we will define: `traverseNode`. - function traverseArray(array, parent) { - array.forEach(child => { - traverseNode(child, parent); - }); - } - - // `traverseNode` will accept a `node` and its `parent` node. So that it can - // pass both to our visitor methods. - function traverseNode(node, parent) { - - // We start by testing for the existence of a method on the visitor with a - // matching `type`. - let methods = visitor[node.type]; - - // If there is an `enter` method for this node type we'll call it with the - // `node` and its `parent`. - if (methods && methods.enter) { - methods.enter(node, parent); - } - - // Next we are going to split things up by the current node type. - switch (node.type) { - - // We'll start with our top level `Program`. Since Program nodes have a - // property named body that has an array of nodes, we will call - // `traverseArray` to traverse down into them. - // - // (Remember that `traverseArray` will in turn call `traverseNode` so we - // are causing the tree to be traversed recursively) - case 'Program': - traverseArray(node.body, node); - break; - - // Next we do the same with `CallExpression` and traverse their `params`. - case 'CallExpression': - traverseArray(node.params, node); - break; - - // In the cases of `NumberLiteral` and `StringLiteral` we don't have any - // child nodes to visit, so we'll just break. - case 'NumberLiteral': - case 'StringLiteral': - break; - - // And again, if we haven't recognized the node type then we'll throw an - // error. - default: - throw new TypeError(node.type); - } - - // If there is an `exit` method for this node type we'll call it with the - // `node` and its `parent`. - if (methods && methods.exit) { - methods.exit(node, parent); - } - } - - // Finally we kickstart the traverser by calling `traverseNode` with our ast - // with no `parent` because the top level of the AST doesn't have a parent. - traverseNode(ast, null); -} - -/** - * ============================================================================ - * ⁽(◍˃̵͈̑ᴗ˂̵͈̑)⁽ - * THE TRANSFORMER!!! - * ============================================================================ - */ - -/** - * Next up, the transformer. Our transformer is going to take the AST that we - * have built and pass it to our traverser function with a visitor and will - * create a new ast. - * - * ---------------------------------------------------------------------------- - * Original AST | Transformed AST - * ---------------------------------------------------------------------------- - * { | { - * type: 'Program', | type: 'Program', - * body: [{ | body: [{ - * type: 'CallExpression', | type: 'ExpressionStatement', - * name: 'add', | expression: { - * params: [{ | type: 'CallExpression', - * type: 'NumberLiteral', | callee: { - * value: '2' | type: 'Identifier', - * }, { | name: 'add' - * type: 'CallExpression', | }, - * name: 'subtract', | arguments: [{ - * params: [{ | type: 'NumberLiteral', - * type: 'NumberLiteral', | value: '2' - * value: '4' | }, { - * }, { | type: 'CallExpression', - * type: 'NumberLiteral', | callee: { - * value: '2' | type: 'Identifier', - * }] | name: 'subtract' - * }] | }, - * }] | arguments: [{ - * } | type: 'NumberLiteral', - * | value: '4' - * ---------------------------------- | }, { - * | type: 'NumberLiteral', - * | value: '2' - * | }] - * (sorry the other one is longer.) | } - * | } - * | }] - * | } - * ---------------------------------------------------------------------------- - */ - -// So we have our transformer function which will accept the lisp ast. -function transformer(ast) { - - // We'll create a `newAst` which like our previous AST will have a program - // node. - let newAst = { - type: 'Program', - body: [], - }; - - // Next I'm going to cheat a little and create a bit of a hack. We're going to - // use a property named `context` on our parent nodes that we're going to push - // nodes to their parent's `context`. Normally you would have a better - // abstraction than this, but for our purposes this keeps things simple. - // - // Just take note that the context is a reference *from* the old ast *to* the - // new ast. - ast._context = newAst.body; - - // We'll start by calling the traverser function with our ast and a visitor. - traverser(ast, { - - // The first visitor method accepts any `NumberLiteral` - NumberLiteral: { - // We'll visit them on enter. - enter(node, parent) { - // We'll create a new node also named `NumberLiteral` that we will push to - // the parent context. - parent._context.push({ - type: 'NumberLiteral', - value: node.value, - }); - }, - }, - - // Next we have `StringLiteral` - StringLiteral: { - enter(node, parent) { - parent._context.push({ - type: 'StringLiteral', - value: node.value, - }); - }, - }, - - // Next up, `CallExpression`. - CallExpression: { - enter(node, parent) { - - // We start creating a new node `CallExpression` with a nested - // `Identifier`. - let expression = { - type: 'CallExpression', - callee: { - type: 'Identifier', - name: node.name, - }, - arguments: [], - }; - - // Next we're going to define a new context on the original - // `CallExpression` node that will reference the `expression`'s arguments - // so that we can push arguments. - node._context = expression.arguments; - - // Then we're going to check if the parent node is a `CallExpression`. - // If it is not... - if (parent.type !== 'CallExpression') { - - // We're going to wrap our `CallExpression` node with an - // `ExpressionStatement`. We do this because the top level - // `CallExpression` in JavaScript are actually statements. - expression = { - type: 'ExpressionStatement', - expression: expression, - }; - } - - // Last, we push our (possibly wrapped) `CallExpression` to the `parent`'s - // `context`. - parent._context.push(expression); - }, - } - }); - - // At the end of our transformer function we'll return the new ast that we - // just created. - return newAst; -} - -/** - * ============================================================================ - * ヾ(〃^∇^)ノ♪ - * THE CODE GENERATOR!!!! - * ============================================================================ - */ - -/** - * Now let's move onto our last phase: The Code Generator. - * - * Our code generator is going to recursively call itself to print each node in - * the tree into one giant string. - */ - -function codeGenerator(node) { - - // We'll break things down by the `type` of the `node`. - switch (node.type) { - - // If we have a `Program` node. We will map through each node in the `body` - // and run them through the code generator and join them with a newline. - case 'Program': - return node.body.map(codeGenerator) - .join('\n'); - - // For `ExpressionStatement` we'll call the code generator on the nested - // expression and we'll add a semicolon... - case 'ExpressionStatement': - return ( - codeGenerator(node.expression) + - ';' // << (...because we like to code the *correct* way) - ); - - // For `CallExpression` we will print the `callee`, add an open - // parenthesis, we'll map through each node in the `arguments` array and run - // them through the code generator, joining them with a comma, and then - // we'll add a closing parenthesis. - case 'CallExpression': - return ( - codeGenerator(node.callee) + - '(' + - node.arguments.map(codeGenerator) - .join(', ') + - ')' - ); - - // For `Identifier` we'll just return the `node`'s name. - case 'Identifier': - return node.name; - - // For `NumberLiteral` we'll just return the `node`'s value. - case 'NumberLiteral': - return node.value; - - // For `StringLiteral` we'll add quotations around the `node`'s value. - case 'StringLiteral': - return '"' + node.value + '"'; - - // And if we haven't recognized the node, we'll throw an error. - default: - throw new TypeError(node.type); - } -} - -/** - * ============================================================================ - * (۶* ‘ヮ’)۶” - * !!!!!!!!THE COMPILER!!!!!!!! - * ============================================================================ - */ - -/** - * FINALLY! We'll create our `compiler` function. Here we will link together - * every part of the pipeline. - * - * 1. input => tokenizer => tokens - * 2. tokens => parser => ast - * 3. ast => transformer => newAst - * 4. newAst => generator => output - */ - -function compiler(input) { - let tokens = tokenizer(input); - let ast = parser(tokens); - let newAst = transformer(ast); - let output = codeGenerator(newAst); - - // and simply return the output! - return output; -} - -/** - * ============================================================================ - * (๑˃̵ᴗ˂̵)و - * !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!YOU MADE IT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - * ============================================================================ - */ - -// Now I'm just exporting everything... -module.exports = { - tokenizer, - parser, - traverser, - transformer, - codeGenerator, - compiler, -};