Introduction to the prototype Perl6 Compiler

Synopsis

A simple introduction and overview to P6C. This document is only to be thought of as a reference to the general flow of things; to get complete details, refer to the source code. If you ignore this bit now, don't worry, it will be said many more times :)

Objects

There are 3 main types of objects that are used in P6C; it's best to become familiar with them before anything else:

Raw Parse Object

Raw parse objects are generated by Parse::RecDescent when it uses the grammar found in P6C/Parser.pm. They contain the raw data from the parse. After the parse, these objects process themselves using their tree methods, defined in P6C/Tree.pm.

Node Object

Nodes are processed Parser objects, and are defined in P6C/Nodes.pm. A tree of these objects is called an op tree. This tree is used in generating IMC code with P6C/IMCC.pm. Names mostly correspond to the rulenames from the grammar, but not entirely. Some are omitted, and two -- ValueList and Register -- are added.

Context Object

Context objects are defined in P6C/Context.pm, and are actually a data member of Node objects (although they are added during the context propagation phase, rather that the parse tree deciphering phase.) Context objects contain the current context information of the associated node.

Flow Control

As a brief overview, the whole process works like this: First, perl6 code is parsed using P6C/Parser.pm, which generates a raw parse tree. The raw tree is then processed by methods found in P6C/Tree.pm, turning the raw parse objects into an optree. Context is propagated onto the optree with P6C/Context.pm, and then evaluated and compiled to intermediate code by imcc. imcc then translates...this into parrot assembly code, which is then assembled into bytecode by the parrot assembler and run. It's just that simple.

In more detail

The first thing perl6 will do is start the first pass over your code: parse it. Assuming a normal run (no flags), perl6 will use the precompiled parser, Perl6Grammar.pm. If you made any changes to the grammar in P6C/Parser.pm, you'll need to run your code with the --force-grammar flag to recompile it.

The Parser

Before the grammar itself is described, an overview of the parser will be helpful. The parser works by recursively defining a statement down to its most basic levels, and then turning the matched statement into a tree-like structure of objects. For instance, a small program like this:

$x = 4;
$y = $x + 6;
print $y;

Could be visualized as:

Program
   scalar expression
       operator '='
           left  => variable
                       sigil   => '$'
                       varname => 'x'
           right => scalar literal
                       type  => number
                       value => 4
   scalar expression
       operator '='
           left  => variable
                       sigil   => '$'
                       varname => 'y'
           right => scalar expression
                       operator '+'
                           left  => variable
                                       sigil   => '$'
                                       varname => 'x'
                           right => scalar literal
                                       type  => number
                                       value => 6
   scalar expression
       prefix 'print'
           args => variable
                       sigil   => '$'
                       varname => 'y'

Each node on this tree is actually an object in the form of a blessed anonymous array, created as a result of the grammar's autoaction. It holds the raw data from the parse; one element per production.

Next, on to the grammar. As stated earlier, the grammar is written in Parse::RecDescent. For almost all cases, rules should be written without an action as the final production, so that the autoaction that turns the match into an object can be applied. To get a real idea of what's going on, you should read through the grammar; it's long, but most rules are short. The base rule for the grammar is prog, but you can probably start at stmt unless you care about error handling and such like that.

A few notes about the grammar in general:

"Want" Rules

Keywords and prefix operators use special "want_for" rules to match their correct definition. This is for modularity purposes. The prefix rule will match a keyword, and then do a hash lookup to find its corresponding "want_for" rule. For instance, if the "grep" keyword is found, then the lookup will then attempt to match the "want_for_grep" rule. The "want_for" rules match the "body" of the corresponding rule. Many builtin functions and unary operators will need their own individual "want_rule" for the parser to be accurate.

Regex as Speed Optimizations

Simple rules have been redefined as compiled regular expressions for speed purposes. These include all of the binary operators, as well as the numerical constant, error flusher, and special string delimiter.

Special Parser Functions

A few notes about some of the functions in the parser: add_class adds user-defined classes to the parser, and add_function does the same for functions. add_context is used by add_function to create the resulting "want_for" rule for the function (to handle prototypes if it has them), as well as generating the tree deciphering method (see "Deciphering the Object Tree" for more details). Finally, got_err handles error messages.

Deciphering the Object Tree

Once the parse is complete, the object tree is deciphered by a depth-first recursive traversal of the parse tree using the tree methods found in Tree. While there is a matching "tree" method for nearly every single rule defined in Parser, there is no way to "generate" the tree methods as they need to handle the data very specifically. However, the general gist of what they do is like this:

Remove junk data from the raw object

Removing junk amounts to removing the syntax that won't really matter to the compiler. For instance: removing the "," from argument lists, removing special quote delimiters, or ignoring unnecessary whitespace. None of this syntax really matters internally, so it is thrown away.

Turn the data into a processed node for "base" types

If the object is a "base" type (i.e has a corresponding node type in P6C/Nodes.pm), the proper information is extracted from the parse object and used to create and return a new node of that type.

Otherwise, call tree on the remaining sub-objects

Calls the tree methods of its members, gathers up the return values, and builds a branch of the optree.

Propagating Context

Propagating Context is another way of saying "Figuring out what kind of value a node is expected to yield during compilation." Context information is added to a node by adding another member, ctx, to the node, and setting its value equal to a context object. Context objects are defined in P6C/Context.pm, which also contains the functions for figuring out context. P6C/Addcontext.pm defines the ctx_left and <ctx_right> methods for each Node type (which deal for lvalue and rvalue context, respectively), which define how each Node type propagates its context.

Compiling to IMC

Once the parse tree has been post-processed, the next stage is to travel through the op tree and compile it to IMC code. Each node type has a val method that will be called by its parent node. This method gathers values from child nodes (by calling their val methods), emits the code for the node's operation (using P6C/IMCC.pm::code), and then returns the proper type for its context. Code is appended to the current function in the order in which it is generated, so subnodes have to be evaluated in the proper order. Things acting as lvalues need to define an assign function. Currently only variables, variable declarations, and the ternary operator do so, but other things will need to as well at some point. Finally, regex nodes actually use an rx_val function instead, which behaves differently, but serves basically the same purpose.

The compiler is very large, but there are a few things to note:

scope

The compiler maintains a "current function" in which code is emitted, locals are declared, and symbol lookups begin. P6C/IMCC.pm::code($x) will add the IMC code $x to the current function. Scope can further be manipulated "inside" a function by using push_scope, which increases the depth of the scope by one, and pop_scope, which lowers it by one.

symbol table

The compiler also maintains a primitive symbol table. Locally scoped variables (right now, meaning "function" variables) can be added with add_localvar. Global variables can be added with add_globalvar. Lookups can be done with findvar, which first searches the active function's locals, then parameters, and then finally globals. This implementation may change when lexical scoping is fully implemented.

code generation

There are many functions that aid in the generation of IMC code; see the P6C/IMCC.pm documentation for more details.

Finishing the process

From this point on, most of the work is done by external parrot processes. First, P6C/IMCC.pm is used to convert the op tree to .imc code. IMCC converts this code to .pasm. The assembler assembles the the .pasm to parrot bytecode, and then finally the bytecode is run by the driver.

Author of this document

Joseph F. Ryan (ryan.311@osu.edu)

$Id: overview.pod,v 1.2 2003/02/16 21:23:25 scog Exp $