Introduction to the prototype Perl6 Compiler
Synopsis
A simple introduction and overview to P6C. This document is only to be thought of as a reference to the general flow of things; to get complete details, refer to the source code. If you ignore this bit now, don't worry, it will be said many more times :)
Objects
There are 3 main types of objects that are used in P6C; its best to become familiar with them before anything else:
- Raw Parse Object
-
Raw parse objects are generated by Parse::RecDescent when it uses the grammar found in P6C/Parser.pm. They contain the raw data from the parse. After the parse, these objects process themselves using their
tree
methods, defined in P6C/Tree.pm - Node Object
-
Nodes are processed Parser objects, and are defined in P6C/Nodes.pm. A tree of these objects is called an op tree. This tree is used in generating IMC code with P6C/IMCC.pm. Names mostly correspond to the rulenames from the grammar, but not entirely. Some are omitted, and two -- ValueList and Register -- are added.
- Context Object
-
Context objects are defined in P6C/Context.pm, and are actually a data member of Node objects (although they are added during the context propagation phase, rather that the parse tree deciphering phase.) Context objects contain the current context information of the associated node.
Flow Control
As a brief overview, the whole process works like this: First, perl6 code is parsed using P6C/Parser.pm, which generates a raw parse tree. The raw tree is then processed by methods found in P6C/Tree.pm, turning the raw parse objects into an optree. Context is propagated onto the optree with P6C/Context.pm, and then evaluated and compiled to intermediate code by imcc
. imcc
then translates...this into parrot assembly code, which is then assembled into bytecode by the parrot assembler and run. Its just that simple.
In more detail
The first thing perl6 will do is start the first pass over your code: parse it. Assuming a normal run (no flags), perl6 will use the precompiled parser, Perl6Grammar.pm If you made any changes to the grammar in P6C/Parser.pm, you'll need to run your code with the --force-grammar flag to recompile it.
The Parser
Before the grammar itself is described, an overview of the parser will be helpful. The parser works by recursively defining a statement down to its most basic levels, and then turning the matched statement into a tree-like structure of objects. For instance, a small program like this:
$x = 4;
$y = $x + 6;
print $y;
Could be visualized as:
Program
scalar expression
operator '='
left => variable
sigil => '$'
varname => 'x'
right => scalar literal
type => number
value => 4
scalar expression
operator '='
left => variable
sigil => '$'
varname => 'y'
right => scalar expression
operator '+'
left => variable
sigil => '$'
varname => 'x'
right => scalar literal
type => number
value => 6
scalar expression
prefix 'print'
args => variable
sigil => '$'
varname => 'y'
Each node on this tree is actually an object in the form of a blessed anyonomous array, created as a result of the grammar's autoaction. It holds the raw data from the parse; one element per production.
Next, on to the grammar. As stated earlier, the grammar is written in Parse::RecDescent. For almost all cases, rules should be written without an action as the final production, so that the autoaction that turns the match into an object can be applied. To get a real idea of whats going on, you should read through the grammar; its long, but most rules are short. The base rule for the grammar is prog
; but you can probably start at stmt
unless you care about error handling and such like that. A few notes about the grammar in general:
- "Want" Rules Keywords and prefix operators use special "want_for" rules to match their correct definition. This is for modularity purposes. The prefix rule will match a keyword, and then do a hash lookup to find its corresponding "want_for" rule. For instance, if the "grep" keyword is found, then the lookup will then attempt to match the "want_for_grep" rule. The "want_for" rules match the "body" of the corresponding rule. Many builtin functions and unary operators will need their own individual "want_rule" for the parser to be accurate.
- Regex as Speed Optimizations Simple rules have been redefined as compiled regular expressions for speed purposes. These include all of the binary operators, as well as the numerical constant, error flusher, and special string delimiter.
- Special Parser Functions A few notes about some of the functions in the parser:
add_class
adds user-defined classes to the parser, andadd_function
does the same for functions.add_context
is used byadd_function
to create the resulting "want_for" rule for the function (to handle prototypes if it has them), as well as generating the tree deciphering method (see "Deciphering the Object Tree" for more details). Finally,got_err
handles error messages.
Deciphering the Object Tree
Once the parse is complete, the object tree is deciphered by a depth-first recursive traversal of the parse tree using the tree
methods found in Tree. While there is a matching "tree" method for nearly every single rule defined in Parser, there is no way to "generate" the tree methods as they need to handle the data very specifically. However, the general gist of what they do is like this:
- Remove junk data from the raw object Removing junk amounts to removing the syntax that won't really matter to the compiler. For instance: removing the "," from argument lists, removing special quote delimeters, or ignoring unnecessary whitespace. None of this syntax really matters internally, so it is thrown away.
- Turn the data into a processed node for "base" types If the object is a "base" type (i.e has a corresponding node type in P6C/Nodes.pm), the proper information is extracted from the parse object and used to create and return a new node of that type.
- Otherwise, call tree on the remaining sub-objects Calls the
tree
methods of its members, gathers up the return values, and builds a branch of the optree.
Propagating Context
Propagating Context is another way of saying "Figuring out what kind of value a node is expected to yield during compilition." Context information is added to a node by adding another member, ctx
, to the node, and setting its value equal to a context object. Context objects are defined in P6C/Context.pm, which also contains the functions for figuring out context. P6C/Addcontext.pm defines the ctx_left
and <ctx_right> methods for each Node type (which deal for lvalue and rvalue context, respectively), which define how each Node type propagates its context.
Compiling to IMC
Once the parse tree has been post-processed, the next stage is to travel through the op tree and compile it to IMC code. Each node type has a val
method that will be called by its parent node. This method gathers values from child nodes (by calling their val
methods), emits the code for the node's operation (using P6C/IMCC.pm::code
), and then returns the proper type for its context. Code is appended to the current function in the order in which it is generated, so subnodes have to be evaluated in the proper order. Things acting as lvalues need to define an assign
function. Currently only variables, variable declarations, and the ternary operator do so, but other things will need to as well at some point. Finally, regex nodes actually use an rx_val
function instead, which behaves differently, but serves basically the same purpose.
The compiler is very large, but there are a few things to note:
- scope
-
The compiler maintains a "current function" in which code is emitted, locals are declared, and symbol lookups begin.
P6C/IMCC.pm::code($x)
will add the IMC code $x to the current function. Scope can futher be manipulated "inside" a function by usingpush_scope
, which increases the depth of the scope by one, andpop_scope
, which lowers it by one. - symbol table
-
The compiler also maintains a primitive symbol table. Locally scoped variables (right now, meaning "function" variables) can be added with
add_localvar
. Global variables can be added withadd_globalvar
. Lookups can be done withfindvar
, which first searches the active function's locals, then parameters, and then finally globals. This implementation may change when lexical scoping is fully implemented. - code generation
-
There are many functions that aid in the generation of IMC code; see the P6C/IMCC.pm documentation for more details.
Finishing the process
From this point on, most of the work is done by external parrot processes. First, P6C/IMCC.pm is used to convert the op tree to .imc code. IMCC converts this code to .pasm. The assembler assembles the the .pasm to parrot bytecode, and then finally the bytecode is run by the driver.
Author of this document
Joseph F. Ryan (ryan.311@osu.edu)