The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Name

SPVM::Document::Language::Tokenization - Tokenization in the SPVM Language

Description

This document describes the tokenization in the SPVM language.

Tokenization

This section describes lexical analysis in the SPVM Language.

This is called tokenization.

See SPVM::Document::Language::SyntaxParsing about syntax parsing.

Character Encoding

The character encoding of SPVM source codes is UTF-8.

If a character is an ASCII character, it must be an ASCII printable character or a space character.

Compilation Errors:

The charactor encoding of SPVM source codes must be UTF-8. Otherwise a compilation error occurs.

If a character is an ASCII character, it must be an ASCII printable character or a space character. Otherwise a compilation error occurs.

Line Terminators

The line terminator is ASCII LF.

When a line terminator appears, the current line number is incremented by 1.

Space Characters

The space characters are ASCII SP, HT, FF, LF.

Word Characters

The word characters are ASCII a-zA-Z, 0-9, _.

Names

This section describes names.

Symbol Name

A symbol name consists of word characters and ::.

It dose not contains __.

It dose not begin with 0-9.

It dose not begin with ::.

It dose not end with ::.

It dose not contains ::::.

It dose not begin with 0-9.

Compliation Errors:

If a symbol name is invald, a compilation error occurs.

Examples:

# Symbol names
foo
foo_bar2
Foo::Bar

# Invalid symbol names
2foo
foo__bar
::Foo
Foo::
Foo::::Bar

Class Name

A class name is a symbol name.

Each partial name of a class name must begin with an uppercase letter.

Partial names are individual names separated by ::. For example, the partial names of Foo::Bar::Baz are Foo, Bar, and Baz.

Compilation Errors:

If a class name is invalid, a compilation error occurs.

Examples:

# Class names
Foo
Foo::Bar
Foo::Bar::Baz3
Foo::bar
Foo_Bar::Baz_Baz

# Invalid class names
Foo
Foo::::Bar
Foo::Bar::
Foo__Bar
Foo::bar

Method Name

A method name is a symbol name without :: or an empty string "".

Method names with the same name as keywords are allowed.

Compilation Errors:

If a method name is invalid, a compilation error occurs.

Examples:

# Method names
FOO
FOO_BAR3
foo
foo_bar
_foo
_foo_bar_

# Invalid method names
foo__bar
3foo

Field Name

A field name is a symbol name without ::.

Field names with the same name as keywords are allowed.

Compilation Errors:

If a field names is invalid, a compilation error occurs.

Examples:

# Field names
FOO
FOO_BAR3
foo
foo_bar
_foo
_foo_bar_

# Invalid field names
foo__bar
3foo
Foo::Bar

Variable Name

A variable name begins with $ and is followed by a symbol name.

The symbol name in a variable name can be surrounded by { and }.

Compilation Errors:

If a field names is invalid, a compilation error occurs.

If an opening { exists and the closing } dose not exist, a compilation error occurs.

Examples:

# Variable names
$name
$my_name
${name}
$Foo::name
$Foo::Bar::name
${Foo::name}

# Invalid variable names
$::name
$name::
$Foo::::name
$my__name
${name

Class Variable Name

A class variable name is a variable name.

Examples:

# Class variable names
$NAME
$MY_NAME
${NAME}
$FOO::NAME
$FOO::BAR::NAME
${FOO::NAME_BRACE}
$FOO::name

# Invalid class variable names
$::NAME
$NAME::
$FOO::::NAME
$MY__NAME
$3FOO
${NAME

Local Variable Name

A local variable name is a variable name without ::.

Examples:

# Local variable names
$name
$my_name
${name_brace}
$_name
$NAME

# Invalid local variable names
$::name
$name::
$Foo::name
$Foo::::name
$my__name
${name
$3foo

Keywords

The List of Keywords:

alias
allow
as
basic_type_id
break
byte
can
case
cmp
class
compile_type_name
copy
default
die
div_uint
div_ulong
double
dump
elsif
else
enum
eq
eval
eval_error_id
extends
for
float
false
gt
ge
has
if
interface
int
interface_t
isa
isa_error
isweak
is_compile_type
is_type
is_error
is_read_only
args_width
last
length
lt
le
long
make_read_only
my
mulnum_t
method
mod_uint
mod_ulong
mutable
native
ne
next
new
new_string_len
of
our
object
print
private
protected
public
precompile
pointer
return
require
required
rw
ro
say
static
switch
string
short
scalar
true
type_name
undef
unless
unweaken
use
version
void
warn
while
weaken
wo
INIT
__END__
__PACKAGE__
__FILE__
__LINE__

Operator Tokens

The List of Operator Tokens:

!
!=
$
%
&
&&
&=
=
==
^
^=
|
||
|=
-
--
-=
~
@
+
++
+=
*
*=
<
<=
>
>=
<=>
%
%=
<<
<<=
>>=
>>
>>>
>>>=
.
.=
/
/=
\
(
)
{
}
[
]
;
:
,
->
=>

Comment

Comments have no meaning.

#COMMENT

A comment begins with #.

It is followed by any string COMMENT.

It ends with ASCII LF.

Line directives take precedence over comments.

File directives take precedence over comments.

Examples:

# This is a comment line

Line Directive

A line directive set the current line number.

#line NUMBER

A line directive begins with #line from the beggining of the line.

It is followed by one or more ASCII SP.

It is followed by NUMBER. NUMBER is a positive 32bit integer.

It ends with ASCII LF.

The current line number of the source code is set to NUMBER.

Line directives take precedence over comments.

Compilation Errors:

A line directive must begin from the beggining of the line. Otherwise an compilation error occurs.

A line directive must end with "\n". Otherwise an compilation error occurs.

A line directive must have a line number. Otherwise an compilation error occurs.

The line number given to a line directive must be a positive 32bit integer. Otherwise an compilation error occurs.

Examples:

class MyClass {
  
  static method main : void () {
    
#line 39
    
  }
}

File Directive

A file directive set the current file path.

#file "FILE_PATH"

A file directive begins from the beggining of the source code.

It is followed by one or more ASCII SP.

It is followed by ".

It is followed by FILE_PATH. FILE_PATH is a string that represetns a file path.

It is closed with ".

It ends with ASCII LF.

The current file path is set to FILE_PATH.

File directives take precedence over comments.

Compilation Errors:

A file directive must begin from the beggining of the source code. Otherwise an compilation error occurs.

A file directive must end with "\n". Otherwise an compilation error occurs.

A file directive must have a file path. Otherwise an compilation error occurs.

A file directive must end with ". Otherwise an compilation error occurs.

Examples:

#file "/path/MyClass.spvm"
class MyClass {

}

__END__

If a line begins with __END__ and ends with ASCII LF, the line with __END__ and the below lines are interpreted as comments.

Examples:

class MyClass {
  
}

__END__

foo
bar

POD

POD is a syntax to write multiline comment. POD has no meaning.

The Beginning of a POD:

=NAME

The beginning of a POD begins with = from the beggining of the line.

It is followed by NAME. NAME is any string that begins with ASCII a-zA-Z.

It ends with ASCII LF.

The End of a POD:

=cut

The end of a POD begins with = from the beggining of the line.

It is followed by cut.

It ends with ASCII LF.

Examples:

=pod

Comment1
Comment2

=cut

=head1

Comment1
Comment2

=cut

Fat Comma

A fat comma is

=>

The fat comma is an alias for a comma ,.

# Comma
["a", "b", "c", "d"]

# Fat Comma
["a" => "b", "c" => "d"]

If the left operand of a fat comma is a symbol name without ::, it is wrraped by " and is treated as a string literal.

# foo_bar2 is treated as "foo_bar2"
[foo_bar2 => "Mark"]

["foo_bar2" => "Mark"]

Literals

A literal represents a constant value.

Numeric Literals

A numeric literal represents a constant number.

Integer Literals

A interger literal represents a constant number of an integer type.

Integer Literal Decimal Notation

The interger literal decimal notation represents a number of int type or long type using decimal numbers 0-9.

It can begin with a minus -.

It is followed by one or more of 0-9.

_ can be placed at the any positions after the first 0-9 as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is long type. Otherwise the return type is int type.

Compilation Errors:

If the return type is int type and the value is greater than the max value of int type or less than the minimal value of int type, a compilation error occurs.

If the return type is long type and the value is greater than the max value of long type or less than the minimal value of long type, a compilation error occurs.

Examples:

123
-123
123L
123l
123_456_789
-123_456_789L

Integer Literal Hexadecimal Notation

The interger literal hexadecimal notation represents a number of int type or long type using hexadecimal numbers 0-9a-zA-Z.

It can begin with a minus -.

It is followed by 0x or 0X.

It is followed by one or more 0-9a-zA-Z. This is called hexadecimal numbers part.

_ can be placed at the any positions after 0x or 0X as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is long type. Otherwise the return type is int type.

If the return type is int type, the hexadecimal numbers part is interpreted as an unsigned 32 bit integer, and is converted to a signed 32-bit integer without changing the bits. For example, 0xFFFFFFFF is -1.

If the return type is long type, the hexadecimal numbers part is interpreted as unsigned 64 bit integer, and is converted to a signed 64-bit integer without changing the bits. For example, 0xFFFFFFFFFFFFFFFFL is -1L.

Compilation Errors:

If the return type is int type and the hexadecimal numbers part is greater than hexadecimal FFFFFFFF, a compilation error occurs.

If the return type is long type and the hexadecimal numbers part is greater than hexadecimal FFFFFFFFFFFFFFFF, a compilation error occurs.

Examples:

0x3b4f
0X3b4f
-0x3F1A
0xDeL
0xFFFFFFFF
0xFF_FF_FF_FF
0xFFFFFFFFFFFFFFFFL

Integer Literal Octal Notation

The interger literal octal notation represents a number of int type or long type using octal numbers 0-7.

It can begin with a minus -.

It is followed by 0.

It is followed by one or more 0-7. This is called octal numbers part.

_ can be placed at the any positions after 0 as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is long type. Otherwise the return type is int type.

If the return type is int type, the octal numbers part is interpreted as an unsigned 32 bit integer, and is converted to a signed 32-bit integer without changing the bits. For example, 037777777777 is -1.

If the return type is long type, the octal numbers part is interpreted as unsigned 64 bit integer, and is converted to a signed 64-bit integer without changing the bits. For example, 01777777777777777777777L is -1L.

If the return type is long type, the value that is except for - is interpreted as unsigned 64 bit integer uint64_t type in the C language, and the following conversion is performed.

Compilation Errors:

If the return type is int type and the octal numbers part is greater than octal 37777777777, a compilation error occurs.

If the return type is long type and the octal numbers part is greater than octal 1777777777777777777777, a compilation error occurs.

Examples:

0755
-0644
0666L
0655_755

Integer Literal Binary Notation

The interger literal binary notation represents a number of int type or long type using binary numbers 0 and 1.

It can begin with a minus -.

It is followed by 0b or 0B.

It is followed by one or more 0 and 1. This is called binary numbers part.

_ can be placed at the any positions after 0b or 0B as a separator. _ has no meaning.

It can end with the suffix L or l.

If the suffix L or l exists, the return type is long type. Otherwise the return type is int type.

If the return type is int type, the binary numbers part is interpreted as an unsigned 32 bit integer, and is converted to a signed 32-bit integer without changing the bits. For example, 0b11111111111111111111111111111111 is -1.

If the return type is long type, the binary numbers part is interpreted as unsigned 64 bit integer, and is converted to a signed 64-bit integer without changing the bits. For example, 0b1111111111111111111111111111111111111111111111111111111111111111L is -1L.

Compilation Errors:

If the return type is int type and the value that is except for - is greater than binary 11111111111111111111111111111111, a compilation error occurs.

If the return type is long type and the value that is except for - is greater than binary 1111111111111111111111111111111111111111111111111111111111111111, a compilation error occurs.

Examples:

0b0101
-0b1010
0b110000L
0b10101010_10101010

Floating Point Literals

The floating point litral represetns a floating point number.

Floating Point Literal Decimal Notation

The floating point litral decimal notation represents a number of float type and double type using decimal numbers 0-9.

It can begin with a minus -.

It is followed by one or more 0-9.

_ can be placed at the any positions after the first 0-9.

It can be followed by a floating point part, an exponent part, or a combination of a floating point part and an exponent part.

[Floating Point Part Begin]

A floating point part begins with ..

It is followed by one or more 0-9.

[Floating Point Part End]

[Exponent Part Begin]

An exponent part begins with e or E.

It can be followed by + or -

It is followed by one or more 0-9.

[Exponent Part End]

A floating point litral decimal notation can end with a suffix f, F, d, or D.

If a suffix does not exists, a floating point litral decimal notation must have a floating point part or an exponent part.

If the suffix f or F exists, the return type is float type. Otherwise the return type is double type.

Compilation Errors:

If the return type is float type, the floating point litral decimal notation without the suffix must be able to be parsed by the strtof function in the C language. Otherwise, a compilation error occurs.

If the return type is double type, the floating point litral decimal notation without the suffix must be able to be parsed by the strtod function in the C language. Otherwise, a compilation error occurs.

Examples:

1.32
-1.32
1.32f
1.32F
1.32d
1.32D
1.32e3
1.32e-3
1.32E+3
1.32E-3
1.32e3f
12e7

Floating Point Literal Hexadecimal Notation

The floating point litral hexadecimal notation represents a number of float type and double type using hexadecimal numbers 0-9a-zA-Z.

It can begin with a minus -.

It is followed by 0x or 0X.

It is followed by one or more 0-9a-zA-Z.

_ can be placed at the any positions after 0x or 0X.

It can be followed by a floating point part, an exponent part, or a combination of a floating point part and an exponent part.

[Floating Point Part Begin]

A floating point part begins with .

It is followed by one or more 0-9a-zA-Z.

[Floating Point Part End]

[Exponent Part Begin]

An exponent part begins with p or P.

It can be followed by + or -.

It is followed by one or more 0-9.

[Exponent Part End]

A floating point litral hexadecimal notation can end with a suffix f, F, d, or D.

If a suffix does not exists, a floating point litral hexadecimal notation must have a floating point part or an exponent part.

Compilation Errors:

If the return type is float type, the floating point litral hexadecimal notation without the suffix must be able to be parsed by the strtof function in the C language. Otherwise, a compilation error occurs.

If the return type is double type, thefloating point litral hexadecimal notation without the suffix must be able to be parsed by the strtod function in the C language. Otherwise, a compilation error occurs.

Examples:

0x3d3d.edp0
0x3d3d.edp3
0x3d3d.edP3
0x3d3d.edP+3
0x3d3d.edP-3f
0x3d3d.edP-3F
0x3d3d.edP-3d
0x3d3d.edP-3D
0x3d3dP+3

Bool Literals

The bool literal represents a bool object.

true

true is the alias for Bool#TRUE.

true

Examples:

# true
my $bool_object_true = true;

false

false is the alias for Bool#FALSE.

false

Examples:

# false
my $bool_object_false = false;

Character Literal

A character literal represents a number of byte type that normally represents an ASCII character.

It begins with '.

It is followed by a printable ASCII character 0x20-0x7e or an character literal escape character.

It ends with '.

The return type is byte type.

Compilation Errors:

If the format of the character literal is invalid, a compilation error occurs.

Character Literal Escape Characters

The List of Character Literal Escape Characters:

Character Literal Escape Characters Values
\a 0x07 BEL
\t 0x09 HT
\n 0x0A LF
\f 0x0C FF
\r 0x0D CR
\" 0x22 "
\' 0x27 '
\\ 0x5C \
Octal Escape Character A number represented by an octal escape character
Hexadecimal Escape Character A number represented by a hexadecimal escape character

The type of every character literal escape character is byte type.

Examples:

# Charater literals
'a'
'x'
'\a'
'\t'
'\n'
'\f'
'\r'
'\"'
'\''
'\\'
' '
'\0'
'\012'
'\377'
'\o{1}'
'\xab'
'\xAB'
'\x0D'
'\x0A'
'\xD'
'\xA'
'\xFF'
'\x{A}'

Octal Escape Character

The octal escape character represents an unsined 8-bit integer using octal numbers 0-7.

The octal escape character is a part of a string literal and a character literal.

It begins with \0, \1, \2, \3, \4, \5, \6, \7, or \o{.

If it begins with \0, \1, \2, \3, \4, \5, \6, or \7, it is followed by one to two 0-7.

If it begins with \o{, it is followed by one to three 0-7, and ends with }.

The octal numbers after \ or \o{ is called octal numbers part.

Octal numbers part is interpreted as an unsined 8-bit integer, and is converted to a number of byte type without changing the bits.

Compilation Errors:

The octal numbers part must be less than or equal to 377. Otherwise a compilation error occurs.

If an octal escape character begins with \o{, the close } must exist. Otherwise a compilation error occurs.

Examples:

# Octal escape characters
\0
\01
\03
\012
\001
\077
\377
\o{1}
\o{12}

Hexadecimal Escape Character

The hexadecimal escape character represents an unsined 8-bit integer using hexadecimal numbers 0-9a-fA-F.

The hexadecimal escape character is a part of a string literal and a character literal.

The hexadecimal escape character begins with \x.

It can be followed by {.

It is followed by one or two 0-9a-fA-F. This is called hexadecimal numbers part.

If it contains {, it must be followed by }.

Hexadecimal numbers part is interpreted as an unsined 8-bit integer, and is converted to a number of byte type without changing the bits.

Compilation Errors:

If the format of the hexadecimal escape character is invalid, a compilation error occurs.

Examples:

# Hexadecimal escape characters
\xab
\xAB
\x0D
\x0A
\xD
\xA
\xFF
\x{A}

String Literal

A string literal represents a constant string.

A string literal begins with ".

It is followed by zero or more UTF-8 characters, string literal escape characters, or variable expansions.

It ends with ".

The return type is string type.

Compilation Errors:

If the format of the string literal is invalid, a compilation error occurs.

Examples:

# String literals
""
"abc";
"あいう"
"hello\tworld\n"
"hello\x0D\x0A"
"hello\xA"
"hello\x{0A}"
"hello\0"
"hello\012"
"hello\377"
"AAA $foo BBB"
"AAA $FOO BBB"
"AAA $$foo BBB"
"AAA $foo->{x} BBB"
"AAA $foo->[3] BBB"
"AAA $foo->{x}[3] BBB"
"AAA $@ BBB"
"\N{U+3042}\N{U+3044}\N{U+3046}"

String Literal Escape Characters

The List of String Literal Escape Characters:

String Literal Escape Characters Values
\a 0x07 BEL
\t 0x09 HT
\n 0x0A LF
\f 0x0C FF
\r 0x0D CR
\" 0x22 "
\$ 0x24 $
\' 0x27 '
\\ 0x5C \
Octal Escape Character A number represented by an octal escape character
Hexadecimal Escape Character A number represented by a hexadecimal escape character
A Unicode escape character Numbers represented by an Unicode escape character
A raw escape character Numbers represented by a hexadecimal escape character

The type of every string literal escape character ohter than the Unicode escape character and the raw escape character is byte type.

The type of each number contained in the Unicode escape character and the raw escape character is byte type.

Unicode Escape Character

The Unicode escape character represents an UTF-8 character.

An UTF-8 character is represented by an Unicode code point with hexadecimal numbers 0-9a-fA-F.

This is one to four numbers of byte type.

The Unicode escape character is a part of a string literal.

It begins with \N{U+.

It is followed by one or more 0-9a-fA-F. This is called code point part.

It ends with }.

Compilation Errors:

If a code point part is not a Unicode scalar value, a compilation error occurs.

Examples:

# Unicode escape characters

# あ
\N{U+3042}

# い
\N{U+3044}

# う
\N{U+3046}"

Raw Escape Characters

A raw escape character is an escapa character that <\> is interpreted as ASCII \ and the following character is interpreted as itself.

For example, a raw escape character \s is ASCII chracters \s.

A raw escape character is a part of a string literal.

The List of Raw Escape Characters:

Raw Escape Characters
\!
\#
\%
\&
\(
\)
\*
\+
\,
\-
\.
\/
\:
\;
\<
\=
\>
\?
\@
\A
\B
\D
\G
\H
\K
\N
\P
\R
\S
\V
\W
\X
\Z
\[
\]
\^
\_
\`
\b
\d
\g
\h
\k
\p
\s
\v
\w
\z
\{
\|
\}
\~

Variable Expansion

The variable expasion is a syntax to embed getting a local variable, getting a class variables, a dereference, getting a field, getting an array element, getting the exception variable into a string literal.

"AAA $foo BBB"
"AAA $FOO BBB"
"AAA $$foo BBB"
"AAA $foo->{x} BBB"
"AAA $foo->[3] BBB"
"AAA $foo->{x}[3] BBB"
"AAA $foo->{x}->[3] BBB"
"AAA $@ BBB"
"AAA ${foo}BBB"

The above codes are expanded to the following codes.

"AAA " . $foo . " BBB"
"AAA " . $FOO . " BBB"
"AAA " . $$foo . " BBB"
"AAA " . $foo->{x} . " BBB"
"AAA " . $foo->[3] . " BBB"
"AAA " . $foo->{x}[3] . " BBB"
"AAA " . $foo->{x}->[3] . " BBB"
"AAA " . $@ . "BBB"
"AAA " . ${foo} . "BBB"

The operation of getting field does not contain space characters between { and }.

The index of getting array element must be a constant interger.

The getting array dose not contain space characters between [ and ].

The end $ is interpreted by $, not interpreted as a variable expansion.

# AAA$
"AAA$"

Single-Quoted String Literal

A single-quoted string literal represents a constant string without variable expansions with a few escape characters.

It begins with q'.

It is followed by zero or more UTF-8 characters, or single-quoted string literal escape characters.

It ends with '.

The return type is string type.

Compilation Errors:

A single-quoted string literal must be end with '. Otherwise a compilation error occurs.

If the escape character in a single-quoted string literal is invalid, a compilation error occurs.

Examples:

# Single-quoted string literals
q'abc';
q'abc\'\\';

Single-Quoted String Literal Escape Characters

The List of Single-Quoted String Literal Escape Characters:

Single-Quoted String Literal Escape Characters Values
\' 0x27 '
\\ 0x5C \

The type of every single-quoted string literal escape character is byte type.

Here Document

A here document represents a constant string in multiple lines without escape characters and variable expansions.

<<'HERE_DOCUMENT_NAME';
LINE1
LINE2
LINEn
HERE_DOCUMENT_NAME

A here document begins with <<'HERE_DOCUMENT_NAME'; and ASCII LF.

HERE_DOCUMENT_NAME is a here document name.

It is followed by a string in multiple lines.

It ends with HERE_DOCUMENT_NAME from the beginning of a line and ASCII LF.

Compilation Errors:

<<'HERE_DOCUMENT_NAME'; must not contain space characters. Otherwise a compilation error occurs.

Examples:

# Here document
my $string = <<'EOS';
Hello
World
EOS

Here Document Name

A here document name consist of a-z, A-Z, _, 0-9.

The length of a here document name is greater than or equal to 0.

A here document name cannot begin with 0-9.

A here document name cannot contain __.

Compilaition Errors:

If the format of a here document name is invalid, a compilatio error occurs.

See Also

Copyright & License

Copyright (c) 2023 Yuki Kimoto

MIT License