NAME

HeaderParser - A minimal header file parser that can be hooked by other porting scripts.

SYNOPSIS

my $o= HeaderParser->new();
my $lines= $o->parse_fh($fh);

DESCRIPTION

HeaderParser is a tool to parse C preprocessor header files. The tool understands the syntax of preprocessor conditions, and is capable of creating a parse tree of the expressions involved, and normalizing them as well.

C preprocessor files are a bit tricky to parse properly, especially with a "line by line" model. There are two issues that must be dealt with:

Line Continuations

Any line ending in "\\\n" (that is backslash newline) is considered to be part of a longer string which continues on the next line. Processors should replace the "\\\n" typically with a space when converting to a "real" line.

Comments Acting As A Line Continuation

The rules for header files stipulates that C style comments are stripped before processing other content, this means that comments can serve as a form of line continuation:

#if defined(foo) /*
*/ && defined(bar)

is the same as

#if defined(foo) && defined(bar)

This type of comment usage is often overlooked by people writing header file parsers for the first time.

Indented pre processor directives.

It is easy to forget that there may be multiple spaces between the "#" character and the directive. It also easy to forget that there may be spaces in *front* of the "#" character. Both of these cases are often overlooked.

The main idea of this module is to provide a single framework for correctly parsing the content of our header files in a consistent manner. A secondary purpose it to make various tasks we want to do easier, such as normalizing content or preprocessor expressions, or just extracting the real "content" of the file properly.

parse_fh

This function parses a filehandle into a set of lines. Each line is represented by a hash based object which contains the following fields:

bless {
    cond     => [['defined(a)'],['defined(b)']],
    type     => "content",
    sub_type => undef,
    raw      => $raw_content_of_line,
    line     => $normalized_content_of_line,
    level    => $level,
    source         => $filename_or_string,
    start_line_num => $line_num_for_first_line,
    n_lines        => $line_num - $line_num_for_first_line,
}, "HeaderLine"

A "line" in this context is a logical line, and because of line continuations and comments may contain more than one physical line, and thus more than one newline, but will always include at least one and will always end with one (unless there is no newline at the end of the file). Thus

before /*
 this is a comment
*/ after \
and continues

will be treated as a single logical line even though the content is spread over four lines.

cond

An array of arrays containing the normalized expressions of any C preprocessor conditional blocks which include the line. Each line has its own copy of the conditions it was operated on currently, but that may change so dont alter this data. The inner arrays may contain more than one element. If so then the line is part of an "#else" or "#elsif" and the clauses should be considered to be a conjunction when considering "when is this line included", however when considered as part of an if/elsif/else, each added clause represents the most recent condition. In the following you can see how:

before          /* cond => [ ]                      */
#if A           /* cond => [ ['A'] ]                */
do-a            /* cond => [ ['A'] ]                */
#elif B         /* cond => [ ['!A', 'B'] ]          */
do-b            /* cond => [ ['!A', 'B'] ]          */
#else           /* cond => [ ['!A', '!B'] ]         */
do-c            /* cond => [ ['!A', '!B'] ]         */
# if D          /* cond => [ ['!A', '!B'], ['D'] ]  */
do-d            /* cond => [ ['!A', '!B'], ['D'] ]  */
# endif         /* cond => [ ['!A', '!B'], ['D'] ]  */
#endif          /* cond => [ ['!A', '!B'] ]         */
after           /* cond => [ ]                      */

So in the above we can see how the three clauses of the if produce a single "frame" in the cond array, but that frame "grows" and changes as additional else clauses are added. When an entirely new if block is started (D) it gets its own block. Each endif includes the clause it terminates.

type

This value indicates the type of the line. This may be one of the following: 'content', 'cond', 'define', 'include' and 'error'. Several of the types have a sub_type.

sub_type

This value gives more detail on the type of the line where necessary. Not all types have a subtype.

Type    | Sub Type
--------+----------
content | text
        | include
        | define
        | error
cond    | #if
        | #elif
        | #else
        | #endif

Note that there are no '#ifdef' or '#elifndef' or similar expressions. All expressions of that form are normalized into the '#if defined' form to simplify processing.

raw

This was the raw original text before HeaderParser performed any modifications to it.

line

This is the normalized and modified text after HeaderParser or any callbacks have processed it.

level

This is the "indent level" of a line and corresponds to the number of blocks that the line is within, not including any blocks that might be created by the line itself.

before          /* level => 0 */
#if A           /* level => 0 */
do-a            /* level => 1 */
#elif B         /* level => 0 */
do-b            /* level => 1 */
#else           /* level => 0 */
do-c            /* level => 1 */
# if D          /* level => 1 */
do-d            /* level => 2 */
# endif         /* level => 1 */
#endif          /* level => 0 */
after           /* level => 0 */

parse_fh() will throw an exception if it encounters a malformed expression or input it cannot handle.

lines_as_str

This function will return a string representation of the lines it is provided.

group_content

This function will group the text in the file by the conditions which contain it. This is only useful for files where the content is essentially a list and where changing the order that lines are output in will not break the resulting file.

Each content line will be grouped into a structure of nested if/else blocks (elif will produce a new nested block) such that the content under the control of a given set of normalized condition clauses are grouped together in the order the occurred in the file, such that each combined conditional clause is output only once.

This means a file like this:

#if A
A
#elif K
AK
#else
ZA
#endif
#if B && Q
B
#endif
#if Q && B
BC
#endif
#if A
AD
#endif
#if !A
ZZ
#endif

Will end up looking roughly like this:

#if A
A
AD
#else
ZZ
# if K
AK
# else
ZA
# endif
#endif
#if B && Q
B
BC
#endif

Content at a given block level always goes before conditional clauses at the same nesting level.

HOOKS

There are severals hooks that are available, pre_process_content and post_process_content, and post_process_grouped_content. All of these hooks will be called with the HeaderParser object as the first argument. The "process_content" callbacks will be called with a line hash as the second argument, and post_process_grouped_content will be called with an array of line hashes for the content in that group, so that the array may be modified or sorted. Callbacks called from inside of group_content() (that is post_process_content and post_process_grouped_content will be called with an additional argument containing and array specifying the actual conditional "path" to the content (which may differ somewhat from the data in a lines "cond" property).

These hooks may do what they like, but generally they will modify the "line" property of the line hash to change the final output returned by lines_as_str() or group_content().

FORMATTING AND INDENTING

Header parser tries hard to produce neat and readable output with a consistent style and form. For example:

#if defined(FOO)
# define HAS_FOO
# if defined(BAR)
#   define HAS_FOO_AND_BAR
# else /* !defined(BAR) */
#   define HAS_FOO_NO_BAR
# endif /* !defined(BAR) */
#endif /* defined(FOO) */

HeaderParser uses two space tab stops for indenting C pre-processor directives. It puts the spaces between the "#" and the directive. The "#" is considered "part" of the indent, even though the space comes after it. This means the first indent level "looks" like one space, and following indents look like 2. This should match what a sensible editor would do with two space tab stops. The indent_chars() method can be used to convert an indent level into a string that contains the appropriate number of spaces to go in between the "#" and the directive.

When emitting "#endif", "#elif" and "#else" directives comments will be emitted also to show the conditions that apply. These comments may be wrapped to cover multiple lines. Some effort is made to get these comments to line up visually, but it uses heuristics which may not always produce the best result.