Marpa::R2::Semantics::Rank(3pm)

User Contributed Perl Documentation

Marpa::R2::Semantics::Rank(3pm)

NAME¶

Marpa::R2::Semantics::Rank - How ranks are computed

Synopsis¶

    my $source = <<'END_OF_SOURCE';
      :discard ~ ws; ws ~ [\s]+
      :default ::= action => ::array
      Top ::= List action => main::group
      List ::= Item3 rank => 3
      List ::= Item2 rank => 2
      List ::= Item1 rank => 1
      List ::= List Item3 rank => 3
      List ::= List Item2 rank => 2
      List ::= List Item1 rank => 1
      Item3 ::= VAR '=' VAR action => main::concat
      Item2 ::= VAR '='     action => main::concat
      Item1 ::= VAR         action => main::concat
      VAR ~ [\w]+
    END_OF_SOURCE
    my @tests = (
        [ 'a',                 '(a)', ],
        [ 'a = b',             '(a=b)', ],
        [ 'a = b = c',         '(a=)(b=c)', ],
        [ 'a = b = c = d',     '(a=)(b=)(c=d)', ],
        [ 'a = b c = d',       '(a=b)(c=d)' ],
        [ 'a = b c = d e =',   '(a=b)(c=d)(e=)' ],
        [ 'a = b c = d e',     '(a=b)(c=d)(e)' ],
        [ 'a = b c = d e = f', '(a=b)(c=d)(e=f)' ],
    );
    my $grammar = Marpa::R2::Scanless::G->new( { source => \$source } );
    for my $test (@tests) {
        my ( $input, $output ) = @{$test};
        my $recce = Marpa::R2::Scanless::R->new(
            {
                grammar        => $grammar,
                ranking_method => 'high_rule_only'
            }
        );
        $recce->read( \$input );
        my $value_ref = $recce->value();
        if ( not defined $value_ref ) {
            die 'No parse';
        }
        push @results, ${$value_ref};
    }

    sub flatten {
        my ($array) = @_;
        # say STDERR 'flatten arg: ', Data::Dumper::Dumper($array);
        my $ref = ref $array;
        return [$array] if $ref ne 'ARRAY';
        my @flat = ();
      ELEMENT: for my $element ( @{$array} ) {
            my $ref = ref $element;
            if ( $ref ne 'ARRAY' ) {
                push @flat, $element;
                next ELEMENT;
            }
            my $flat_piece = flatten($element);
            push @flat, @{$flat_piece};
        }
        return \@flat;
    }
    sub concat {
        my ( $pp, @args ) = @_;
        # say STDERR 'concat: ', Data::Dumper::Dumper(\@args);
        my $flat = flatten( \@args );
        return join '', @{$flat};
    }
    sub group {
        my ( $pp, @args ) = @_;
        # say STDERR 'comma_sep args: ', Data::Dumper::Dumper(\@args);
        my $flat = flatten( \@args );
        return join '', map { +'(' . $_ . ')'; } @{$flat};
    }

Description¶

This document describes rule ranking. Rule ranking plays a role in parse ordering, which is described in a separate document.

Lexeme locations and fenceposts¶

The lexeme locations are an ordered set of sets of lexemes. Numbering of lexeme locations is 0-based.

It is also useful to use an idea of locations between lexeme locations -- This is the classic "fencepost" issue -- sometimes you want to count sections of fence, and in other cases it is more convenient to count fenceposts.

Let the lexeme locations be from 0 to "N". If "I" is greater than 1 and less than "N", then lexeme fencepost "I" is before lexeme location "I" and after lexeme location "I-1". lexeme fencepost 0 is before lexeme location 0. lexeme fencepost "N" is after lexeme location "N"

The above implies that

If there are "N" lexeme locations, there are "N+1" lexeme fenceposts.
lexeme location "I" is always between lexeme fencepost "I" and lexeme fencepost "I+1".

Lexeme locations and fenceposts are closely related to G1 locations.

Dotted Rules¶

To understand this document, it is important to understand what a dotted rule is. An acquaintance with dotted rules is also important in understanding Marpa's progress reports. Dotted rules are thoroughly described in the progress report documentation. This section repeats the main ideas from the perspective of this document.

Recall that a rule is a LHS (left hand side) and a RHS (right hand side). The LHS is always exactly one symbol. The RHS is zero or more symbols. Consider the following example of a rule, given in the syntax of Marpa's DSL.

    S ::= A B C

Dotted rules are used to track the progress of a parse through a rule. They consist of a rule and a dot position, which marks the point in the RHS which scanning has reached. It is called the dot position because, traditionally, the position is represented by a dot. The symbol before the dot is called the predot symbol.

The following is an example of a dotted rule. (The dot of a dotted rule is not part of Marpa's DSL but, when it is useful for illustration, we will use it in the notation in this document.)

    S ::= A B . C

In this rule, B is the predot symbol.

When the dot is after the last symbol of the RHS, the dotted rule is called a completion. Here is the completion for the above rule:

    S ::= A B C .

In this completion example, the symbol C is the predot symbol.

When the dot is before the first symbol of the RHS, the rule is called a prediction. Here is the prediction of the rule we've been using for our examples:

    S ::= . A B C

In predictions, there is no predot symbol.

Choicepoints¶

Informally a choicepoint is a place where the parser can decide among one or more parse choices. A choicepoint can be thought of either a set of parse choices, or as the tuple of the properties which all the parse choices for the choicepoint must have in common.

For a choicepoint to work, all the choices must have enough in common that each of them could be replaced with any other. Thought of as a tuple of properties, a choicepoint is a triple: "(dp, start, current)". In this triple,

"dp" is a dotted rule. The predot symbol of "dp" is the predot symbol of the choicepoint -- if "dp" is a prediction, there is no predot symbol. The rule of "dp" is the rule of the choicepoint. The dot position of "dp" is the dot position of the choicepoint.
"start" is the lexeme location where the dotted rule begins. "start" is sometimes called the origin of the choicepoint.
"current" is the lexeme location corresponding to the dot in "dp". "current" is sometimes called the "current location" of the choicepoint.

The choicepoint is a prediction choicepoint if "dp" is a prediction. The choicepoint is a token choicepoint if it is not a prediction choicepoint and the predot symbol of "dp" is a token symbol. The choicepoint is a rule choicepoint if it is not a prediction choicepoint and the predot symbol of "dp" is the LHS of a rule. (Token symbols are never the LHS of a rule, and vice versa.)

Parse choices¶

As mentioned, a choicepoint can be seen as a set of one or more parse choices. From the point of view of the individual parse trees, the traversal is top-down and left-to-right.

Often, there is only one parse choice. When there is only one parse choice, the choicepoint is said to be trivial. Prediction and token choicepoints are always trivial. If all of the choicepoints of a parse are trivial, the parse is unambiguous.

Every rule choicepoint, and therefore every non-trivial choicepoint, has a set of parse choices associated with it. For a rule choicepoint, each parse choice is a duple: "(predecessor, cause)", where "cause" and "predecesor" are also choicepoints. The first element of the duple is the predecessor choicepoint, or predecessor, of the parse choice. The second element of the duple is the cause choicepoint, or cause of the parse choice.

Let "res" be a rule choicepoint and let "(pred, cuz)" be one of the parse choices of "res". Then we say that "res" is the result of "cuz" and "pred"; and we say that "cuz" is one of the causes of "res".

The predecessor of a parse choice represents a portion of the parse that "leads up to" the cause. The predecessor of a parse choice plays no role in ranking decisions, and this document will mostly ignore predecessors.

Causes¶

Since it is a choicepoint, a cause choicepoint must also be a triple. Let the result of "cuz" be "res". Let the triple for for "cuz" be "(cuz-dp, cuz-origin, cuz-current)". Let the triple for for "res" be "(res-dp, res-origin, res-current)".

If "res" is the result of "cuz", then

"cuz-current" is always the same as "res-current". In other words, the current location of "cuz" is the same as the current location of "res".
"cuz-origin" is before "res-current". In other words, the origin of "cuz" is before the current location of "res".
"cuz-origin" is at or after "res-origin". In other words, the origin of "cuz" is at or after the origin of "res". While it is not always properly between "res-origin" and "res-current", "cuz-origin" is always in the range bounded by "res-origin" on one side, and by "res-current" on the other. "cuz-origin" can therefore be thought of as "in the middle" of the parse choice. For this reason, "cuz-origin" is often called the middle location of the parse choice.
The dotted rule of "cuz" will be a completion.
The LHS of the dotted rule of "cuz" will be the predot symbol of "res".

Summing up the above, the causes of a rule choicepoint vary only by middle position and rule RHS. The causes of a rule choicepoint always share the same current lexeme location and rule LHS, and their dotted rules are always completions.

Rank is by Cause¶

The rank of a parse choice is that of its cause. The rank of a cause is the rank of its rule. Therefore, the rank of a parse choice is the rank of the rule of its cause. That last sentence is the most important sentence of this document, so we will repeat it:

    The rank of a parse choice is
    the rank of the rule of its cause.

Motivation¶

By the definition of the previous section, the rule of the choicepoint itself plays no direct role in ranking. Instead, the ranking is based on the rules of one of the choicepoint's child nodes. At first, this may seem unnecessarily roundabout, and even counter-intuitive, but on reflection, it will make sense.

Ranking cannot be based directly on the rank of the choicepoint's rule because every parse choice for that choicepoint shares the choicepoint's rule. We have not examined the internal structure of predecessors, but ranking is not based on them for the same reason: the rule of a predecessor is always the same as the rule of its result choicepoint, and therefore the rule of its predecessor is the same for every parse choice of a choicepoint.

On the other hand, the causes of parse choices can and often do differ in their rules. Therefore ranking is based on the rules of the causes of a choicepoint.

Note that ranking is only by direct causes. It might reasonably be asked, why not, at least in the case of a tie, look at causes of causes. Or why not resolve ties by looking at causes of predecessors?

Marpa's built-in rule ranking was chosen to be the most powerful system that could be implemented with effectively zero cost. Ranking by direct causes uses only information that is quickly and directly available, so that its runtime cost is probably not measurable.

Complexity was also an issue. If indirect causes are taken into account, we would need to specify which causes, and under what circumstances they are used. Only causes of causes? Or causes of predecessors? Depth-first, or breadth-first? Only in case of ties, or using a more complex metric? To arbitrary depth, or using a traversal that is cut off at some point?

Ranking by direct causes only means the answer to all of these questions is as simple as it can be. Once we get into analyzing the synopsis grammar in detail, the importance of simplicity will become clear.

To be sure, there are apps whose requirements justify extra overhand and extra complexity. For these apps, Marpa's ASF's allow full generality in ranking.

Examples¶

Our examples in this document will look at the ranked grammar in the synopsis, and at minor variations of it.

Longest highest, version 1¶

We will first consider the example as it is given in the synopsis:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array
  Top ::= List action => main::group
  List ::= Item3 rank => 3
  List ::= Item2 rank => 2
  List ::= Item1 rank => 1
  List ::= List Item3 rank => 3
  List ::= List Item2 rank => 2
  List ::= List Item1 rank => 1
  Item3 ::= VAR '=' VAR action => main::concat
  Item2 ::= VAR '='     action => main::concat
  Item1 ::= VAR         action => main::concat
  VAR ~ [\w]+

Longest highest ranking

The DSL in the synopsis ranks its items "longest highest". Here "items" are represented by the symbols, "<Item3>", "<Item2>" and "<Item1>". The "longest" choice is considered to be the one with the most lexemes. Working this idea out for this grammar, we see that the items should rank, from highest to lowest: "<Item3>", "<Item2>" and "<Item1>".

The non-trivial choicepoints

Several examples and their results are shown in the synopsis. If you study the grammar, you will see that the non-trivial choicepoints will be for the dotted rules:

    Top ::= List .
    List ::= List . Item3
    List ::= List . Item2
    List ::= List . Item1

The above are all of the potential dotted rules for result choicepoints.

The cause choicepoints

In all of these dotted rules, the predot symbol is "<List>". Recall that cause choicepoints are always completions. Since the predot symbol of all the non-trivial choicepoints is "<List>", the dotted rule of a cause of the non-trivial choicepoints may be any of

    List ::= Item1 .
    List ::= Item2 .
    List ::= Item3 .
    List ::= List Item1 .
    List ::= List Item2 .
    List ::= List Item3 .

In fact, if we study the grammar more closely, we will see that the only possible ambiguity is in the sequence of items, and that ambiguities always take the form ""(v=)(v)"" versus ""(v=v)"". Therefore the only causes of non-trivial parse choices are

    List ::= Item3 .
    List ::= Item1 .
    List ::= List Item3 .
    List ::= List Item1 .

The first non-trivial choicepoints

Further study of the grammar shows that the first non-trivial choice must be between two parse choices with these cause choicepoints:

    List ::= Item3 .
    List ::= Item1 .

The rule "List ::= Item3" has rank 3, and it obviously outranks the rule "List ::= Item1", which has rank 1. Our example uses "high_rank_only" ranking, and our example therefore will leave only this choice:

    List ::= Item3 .

With only one choice left, the resulting choicepoint becomes trivial, and, as defined above, that remaining choice is "longest", and therefore the correct one for the "longest highest" ranking.

The second and subsequent non-trivial choicepoints

We've looked at only the first non-trivial choice for our example code. Again examining the grammar, we see that the second and subsequent non-trivial choices will all be between parse choices whose causes have these two dotted rules:

    List ::= List Item3 .
    List ::= List Item1 .

The rule "List ::= List Item3" has rank 3, and it obviously outranks the rule "List ::= List Item1", which has rank 1. Our example uses "high_rank_only" ranking, and our example therefore will leave only this choice:

    List ::= List Item3 .

Conclusion

We have now shown that our example will reduce all choicepoints to a single choice, one which is consistent with "longest highest" ranking. Since all choicepoints are reduced to a single choice, the ranked grammar in unambiguous.

This analysis made a lot of unstated assumptions. Below, there is a "Proof of correctness". It deals with this same example, but proceeds much more carefully.

Shortest highest, version 1¶

Here we see the grammar of the synopsis, reworked for a "shortest highest" ranking. "Shortest highest" is the reverse of "longest highest".

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array
  Top ::= List action => main::group
  List ::= Item3 rank => 1
  List ::= Item2 rank => 2
  List ::= Item1 rank => 3
  List ::= List Item3 rank => 1
  List ::= List Item2 rank => 2
  List ::= List Item1 rank => 3
  Item3 ::= VAR '=' VAR action => main::concat
  Item2 ::= VAR '='     action => main::concat
  Item1 ::= VAR         action => main::concat
  VAR ~ [\w]+

Here are what the results will look like for "shortest highest".

    my @tests = (
        [ 'a',                 '(a)', ],
        [ 'a = b',             '(a=)(b)', ],
        [ 'a = b = c',         '(a=)(b=)(c)', ],
        [ 'a = b = c = d',     '(a=)(b=)(c=)(d)', ],
        [ 'a = b c = d',       '(a=)(b)(c=)(d)' ],
        [ 'a = b c = d e =',   '(a=)(b)(c=)(d)(e=)' ],
        [ 'a = b c = d e',     '(a=)(b)(c=)(d)(e)' ],
        [ 'a = b c = d e = f', '(a=)(b)(c=)(d)(e=)(f)' ],
    );

The reader who wants an example of a ranking scheme to work out for themselves may find this one suitable. The reasoning will very similar to that for the "longest highest" example, just above.

Longest highest, version 2¶

The previous examples have shown the rule involved in parse ranking in "spelled out" form. In fact, a more compact form of the grammar can be used, as shown below for "longest highest" ranking.

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array
  Top ::= List action => main::group
  List ::= Item rank => 1
  List ::= List Item rank => 0
  Item ::= VAR '=' VAR rank => 3 action => main::concat
  Item ::= VAR '='     rank => 2 action => main::concat
  Item ::= VAR         rank => 1 action => main::concat
  VAR ~ [\w]+

Shortest highest, version 2¶

This is the grammar for "shortest highest", in compact form:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array
  Top ::= List action => main::group
  List ::= Item rank => 0
  List ::= List Item rank => 1
  Item ::= VAR '=' VAR rank => 1 action => main::concat
  Item ::= VAR '='     rank => 2 action => main::concat
  Item ::= VAR         rank => 3 action => main::concat
  VAR ~ [\w]+

Again, the reader looking for simple examples to work out for themselves may want to rework the argument given above for the "spelled out" examples for the compact examples.

Alternatives to Ranking¶

Reimplementation as pure BNF¶

It is generally better, when possible, to write a language as BNF, instead of using ranking. The advantage of using BNF is that you can more readily determine exactly what language it is that you are parsing: Ranked grammars make look easier to analyze at first glance, but the more you look at them the more tricky you realize they are.

These "pure BNF" reimplementations rely on an observation used in the detailed proof of the ranked BNF example: The parse string becomes easier to analyze when you see it as a sequence of "var-bounded" substrings, where "var-bounded" means the substring is delimited by the start of the string, the end of the string, or the fencepost between two "<VAR>" lexemes.

Longest highest as pure BNF

Here is the "longest highest" example, reimplemented as BNF:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array
  Top            ::= Max_Boundeds action => main::group
  Top            ::= Max_Boundeds Unbounded action => main::group
  Top            ::= Unbounded action => main::group
  Max_Boundeds   ::= Max_Bounded+
  Max_Bounded    ::= Eq_Finals Var_Final3
  Max_Bounded    ::= Var_Final
  Unbounded      ::= Eq_Finals
  Eq_Finals      ::= Eq_Final+
  Var_Final      ::= Var_Final3 | Var_Final1
  Var_Final3     ::= VAR '=' VAR action => main::concat
  Eq_Final       ::= VAR '='     action => main::concat
  Var_Final1     ::= VAR         action => main::concat
  VAR ~ [\w]+

Shortest highest as pure BNF

We can also reimplement the "shortest highest" example as BNF. One of the advantages of a BNF (re)implementation, is that it often clarifies the grammar. For example in this case, we note that the DSL rule

  Var_Final3     ::= VAR '=' VAR action => main::concat

is, in fact, never used. We therefore omit it:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array
  Top            ::= Max_Boundeds action => main::group
  Top            ::= Max_Boundeds Unbounded action => main::group
  Top            ::= Unbounded action => main::group
  Max_Boundeds   ::= Max_Bounded+
  Max_Bounded    ::= Eq_Finals Var_Final
  Max_Bounded    ::= Var_Final
  Unbounded      ::= Eq_Finals
  Eq_Finals      ::= Eq_Final+
  Eq_Final       ::= VAR '='     action => main::concat
  Var_Final      ::= VAR         action => main::concat
  VAR ~ [\w]+

Abstract Syntax Forests (ASFs)¶

Ranking can also be implemented using Marpa's abstract syntax forests. ASFs are likely to be less efficient than the built in ranking, but they are a more general and powerful solution.

Comparison with PEG¶

For those familiar with PEG, Marpa's ranking may seem familiar. In fact, Marpa's ranking can be seen as a "better PEG".

A PEG specification looks like a BNF grammar. It is a common misconception that a PEG specification implements the same language that it would if interpreted as pure, unranked BNF. This is the case only when the grammar is LL(1) -- in other words, rarely in practical use.

Typically, a PEG implementer thinks their problem out in BNF, perhaps ordering the PEG rules to resolve a few of the choices, and then twiddles the PEG specification until the test suite passes. This results in a parser whose behavior is to a large extent unknown.

Marpa with ranking allows a safer form of PEG-style parsing. Both Marpa's DSL and PEG allow the implementer to specify any context-free grammar, but Marpa parses all context-free grammars correctly, while PEG only correctly parses a small subset of the context-free grammars and fakes the rest.

In Marpa, the implementer of a ranked grammar can work systematically. He can first write a "superset grammar", and then "sculpt" it using ranks until he has exactly the language he wants. At all points, the superset grammar will act as a "safety net". That is, ranking or no ranking, Marpa returns no parse trees that are not specified by the BNF grammar.

This "sculpted superset" is more likely to result in a satisfactory solution than the PEG approach. The PEG specification, re-interpreted as pure BNF, will usually only parse a subset of what the implementer needs, and the implementer must use ranks to "stretch" the grammar to describe what he has in mind.

Ranking is not as predictable as specifying BNF. The "safety net" provided by Marpa's "sculpted superset" guards against over-liberal parsing. This is important: over-liberal parsing can be a security loophole, and bugs caused by over-liberal parsing have been the topic of security advisories.

Determining the exact language parsed by a ranked grammar in PEG is extremely hard -- a small specialist literature has looked at this subject. Most implementers don't bother trying -- when the test suite passes, they consider the PEG implementation complete.

Ranking's effect is more straightforward in Marpa. Above, we showed very informally that the example in our synopsis does what it claims to do -- that the parses returned are actually, and only, those desired. A more formal proof is given below.

Since Marpa parses all context-free grammars, and Marpa parses most unambiguous grammars in linear time, Marpa often parses your language efficiently when it is implemented as pure BNF, without rule ranking. As one example, the ranked grammar in the synopsis can be reimplemented as pure unranked BNF. Where it is possible to convert your ranked grammar to a pure BNF grammar, that will usually be the best and safest choice.

Details¶

This section contains additional explanations, not essential to understanding the rest of this document. Often they are formal or mathematical. While some people find these helpful, others find them distracting, which is why they are segregated here.

Earley parsing¶

In this section we assume that the reader is comfortable reading and analyzing BNF. We will often use the basic theorem of Earley parsing. As a reminder, the theorem states that an instance of a dotted rule is present in a parse, if and only if that instance is valid for the input so far. More formally:

(EARLEY): Let "G" be a grammar. Let "Syms" be the set of symbols in "G" and let exactly one of the symbols in "Syms" be distinguished as the "start symbol". Call the "start symbol", "<Start>".

Let "Terms" be a subset of "Syms" called the "terminals" of "G". Let "W" be a string of symbols from "Terms". Let "W[i]" be the "i"'th terminal of "W", so that the first terminal of "W" is "W[1]".

Where "alpha" and "beta" are sequences of symbols in "Syms", let "alpha => beta" mean that "alpha" derives "beta" in grammar "G" in zero or more steps.

In this theorem, let "alpha", "beta", "gamma", "delta" be sequences of zero or more symbols in "Syms".

Theorem: These two statement sets are equivalent:

Statement set 1 is true if and only if we have all of the following:

1a. "eitem" is an instance of a dotted rule, which we will treat as a triple: "dp, origin, current". In the literature, a triple of this form is also called an "Earley item".

1b. "dp" is the dotted rule "<A> ::= beta . gamma>".

Statement set 2 is true if and only if we have all of the following:

2a. "alpha => W[1] ... W[origin]"

2b. "beta => W[origin+1] .. W[current]"

2c. "<Start> => alpha <A> delta"

The proof is omitted. It can be found in Jay Earley's thesis, in his original paper, and in Aho and Ullmann's 1972 textbook.

Proof of correctness¶

Let "G" be the grammar in the Marpa DSL synopsis, Let "W" be a string, and let "ps-pure" be the set of parses allowed by "G", treated as if it was pure unranked BNF. Let "ps-ranked" be the set of parses allowed by the "G" after ranking is taken into account.

We will say string "W" is valid if and only if "ps-pure" is non-empty. We will say that a parse set is "unambiguous" if and only if it contains at most one parse.

In the less formal discussion above, we took an intuitive approach to the idea of "longest highest", one which made the unstated assumption that optimizing for "longest highest" locally would also optimize globally. Here we make our idea of "longest highest" explicit and justify it.

What "longest highest" means locally is reasonably obvious -- if "item3" has more lexemes than "item1", then "item3" is "longer" than "item1". It is less clear what it means globally. Items might overlap, so that optimizing locally might make the parse as a whole suboptimal.

We will define a "longest highest" parse of "W" as one of the parses of "W" with the fewest "items". This is based on the observations that "W" is fixed in length; and that, if items are longer, fewer of them will fit into "W".

More formally, let "p" be a parse, and let "ps" be a parse set. Let Items(p) be the number of items in "p". Let Items(ps), be the minimum number of items of any parse in "ps". Then, if "lh" is a parse in "ps" and if, for every parse "p" in parse set "ps", "Items(lh) <= Items(p)", we say that "lh" is a "longest highest" parse.

To conveniently represent token strings we will represent them as literal strings where ""v"" stands for a "<VAR>" lexemes and equal signs represent themselves. For example, we might represent the token string

    <VAR> = <VAR>

as the string

v=v

Theorem: If "W" is a valid string, "ps-ranked" contains exactly one parse, and that parse is the "longest highest" parse.

Proof: Recall that a Marpa "high_rank_only" parse first parses according to the pure unranked BNF, and then prunes the parse using ranking.

Part 1: We will first examine "ps-pure", the parse set which results from the pure BNF phase.

(1) "Item": An "item" will means an instance of one of the symbols "Item1", "Item2" or "Item3". When we want to show a token string's division into items, we will enclose them in parentheses. For example ""(v=)(v=)(v)"" will indicate that the token string is divided into 3 items; and ""(v=)(v=v)"" will represent the same token string divided into 2 items.

(2) "Factoring": Analyzing the grammar "G", we see that it is a sequence of items, and that "G" is ambiguous for a string "W" if and only if that string divides into items in more than one way. We will call a division of "W" into items a "factoring".

(3) "Independence": Let "seq1" and "seq2" be two factorings of "W" into items. Let "first" and "last" be two lexeme locations such that "first <= last".

Let "sub1" be a subsequence of one or more items of "seq1" that starts at "first" and ends at "last". Similarly, let "sub2" be a subsequence of one or more items of "seq2" that starts at "first" and ends at "last". Let "seq3" be the result of replacing "sub1" in "seq1" with "sub2". Then, because "G" is a context free grammar, "seq3" is also a factoring of "W".

As an example, Let "W" be "v=vv=v", let "seq1" be "(v=v)(v=v)" and let "seq2" be "(v=)(v)(v=)(v)". Let "first" be 4 and "last" be 6, so that "sub1" is "(v=v)" and "sub2" is "(v=)(v)". Then "seq3" is "(v=v)(v=)(v)". We claimed that "seq3" must be a factoring of "W", and we can see that our claim is correct.

(4): Recall the discussion of lexeme fenceposts. We will say that a lexeme fencepost "fp" is a "factoring barrier" if it is not properly contained in any item. More formally, let First(it1) be first lexeme location of an item "it1", and let Last(it1) be last lexeme location of "it1". Let "i" be a lexeme fencepost "i" such that, for every item "it", either "i <= First(it)" or "i > Last(it)". Then lexeme fencepost "i" is a factoring barrier.

(5): Where "|W|" is the length of "W", we can see from (4) that fencepost "|W|" is a factoring barrier.

(6): Also from (4) we see that lexeme fencepost 0 is a factoring barrier.

(7): Let a "subfactoring" be a sequence of items bounded by factoring barriers.

(8): Let a "minimal subfactoring" be a subfactoring which does not properly contain any factoring barriers. From (5) and (6) we can see that "W" can be divided into one or more minimal subfactorings.

(9): We can see from "G" that two "VAR"s never occur together in the same item. This means that wherever we do find two "VAR"s in a row, the lexeme fencepost between them is a factoring barrier.

(10): No item begins with an equal sign (""=""). Every item begins with a "<VAR>" token.

(11): Every minimal subfactoring starts with a "<VAR>" token. This is because every subfactoring starts with an item, and from (10).

(12): Every minimal subfactoring start with a "<VAR>" token and alternates equal signs and "<VAR>" lexemes: that is, it has the pattern ""v=v=v= ..."". This follows from (9) and (11).

(13): We will call a minimal subfactoring "var-bounded" or simply "bounded" if it starts and ends with a "<VAR>" token. By (12), this means that it has the pattern ""v=v=v= ... v"".

(14): We will call a minimal subfactoring "var-unbounded" or simply "unbounded" if it is not var-bounded. By (12), this means that it has the pattern ""v=v= ... v="".

(15): If a "<Item1>" or "<Item3>" occurs in a minimal subfactoring, it is the last item in that subfactoring.

To show (15), assume for a reductio ad absurdam there is a minimal subfactoring with a non-final "<Item1>" or "<Item3>". Call that subfactoring "f", and call that item, "it1".

(15a): Both "<Item1>" and "<Item3>" end with a "<VAR>" token, so that "it1" ends in a "<VAR>" token.

(15b): Since "it1", by assumption for the reductio, is non-final, there is a next item. Call the next item "it2".

(15c): By (10), all items start with "<VAR>" lexemes, so that "it2" starts with a "<VAR>" token.

(15d): From (15a) and (15c), we know that "it1" ends with a "<VAR>" token and "it2" starts with a "<VAR>" token, so that there are two "<VAR>" lexemes at the fencepost between "it1" and "it2".

(15e): From (9) and (15d), there is a factoring barrier at the fencepost between "it1" and "it2".

(15f): All items, including "it1" and "it2" and of non-zero length so that, from (15e), we know that the fencepost between "it1" and "it2" is properly inside subfactoring "f".

(15g): From (15f) we see that there is a factoring barrier properly inside "f". But "f" is assumed for the reductio to be minimal and therefore cannot properly contain a factoring barrier.

(15h): (15g) shows the reductio, and allows us to conclude that, if "f" contains an "<Item1>" or an "<Item3>", that item must be the final item. This is what we needed to show for (15).

(16): By (3), every minimal subfactoring is independent -- in other words, no ambiguity crosses factoring barriers. So we narrow our consideration of ambiguities to two cases: ambiguities in unbounded minimal subfactorings, and ambiguities in bounded minimal subfactorings.

(17): Every unbounded minimal subfactoring is unambiguous. To show (17), let "u" be an unbounded minimal subfactoring. It follows from the definition of unbounded minimal subfactoring (14), that "u" ends in an equal sign. Both "<Item1>" and "<Item3>" end in "<VAR>" lexemes, so that no "<Item1>" or "<Item3>" can be the last item in "u". Therefore, by (15), there is no "<Item1>" or "<Item3>" item in "u". This means that every item in "u" is an "<Item2>" item. This in turn means that "u" is unambiguous, which shows (17).

(18): Every bounded minimal subfactoring parses either as a sequence of "Item2"'s followed by an "Item1"; or as as a sequence of "Item2"'s followed by an "Item3". This follows from (1), (13) and (15).

(19): From (16), (17) and (18) it follows that every ambiguity between subfactorings is between bounded subfactorings whose divisions into items take the two forms

""(v=) ... (v=) (v)""

and

""(v=) ... (v=v)""

Part 2: We have now shown how pure BNF parsing works for the grammar in the synopsis. We next show the consequences of ranking.

(20): A completed instance of "List ::= Item1" can only be found at lexeme location 1. We know this from "G" and (EARLEY).

(21): A completed instance of "List ::= List Item1" can only be found at lexeme location 2 or after. We know this from "G" and (EARLEY).

(22) At most one dotted rule with "Item1" as its last item appears in a cause at any choicepoint. We know this because all choices of a choicepoint must be completions with the same current location, and from (20) and (21).

(23): A completed instance of "List ::= Item3" can only be found at lexeme location 3. We know this from "G" and (EARLEY).

(24): A completed instance of "List ::= List Item3" can only be found at lexeme location 4 or after. We know this from "G" and (EARLEY).

(25) At most one dotted rule with "Item3" as its last item appears in a cause at any choicepoint. We know this because all choices of a choicepoint must be completions with the same current location, and from (23) and (24).

(26): Consider the ambiguity shown in (19) and let "current" be the lexeme fencepost at its end. An item also ends at "current" so from "G" and (EARLEY), we know that there is at least one dotted rule instance whose predot symbol is "List" and whose current location is "current". The choicepoint can have one of these dotted rules:

     Top ::= List .
     List ::= List . Item3
     List ::= List . Item2
     List ::= List . Item1

with location 0 as its origin and "current" as its current location. There may be more than one choicepoint fulfilling these criteria.

Of the choicepoints fulfilling the criteria, we arbitrarily pick one. We call this choicepoint "result". We will see as we proceed that, while the choicepoints may have different dotted rules, our choice of one of them for "result" makes no difference in the ranking logic. The only properties of the "result" that we will use are dot position, predot symbol, and current location. These properties are the same regardless of which choicepoint we picked for "result".

(27): Recall that the "middle location" of a choice is always the origin of its cause. From (26) we see that, for every possible rule in "result", the predot symbol is the first symbol of the dotted rule. This means that the origin of the causes of "result" must be the same as the origin of "result". From (26) we also know that the origin of "result" is always location 0. Therefore the origin of every one of the causes of "result" must be location 0. Therefore every one of the causes of "result" has the same middle location.

(28): From (26) and (27) we know that, if "cuz" is a cause of "result", it has an origin at location 0; its current location is "current"; and its LHS is "List". "cuz" may be any one of

     List ::= Item3 .
     List ::= Item2 .
     List ::= Item1 .
     List ::= List Item3 .
     List ::= List Item2 .
     List ::= List Item1 .

(29): From (19) and (EARLEY) we know that dotted rules ending in "Item2" are not valid choices at location "current".

(30): From (19), (22), (28) and (EARLEY) we know that exactly one of the following two dotted rules is the cause of a parse choice for "result":

     List ::= Item1 .
     List ::= List Item1 .

(31): From (19), (25), (28) and (EARLEY) we know that exactly one of the following two dotted rules is the cause of a parse choice for "result":

     List ::= Item3 .
     List ::= List Item3 .

(32): From (28), (29), (30) and (31) we know that we will have exactly two choices for "result"; that the final symbol of the cause in one will be "Item1"; and that the final symbol of the cause in the other will be "Item3". From the grammar in the synopsis, we know that the cause ending in "Item1" will have rank 1; and that the cause ending in "Item3" will have rank 3. From this we see, since the ranking method is "high_rank_only", that the choice whose final symbol is "Item3" will be the only one kept in "P-ranked". We further see from (19) that this choice will have fewer items than the alternative.

(33): From (19) we know that there is at most one ambiguity per minimal subfactoring. From (16) we know that ambiguity resolutions of a minimal subfactoring are independent of the ambiguity resolutions in other minimal subfactorings. From (32) we know that every subfactoring in "P-ranked" has exactly one factoring, and that that factoring is the one with the fewest items. Therefore, "P-ranked" will contain exactly one parse, and that that parse will be "longest highest".

QED.

Copyright and License¶

  Copyright 2022 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.
  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.
  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.

2025-06-25

perl v5.40.1

Source file:	Marpa::R2::Semantics::Rank.3pm.en.gz (from libmarpa-r2-perl 12.000000-1)
Source last updated:	2025-06-25T13:50:33Z
Converted to HTML:	2025-12-30T11:23:44Z