Simple but Powerful Pratt Parsing › 迷你 Pratt 解析器 Minimal Pratt Parser [#268]

我们将解析一个基本原子为单字符数字和变量,操作符为标点的表达式。我们定义一个简单的 分词器 ( tokenizer )
We will be parsing expressions where basic atoms are single character numbers and variables, and which uses punctuation for operators. Let’s define a simple tokenizer:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum Token {
    Atom(char),
    Op(char),
    Eof,
}

struct Lexer {
    tokens: Vec<Token>,
}

impl Lexer {
    fn new(input: &str) -> Lexer {
        let mut tokens = input
            .chars()
            .filter(|it| !it.is_ascii_whitespace())
            .map(|c| match c {
                '0'..='9' |
                'a'..='z' | 'A'..='Z' => Token::Atom(c),
                _ => Token::Op(c),
            })
            .collect::<Vec<_>>();
        tokens.reverse();
        Lexer { tokens }
    }

    fn next(&mut self) -> Token {
        self.tokens.pop().unwrap_or(Token::Eof)
    }
    fn peek(&mut self) -> Token {
        self.tokens.last().copied().unwrap_or(Token::Eof)
    }
}

为了确保我们能正确地处理优先级绑定力,我们将把 中缀 ( infix ) 表达式转换为一种具有明确含义的标准符号⸺S-expressions(无论出于和种原因,在波兰不太流行)。
To make sure that we got the precedence binding power correctly, we will be transforming infix expressions into a gold-standard (not so popular in Poland, for whatever reason) unambiguous notation — S-expressions:

1 + 2 * 3 == (+ 1 (* 2 3))
use std::fmt;

enum S {
    Atom(char),
    Cons(char, Vec<S>),
}

impl fmt::Display for S {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            S::Atom(i) => write!(f, "{}", i),
            S::Cons(head, rest) => {
                write!(f, "({}", head)?;
                for s in rest {
                    write!(f, " {}", s)?
                }
                write!(f, ")")
            }
        }
    }
}

接着让我们从这里开始:包括原子和两个二元操作符 +* 的表达式:
And let’s start with just this: expressions with atoms and two infix binary operators, + and *:

fn expr(input: &str) -> S {
    let mut lexer = Lexer::new(input);
    expr_bp(&mut lexer)
}

fn expr_bp(lexer: &mut Lexer) -> S {
    todo!()
}

#[test]
fn tests() {
    let s = expr("1 + 2 * 3");
    assert_eq!(s.to_string(), "(+ 1 (* 2 3))")
}

那么,通用方法大致是我们用来处理左递归的方法⸺从第一个数字开始解析,进入循环,消费操作符然后…做某事?
So, the general approach is roughly the one we used to deal with left recursion — start with parsing a first number, and then loop, consuming operators and doing … something?

fn expr_bp(lexer: &mut Lexer) -> S {
    let lhs = match lexer.next() {
        Token::Atom(it) => S::Atom(it),
        t => panic!("bad token: {:?}", t),
    };

    loop {
        let op = match lexer.next() {
            Token::Eof => break,
            Token::Op(op) => op,
            t => panic!("bad token: {:?}", t),
        };

        todo!()
    }

    lhs
}

#[test]
fn tests() {
    let s = expr("1"); 
    assert_eq!(s.to_string(), "1");
}

  1. 注意我们已经能够解析这个简单的测试表达式!
(1) Note that we already can parse this simple test!

我们想使用绑定力的想法,因此让我们计算操作符左边和右边的力。我们用 u8 来表示绑定力,所以,为了满足结合性,我们会加 1,然后为输入的结尾保留 0,操作符的最低绑定力就是 1 了。
We want to use this power idea, so let’s compute both left and right powers of the operator. We’ll use u8 to represent power, so, for associativity, we’ll add 1. And we’ll reserve the 0 power for the end of input, so the lowest power operator can have is 1.

fn expr_bp(lexer: &mut Lexer) -> S {
    let lhs = match lexer.next() {
        Token::Atom(it) => S::Atom(it),
        t => panic!("bad token: {:?}", t),
    };

    loop {
        let op = match lexer.peek() {
            Token::Eof => break,
            Token::Op(op) => op,
            t => panic!("bad token: {:?}", t),
        };
        let (l_bp, r_bp) = infix_binding_power(op);

        todo!()
    }

    lhs
}

fn infix_binding_power(op: char) -> (u8, u8) {
    match op {
        '+' | '-' => (1, 2),
        '*' | '/' => (3, 4),
        _ => panic!("bad op: {:?}", op)
    }
}

现在到了关键部分,我们将引入递归。考虑这个例子(绑定力在下):
And now comes the tricky bit, where we introduce recursion into the picture. Let’s think about this example (with powers below):

a   +   b   *   c   *   d   +   e
  1   2   3   4   3   4   1   2

光标现在位于第一个 + 号,我们知道左边的绑定力 bp1,右边的是 2。左操作数 lhs 存储 a+ 之后的下一个操作符是 *,所以我们不应该将 b 添加到 a 中。问题在于我们还没有看到下一个操作符,我们刚刚经过 +。我们能加入 前瞻 ( lookahead ) 吗?看来不行⸺我们需要跳过 bcd 找到下一个具有更低绑定力的操作符,这听起来相当 无界 ( unbounded ) 。不过我们在逐步接近了!当前的右优先级是 2,为了能够折叠表达式,我们需要找到下一个优先级更低的操作符。因此,让我们从 b 开始递归调用 expr_bp,但也让它一旦遇到绑定优先级低于 2 的就停下。这就需要在主函数中增加 min_bp 参数。
The cursor is at the first +, we know that the left bp is 1 and the right one is 2. The lhs stores a. The next operator after + is *, so we shouldn’t add b to a. The problem is that we haven’t yet seen the next operator, we are just past +. Can we add a lookahead? Looks like no — we’d have to look past all of b, c and d to find the next operator with lower binding power, which sounds pretty unbounded. But we are onto something! Our current right priority is 2, and, to be able to fold the expression, we need to find the next operator with lower priority. So let’s recursively call expr_bp starting at b, but also tell it to stop as soon as bp drops below 2. This necessitates the addition of min_bp argument to the main function.

于是,我们有了一个功能齐全的迷你 Pratt 解析器:
And lo, we have a fully functioning minimal Pratt parser:

fn expr(input: &str) -> S {
    let mut lexer = Lexer::new(input);
    expr_bp(&mut lexer, 0) 
}

fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S { 
    let mut lhs = match lexer.next() {
        Token::Atom(it) => S::Atom(it),
        t => panic!("bad token: {:?}", t),
    };

    loop {
        let op = match lexer.peek() {
            Token::Eof => break,
            Token::Op(op) => op,
            t => panic!("bad token: {:?}", t),
        };

        let (l_bp, r_bp) = infix_binding_power(op);
        if l_bp < min_bp { 
            break;
        }

        lexer.next(); 
        let rhs = expr_bp(lexer, r_bp);

        lhs = S::Cons(op, vec![lhs, rhs]); 
    }

    lhs
}

fn infix_binding_power(op: char) -> (u8, u8) {
    match op {
        '+' | '-' => (1, 2),
        '*' | '/' => (3, 4),
        _ => panic!("bad op: {:?}"),
    }
}

#[test]
fn tests() {
    let s = expr("1");
    assert_eq!(s.to_string(), "1");

    let s = expr("1 + 2 * 3");
    assert_eq!(s.to_string(), "(+ 1 (* 2 3))");

    let s = expr("a + b * c * d + e");
    assert_eq!(s.to_string(), "(+ (+ a (* (* b c) d)) e)");
}

  1. min_bp 参数是关键的新增项。expr_bp 现在可以解析具有相对较高绑定力的表达式。一旦它遇到比 min_bp 更弱的绑定力时,就会停止解析。
  2. 这就是 “停止解析” 的地方
  3. 在这里,我们越过操作符本身并进行递归调用,注意我们如何使用 l_bpmin_bp 对比进行检查,并将 r_bp 作为递归调用的新 min_bp。因此你可以把 min_bp 看作当前表达式左侧操作符的绑定力。
  4. 最后,在解析完正确的右侧表达式之后,我们组装出新的当前表达式。
  5. 要开始递归,我们使用零作为绑定力。请记住,一开始左侧操作符的绑定优先级是可能的最低值,零,因为那里实际上没有操作符。
(1) min_bp argument is the crucial addition. expr_bp now parses expressions with relatively high binding power. As soon as it sees something weaker than min_bp, it stops.
(2) This is the “it stops” point.
(3) And here we bump past the operator itself and make the recursive call. Note how we use l_bp to check against min_bp, and r_bp as the new min_bp of the recursive call. So, you can think about min_bp as the binding power of the operator to the left of the current expressions.
(4) Finally, after parsing the correct right hand side, we assemble the new current expression.
(5) To start the recursion, we use binding power of zero. Remember, at the beginning the binding power of the operator to the left is the lowest possible, zero, as there’s no actual operator there.

是的,这 40 行代码就是 Pratt 解析算法。它们有些难对付,但如果你理解了它们,其他部分就只是简单的扩展。
So, yup, these 40 lines are the Pratt parsing algorithm. They are tricky, but, if you understand them, everything else is straightforward additions.