现在让我们添加各种奇特的表达式以展示这个算法的强大功能和灵活性。首先,添加一个高优先级,右结合的函数组合操作符 .
:
Now let’s add all kinds of weird expressions to show the power and flexibility of the algorithm. First, let’s add a high-priority, right associative function composition operator: .:
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (6, 5),
_ => panic!("bad op: {:?}"),
}
}
是的,只要一行!注意操作符的左侧绑定更紧,这赋予了我们需要的右结合性:
Yup, it’s a single line! Note how the left side of the operator binds tighter, which gives us desired right associativity:
let s = expr("f . g . h");
assert_eq!(s.to_string(), "(. f (. g h))");
let s = expr(" 1 + 2 + f . g . h * 3 * 4");
assert_eq!(s.to_string(), "(+ (+ 1 2) (* (* (. f (. g h)) 3) 4))");
现在,让我们添加一元操作符 -
。它的绑定力要比二元算术操作符更强,但比组合操作符 .
弱。这需要我们更改循环的起始方式,因为我们不再能假设第一个符号是一个原子,我们还需要处理负号。不过,让类型来引导我们。首先,从绑定优先级开始。作为一个一元操作符,它实际上只有右侧绑定力,所以,让我们来编码这部分内容:
Now, let’s add unary -, which binds tighter than binary arithmetic operators, but less tight than composition. This requires changes to how we start our loop, as we no longer can assume that the first token is an atom, and need to handle minus as well. But let the types drive us. First, we start with binding powers. As this is an unary operator, it really only have right binding power, so, ahem, let’s just code this:
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (8, 7),
_ => panic!("bad op: {:?}"),
}
}
- 在这里,我们返回一个 dummy
()
,以明确表明这是一个前缀而不是后缀操作符,因此只能将表达式绑定到右边。
- 注意,因为我们想在
.
和 *
之间添加一元操作符 -
,所以我们需要将 .
的优先级
上调
2
两级。一般规则是,如果操作符是二元的,则我们使用奇数优先级作为基数,然后选择一侧将其增加 1
以实现结合性。对于一元减法来说,这不太重要,我们可以使用 5
或 6
,但坚持使用奇数更为一致。
(1) Here, we return a dummy () to make it clear that this is a prefix, and not a postfix operator, and thus can only bind things to the right.
(2) Note, as we want to add unary - between . and *, we need to shift priorities of . by two. The general rule is that we use an odd priority as base, and bump it by one for associativity, if the operator is binary. For unary minus it doesn’t matter and we could have used either 5 or 6, but sticking to odd is more consistent.
将此代码插入 expr_bp
,我们得到了:
Plugging this into expr_bp, we get:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
todo!()
}
t => panic!("bad token: {:?}", t),
};
...
}
现在我们只有 r_bp
而没有 l_bp
,所以只要从主循环的代码里复制粘贴一半出来?别忘了,我们使用 r_bp
进行递归调用。
Now, we only have r_bp and not l_bp, so let’s just copy-paste half of the code from the main loop? Remember, we use r_bp for recursive calls.
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
}
lhs
}
#[test]
fn tests() {
...
let s = expr("--1 * 2");
assert_eq!(s.to_string(), "(* (- (- 1)) 2)");
let s = expr("--f . g");
assert_eq!(s.to_string(), "(- (- (. f g)))");
}
有趣的是,这种纯机械的、类型驱动的转换是有效的。当然,你也可以推理出它为什么有效。同样的参数也适用;在我们消费了前缀操作符之后,操作数由绑定更紧密的操作符组成,而我们恰好有一个可以解析比指定绑定力更紧密的表达式的函数。
Amusingly, this purely mechanical, type-driven transformation works. You can also reason why it works, of course. The same argument applies; after we’ve consumed a prefix operator, the operand consists of operators that bind tighter, and we just so conveniently happen to have a function which can parse expressions tighter than the specified power.
好吧,开始变得愚蠢了。如果使用 ((), u8)
“正好”适用于前缀操作符,那么 (u8, ())
可以处理后缀操作符吗?好,让我们添加阶乘 !
。它应该比 -
绑定得更紧密,因为 -(92!)
显然比 (-92)!
更有用。那么,熟悉的操练⸺新的优先级函数,上调 .
的优先级(这一点在 Pratt 解析器中很烦人),复制粘贴代码…
Ok, this is getting stupid. If using ((), u8) “just worked” for prefix operators, can (u8, ()) deal with postfix ones? Well, let’s add ! for factorials. It should bind tighter than -, because -(92!) is obviously more useful than (-92)!. So, the familiar drill — new priority function, shifting priority of . (this bit is annoying in Pratt parsers), copy-pasting the code…
let (l_bp, ()) = postfix_binding_power(op);
if l_bp < min_bp {
break;
}
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
等等,这里有问题。解析前缀表达式之后,我们可以看到后缀或中缀操作符。但我们放弃了无法识别的操作符,这是不行的…所以,我们让 postfix_binding_power
返回一个 option,用于操作符不是后缀的情况:
Wait, something’s wrong here. After we’ve parsed the prefix expression, we can see either a postfix or an infix operator. But we bail on unrecognized operators, which is not going to work… So, let’s make postfix_binding_power to return an option, for the case where the operator is not postfix:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = S::Cons(op, vec![lhs]);
continue;
}
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' => (7, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (10, 9),
_ => panic!("bad op: {:?}"),
}
}
#[test]
fn tests() {
let s = expr("-9!");
assert_eq!(s.to_string(), "(- (! 9))");
let s = expr("f . g !");
assert_eq!(s.to_string(), "(! (. f g))");
}
很有趣,旧的测试和新的测试都通过了。
Amusingly, both the old and the new tests pass.
现在我们准备添加一种新的表达式:括号表达式。其实这并不难,我们一开始就可以这样做,但这时候才处理它是有意义的,你马上就会明白为什么。括号只是一种基本表达式,处理方式与 atom 类似:
Now, we are ready to add a new kind of expression: parenthesised expression. It is actually not that hard, and we could have done it from the start, but it makes sense to handle this here, you’ll see in a moment why. Parens are just a primary expressions, and are handled similar to atoms:
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
不幸的是,以下测试失败了:
Unfortunately, the following test fails:
panic 来自下面的循环⸺我们唯一的终止条件是到达 eof,而 )
绝对不是 eof。解决这个问题的最简单方法是将 infix_binding_power
更改为在无法识别的操作数上返回 None
。这样,它就会变得类似于 postfix_binding_power
!
The panic comes from the loop below — the only termination condition we have is reaching eof, and ) is definitely not eof. The easiest way to fix that is to change infix_binding_power to return None on unrecognized operands. That way, it’ll become similar to postfix_binding_power again!
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = S::Cons(op, vec![lhs]);
continue;
}
if let Some((l_bp, r_bp)) = infix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
continue;
}
break;
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' => (7, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> Option<(u8, u8)> {
let res = match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (10, 9),
_ => return None,
};
Some(res)
}
现在让我们添加数组索引操作符:a[i]
。它是什么
缀
?
包围缀
?如果它只是 a[]
,它显然是后缀。如果它只是 [i]
,它的作用与括号完全相同。关键在:i
部分实际上并不参与整个绑定力策略,因为它是
明确界定
的。所以我们只需要这样做:
And now let’s add array indexing operator: a[i]. What kind of -fix is it? Around-fix? If it were just a[], it would clearly be postfix. if it were just [i], it would work exactly like parens. And it is the key: the i part doesn’t really participate in the whole power game, as it is unambiguously delimited. So, let’s do this:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = if op == '[' {
let rhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(']'));
S::Cons(op, vec![lhs, rhs])
} else {
S::Cons(op, vec![lhs])
};
continue;
}
if let Some((l_bp, r_bp)) = infix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
continue;
}
break;
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' | '[' => (7, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> Option<(u8, u8)> {
let res = match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (10, 9),
_ => return None,
};
Some(res)
}
#[test]
fn tests() {
...
let s = expr("x[0][1]");
assert_eq!(s.to_string(), "([ ([ x 0) 1)");
}
- 注意,我们对
!
和 [
使用相同的优先级。一般来说,为了保证算法的正确性,当我们做出决策时,保证优先级永远不相等是非常重要的。否则我们可能会陷入在两个同样好的化简候选者中微调结合性的境地。然而, 我们仅比较右 bp
和左 bp
!因此对于两个后缀运算符,优先级相同是没问题的,因为他们都是右 dp
。
(1) Note that we use the same priority for ! as for [. In general, for the correctness of our algorithm it’s pretty important that, when we make decisions, priorities are never equal. Otherwise, we might end up in a situation like the one before tiny adjustment for associativity, where there were two equally-good candidates for reduction. However, we only compare right bp with left bp! So for two postfix operators it’s OK to have priorities the same, as they are both right.
最后,所有操作符的最终 boss,可怕的
三元
操作符:
Finally, the ultimate boss of all operators, the dreaded ternary:
c ? e1 : e2
这是…
所有什么地方缀
操作符吗?好吧,我们稍微改变下三元操作符的语法:
Is this … all-other-the-place-fix operator? Well, let’s change the syntax of ternary slightly:
译者按:Agda 中确实存在这种混合的操作符,且可以由用户定义,称为“混缀”(Mixfix Operators)
c [ e1 ] e2
让我们回想一下,a[i]
事实上是后缀运算符 + 括号…所以,是的,?
和 :
实际上是一对奇怪的括号!我们就这样处理!现在,优先级和结合性呢?在这种情况下,结合性到底是什么?
And let’s recall that a[i] turned out to be a postfix operator + parenthesis… So, yeah, ? and : are actually a weird pair of parens! And let’s handle it as such! Now, what about priority and associativity? What associativity even is in this case?
a ? b : c ? d : e
为了弄清楚,我们只需要压缩括号部分:
To figure it out, we just squash the parens part:
a ?: c ?: e
这可以被解析为
This can be parsed as
(a ?: c) ?: e
或作为
or as
a ?: (c ?: e)
什么更有用?为了这样的?
-链:
What is more useful? For ?-chains like this:
a ? b :
c ? d :
e
右结合的读法更有用。从优先级来看,三元操作符优先级较低。在 C 语言中,只有 =
和 ,
有更低的优先级。趁此机会,我们也添加 C 风格的右结合 =
。
the right-associative reading is more useful. Priority-wise, the ternary is low priority. In C, only = and , have lower priority. While we are at it, let’s add C-style right associative = as well.
这是我们最完整,最完美的简单 Pratt 解析器的版本:
Here’s our the most complete and perfect version of a simple Pratt parser:
use std::{fmt, io::BufRead};
enum S {
Atom(char),
Cons(char, Vec<S>),
}
impl fmt::Display for S {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
S::Atom(i) => write!(f, "{}", i),
S::Cons(head, rest) => {
write!(f, "({}", head)?;
for s in rest {
write!(f, " {}", s)?
}
write!(f, ")")
}
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum Token {
Atom(char),
Op(char),
Eof,
}
struct Lexer {
tokens: Vec<Token>,
}
impl Lexer {
fn new(input: &str) -> Lexer {
let mut tokens = input
.chars()
.filter(|it| !it.is_ascii_whitespace())
.map(|c| match c {
'0'..='9'
| 'a'..='z' | 'A'..='Z' => Token::Atom(c),
_ => Token::Op(c),
})
.collect::<Vec<_>>();
tokens.reverse();
Lexer { tokens }
}
fn next(&mut self) -> Token {
self.tokens.pop().unwrap_or(Token::Eof)
}
fn peek(&mut self) -> Token {
self.tokens.last().copied().unwrap_or(Token::Eof)
}
}
fn expr(input: &str) -> S {
let mut lexer = Lexer::new(input);
expr_bp(&mut lexer, 0)
}
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = if op == '[' {
let rhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(']'));
S::Cons(op, vec![lhs, rhs])
} else {
S::Cons(op, vec![lhs])
};
continue;
}
if let Some((l_bp, r_bp)) = infix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = if op == '?' {
let mhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(':'));
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![lhs, mhs, rhs])
} else {
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![lhs, rhs])
};
continue;
}
break;
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 9),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' => (11, ()),
'[' => (11, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> Option<(u8, u8)> {
let res = match op {
'=' => (2, 1),
'?' => (4, 3),
'+' | '-' => (5, 6),
'*' | '/' => (7, 8),
'.' => (14, 13),
_ => return None,
};
Some(res)
}
#[test]
fn tests() {
let s = expr("1");
assert_eq!(s.to_string(), "1");
let s = expr("1 + 2 * 3");
assert_eq!(s.to_string(), "(+ 1 (* 2 3))");
let s = expr("a + b * c * d + e");
assert_eq!(s.to_string(), "(+ (+ a (* (* b c) d)) e)");
let s = expr("f . g . h");
assert_eq!(s.to_string(), "(. f (. g h))");
let s = expr(" 1 + 2 + f . g . h * 3 * 4");
assert_eq!(
s.to_string(),
"(+ (+ 1 2) (* (* (. f (. g h)) 3) 4))",
);
let s = expr("--1 * 2");
assert_eq!(s.to_string(), "(* (- (- 1)) 2)");
let s = expr("--f . g");
assert_eq!(s.to_string(), "(- (- (. f g)))");
let s = expr("-9!");
assert_eq!(s.to_string(), "(- (! 9))");
let s = expr("f . g !");
assert_eq!(s.to_string(), "(! (. f g))");
let s = expr("(((0)))");
assert_eq!(s.to_string(), "0");
let s = expr("x[0][1]");
assert_eq!(s.to_string(), "([ ([ x 0) 1)");
let s = expr(
"a ? b :
c ? d
: e",
);
assert_eq!(s.to_string(), "(? a b (? c d e))");
let s = expr("a = 0 ? b : c = d");
assert_eq!(s.to_string(), "(= a (= (? 0 b c) d))")
}
fn main() {
for line in std::io::stdin().lock().lines() {
let line = line.unwrap();
let s = expr(&line);
println!("{}", s)
}
}
该代码也可以在此仓库中获得,输入结束 :-)
The code is also available in this repository, Eof :-)