Jinser Kafka [jsr-0001]
Jinser Kafka [jsr-0001]
I am a Junior Computer Science major and a fancier of programming languages. There is my WIP &'static CV.
NOTE: I always forget, so let me write it down here: Press the key Ctrl + k
or Cmd + k
to open ninja "telescope".
正在为
IDEA
工作,实习开发
MoonBit
周边设施。
The source code of this site is hosted at GitHub: jetjinser/forest.
Recently added [jsr-0013]
Recently added [jsr-0013]
Waiting for Forester 5.0 to be released.
Blog posts [jsr-0003]
Blog posts [jsr-0003]
Forester
anatomy of \def [jsr-0006]
- May 19, 2024
- Jinser Kafka
Forester anatomy of \def [jsr-0006]
- May 19, 2024
- Jinser Kafka
简单描述在
Forester
中 \def
的使用方式。从语法开始。
Forester
def of \def [fst-0005]
- May 19, 2024
- Jinser Kafka
Forester def of \def [fst-0005]
- May 19, 2024
- Jinser Kafka
Forester
的 def 语法定义大致摘抄自 ocaml-forester 中的 lib/frontend/Grammar.mly
。
%token <string> IDENT let ident := | ident = IDENT; { String.split_on_char '/' ident } let bvar := | x = TEXT; { [x] } let binder == list(squares(bvar)) let textual_expr == list(locate(textual_node)) let arg == braces(textual_expr) let textual_node := | ~ = TEXT; <Code.Text> | ~ = WHITESPACE; <Code.Text> | ~ = head_node; <Fun.id> let fun_spec == ~ = ident; ~ = binder; ~ = arg; <> let head_node := (* ... *) | DEF; (~,~,~) = fun_spec; <Code.Def>
从 grammar 看 \def
语法大致由三个部分组成:ident
,binder
,arg
。考虑一个完整的 \def
实例:
Example. Forester example of \def with binder [fst-0004]
- May 19, 2024
- Jinser Kafka
Example. Forester example of \def with binder [fst-0004]
- May 19, 2024
- Jinser Kafka
\def\lam[x]{ \lambda\x\mathpunct{.} }
分别来看组成部分:
-
\def
是关键词,表示一个宏定义的开始。 -
\lam
是ident
,表示正在\def
的宏的名字。 -
[x]
是binder
,即绑定参数名的位置,这里绑定的参数名是x
。 -
\lambda\x\mathpunct{.}
是arg
,parser 中写作arg
也许是想表示其作为\def
参数的意思,不过实际上可以看作正在定义的宏的 body。观察 body,可以发现binder
中绑定的参数名在其中被用到:\x
。
这个通过 \def
定义出的名为 \lam
的宏被设计为在 \(\LaTeX \) 环境中使用,使用 ## { }
进入。
##{ \begin{aligned} \lam{x} \\ \lam{x y} \\ \lam{x}\lam{y} \\ \end{aligned} }
可以注意到参数是如何传递给 \lam
宏的,在宏后面写一对花括号,花括号中的任意字符都是会传递给该宏的参数。这里三次传递的参数分别是 x
,x y
以及 x
和 y
。
##{ \begin{aligned} \lambda\x\mathpunct{.} \\ \lambda\x y\mathpunct{.} \\ \lambda\x\mathpunct{.}\lambda\y\mathpunct{.} \\ \end{aligned} }
观察展开后的样子,可以看到这些参数分别出现在了各自原先 \x
的位置,这可以看作一种重写或替换。
渲染的效果就像下面这样:
\[ \begin {aligned} \lambda x\mathpunct {.} \\ \lambda x y\mathpunct {.} \\ \lambda x\mathpunct {.}\lambda y\mathpunct {.} \\ \end {aligned} \]\def
的参数可以不止一个,在 binder
的位置写任意数量的参数都是允许的。
Example.
Forester
example of \def with multi binder [fst-0007]
- May 19, 2024
- Jinser Kafka
Example. Forester example of \def with multi binder [fst-0007]
- May 19, 2024
- Jinser Kafka
\def\SubSup[arg1][arg2]{#{_{\arg1}^{\arg2}}}
每个 binder
都需要用方括号括起来,每个绑定参数的名字都在方括号中;绑定了的参数名可以在 arg
中使用。
\SubSup{x}{y}
=> \(_{x}^{y}\)。
传参的时候也是写多个花括号,要传递的参数按顺序对应。
最后看一个更简单的例子:
Example.
Forester
example of \def without binder [fst-0006]
- May 19, 2024
- Jinser Kafka
Example. Forester example of \def without binder [fst-0006]
- May 19, 2024
- Jinser Kafka
\def\forester{ [ocaml-forester](https://git.sr.ht/~jonsterling/ocaml-forester) }
在这个例子中,binder
被省略,这意味着这个宏不需要绑定参数。使用不需要参数的宏非常简单直接,只需要不写花括号提供参数。
\forester is a experimental forest markup language.
会被展开成:
[ocaml-forester](https://git.sr.ht/~jonsterling/ocaml-forester) is a experimental forest markup language.
ocaml-forester is a experimental forest markup language.
Forester
与
Neovim
/
Nix
合作笔记 [jsr-0004]
- May 1, 2024
- Jinser Kafka
Forester
与
Neovim
/
Nix
合作笔记 [jsr-0004]
- May 1, 2024
- Jinser Kafka
扁平的文件树导航问题可以通过虚拟嵌套(部分)解决,如果不是遇到特别大的树,效果挺好的。
Forester
nesting tree [fst-0001]
- April 9, 2024
- Jinser Kafka
Forester nesting tree [fst-0001]
- April 9, 2024
- Jinser Kafka
Forester 的 tree 是扁平的,不推荐用 file tree 组织,这样一来当树多的时候,使用 file tree 就难以导航。我暂时没有使用 fuzz finder 类似插件的习惯,所以我就在 file tree 插件层面配置了一个 file nesting 虚拟嵌套。
Neo-tree file nesting [nvim-0001]
- April 9, 2024
- Jinser Kafka
Neo-tree file nesting [nvim-0001]
- April 9, 2024
- Jinser Kafka
:help neo-tree-file-nesting
nesting_rules = { ["forester"] = { pattern = "(.*)-0001%.tree$", files = { "%1-*.tree" }, }, },
hls-0001.tree hls-0002.tree hls-0003.tree hls-0004.tree TREENAME-0001.tree TREENAME-0002.tree TREENAME-0003.tree ...
效果还不错。但有个小问题,git status 只会渲染文件和包含该文件的文件夹;而 nested file 不是文件夹,而 pattern 作为文件本身也不应该被渲染。
Neo-tree file nesting git-status issue [nvim-0002]
- April 9, 2024
- Jinser Kafka
Neo-tree file nesting git-status issue [nvim-0002]
- April 9, 2024
- Jinser Kafka
expand nested files
trees/ hls-0001.tree hls-0002.tree hls-0003.tree hls-0004.tree ... TREENAME-0001.tree TREENAME-0002.tree TREENAME-0003.tree ...
collapse nested files
trees/
hls-0001.tree
...
TREENAME-0001.tree
这有一点两难,我目前想到了两个方式。
- 动态渲染:折叠 file 后,让 pattern 也渲染 git status。但是估计有点问题,文件夹的状态应该也是 git status 给的,貌似很难决定 nested file pattern 应该是什么状态。
- 虚拟文件夹:在 Forester 的用例下,其实有个虚拟文件夹会更好?但是这样仍然不能让 git 来计算文件夹的 status。
Fix
Forester
Nix
flake on x86_64-darwin [jsr-0005]
- May 1, 2024
- Jinser Kafka
Fix
Forester
Nix
flake on x86_64-darwin [jsr-0005]
- May 1, 2024
- Jinser Kafka
从
Forester
3.0 开始增加了
Nix
flake 支持,遗憾的是,jon 和 kento 都不使用 x86_64-darwin。作为二等公民 nix-darwin 的二等公民 x86_64-darwin 不幸运地 broken 了。
Fix
Forester
opam-nix ocaml-cf broken [fst-0002]
- May 1, 2024
- Jinser Kafka
Fix Forester opam-nix ocaml-cf broken [fst-0002]
- May 1, 2024
- Jinser Kafka
ocaml-cf 的 lib_gen/dune
文件疑似写错了,导致
Forester
在 x86_64-darwin 上构建失败。
(rule (targets detect.exe) (enabled_if (= %{system} macosx)) (deps detect.c (package ctypes)) (action (run %{cc} -I %{ocaml_where} -I %{lib:ctypes:} -o %{targets} %{deps})))
Done: 31% (18/58, 40 left) (jobs: 0)^M ^MFile "lib_gen/dune", line 27, characters 0-185: 27 | (rule 28 | (targets detect.exe) 29 | (enabled_if 30 | (= %{system} macosx)) 31 | (deps 32 | detect.c 33 | (package ctypes)) 34 | (action 35 | (run %{cc} -I %{ocaml_where} -I %{lib:ctypes:} -o %{targets} %{deps}))) Error: File unavailable: /nix/store/xb81cf7qgnyvsyvp9nfj15dmqchc24gk-ctypes-0.22.0/doc/ctypes/CHANGES.md
我不懂 dune 是如何工作的,但我猜测 (deps detect.c (package ctypes))
是将 ctypes 当作源代码依赖来构建了。我将 (package ctypes)
直接删去,一切都正常工作了。
我在 GitHub 提了一个 PR。
Fix forster nix flake on x86_64-darwin link error [fst-0003]
- May 1, 2024
- Jinser Kafka
Fix forster nix flake on x86_64-darwin link error [fst-0003]
- May 1, 2024
- Jinser Kafka
使用 forster 原先的 nix flake 在 x86_64-darwin 上 build 到最后 link 的阶段会遇到 ld 报错:Unresolved Symbol(s): `_preadv`
和 Unresolved Symbol(s): `_pwritev`
。
好吧,是因为 nixpkgs 中的 x86_64-darwin 默认用的 apple sdk 是 10.12 的,是个非常老的版本,确实没有这两个符号。解决方案就是用更高的版本 override 掉默认的 10.12 sdk,一般是 11。
Opam-nix override stdenv [nix-0008]
- May 1, 2024
- Jinser Kafka
Opam-nix override stdenv [nix-0008]
- May 1, 2024
- Jinser Kafka
/startverb pkgs = import nixpkgs { inherit system; config.replaceStdenv = { pkgs }: if pkgs.stdenv.isDarwin then pkgs.overrideSDK pkgs.stdenv "11.0" else pkgs.stdenv; }; ... scope = on.buildOpamProject' { repos = [ "${opam-repository}" ]; inherit pkgs; } ./. query; /stopverb
这里有一个完整的 example
Translations [jsr-000N]
Translations [jsr-000N]
翻译文章的列表,标记的时间是翻译时间。
Translation. Simple but Powerful Pratt Parsing [jsr-0017]
- March 19, 2025
-
Alex Kladov with contributions from Jinser Kafka
- Original
Translation. Simple but Powerful Pratt Parsing [jsr-0017]
- March 19, 2025
- Alex Kladov with contributions from Jinser Kafka
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
Translated according to the version last updated on Apr 13, 2020.
根据2020年4月13日更新的版本翻译。
欢迎阅读我关于 Pratt 解析的文章——语法分析的 monad 教程。关于 Pratt 解析的文章数量如此之多,以至于还有一篇概览文章 :)
Welcome to my article about Pratt parsing — the monad tutorial of syntactic analysis. The number of Pratt parsing articles is so large that there exists a survey post :)
本文的目标:
The goals of this particular article are:
Raising an issue that the so-called left-recursion problem is overstated.
Complaining about inadequacy of BNF for representing infix expressions.
Providing a description and implementation of Pratt parsing algorithm which sticks to the core and doesn’t introduce a DSL-y abstraction.
Understanding the algorithm myself for hopefully the last time. I’ve implemented a production-grade Pratt parser once, but I no longer immediately understand that code :-)
本文假设读者对解析技术有一定的了解,例如,不会在这里解释
上下文无关文法
是什么。
This post assumes a fair bit of familiarity with parsing techniques, and, for example, does not explain what a context free grammar is.
介绍
Introduction
[#265]
- March 19, 2025
- Alex Kladov
介绍 Introduction [#265]
- March 19, 2025
- Alex Kladov
解析
是编译器将一系列
标记
转换为树状表示的过程:
Parsing is the process by which a compiler turns a sequence of tokens into a tree representation:
Add Parser / \ "1 + 2 * 3" -------> 1 Mul / \ 2 3
有许多方法可以完成这个任务,这些方法大致可以归为以下两类:
There are many approaches to this task, which roughly fall into one of the broad two categories:
Using a DSL to specify an abstract grammar of the language
Hand-writing the parser
Pratt 解析是手写解析最常用的技术之一。
Pratt parsing is one of the most frequently used techniques for hand-written parsing.
BNF [#266]
- March 19, 2025
- Alex Kladov
BNF [#266]
- March 19, 2025
- Alex Kladov
语法分析理论的顶峰在于发现将线性结构解码为树状结构的上下文无关文法表示法(通常使用 BNF 具体语法):
The pinnacle of syntactic analysis theory is discovering the context free grammar notation (often using BNF concrete syntax) for decoding linear structures into trees:
Item ::= StructItem | EnumItem | ... StructItem ::= 'struct' Name '{' FieldList '}' ...
我记得我当时对这个想法非常着迷,特别是它与自然语言句子结构的相似之处。然而,当我们开始描述表达式时,我的乐观情绪很快就消失了。自然表达式文法确实可以让人们看出什么是表达式。
I remember being fascinated by this idea, especially by parallels with natural language sentence structure. However, my optimism quickly waned once we got to describing expressions. The natural expression grammar indeed allows one to see what is an expression.
Expr ::= Expr '+' Expr | Expr '*' Expr | '(' Expr ')' | 'number'
虽然这种文法看起来很棒,但实际上它是模糊且不精确的,需要重新编写以适用于自动解析器生成。具体来说,我们需要指定操作符的
优先级
和
结合性
。修正后的文法如下:
Although this grammar looks great, it is in fact ambiguous and imprecise, and needs to be rewritten to be amendable to automated parser generation. Specifically, we need to specify precedence and associativity of operators. The fixed grammar looks like this:
Expr ::= Factor | Expr '+' Factor Factor ::= Atom | Factor '*' Atom Atom ::= 'number' | '(' Expr ')'
对我来说,在这种新的表述中,表达式的“形状”完全丢失了。此外,我在学习了三四门形式语言课程后,才能够可靠地自己创建这种文法。
To me, the “shape” of expressions feels completely lost in this new formulation. Moreover, it took me three or four courses in formal languages before I was able to reliably create this grammar myself.
这就是我为什么喜欢 Pratt 解析⸺它是对递归下降解析算法的改进,使用了优先级和结合性的自然术语来解析表达式,而不是文法混淆技术。
And that’s why I love Pratt parsing — it is an enhancement of recursive descent parsing algorithm, which uses the natural terminology of precedence and associativity for parsing expressions, instead of grammar obfuscation techniques.
递归下降和左递归
Recursive descent and left-recursion
[#267]
- March 19, 2025
- Alex Kladov
递归下降和左递归 Recursive descent and left-recursion [#267]
- March 19, 2025
- Alex Kladov
手写解析器的最简单技术是
递归下降
,它将文法建模为一组相互递归的函数。例如,上述文法片段可以写成这样:
The simplest technique for hand-writing a parser is recursive descent, which models the grammar as a set of mutually recursive functions. For example, the above item grammar fragment can look like this:
fn item(p: &mut Parser) {
match p.peek() {
STRUCT_KEYWORD => struct_item(p),
ENUM_KEYWORD => enum_item(p),
...
}
}
fn struct_item(p: &mut Parser) {
p.expect(STRUCT_KEYWORD);
name(p);
p.expect(L_CURLY);
field_list(p);
p.expect(R_CURLY);
}
...
传统上,教科书指出左递归文法是这种方法的
阿喀琉斯之踵
,并利用这一缺点来促使产生更高级的 LR 解析技术。一种有问题的文法示例如下:
Traditionally, text-books point out left-recursive grammars as the Achilles heel of this approach, and use this drawback to motivate more advanced LR parsing techniques. An example of problematic grammar can look like this:
Sum ::= Sum '+' Int | Int
确实,如果我们朴素地编码
sum
函数,这不会太有用:
Indeed, if we naively code the sum function, it wouldn’t be too useful:
fn sum(p: &mut Parser) {
// Try first alternative
sum(p);
p.expect(PLUS);
int(p);
// If that fails, try the second one
...
}
(1): At this point we immediately loop and overflow the stack
理论上讲,解决这个问题需要重写语法以消除左递归。然而在实际操作中,对于手写解析器,解决方案要简单得多⸺打破纯递归范式,使用循环:
A theoretical fix to the problem involves rewriting the grammar to eliminate the left recursion. However in practice, for a hand-written parser, a solution is much simpler — breaking away with a pure recursive paradigm and using a loop:
fn sum(p: &mut Parser) {
int(p);
while p.eat(PLUS) {
int(p);
}
}
Pratt 解析,一般“形状”
Pratt parsing, the general shape
[#268]
- March 19, 2025
- Alex Kladov
Pratt 解析,一般“形状” Pratt parsing, the general shape [#268]
- March 19, 2025
- Alex Kladov
仅使用循环不足以解析中缀表达式。相反,Pratt 解析同时使用循环和递归:
Using just loops won’t be enough for parsing infix expressions. Instead, Pratt parsing uses both loops and recursion:
fn parse_expr() {
...
loop {
...
parse_expr()
...
}
}
它不仅会让你陷入
摩比乌斯
形状的仓鼠轮中,而且还能处理结合性和优先级问题!
Not only does it send your mind into Möbeus-shaped hamster wheel, it also handles associativity and precedence!
从优先级到绑定力
From Precedence to Binding Power
[#269]
- March 19, 2025
- Alex Kladov
从优先级到绑定力 From Precedence to Binding Power [#269]
- March 19, 2025
- Alex Kladov
我要坦白:我总是被“高优先级”和“低优先级”弄糊涂。在
a + b * c
,加法的优先级更低,但它在解析树的顶部…
I have a confession to make: I am always confused by “high precedence” and “low precedence”. In a + b * c, addition has a lower precedence, but it is at the top of the parse tree…
因此,我发现用
绑定力
的概念来思考更直观。
So instead, I find thinking in terms of binding power more intuitive.
expr: A + B * C power: 3 3 5 5
*
更强,它有更大的力将 B
和 C
结合在一起,于是这个表达式被解析为 A + (B * C)
The * is stronger, it has more power to hold together B and C, and so the expression is parsed as A + (B * C).
但是结合性呢?在
A + B + C
中,所有操作符似乎都有相同的力,目前尚不清楚哪一个 +
要先结合在一起。不过这也可以用绑定力来建模,如果我们让它稍微不对称:
What about associativity though? In A + B + C all operators seem to have the same power, and it is unclear which + to fold first. But this can also be modelled with power, if we make it slightly asymmetric:
expr: A + B + C power: 0 3 3.1 3 3.1 0
这里我们稍微提高了
+
的右绑定力,这样它能更紧密地和右操作数结合。我们还在两端添加了零,因为两边没有操作符可以绑定。在这种情况下,(只有)第一个 +
比它的邻居更强,将两边的参数结合在了一起,因此我们将它可以简化为:
Here, we pumped the right power of + just a little bit, so that it holds the right operand tighter. We also added zeros at both ends, as there are no operators to bind from the sides. Here, the first (and only the first) + holds both of its arguments tighter than the neighbors, so we can reduce it:
expr: (A + B) + C power: 0 3 3.1 0
现在我们可以将第二个加号折叠起来,得到
(A + B) + C
。或者,从语法树的角度看来,第二个 +
实际上喜欢右操作数更胜于左操作数,所以它更急于与 C
结合。在此过程中,第一个 +
抓住了 A
和 B
,因为它们没有竞争对手。
Now we can fold the second plus and get (A + B) + C. Or, in terms of the syntax tree, the second + really likes its right operand more than the left one, so it rushes to get hold of C. While he does that, the first + captures both A and B, as they are uncontested.
Pratt 解析的作用是通过从左到右处理字符串,找到这些比邻居操作符更强势的操作符。我们几乎已经到了可以开始编写代码的地步,但让我们先看看另一个示例。我们将使用函数组合操作符
.
(
点
)作为一个具有高绑定力的右结合操作符。也就是说,f . g . h
将被解析为 f . (g . h)
,或者用绑定力来解释:
What Pratt parsing does is that it finds these badass, stronger than neighbors operators, by processing the string left to right. We are almost at a point where we finally start writing some code, but let’s first look at the other running example. We will use function composition operator, . (dot) as a right associative operator with a high binding power. That is, f . g . h is parsed as f . (g . h), or, in terms of power
f . g . h 0 8.5 8 8.5 8 0
迷你 Pratt 解析器
Minimal Pratt Parser
[#270]
- March 19, 2025
- Alex Kladov
迷你 Pratt 解析器 Minimal Pratt Parser [#270]
- March 19, 2025
- Alex Kladov
我们将解析一个基本原子为单字符数字和变量,操作符为标点的表达式。我们定义一个简单的
分词器
:
We will be parsing expressions where basic atoms are single character numbers and variables, and which uses punctuation for operators. Let’s define a simple tokenizer:
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum Token {
Atom(char),
Op(char),
Eof,
}
struct Lexer {
tokens: Vec<Token>,
}
impl Lexer {
fn new(input: &str) -> Lexer {
let mut tokens = input
.chars()
.filter(|it| !it.is_ascii_whitespace())
.map(|c| match c {
'0'..='9' |
'a'..='z' | 'A'..='Z' => Token::Atom(c),
_ => Token::Op(c),
})
.collect::<Vec<_>>();
tokens.reverse();
Lexer { tokens }
}
fn next(&mut self) -> Token {
self.tokens.pop().unwrap_or(Token::Eof)
}
fn peek(&mut self) -> Token {
self.tokens.last().copied().unwrap_or(Token::Eof)
}
}
为了确保我们能正确地处理
优先级绑定力,我们将把
中缀
表达式转换为一种具有明确含义的标准符号⸺S-expressions(无论出于和种原因,在波兰不太流行)。
To make sure that we got the precedence binding power correctly, we will be transforming infix expressions into a gold-standard (not so popular in Poland, for whatever reason) unambiguous notation — S-expressions:
1 + 2 * 3 == (+ 1 (* 2 3))
use std::fmt;
enum S {
Atom(char),
Cons(char, Vec<S>),
}
impl fmt::Display for S {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
S::Atom(i) => write!(f, "{}", i),
S::Cons(head, rest) => {
write!(f, "({}", head)?;
for s in rest {
write!(f, " {}", s)?
}
write!(f, ")")
}
}
}
}
接着让我们从这里开始:包括原子和两个二元操作符
+
和 *
的表达式:
And let’s start with just this: expressions with atoms and two infix binary operators, + and *:
fn expr(input: &str) -> S {
let mut lexer = Lexer::new(input);
expr_bp(&mut lexer)
}
fn expr_bp(lexer: &mut Lexer) -> S {
todo!()
}
#[test]
fn tests() {
let s = expr("1 + 2 * 3");
assert_eq!(s.to_string(), "(+ 1 (* 2 3))")
}
那么,通用方法大致是我们用来处理左递归的方法⸺从第一个数字开始解析,进入循环,消费操作符然后…做某事?
So, the general approach is roughly the one we used to deal with left recursion — start with parsing a first number, and then loop, consuming operators and doing … something?
fn expr_bp(lexer: &mut Lexer) -> S {
let lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.next() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
todo!()
}
lhs
}
#[test]
fn tests() {
let s = expr("1");
assert_eq!(s.to_string(), "1");
}
(1) Note that we already can parse this simple test!
我们想使用绑定力的想法,因此让我们计算操作符左边和右边的力。我们用
u8
来表示绑定力,所以,为了满足结合性,我们会加 1
,然后为输入的结尾保留 0
,操作符的最低绑定力就是 1
了。
We want to use this power idea, so let’s compute both left and right powers of the operator. We’ll use u8 to represent power, so, for associativity, we’ll add 1. And we’ll reserve the 0 power for the end of input, so the lowest power operator can have is 1.
fn expr_bp(lexer: &mut Lexer) -> S {
let lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
let (l_bp, r_bp) = infix_binding_power(op);
todo!()
}
lhs
}
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
_ => panic!("bad op: {:?}", op)
}
}
现在到了关键部分,我们将引入递归。考虑这个例子(绑定力在下):
And now comes the tricky bit, where we introduce recursion into the picture. Let’s think about this example (with powers below):
a + b * c * d + e 1 2 3 4 3 4 1 2
光标现在位于第一个
+
号,我们知道左边的绑定力 bp
是 1
,右边的是 2
。左操作数 lhs
存储 a
。+
之后的下一个操作符是 *
,所以我们不应该将 b
添加到 a
中。问题在于我们还没有看到下一个操作符,我们刚刚经过 +
。我们能加入
前瞻
吗?看来不行⸺我们需要跳过 b
、c
和 d
找到下一个具有更低绑定力的操作符,这听起来相当
无界
。不过我们在逐步接近了!当前的右优先级是 2
,为了能够折叠表达式,我们需要找到下一个优先级更低的操作符。因此,让我们从 b
开始递归调用 expr_bp
,但也让它一旦遇到绑定优先级低于 2
的就停下。这就需要在主函数中增加 min_bp
参数。
The cursor is at the first +, we know that the left bp is 1 and the right one is 2. The lhs stores a. The next operator after + is *, so we shouldn’t add b to a. The problem is that we haven’t yet seen the next operator, we are just past +. Can we add a lookahead? Looks like no — we’d have to look past all of b, c and d to find the next operator with lower binding power, which sounds pretty unbounded. But we are onto something! Our current right priority is 2, and, to be able to fold the expression, we need to find the next operator with lower priority. So let’s recursively call expr_bp starting at b, but also tell it to stop as soon as bp drops below 2. This necessitates the addition of min_bp argument to the main function.
于是,我们有了一个功能齐全的迷你 Pratt 解析器:
And lo, we have a fully functioning minimal Pratt parser:
fn expr(input: &str) -> S {
let mut lexer = Lexer::new(input);
expr_bp(&mut lexer, 0)
}
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
}
lhs
}
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
_ => panic!("bad op: {:?}"),
}
}
#[test]
fn tests() {
let s = expr("1");
assert_eq!(s.to_string(), "1");
let s = expr("1 + 2 * 3");
assert_eq!(s.to_string(), "(+ 1 (* 2 3))");
let s = expr("a + b * c * d + e");
assert_eq!(s.to_string(), "(+ (+ a (* (* b c) d)) e)");
}
min_bp
参数是关键的新增项。expr_bp
现在可以解析具有相对较高绑定力的表达式。一旦它遇到比 min_bp
更弱的绑定力时,就会停止解析。l_bp
与 min_bp
对比进行检查,并将 r_bp
作为递归调用的新 min_bp
。因此你可以把 min_bp
看作当前表达式左侧操作符的绑定力。(1) min_bp argument is the crucial addition. expr_bp now parses expressions with relatively high binding power. As soon as it sees something weaker than min_bp, it stops.
(2) This is the “it stops” point.
(3) And here we bump past the operator itself and make the recursive call. Note how we use l_bp to check against min_bp, and r_bp as the new min_bp of the recursive call. So, you can think about min_bp as the binding power of the operator to the left of the current expressions.
(4) Finally, after parsing the correct right hand side, we assemble the new current expression.
(5) To start the recursion, we use binding power of zero. Remember, at the beginning the binding power of the operator to the left is the lowest possible, zero, as there’s no actual operator there.
是的,这 40 行代码就是 Pratt 解析算法。它们有些难对付,但如果你理解了它们,其他部分就只是简单的扩展。
So, yup, these 40 lines are the Pratt parsing algorithm. They are tricky, but, if you understand them, everything else is straightforward additions.
铃铛和哨子
Bells and Whistles
[#271]
- March 19, 2025
- Alex Kladov
铃铛和哨子 Bells and Whistles [#271]
- March 19, 2025
- Alex Kladov
现在让我们添加各种奇特的表达式以展示这个算法的强大功能和灵活性。首先,添加一个高优先级,右结合的函数组合操作符
.
:
Now let’s add all kinds of weird expressions to show the power and flexibility of the algorithm. First, let’s add a high-priority, right associative function composition operator: .:
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (6, 5),
_ => panic!("bad op: {:?}"),
}
}
是的,只要一行!注意操作符的左侧绑定更紧,这赋予了我们需要的右结合性:
Yup, it’s a single line! Note how the left side of the operator binds tighter, which gives us desired right associativity:
let s = expr("f . g . h");
assert_eq!(s.to_string(), "(. f (. g h))");
let s = expr(" 1 + 2 + f . g . h * 3 * 4");
assert_eq!(s.to_string(), "(+ (+ 1 2) (* (* (. f (. g h)) 3) 4))");
现在,让我们添加一元操作符
-
。它的绑定力要比二元算术操作符更强,但比组合操作符 .
弱。这需要我们更改循环的起始方式,因为我们不再能假设第一个符号是一个原子,我们还需要处理负号。不过,让类型来引导我们。首先,从绑定优先级开始。作为一个一元操作符,它实际上只有右侧绑定力,所以,让我们来编码这部分内容:
Now, let’s add unary -, which binds tighter than binary arithmetic operators, but less tight than composition. This requires changes to how we start our loop, as we no longer can assume that the first token is an atom, and need to handle minus as well. But let the types drive us. First, we start with binding powers. As this is an unary operator, it really only have right binding power, so, ahem, let’s just code this:
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (8, 7),
_ => panic!("bad op: {:?}"),
}
}
()
,以明确表明这是一个前缀而不是后缀操作符,因此只能将表达式绑定到右边。.
和 *
之间添加一元操作符 -
,所以我们需要将 .
的优先级
上调
2
两级。一般规则是,如果操作符是二元的,则我们使用奇数优先级作为基数,然后选择一侧将其增加 1
以实现结合性。对于一元减法来说,这不太重要,我们可以使用 5
或 6
,但坚持使用奇数更为一致。(1) Here, we return a dummy () to make it clear that this is a prefix, and not a postfix operator, and thus can only bind things to the right.
(2) Note, as we want to add unary - between . and *, we need to shift priorities of . by two. The general rule is that we use an odd priority as base, and bump it by one for associativity, if the operator is binary. For unary minus it doesn’t matter and we could have used either 5 or 6, but sticking to odd is more consistent.
将此代码插入
expr_bp
,我们得到了:
Plugging this into expr_bp, we get:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
todo!()
}
t => panic!("bad token: {:?}", t),
};
...
}
现在我们只有
r_bp
而没有 l_bp
,所以只要从主循环的代码里复制粘贴一半出来?别忘了,我们使用 r_bp
进行递归调用。
Now, we only have r_bp and not l_bp, so let’s just copy-paste half of the code from the main loop? Remember, we use r_bp for recursive calls.
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
}
lhs
}
#[test]
fn tests() {
...
let s = expr("--1 * 2");
assert_eq!(s.to_string(), "(* (- (- 1)) 2)");
let s = expr("--f . g");
assert_eq!(s.to_string(), "(- (- (. f g)))");
}
有趣的是,这种纯机械的、类型驱动的转换是有效的。当然,你也可以推理出它为什么有效。同样的参数也适用;在我们消费了前缀操作符之后,操作数由绑定更紧密的操作符组成,而我们恰好有一个可以解析比指定绑定力更紧密的表达式的函数。
Amusingly, this purely mechanical, type-driven transformation works. You can also reason why it works, of course. The same argument applies; after we’ve consumed a prefix operator, the operand consists of operators that bind tighter, and we just so conveniently happen to have a function which can parse expressions tighter than the specified power.
好吧,开始变得愚蠢了。如果使用
((), u8)
“正好”适用于前缀操作符,那么 (u8, ())
可以处理后缀操作符吗?好,让我们添加阶乘 !
。它应该比 -
绑定得更紧密,因为 -(92!)
显然比 (-92)!
更有用。那么,熟悉的操练⸺新的优先级函数,上调 .
的优先级(这一点在 Pratt 解析器中很烦人),复制粘贴代码…
Ok, this is getting stupid. If using ((), u8) “just worked” for prefix operators, can (u8, ()) deal with postfix ones? Well, let’s add ! for factorials. It should bind tighter than -, because -(92!) is obviously more useful than (-92)!. So, the familiar drill — new priority function, shifting priority of . (this bit is annoying in Pratt parsers), copy-pasting the code…
let (l_bp, ()) = postfix_binding_power(op);
if l_bp < min_bp {
break;
}
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
等等,这里有问题。解析前缀表达式之后,我们可以看到后缀或中缀操作符。但我们放弃了无法识别的操作符,这是不行的…所以,我们让
postfix_binding_power
返回一个 option,用于操作符不是后缀的情况:
Wait, something’s wrong here. After we’ve parsed the prefix expression, we can see either a postfix or an infix operator. But we bail on unrecognized operators, which is not going to work… So, let’s make postfix_binding_power to return an option, for the case where the operator is not postfix:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = S::Cons(op, vec![lhs]);
continue;
}
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' => (7, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> (u8, u8) {
match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (10, 9),
_ => panic!("bad op: {:?}"),
}
}
#[test]
fn tests() {
let s = expr("-9!");
assert_eq!(s.to_string(), "(- (! 9))");
let s = expr("f . g !");
assert_eq!(s.to_string(), "(! (. f g))");
}
很有趣,旧的测试和新的测试都通过了。
Amusingly, both the old and the new tests pass.
现在我们准备添加一种新的表达式:括号表达式。其实这并不难,我们一开始就可以这样做,但这时候才处理它是有意义的,你马上就会明白为什么。括号只是一种基本表达式,处理方式与 atom 类似:
Now, we are ready to add a new kind of expression: parenthesised expression. It is actually not that hard, and we could have done it from the start, but it makes sense to handle this here, you’ll see in a moment why. Parens are just a primary expressions, and are handled similar to atoms:
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
不幸的是,以下测试失败了:
Unfortunately, the following test fails:
panic 来自下面的循环⸺我们唯一的终止条件是到达 eof,而
)
绝对不是 eof。解决这个问题的最简单方法是将 infix_binding_power
更改为在无法识别的操作数上返回 None
。这样,它就会变得类似于 postfix_binding_power
!
The panic comes from the loop below — the only termination condition we have is reaching eof, and ) is definitely not eof. The easiest way to fix that is to change infix_binding_power to return None on unrecognized operands. That way, it’ll become similar to postfix_binding_power again!
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = S::Cons(op, vec![lhs]);
continue;
}
if let Some((l_bp, r_bp)) = infix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
continue;
}
break;
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' => (7, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> Option<(u8, u8)> {
let res = match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (10, 9),
_ => return None,
};
Some(res)
}
现在让我们添加数组索引操作符:
a[i]
。它是什么
缀
?
包围缀
?如果它只是 a[]
,它显然是后缀。如果它只是 [i]
,它的作用与括号完全相同。关键在:i
部分实际上并不参与整个绑定力策略,因为它是
明确界定
的。所以我们只需要这样做:
And now let’s add array indexing operator: a[i]. What kind of -fix is it? Around-fix? If it were just a[], it would clearly be postfix. if it were just [i], it would work exactly like parens. And it is the key: the i part doesn’t really participate in the whole power game, as it is unambiguously delimited. So, let’s do this:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = if op == '[' {
let rhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(']'));
S::Cons(op, vec![lhs, rhs])
} else {
S::Cons(op, vec![lhs])
};
continue;
}
if let Some((l_bp, r_bp)) = infix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = expr_bp(lexer, r_bp);
lhs = S::Cons(op, vec![lhs, rhs]);
continue;
}
break;
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 5),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' | '[' => (7, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> Option<(u8, u8)> {
let res = match op {
'+' | '-' => (1, 2),
'*' | '/' => (3, 4),
'.' => (10, 9),
_ => return None,
};
Some(res)
}
#[test]
fn tests() {
...
let s = expr("x[0][1]");
assert_eq!(s.to_string(), "([ ([ x 0) 1)");
}
!
和 [
使用相同的优先级。一般来说,为了保证算法的正确性,当我们做出决策时,保证优先级永远不相等是非常重要的。否则我们可能会陷入在两个同样好的化简候选者中微调结合性的境地。然而, 我们仅比较右 bp
和左 bp
!因此对于两个后缀运算符,优先级相同是没问题的,因为他们都是右 dp
。(1) Note that we use the same priority for ! as for [. In general, for the correctness of our algorithm it’s pretty important that, when we make decisions, priorities are never equal. Otherwise, we might end up in a situation like the one before tiny adjustment for associativity, where there were two equally-good candidates for reduction. However, we only compare right bp with left bp! So for two postfix operators it’s OK to have priorities the same, as they are both right.
最后,所有操作符的最终 boss,可怕的
三元
操作符:
Finally, the ultimate boss of all operators, the dreaded ternary:
c ? e1 : e2
这是…
所有什么地方缀
操作符吗?好吧,我们稍微改变下三元操作符的语法:
Is this … all-other-the-place-fix operator? Well, let’s change the syntax of ternary slightly:
译者按:Agda 中确实存在这种混合的操作符,且可以由用户定义,称为“混缀”(Mixfix Operators)
c [ e1 ] e2
让我们回想一下,
a[i]
事实上是后缀运算符 + 括号…所以,是的,?
和 :
实际上是一对奇怪的括号!我们就这样处理!现在,优先级和结合性呢?在这种情况下,结合性到底是什么?
And let’s recall that a[i] turned out to be a postfix operator + parenthesis… So, yeah, ? and : are actually a weird pair of parens! And let’s handle it as such! Now, what about priority and associativity? What associativity even is in this case?
a ? b : c ? d : e
为了弄清楚,我们只需要压缩括号部分:
To figure it out, we just squash the parens part:
a ?: c ?: e
这可以被解析为
This can be parsed as
(a ?: c) ?: e
或作为
or as
a ?: (c ?: e)
什么更有用?为了这样的
?
-链:
What is more useful? For ?-chains like this:
a ? b :
c ? d :
e
右结合的读法更有用。从优先级来看,三元操作符优先级较低。在 C 语言中,只有
=
和 ,
有更低的优先级。趁此机会,我们也添加 C 风格的右结合 =
。
the right-associative reading is more useful. Priority-wise, the ternary is low priority. In C, only = and , have lower priority. While we are at it, let’s add C-style right associative = as well.
这是我们最完整,最完美的简单 Pratt 解析器的版本:
Here’s our the most complete and perfect version of a simple Pratt parser:
use std::{fmt, io::BufRead};
enum S {
Atom(char),
Cons(char, Vec<S>),
}
impl fmt::Display for S {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
S::Atom(i) => write!(f, "{}", i),
S::Cons(head, rest) => {
write!(f, "({}", head)?;
for s in rest {
write!(f, " {}", s)?
}
write!(f, ")")
}
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum Token {
Atom(char),
Op(char),
Eof,
}
struct Lexer {
tokens: Vec<Token>,
}
impl Lexer {
fn new(input: &str) -> Lexer {
let mut tokens = input
.chars()
.filter(|it| !it.is_ascii_whitespace())
.map(|c| match c {
'0'..='9'
| 'a'..='z' | 'A'..='Z' => Token::Atom(c),
_ => Token::Op(c),
})
.collect::<Vec<_>>();
tokens.reverse();
Lexer { tokens }
}
fn next(&mut self) -> Token {
self.tokens.pop().unwrap_or(Token::Eof)
}
fn peek(&mut self) -> Token {
self.tokens.last().copied().unwrap_or(Token::Eof)
}
}
fn expr(input: &str) -> S {
let mut lexer = Lexer::new(input);
expr_bp(&mut lexer, 0)
}
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> S {
let mut lhs = match lexer.next() {
Token::Atom(it) => S::Atom(it),
Token::Op('(') => {
let lhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
Token::Op(op) => {
let ((), r_bp) = prefix_binding_power(op);
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![rhs])
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::Eof => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
if let Some((l_bp, ())) = postfix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = if op == '[' {
let rhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(']'));
S::Cons(op, vec![lhs, rhs])
} else {
S::Cons(op, vec![lhs])
};
continue;
}
if let Some((l_bp, r_bp)) = infix_binding_power(op) {
if l_bp < min_bp {
break;
}
lexer.next();
lhs = if op == '?' {
let mhs = expr_bp(lexer, 0);
assert_eq!(lexer.next(), Token::Op(':'));
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![lhs, mhs, rhs])
} else {
let rhs = expr_bp(lexer, r_bp);
S::Cons(op, vec![lhs, rhs])
};
continue;
}
break;
}
lhs
}
fn prefix_binding_power(op: char) -> ((), u8) {
match op {
'+' | '-' => ((), 9),
_ => panic!("bad op: {:?}", op),
}
}
fn postfix_binding_power(op: char) -> Option<(u8, ())> {
let res = match op {
'!' => (11, ()),
'[' => (11, ()),
_ => return None,
};
Some(res)
}
fn infix_binding_power(op: char) -> Option<(u8, u8)> {
let res = match op {
'=' => (2, 1),
'?' => (4, 3),
'+' | '-' => (5, 6),
'*' | '/' => (7, 8),
'.' => (14, 13),
_ => return None,
};
Some(res)
}
#[test]
fn tests() {
let s = expr("1");
assert_eq!(s.to_string(), "1");
let s = expr("1 + 2 * 3");
assert_eq!(s.to_string(), "(+ 1 (* 2 3))");
let s = expr("a + b * c * d + e");
assert_eq!(s.to_string(), "(+ (+ a (* (* b c) d)) e)");
let s = expr("f . g . h");
assert_eq!(s.to_string(), "(. f (. g h))");
let s = expr(" 1 + 2 + f . g . h * 3 * 4");
assert_eq!(
s.to_string(),
"(+ (+ 1 2) (* (* (. f (. g h)) 3) 4))",
);
let s = expr("--1 * 2");
assert_eq!(s.to_string(), "(* (- (- 1)) 2)");
let s = expr("--f . g");
assert_eq!(s.to_string(), "(- (- (. f g)))");
let s = expr("-9!");
assert_eq!(s.to_string(), "(- (! 9))");
let s = expr("f . g !");
assert_eq!(s.to_string(), "(! (. f g))");
let s = expr("(((0)))");
assert_eq!(s.to_string(), "0");
let s = expr("x[0][1]");
assert_eq!(s.to_string(), "([ ([ x 0) 1)");
let s = expr(
"a ? b :
c ? d
: e",
);
assert_eq!(s.to_string(), "(? a b (? c d e))");
let s = expr("a = 0 ? b : c = d");
assert_eq!(s.to_string(), "(= a (= (? 0 b c) d))")
}
fn main() {
for line in std::io::stdin().lock().lines() {
let line = line.unwrap();
let s = expr(&line);
println!("{}", s)
}
}
该代码也可以在此仓库中获得,输入结束 :-)
The code is also available in this repository, Eof :-)
Translation. Partial evaluation in guile [jsr-0015]
- March 16, 2025
-
Andy Wingo with contributions from Jinser Kafka
- Original
Translation. Partial evaluation in guile [jsr-0015]
- March 16, 2025
- Andy Wingo with contributions from Jinser Kafka
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
Translated according to the version last updated on 11 October 2011 10:01 AM.
根据2011年10月11日上午10:01更新的版本翻译。
朋友们,就在刚刚发生了一件很棒的事情:Guile 获得了一个值得尊敬的
内联器
。
Friends, something awesome just happened: Guile just got itself a respectable inliner.
我以前在这个博客说过,引用评论者 Rémi Forax 的话说,“内联是所有优化之母”。的确如此,内联为代码移动,常量折叠,死代码消除,循环优化开辟了空间。然而,功劳可能更应该归功于
部分求值
,这位所有内联优化之母。
I have said before on this blog, quoting commenter Rémi Forax, that "inlining is the mother of all optimizations". It is true that inlining opens up space for code motion, constant folding, dead code elimination, and loop optimizations. However, credit might be better laid at the feet of partial evaluation, the mother of all inlining algorithms.
部分求值是一种源到源的转换,它接收你的程序并产生一个更好的程序:其中任何可以在编译时完成的计算都已完成,只留下那些需要在运行时进行的计算。
Partial evaluation is a source-to-source transformation that takes your program and produces a better one: one in which any computation that can be done at compile-time is already made, leaving only those computations that need to happen at run-time.
例如,这个应用
For example, the application
(+ 2 3)
可以明确地在编译时求值。我们说源表达式
(+ 2 3)
通过常数折叠简化为 5
,得到的结果(在这个情况下是 5
)是
留存
表达式。
can clearly be evaluated at compile-time. We say that the source expression (+ 2 3) reduces to 5 via constant folding. The result, 5 in this case, is the residual expression.
一个更复杂的例子:
A more complicated example would look like:
(let ((string->chars
(lambda (s)
(define char-at
(lambda (n) (string-ref s n)))
(define len
(lambda () (string-length s)))
(let loop ((i 0))
(if (< i (len))
(cons (char-at i)
(loop (1+ i)))
'())))))
(string->chars "yo"))
=> (list #\y #\o)
当我在这里写下
=>
,你应该将它读作“
在编译时留存
”。在这里,我们的输入程序在编译时留存为一个简单的 list 构造。循环完全被
展开
,字符串引用折叠,所有
叶子过程
都被内联。
Here when I write =>, you should read it as, "residualizes at compile-time to". In this case our input program residualized, at compile-time, to a simple list construction. The loop was totally unrolled, the string-refs folded, and all leaf procedures were inlined.
很棒吧?
Neat, eh?
优化解锁表达能力
optimization enables expressiveness
[#272]
- March 16, 2025
- Andy Wingo
优化解锁表达能力 optimization enables expressiveness [#272]
- March 16, 2025
- Andy Wingo
如果部分求值器好好完成他的任务,那么留存程序会运行得更快。然而我对此如此满意的真正原因不是这个;而是它能够让我编写不同的程序。
If the partial evaluator does its job right, the residual program will run faster. However this isn't the real reason that I'm so pleased with it; rather, it's that it lets me write different programs.
你知道,我一直在研究 Guile 编译器和虚拟机。当我写代码时,我知道 Guile 会如何处理这些代码。不幸的是,这导致我的程序要比必要的代码丑,因为我知道 Guile 不会为我内联一些重要的东西。我在比较底层的抽象写代码,因为我不信任编译器。
You see, I hack on Guile's compiler and VM and all that. When I write code, I know what Guile is going to do with it. Unfortunately, this caused my programs to be uglier than necessary, because I knew that Guile wasn't going to inline some important things for me. I wrote at a lower level of abstraction, because I couldn't trust the compiler.
现在,有了部分求值器,我很乐意使用辅助函数,甚至高阶的辅助函数,因为 Guile 基本上会做正确的事。这在支持语法抽象的语言中格外重要,比如 Scheme。如果你是一名 Schemer,但还没有看过 Kent Dybvig 的 Macro Writers' Bill of Rights 演讲(幻灯片),请务必看看。
Now, with the partial evaluator, I'm happy to use helper functions, even higher-order helpers, with the knowledge that Guile will mostly do the right thing. This is particularly important in the context of languages that support syntactic abstraction, like Scheme. If you're a Schemer and haven't seen Kent Dybvig's Macro Writers' Bill of Rights talk (slides), do check it out.
顺便一提,几周前,JSConf.eu 上发生了一件悲伤的事,Andreas Gal(即使是他!)为了获得足够的速度,他不得不手动内联 PDF.js 中的一些函数。不过,稍后再详细介绍 JavaScript。
Incidentally, there was a sad moment in JSConf.eu a couple weekends ago when Andreas Gal (of all people!) indicated that he had to manually inline some functions in PDF.js in order to get adequate speed. More on JavaScript a little later, though.
关于部分求值
about partial evaluation
[#273]
- March 16, 2025
- Andy Wingo
关于部分求值 about partial evaluation [#273]
- March 16, 2025
- Andy Wingo
部分求值看起来很像一个常规的
自循环解释器
。这是一个递归函数,接受一个表达式和一个
环境
,返回一个值。Guile 的部分求值器 peval 会在看到 let 和其他
绑定构造
时建立
词法
环境,并在看到词法引用时尝试
传播
拷贝。
A partial evaluator looks a lot like a regular meta-circular evaluator. It's a recursive function that takes an expression and an environment and yields a value. Guile's partial evaluator, peval, builds up lexical environments when it sees let and other binding constructs, and tries to propagate copies when it sees lexical references.
内联是通过 lambda 表达式的
拷贝传播
实现的。正如上例中初始值
0
通过词法变量 i
传播到 (< i (len))
一样,(lambda () (string-length s))
传播到 len
。lambda 表达式的应用可简化到与 let 绑定等价。因此,对于上面的 loop
的第一次迭代,我们有:
Inlining is facilitated by copy-propagation of lambda expressions. Just as the initial value 0 in the example above propagates through the lexical variable i to reach (< i (len)), (lambda () (string-length s)) propagates to len. Application of a lambda expression reduces to the equivalent of a let binding. So for the first iteration of loop above, we have:
(< i (len))
;; copy propagation
=> (< 0 ((lambda () (string-length s))))
;; beta-reduction
=> (< 0 (string-length s))
;; copy-propagation
=> (< 0 (string-length "yo"))
;; constant-folding
=> (< 0 2)
;; constant-folding
=> #t
这里的条件直接被折叠成了一个常数。因此我们在编译时知道要执行哪一个分支。第二个分支已经死了,所以我们会消除它。这个过程一直持续到我们最终产生结果列表为止。
In this case the condition folded to a constant, so we know at compile-time which branch to take. The second branch is dead, so we eliminate it. The process continues until we finally produce the resulting list.
钻进兔子洞
down the rabbit hole
[#274]
- March 16, 2025
- Andy Wingo
钻进兔子洞 down the rabbit hole [#274]
- March 16, 2025
- Andy Wingo
到这里一切都还很简单:我们有一个
终止
的简单、
良类型
的例子。但要成为真实世界的编译器的一部分,一个部分求值器需要能够应对真实世界的代码:可变数据的访问器,访问可变绑定(词法和全局的),可能无限的递归,未绑定变量,以及
贫瘠类型
的程序。此外,一个真实世界的内联器还需要运行得快,并避免产生膨胀的留存代码。
Up to here things are easy: we have a simple, well-typed example that terminates. But to be part of a real-world compiler, a partial evaluator needs to handle real-world code: accessors for mutable data, access to mutable bindings (lexical and global), indefinite recursion, unbound variables, and poorly-typed programs. In addition, a real-world inliner needs to run quickly and avoid producing bloated residual code.
我应该花点时间指出,静态类型的函数式语言只要简单定义一下,就能避免许多这些问题。编译器专家们倾向于使用早期绑定就不奇怪了。Scheme 确实通过对词法作用域的使用展现了一定程度的早期绑定,但它并不是一个纯函数式语言。在开发 peval 时,是我第一次希望 Scheme 中有不可变的
对
,就像 Racket 和 R6RS 推广的那样。
I should take a moment and note that statically-typed, functional languages can avoid a number of these problems, simply by defining them away. It is no wonder that compiler people tend towards early binding. Scheme does exhibit a fair amount of early binding through its use of lexical scope, but it is not a pure functional language. Working on this peval was the first time that I wished for immutable pairs in Scheme, as promoted by Racket and R6RS.
无论如何,在你的语言中拥有可变性并不是那么的糟糕。你确实错过了一些优化机会,但这没关系。更糟的是生产环境中 peval 在一个表达式花费太多时间。
Anyway, having mutability in your language isn't so bad. You do miss some optimization opportunities, but that is OK. What is not OK in a production peval is spending too much time on an expression.
Guile 的解决方案,遵循 Waddell 和 Dybvig 卓越的 Fast and Effective Procedure Inlining 中的方法,只需计算通过内联器的次数。每次内联尝试都会得到一个新的计数器,在内联尝试中的任何工作都会减少计数器的值。当计数器归零时,内联尝试被中止,取而代之的是留存一个调用。由于程序中的
调用点
数量是固定的,且每个调用点将完成的工作量是有限的,因此最终的算法在源程序大小为 N 时是 O(N)。
Guile's solution, following Waddell and Dybvig's excellent Fast and Effective Procedure Inlining, is to simply count the number of times through the inliner. Each inlining attempt gets a fresh counter, and any work performed within an inlining attempt decrements the counter. When the counter reaches zero, the inlining attempt is aborted, and a call is residualized instead. Since the number of call sites in the program is fixed, and there is a maximum amount of work that will be done at each call site, the resulting algorithm is O(N) in the size of the source program.
为了让定义在他们被使用的上下文中被处理,Guile 的部分求值器还使用了 Waddell 和 Dybvig 的
按需
、
在线
策略。例如,当
(cons 1 2)
被作为条件测试处理时,可能会被化简为 #t
。如果在处理 let 的 body 后,一个绑定没有被引用,那么他会被作为
效应
处理。诸如此类。
Guile's partial evaluator also uses the on-demand, online strategy of Waddell and Dybvig, to allow definitions to be processed in their use contexts. For example, (cons 1 2) may be reduced to #t when processed as a test, in a conditional. If, after processing the body of a let, a binding is unreferenced, then it is processed for effect. Et cetera.
工作量
计数器设置后,Guile 只是简单地尝试内联程序中的每一个调用点,因为如果不奏效它自己会退出。这听起来有些疯狂,但正如 Waddell 和 Dybvig 展示的,这是有效的。工作量计数器也用于限制代码增长,尽管这方法有些粗糙。无论如何,当我优化 Guile 使用的 psyntax 扩展器时,代码的增长量不到百分之一,这对我来说是个胜利。
With the effort counter in place, Guile simply tries to inline every call site in the program, knowing that it will bail out if things don't work. It sounds a little crazy, but it works, as Waddell and Dybvig show. The effort counter also serves to limit code growth, though it is a bit crude. In any case I got less than a percent of code growth when optimizing the psyntax expander that Guile uses, which is a win in my book.
警告
caveats
[#275]
- March 16, 2025
- Andy Wingo
警告 caveats [#275]
- March 16, 2025
- Andy Wingo
部分求值只能传播定义是已知的绑定。在 Guile 中,这将内联限制于词法引用和
原语
引用,并明确地排除了全局引用、模块导入或可变对象的字段。因此我们还没有跨模块的内联,除了那些滥用宏展开器的黑客技巧外。
Partial evaluation can only propagate bindings whose definitions are known. In the case of Guile, then, that restricts inlining to lexical references and primitive references, and notably excludes global references and module imports, or fields of mutable objects. So this does not yet give us cross-module inlining, beyond the hacks that abuse the macro expander.
这个观察有一个推论,即某些语言提倡一种难以分析的编程风格。我这里实际是在讨论面向对象的语言,特别是动态的面向对象语言。当你在 Java 中看到
o.foo()
,至少有可能 foo
是一个 final
方法,所以你知道如果你选择这么做,你可以内联它。但在 JavaScript 中,如果你看到 o.foo()
,你不知道任何信息:在运行时,o
的属性集可以,而且确实会变化,因为人们会对对象 o
、它的原型,或 Object.prototype
做
猴子补丁
。在大多数 JS 实现中,你甚至可以改变 o.__proto__
。即使你能看到你的 o.foo()
调用是由 o.foo = ...
赋值的,在 ES5 中你仍然不知道任何信息,因为 o
可能有一个 foo
属性的 setter。
This observation has a correlary, in that some languages promote a style of programming that is difficult to analyze. I'm really talking about object-oriented languages here, and the dynamic ones in particular. When you see o.foo() in Java, there is at least the possibility that foo is a final method, so you know you can inline it if you choose to. But in JavaScript if you see o.foo(), you don't know anything: the set of properties of o can and does vary at runtime as people monkey-patch the object o, its prototype, or Object.prototype. You can even change o.__proto__ in most JS implementations. Even if you can see that your o.foo() call is dominated by a o.foo = ... assignment, you still don't know anything in ES5, as o could have a setter for the foo property.
有几件事在 JavaScript 世界中减轻了这种不利情况。
This situation is mitigated in the JavaScript world by a couple of things.
首先,你不必以这种方式编程:你可以以一种更函数式的风格使用词法作用域。加上
严格模式
,这让编译器可以看到对
foo
的调用可以被内联,只要 foo
在源程序中不是可变的。这是一个可以通过静态分析轻松证明的属性。
First of all, you don't have to program this way: you can use lexical scoping in a more functional style. Coupled with strict mode, this allows a compiler to see that a call to foo can be inlined, as long as foo isn't mutated in the source program. That is a property that is cheap to prove statically.
然而,如 Andreas Gal 发现的那样,这不是主流 JS 实现做的事。这真的是一种遗憾,对我们所写的程序产生了持久的影响。
However, as Andreas Gal found out, this isn't something that the mainstream JS implementations do. It is really a shame, and it has a lasting impact on the programs we write.
我甚至听到一些人说,在 JavaScript 中你应该避免深层词法绑定,因为访问时间取决于绑定的深度。虽然这对于当前的实现是对的,但这是实现的属性,而不是语言本身的属性。在严格模式中,不使用 with 和 eval 引入的绑定,这样可以快速计算每个函数表达式的自由变量集。当
闭包
被创建时,JS 的实现可以不使用某种嵌套作用域的
句柄
,而是直接复制自由变量的值,并将它们存储在与函数代码相关联的
向量
中(你看,闭包是包含数据的代码)。接着对这些变量的任何访问都通过向量而不是作用域对象进行。
I even heard a couple people say that in JS, you should avoid deep lexical bindings, because the access time depends on the binding depth. While this is true for current implementations, it is a property of the implementations and not of the language. Absent with and eval-introduced bindings, a property that is true in strict-mode code, it is possible to quickly compute the set of free variables for every function expression. When the closure is made, instead of grabbing a handle on some sort of nested scope object, a JS implementation can just copy the values of the free variables, and store them in a vector associated with the function code. (You see, a closure is code with data.) Then any accesses to those variables go through the vector instead of the scope.
对于被赋值的变量⸺再强调一次,这是一个可以静态证明的属性⸺你把变量放到一个新的“盒子”中,并重写对这些变量的访问,使其通过这个盒子进行。捕获一个自由变量时复制的是那个盒子,而不是它的值。
For assigned variables -- again, a property that can be proven statically -- you put the variables in a fresh "box", and rewrite accesses to those variables to go through that box. Capturing a free variable copies the box instead of its value.
这个技术没有什么新东西;Cardelli 和 Dybvig(也可能是其他人)在 80 年代各自独立发现。
There is nothing new about this technique; Cardelli and Dybvig (and probably others) discovered it independently in the 80s.
关于闭包实现的这一点与部分求值有关:人们不会太抱怨 JS 中糟糕的静态内联器,因为普遍差劲的闭包实现损害了词法抽象。真是遗憾!
This point about closure implementation is related to partial evaluation: people don't complain much about the poor static inliners of JS, because the generally poor closure implementations penalize lexical abstraction. Truly a shame!
我似乎离题了,不好意思!
It seems I have digressed. Sorry about that!
我谈到了闭包和词法作用域,这是可以实现静态内联的 JS 语言属性。JavaScript 实现支持内联的第二种(同时更重要的)方式是动态内联。几个月前我用这个问题
钓鱼
。当动态内联工作时,它非常棒,尽管有一些条件限制了启发式方式(向下滚动到“inlining”,注意在这几个月中具体的启发式集合已经变化了。)。
I spoke about closures and lexical scope, properties of the JS language that can enable static inlining. The second (and more important) way that JS implementations can support inlining is dynamically. I trolled about that some months ago. Dynamic inlining is fantastic, when it works, though there are limiting heuristics (scroll down to "inlining", and note that the exact set of heuristics have changed in the intervening months).
所以我的最后一点是关于 Guile 表现出色,而 JS 实现表现不佳的事,出于公平起见,我们现在谈一个相反的点。我希望能够实现动态内联,但这意味着要将中间表示与 scheme 函数关联起来。由于 Guile 能够
提前编译
代码,这意味着我们将不得不将中间表示序列化到磁盘上,就像 GCC 的新
链接时优化器
(LTO)做的那样。但我希望这个实现推迟到之后我们将编译后 Guile 的格式更改为 ELF,否则我们会有运行时内存大小膨胀的风险。
So my last point was about something that Guile does well that JS implementations do poorly, and it's fair that this point should be the reverse. I would like to be able to dynamically inline, but this would mean associating the intermediate representation with Scheme functions. As Guile can compile code ahead-of-time, this means we would have to serialize the IR out to disk, in much the same way as GCC's new link-time optimizer (LTO) does. But I would like to put that off until we change the format of compiled Guile code to be ELF. Otherwise we run the risk of bloating our runtime memory size.
尝试一下
try it out
[#276]
- March 16, 2025
- Andy Wingo
尝试一下 try it out [#276]
- March 16, 2025
- Andy Wingo
Guile 的部分求值器是我和我的 Guile 维护者同事 Ludovic Courtès 共同完成的工作,它的灵感来源于 William Cook 在 2011 年 DSL 会议上的演讲以及 Waddell 和 Dybvig 共同完成的 Fast and Effective Procedure Inlining.
Guile's partial evaluator was joint work between myself and my fellow Guile maintainer Ludovic Courtès, and was inspired by a presentation by William Cook at DSL 2011, along with the Waddell and Dybvig's Fast and Effective Procedure Inlining.
此代码当前仅在从 git 构建的 Guile 开发树中。如果没问题,它将成为 Guile 2.0.3 的一部分,应该会在几周内发布。
This code is currently only in the development Guile tree, built from git. Barring problems, it will be part of Guile 2.0.3, which should be out in a couple weeks.
你可以在命令提示符下查看优化器做了什么:
You can check out what the optimizer does at the command prompt:
>,optimize (let ((x 13)) (* x x))
$1 = 169
>,optimize (let ((x 13)) (* x foo))
$2 = (* 13 foo)
玩得开心,然后将 bug 发送到 [email protected]。
Have fun, and send bugs to [email protected].
Translation. Python Negatypes [jsr-0014]
- March 5, 2025
-
Hillel Wayne with contributions from Jinser Kafka
- Original
Translation. Python Negatypes [jsr-0014]
- March 5, 2025
- Hillel Wayne with contributions from Jinser Kafka
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
Translated according to the version last updated on Sep 26, 2019.
根据2019年9月26日更新的版本翻译。
正文 [#277]
- March 5, 2025
- Hillel Wayne
正文 [#277]
- March 5, 2025
- Hillel Wayne
from abc import ABC
class AbstractIterable(ABC):
@abstractmethod
def __iter__(self):
while False:
yield None
def get_iterator(self):
return self.__iter__()
ABC 被添加来稍微增强
鸭子类型
。如果你继承了
AbstractIterable
,那么所有人都能知道你有一个实现了的 __iter__
方法,可以适当地对其处理。
ABCs were added to strengthen the duck typing a little. If you inherited AbstractIterable, then everybody knew you had an implemented __iter__ method, and could handle that appropriately.
不出所料,这个想法从未流行起来。人们更喜欢“请求原谅而不是请求许可”,然后将对
__iter__
的调用包装在 try
块中。这对静态类型检查可以很有用,但实际上 Mypy 并不使用它。如果你想进行类型检查,发现它有 __iter__
,但某人没有从 AbstractIterable
继承,该怎么办?Mypy 团队改用
协议
,它由 ABC
引导
,但对用户隐藏了这个细节。
Unsurprisingly, this idea never caught on. People instead preferred “better ask forgiveness than permission” and wrapped calls to __iter__ in a try block. This could be useful for static type checking, but in practice Mypy doesn’t use it. What if you wanted to typecheck it had __iter__ but the person did not inherit from AbstractIterable? The Mypy team instead uses protocols, which is bootstrapped off ABCs but hides that detail from the user.
但 ABC 旨在
向后兼容
。而且已经存在具有
__iter__
方法的类。我们如何将它们纳入我们的 AbstractIterable
ABC 中?为了解决这个问题,Python 团队添加了一个特殊的 ABC 方法:
But ABC was intended to be backwards compatible. And there were already existing classes that had a iter method. How could we include them under our AbstractIterable ABC? To handle this, the Python team added a special ABC method:
class AbstractIterable(ABC):
@classmethod
def __subclasshook__(cls, C):
return hasattr(C, "__iter__")
__subclasshook__
是使某些类算作这个 ABC 的子类的运行时条件。如果 OurClass
具有 __iter__
属性,那么 isinstance(OurClass(), AbstractIterable)
就为真,即使它不是从 AbstractIterable
继承的。
__subclasshook__ is the runtime conditions that makes something count as a child of this ABC. isinstance(OurClass(), AbstractIterable) is true if OurClass has a iter attribute, even if it didn’t inherit from AbstractIterable.
这个函数是一个运行时函数。我们可以在其中放入任意代码。它传入的是对象的类,而不是对象本身,因此我们无法检查特定对象的
属性
。但我们仍然可以做一些奇怪的事情:
That function is a runtime function. We can put arbitrary code in it. It passes in the object’s class, not the object itself, so we can’t inspect the properties of the specific object. We can still do some weird stuff:
class PalindromicName(ABC):
@classmethod
def __subclasshook__(cls, C):
name = C.__name__.lower()
return name[::-1] == name
任何名字是回文的类,像是“Noon”,都将被视为
PalindromicName
的子类。我们可以更进一步:既然可以跳下去,为什么还要
凝视深渊
呢?
Any class with a palindromic name, like “Noon”, will counts as a child class of PalindromicName. We can push this even further: why gaze into the abyss when you can jump in?
class NotIterable(ABC):
@classmethod
def __subclasshook__(cls, C):
return not hasattr(C, "__iter__")
这是所有不可迭代的类型。比如
isinstance(5, NotAString)
之类的。我们创建了一个
负类型
:一种仅指定它不是什么的类型。我们可以将其作为
正类型
集减去给定类型的子集,剩下的那部分。没有什么可以阻止我们为“不是字符串的可迭代对象”或“返回的对象与同一可调用对象不相同的可调用对象”创建 ABC。
This is the type of everything that isn’t iterable. We have isinstance(5, NotAString), etc. We’ve created a negatype: a type that only specifies what it isn’t. We can include this as part of a set of positive types, subtracting out a subset of a given type. There’s nothing stopping us from making an ABC for “iterables that aren’t strings”, or “callable objects that don’t return an object of the same callable”.
这有什么用?
How is this useful?
[#278]
- March 5, 2025
- Hillel Wayne
这有什么用? How is this useful? [#278]
- March 5, 2025
- Hillel Wayne
不知道。
No idea.
ABCs 不是作为方法解析顺序的一部分进行检查的,因此你不能使用它来
补丁
属性。Mypy 无法检查
__subclasshook__
。如果你想使用它进行运行时检查,编写函数会比创建 ABC 更简单、更可移植。唯一不同的情况是
单分派
函数,它可以在 virtual ABC 上分派。但仅此而已。
ABCs aren’t checked as part of the method resolution order, so you can’t use this to patch in properties. Mypy can’t check __subclasshook__. If you want it for runtime-checks, writing a function would be simpler and more portable than creating an ABC. Just about the only case where there is a difference is with single-dispatch functions, which can dispatch on virtual ABCs. But that’s about it.
但它很酷!
It’s pretty cool, though!
Translation. Grasping `all-the-apples-at-once' [jsr-000U]
- March 3, 2025
-
Oleg Kiselyov with contributions from Jinser Kafka, 阅卜录
- Original
Translation. Grasping `all-the-apples-at-once' [jsr-000U]
- March 3, 2025
- Oleg Kiselyov with contributions from Jinser Kafka, 阅卜录
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
Translated according to the version last updated on June 3, 2024.
根据2024年6月3日更新的版本翻译。
介绍
Introduction
[#279]
- March 3, 2025
- Oleg Kiselyov
介绍 Introduction [#279]
- March 3, 2025
- Oleg Kiselyov
这是一次轻松的讨论,内容是对一次即兴编程竞赛的分析:参与者使用自选的编程语言解决一个简单但现实的问题。结果显示,那些明显正确且用恰当语言表达出的解决方案,都不是用我预期的语言写成的。这个讨论记录了令人不安的观察和几个意想不到的鼓励,同时也反思了常见的编程语言教学及讨论方式。
This is a light talk with the analysis of an impromptu programming contest: solving a simple but realistic problem in a language of participants' choice. It turns out that the solutions that were obviously right, expressed in just the right words were written not in the languages I have expected. This talk notes the unsettling observations and a few unexpected encouragements, along with a reflection on the common way of teaching and arguing about programming languages.
这场竞赛发生在一个以计算机为中心的社区 Lobsters。Lobsters 有点像 HN,也是一个论坛,会员们会提交和讨论各种文章,从学术期刊论文到博客文章、公告和视频。这个社区可以被称为“主流 IT”:包括嵌入式系统到 Web 的开发人员,系统管理员,学生,计算机爱好者以及少数学者。
The contest happened at a computer-focused community Lobsters. Somewhat like HN, Lobsters is a forum for its members to submit and discuss articles: from journal papers to blog posts, announcements and videos. The community is what one may call the `mainstream IT': developers, from embedded systems to the Web, system administrators as well as students, computer hobbyists, and a few academics.
某天,一篇关于常见工作面试问题的文章被发布出来以供讨论。随后,成员们自发地开始提交他们用自己选择的编程语言编写的解决方案。他们还对自己的和其他人的解决方案进行了评论。
One day an article about a common job interview question was posted for discussion. Spontaneously, members started submitting their own solutions, in their language of choice. They also commented on their own and others solutions.
令我惊讶的是,无论它们是使用什么编程语言编写的,是否是函数式语言(如 Haskell、Erlang、Clojure)或命令式语言(如 Go、Ruby、Raku),这些解决方案被整齐地分成了两类。我现在才意识到,递归不是只在理论上才是迭代的对偶,就像循环一样,它是 low-level 的:一次选择一个苹果。同样,“纯函数式”的状态处理和命令式的状态处理也是如此。
It has struck me how the the solutions neatly fall into two classes -- irrespective of the language they are written in, be it a functional (such as Haskell, Erlang or Clojure) or imperative (Go, Ruby, Raku). Only now I realized that recursion is dual to iteration not just in theory. Like loops, it is low-level: selecting one apple at a time. Likewise is the `pure-functional' and imperative handling of state.
我们如何一次抓住所有苹果?就像在自然语言的学习和教学中一样,我们是否应该少关注语法,而更多地关注对话和讲故事?
How do we grasp all the apples at once? As in learning and teaching of natural languages, should we focus less on grammar and more on conversation and storytelling?
该演讲是在 2021年7月1日 的 IFIP WG 2.1 线上会议进行的。
The talk was given at an online meeting of IFIP WG 2.1 on July 1, 2021.
The Lobsters thread with the contest. February 2021.
<https://lobste.rs/s/zpqrpc/random_job_interview_challenge_clojure>
竞赛问题
Contest problem
[jsr-000V]
竞赛问题 Contest problem [jsr-000V]
Alexey Grigorev (@Al_Grigor) February 3, 2021 (Twitter)
大多数候选人都无法解决这个面试问题:
输入: "aaaabbbcca"
输出: [("a",4), ("b", 3), ("c", 2), ("a", 1)]
写一个将输入转换为输出的函数
我在视频面试中提出这个问题,给予 25 分钟时间
你会如何解决?
Most candidates cannot solve this interview problem:
Input: "aaaabbbcca"
Output: [("a",4), ("b", 3), ("c", 2), ("a", 1)]
Write a function that converts the input to the output
I ask it in the screening interview and give it 25 minutes
How would you solve it?
Alexey Grigorev (@Al_Grigor) February 3, 2021 (Twitter)
The statement of the problem is quoted from the web page posted on Lobsters:
A Random Job Interview Challenge In Clojure. Feb 4 2021 by Savo Djuric
<https://savo.rocks/posts/a-random-job-interview-challenge-in-clojure/>
解法样本
Sample solutions
[jsr-000W]
解法样本 Sample solutions [jsr-000W]
提交到 Lobsters 供讨论的博客文章描述了在 Clojure 中的一个解决方案。但不知为何,这促使会员们用自己选择的编程语言来解决这个问题并发布结果。于是这就形成了一场即兴竞赛。
The blog post submitted to Lobsters for discussion described one solution, in Clojure. Somehow it prompted the members to solve the problem in their language of choice, and post the results. Hence there developed an impromptu contest.
以下是已发布的解法样本。这些解决方案由不同的人用不同的编程语言编写⸺然而不知为何,它们分成了两组,我暂时称之为 Group I 和 Group II。我稍后会告诉你这些组的意义⸺请让我卖个关子。
Below is a sample of posted solutions. They are written in various languages by various people -- yet somehow fall into two groups, which I call for now Group I and Group II. I'll tell what they mean only later -- I hope you'd allow me a bit of suspense.
Clojure (Group I) [#297]
Clojure (Group I) [#297]
(->> "aaaabbbcca"
(partition-by identity)
(map (juxt (comp str first) count)))
提交者还提供了另一个解法:“有点繁琐,但更清晰”:
(for [p (partition-by identity "aaaabbbcca" )]
[(-> p first str) (count p)])
The submitter also gave another solution, ``a bit chattier but also more legible:''
值得注意的是,
comp
是
函数组合
,而 juxt
粗略的类型是 (a -> b * a -> c) -> a -> (b * c)
,这是一个组合子,它接收一个函数二元组和一个值,对该值应用两个函数,然后返回由这两个函数的返回值组成的元组。
It is worth noting that comp is the functional composition, and juxt, with the rough type (a->b * a->c) -> a -> (b * c), is the combinator that takes a tuple of functions and a value, applies each function to it and returns the results as a tuple.
Ruby (Group I) [#298]
Ruby (Group I) [#298]
Ruby 有三种解法⸺都是单行的,而且很相似。第一个可能是最容易理解的。
There were three solutions in Ruby -- all one-liners, and rather similar. The first is probably the easiest to understand.
"aaaabbbcca".split("").chunk_while {|a,b| a==b}.map {|a| [a.first, a.size]}
"aaaabbbcca".chars.slice_when {|c,n| c != n }.map {|s| [s[0], s.size]}
'aaaabbbcca'.chars.chunk(&:itself).map{|x,y| [x, y.size]}
APL (Group I) [#299]
APL (Group I) [#299]
两个 APL 解法:
Two APL solutions:
(((1∘+ +\) ⊢ ≠ ¯1∘⌽) (((⊃,≢) ⊢) ⌸) ⊢) 'aaaabbbcca'
f←(⊃,≢)¨⊢⊆⍨1+(+\0,2≠/⊢) ⊢x←f'aaaabbbcca'
GO (Group II) [#300]
GO (Group II) [#300]
我知道这是一个 Clojure 的帖子,但是出于好奇,我用 Go 编写了它(我最近一直在使用 Go)。只是为了看看是否可以快速解决。我花了大约 10 分钟,减去 3-4 分钟与 runes 战斗的时间。 这不是特别优雅,但能工作。
``I know this is a Clojure post, but out of curiosity I wrote it in Go (what I've been working in lately) just to see if I could solve it quickly. Took about 10 minutes, subtract 3-4 for fighting with runes. It's not particularly elegant, but it works.''
type result struct {
letter string
count int
}
func main() {
const input = "aaaabbbcca"
var ret []result
currentLetter := string(input[0])
countCurrentLetter := 1
for _, elem := range input[1:] {
elemAsString := string(elem)
if currentLetter == elemAsString {
countCurrentLetter++
} else {
ret = append(ret, result{currentLetter, countCurrentLetter})
currentLetter = elemAsString
countCurrentLetter = 1
}
}
ret = append(ret, result{currentLetter, countCurrentLetter})
fmt.Printf("%+v", ret)
}
该解法与类似解法的显著特征是明确的迭代:
for
循环。
The salient feature of this and similar solutions is explicit iteration: the for loop.
Racket (Group II) [#301]
Racket (Group II) [#301]
(for/fold ([p #f]
[cnt 0]
[counts null]
#:result (cdr (reverse (cons `(,p ,cnt) counts))))
([c (in-string "aaaabbbcca")])
(if (equal? p c)
(values c (add1 cnt) counts)
(values c 1 (cons `(,p ,cnt) counts))))
该解法是纯函数式的⸺但与上面 GO 的解决方案非常相似。
The solution is pure functional -- yet closely resembling the Go solution above.
Raku (Group I) [#302]
Raku (Group I) [#302]
"aaaabbbcca".subst( /(.) $0*/ , {"({$0},{$/.chars})" }, :g )
Raku 是前身为 Perl 6 的艺术品。Perl 一直以其(非常)正则表达式而闻名,威力可见一斑。
Raku is the artist formerly known as Perl 6. Perl has always been known for its (ir)regular expressions, whose power is on display.
Erlang (Group II) [#303]
Erlang (Group II) [#303]
L = "aaaabbbcca",
lists:reverse(lists:foldl(
fun
(Elt, [ {[Elt], N} | R ]) -> [ {[Elt], N + 1} | R];
(Elt, Acc) -> [{[Elt], 1} | Acc]
end,
[],
L
)).
与 Racket 解法相似。我们看到了 Erlang
非线性模式匹配
的特征:在第一个模式中变量
Elt
出现了两次。
It is rather similar to the Racket solution. We see the characteristic for Erlang non-linear pattern matching: the variable Elt appears twice in the first pattern.
Python (Group I) [#304]
Python (Group I) [#304]
[(k, len(list(g))) for k, g in itertools.groupby("aaaabbbcca")]
Rust (Group II) [#305]
Rust (Group II) [#305]
fn run_length_encode(ins: &str) -> Vec<(char, usize)> {
let mut out = vec![];
let mut i = ins.chars();
if let Some(mut c) = i.next() {
let mut count = 1;
for new_c in i {
if new_c == c {
count += 1;
} else {
out.push((c, count));
count = 1;
c = new_c;
}
}
out.push((c, count));
}
out}
就像 GO 的解法一样⸺但是更干净,并且更好地处理边缘情况。作者确实掌握了问题的本质:
游程编码
。
It is quite like the solution in Go -- but cleaner and with a better handling of edge cases. The author did grasp the nature of the problem: run-length encoding.
Julia (Group II) [#306]
Julia (Group II) [#306]
function rle(itr)
acc = Tuple{eltype(itr), Int}[]
x = iterate(itr)
isnothing(x) && return acc # iterable is empty
last, state = x
n = 1
for v in Iterators.rest(itr, state)
if last == v
n += 1
else
push!(acc, (last, n))
last = v
n = 1
end
end
push!(acc, (last, n))
return acc
end
镜像了上面的 Rust 解法。
It mirrors the Rust solution above.
Haskell (Group II) [#307]
Haskell (Group II) [#307]
main = interact soln
soln = show . reverse . go []
go counts [] = counts
go counts (x:xs) = go (inc x counts) xs
inc c ((x,ct):xs) | c == x = (x,ct+1):xs
inc c xs = (c, 1):xs
尽管该解法依赖于显式递归而不是
fold
,但它与 Erlang 解法类似。(后来有另一个人提到了另一种使用 group
的解法)。
Although this solution relies on the explicit recursion rather than fold, it is rather similar to the Erlang solution. (A different person later mentioned another solution, using group).
评论
Comments
[jsr-000X]
评论 Comments [jsr-000X]
Lobsters 竞赛的一个特色是参与者对自己和他人的解法的评论。以下是几条值得注意的评论,它们进一步证明了我在本次演讲中的观点。
A particular feature of the Lobsters' contest is comments of the participants, about their and others' solutions. Below are several notable comments, which go on to make my point in this talk.
一位评论者想知道:
One commenter wondered:
阅读 Clojure 解法需要跟着作者的思维过程。阅读等效的 Go 解法几乎没有认知开销。
Clojure 程序员如何编写代码,才能不仅在作者眼中简洁优雅,而且还可以让其他人快速阅读而无需花费时间进行脑力劳动?``Reading the Clojure solution involves following the thought process of how the author ended up there. Reading the equivalent Go solution has very little cognitive overhead.
How does Clojure programmers write code that is not just brief and elegant in the eyes of the author, but can also be read quickly by others without time consuming mental gymnastics?''
另一人回答:
and another replied:
我想这只是你的习惯的问题。我发现上面的两种 Go 解法都很难理解,因为你必须遵循算法并在脑海中构建它的功能。而在大多数 Clojure 解法中,代码会直接告诉你发生了什么。
I guess it's just a matter of what you're used to. I find both Go solutions above arduous to understand because you have to follow the algorithm and construct what it functionally does in your head. Whereas in most of the clojure solutions the code directly tells you what's happening.
例如,带有注释的最高帖子:
Eg. the top post with comments:
; threading macro (take input, then do stuff to it in steps)
(->> "aaaabbbcca"
; split into lists of consecutive same character
(partition-by identity)
; map each item (list of chars) with a function
(map
; return list of outputs of functions
(juxt
; first character as string
(comp str first)
; count of characters
count)))
该信息更加密集。如果你擅长阅读它,你可以一目了然地理解它。
The information is much more dense. If you are good at reading it, you understand it at a glance.
我相信 APL 解法也是如此。我看不到它们,但是确实有清楚看到算法的人。
I believe the same is true for the APL solutions. I can't read them, but someone who does sees the algorithm clearly.''
另一位评论者同意:
Yet another commenter concurred:
我也是;他们的[Go 解法]似乎超级繁琐和低级。就像你给要编译器灌输基本的东西,而不是自然地描述算法一样。
``Same here; they [Go solutions] just seem super tedious and low-level; like you're spoon-feeding the compiler on basic things instead of just describing the algorithm naturally.
如果你没有词汇来描述算法,那么当然低级版本会更容易阅读,就像刚刚学习英语的人会发现“简化英语”维基百科比经常使用专业术语的常规维基百科更容易理解一样。但如果你每天都使用算法,你就应该投资学习这些术语!
If you don't have the vocabulary to describe an algorithm, of course the low-level version is going to be easier to read, much like someone who is only just learning English will find the `Simplified English' wikipedia more accessible than the regular one that often uses specialized jargon. But if you work with algorithms every day, you owe it to yourself to invest in learning the jargon!''
苹果类比
The apples analogy
[jsr-000Y]
苹果类比 The apples analogy [jsr-000Y]
我已经提到了几次苹果。以下引用能说明一切。
I have mentioned apples several times already. The following quote explains what is it all about.
假设我们有一箱苹果,想从中挑选出所有好苹果。我们不想自己做这个任务,所以我们写了一张清单,要求助手来完成。
``Suppose we have a box of apples from which we wish to select all of the good apples. Not wishing to do the task ourselves, we write a list of instructions to be carried out by a helper.
常规语言的指令可能表达如下内容:从箱子里选一个苹果。如果是好的,就把它放在为好苹果保留的地方;如果不好,就把它扔掉。再选一个苹果;如果是好的,就把它放在保留的地方;如果不好,就把它扔掉…继续以这种方式依次检查每个苹果,直到选出所有好苹果。总之,从我们拿起的第一个苹果开始,一次检查一个苹果,直到检查到箱子里的最后一个苹果,直到我们选出并把所有好苹果都放在一边。
The instructions corresponding to a conventional language might be expressed something like the following: Select an apple from the box. If it is good, set it aside in some place reserved for the good apples; if it is not good, discard it. Select a second apple; if it is good put it in the reserved place, and if it is not good discard it.... Continue in this manner examining each apple in turn until all of the good apples have been selected. In summary, then, examine the apples one at a time starting with the first one we pick up and finishing with the last one in the box until we have selected and set aside all of the good apples.
另一方面,与数组语言相对应的指令可以简单地说明“从箱子里选择所有好苹果”。
On the other hand the instructions corresponding to an array language could be stated simply as ``Select all of the good apples from the box.'' Of course, the apples would still have to be examined individually, but the apple-by-apple details could be left to the helper.
可以考虑常规的编程语言是以机器语言为原始形式的“
一次一苹果
”语言,而数组语言可以视为“
一次所有苹果
”语言。
Conventional programming languages may be considered then ``one-apple-at-a-time'' languages with machine languages being a very primitive form, whereas array languages may be considered ``all-the-apples-at-once'' languages.''
2005年10月,
基思·斯迈利
:《My Life with Array Languages》。这个比喻要感谢那个以写出《人月神话》 而闻名的
弗雷德
·P·
布鲁克斯
。
Keith Smillie: ``My Life with Array Languages''. October 2005. This analogy is credited to Frederick P. Brooks: that Fred Brooks, of the Mythical Man-Month fame.
(译者按:《My Life with Array Languages》 原文
The apple analogy originated with Frederick P. Brooks who worked closely with Ken Iverson in the early days of the development of APL.原翻译有错,感谢 阅卜录 指出。)
观察
Observations
[jsr-000Z]
观察 Observations [jsr-000Z]
在一场不受控制的竞赛中,仅凭几个针对单个问题的解法,我们无法得出任何统计结论。但仍有足够的数据供我们观察和思考。第一个观察结果是,人们成功地使用了多种语言来解决问题,提交者自述大约花了 5-10 分钟。提交的解法似乎也很地道,反映了“社区标准”,这些语言教学(或自学)用于工业家的使用方式。
A few solutions to a single problem in an uncontrolled contest do not let us draw any statistical conclusions. Still there is enough data for observations and reflections. The first observation is that quite a variety of languages were used to solve the problem, successfully, and it took the submitters, by their self-reports, about 5-10 minutes. The submitted solutions also seem idiomatic, reflecting the `community standards', the way these languages are (self-) taught and used in the industry.
第二个令人惊讶的观察结果是,提交的解法分为两类:
The second, surprising observation is that the submitted solutions fall into two classes:
scan the input string char by char, building the result: Group II
see the problem as groupBy followed by mapping: Group I
这是编写各组解法的语言:
Here are the languages used to write the solutions in each group:
- Group I: Clojure, Ruby, Raku, Python, Haskell, APL
- Group II: Go, Racket, Erlang, Rust, Julia, Haskell
无论选择哪种方式分类编程语言⸺函数式与命令式、旧式与新型、由“业余爱好者”创建与由“专业人士”创建⸺这两类语言中都有表现。
Whichever the classification of programming languages one may choose -- functional vs. imperative, old vs. new, created by `amateurs' vs. `professionals' -- there is a representative in both groups.
看看 Racket 和 Rust(或 Go、Julia)的解法,可以得出另一个关于
状态
的结论。尽管 Racket 解法是“纯函数式”的,但它与 Rust 和 Go 解法一样拥有状态。状态数量相同。Racket 解法中的状态使用
values
子句一次性更新;更新按位置而不是按名称与其目标匹配,这很容易出错。(顺便说一句,Rust 和 Julia 对边缘情况的处理比 Racket 或 Go 解法更清晰,显然更正确。)
Looking at the Racket and Rust (or Go, Julia) solutions leads to another conclusion, about state. Although the Racket solution is `pure functional', it is just as stateful as Rust and Go solutions. There is the same amount of state. The state is updated in the Racket solution at once, with the values clause; the update is matched to its target by position rather by name, which is quite error prone. (As a side observation, the handling of edge cases is clearer and obviously correct in Rust and Julia than in the Racket or Go solutions.)
因此,“纯函数式”并不会消除状态;它只是改变了处理状态的方式⸺这可能或多或少更清晰和方便。毫无疑问,改进状态处理的方法是将其划分并封装到“
更高级
”、
可组合
的操作中,如“groupBy”:参见 Clojure、Ruby、Python 解法。groupBy 本身可以实现可变或不可变状态⸺无论怎样,状态都被抽象出来,groupBy 的用户不必关心它。
Thus being `pure functional' does not eradicate state; it merely changes the way state is handled -- which could be either more or less clearer and convenient. What unequivocally improves the handling of the state is partitioning and encapsulating it into `higher-level', composable operations like `groupBy': see Clojure, Ruby, Python solutions. The groupBy may itself be implemented with mutable state or without -- no matter, the state is abstracted away and the user of groupBy does not have to care about it.
惊喜
Pleasant surprises
[jsr-0010]
惊喜 Pleasant surprises [jsr-0010]
第一个惊喜是 Group I(即使用
一次所有苹果
解法)的人数众多⸺而且当今使用的语言也具有足够的能力来表达这些解决方案。
The first pleasant surprise was the notable number of Group I, that is, all-the-apples-at-once solutions -- and the number of languages in use today with the adequate facilities to express these solutions.
引发此次竞赛的 Savo Djuric 的博客文章也描述了一个解法,值得引用。
The blog post by Savo Djuric that prompted the contest has also described a solution, which is worth quoting.
由于我最近才开始学习 Clojure,而且我经常做 4clojure 练习,所以这个问题激起了我的兴趣,我决定立即尝试解决它。我启动了我的 REPL,几分钟后,我写出了:
Since I recently started learning Clojure, and I regularly do 4clojure exercises, this problem seemed intriguing, so I decided to try to solve it immediately. I fired up my REPL, and within several minutes i came up with:(interleave (map str (map first (partition-by identity "aaaabbbcca"))) (map count (partition-by identity "aaaabbbcca"))) => ("a" 4 "b" 3 "c" 2 "a" 1)
耶!这有用!但它并不完全符合要求。可以说,它也不是很“干净”。这个解法最让我困扰的是重复的代码,所以我决定用 let:... 来解决这个问题。
Yaay! It works! But it is not exactly what was asked for. Nor is it very ‘clean’, so to say. What bothers me the most in this solution is that i have repetitive code, so I decided to solve that with let:...(defn chop-chop [coll] (let [x (partition-by identity coll)] (map list (map (comp str first) x) (map count x))))
我必须说,在几分钟之内解决问题感觉很棒(尽管这并不难),因为我只是刚开始 Clojure 的旅程。
I must say that it felt great to solve a problem within several minutes (albeit it was not that hard), since I'm only at the beginning of my journey with Clojure.
这篇博文展示了我认为的解决问题的理想方法:把握问题的本质⸺
groupBy
,或 Clojure 中的partition-by
⸺设计解法原型,理解其缺陷并希望做得更好,采取措施改进。我认为作者作为程序员前途光明。
The blog post shows the ideal, to me, way to solve the problem: grasping its nature -- groupBy, or partition-by in Clojure, -- prototyping the solution, understanding its imperfections and wanting to do better, taking steps to improve. I think the author has a bright future as a programmer.
我从他的博客中得知,作者从未上过大学,偶然进入编程领域(但受到他才华横溢的兄弟的启发),并且一直在自学编程语言(现在是 Clojure)。也许这就是(他能做到这件事的)秘密。
From his blog I learned that the author never went to college, got into programming by accident (but prompted by his talented brother), and has been studying programming languages (now, Clojure) entirely on his own. May be that's the secret.
另一个惊喜是看到了我眼中理想的解法:
Another pleasant surprise was seeing the solution that in my eyes is ideal:
[(k, len(list(g))) for k, g in itertools.groupby("aaaabbbcca")]
这个 Python 解法本质上是用常规的数学符号来表达的,每个人都可以读懂(与 APL 不同)。更重要的是,它在语言上很经济⸺同样,与 APL 不同。它只使用两个直接相关的功能:
groupBy
和 comprehension。符号在这里显然是思维的工具。
This Python solution is expressed in what is essentially the conventional mathematical notation, readable by everyone (unlike APL). More importantly, it is linguistically economical -- again, unlike APL. It uses just the two directly relevant facilities: groupBy and comprehension. Notation is clearly here the tool of thought.
结论:编程和自然语言,软件和文学 [jsr-0011]
结论:编程和自然语言,软件和文学 [jsr-0011]
前面引用的一条评论一针见血:要描述一种算法,就需要
词汇表
。编程的很大一部分是
构建
(
学习
或
构造
)词汇来讨论常见的(子)问题,例如
游程编码
、groupBy、
映射
、
累积
、
过滤
等。学习、寻找和命名“正确”的抽象是数学的核心活动⸺正如“符号作为思维工具”中所阐述的那样。
A comment cited earlier hits the nail on the head: to describe an algorithm, one needs vocabulary. A large part of programming is building -- learning or constructing -- vocabulary to talk about common (sub)problems, e.g., run-length-encoding, groupBy, mapping, accumulating, filtering, etc. Learning, finding and naming the `right' abstractions is the core mathematical activity -- as explicated in the `Notation as a tool of thought'.
自然语言也是一种思维工具。建立词汇表并理解何时以及如何使用某个词是学习一门语言最难的部分。自然语言和形式语言有很多共同之处,这一观点由来已久⸺
蒙塔古
在 20 世纪 70 年代就曾对此作出著名而有力的阐述。在之前引用过的
基思·斯迈利
关于数组语言的论文中,他进一步指出,编程教学应该借鉴外语教学最好的理念。
A natural language is also a tool of thought. Building vocabulary and understanding when and how to use a particular word is the hardest part of learning a language. That natural and formal languages have a lot in common is the idea with a long pedigree -- famously and forcefully articulated by Montague in 1970s. Keith Smillie, in his paper about array languages cited earlier further argues that teaching programming ought to adapt the best ideas of teaching foreign languages.
无论是编程语言还是自然语言,人们都可以相对较快地掌握基础知识。流利掌握则需要几十年的时间。似乎没有什么比不断练习、说和听、写和重写更好。我想起
欧内斯特·海明威
和
乔治·普林普顿
(《Interviewer》杂志的创始人和编辑)之间那次著名访谈的结尾(《
巴黎评论
》第 18 期,1958 年春季):
Both in programming and natural languages, one can acquire basics relatively quickly. Fluency takes decades. There doesn't seem to be anything better than constant practice, speaking and listening, writing and rewriting. The ending of the famous interview between Ernest Hemingway and George Plimpton (the `Interviewer', the magazine's founder and editor) (Paris Review, issue 18, Spring 1958) comes to mind:
采访者 你在读到前一天停下的地方时会重写吗?还是等到全部完成后再重写?
海明威 我总是每天重写到我停下的地方。都完成后,你自然会再看一遍。在别人打字时,你会在干净的排版中看到它,这也是另一个修改和重写的机会。在校样时你还有最后的机会。你会对这几次不同的机会心怀感激。
采访者 你进行了多少次重写?
海明威 看情况。《 永别了,武器 》的结尾,也就是最后一页,我重写了三十九次才满意。
采访者 是有什么技术问题吗?什么问题难住了你?
海明威 推敲用词。
INTERVIEWER Do you do any rewriting as you read up to the place you left off the day before? Or does that come later, when the whole is finished? HEMINGWAY I always rewrite each day up to the point where I stopped. When it is all finished, naturally you go over it. You get another chance to correct and rewrite when someone else types it, and you see it clean in type. The last chance is in the proofs. You’re grateful for these different chances. INTERVIEWER How much rewriting do you do? HEMINGWAY It depends. I rewrote the ending to Farewell to Arms, the last page of it, thirty-nine times before I was satisfied. INTERVIEWER Was there some technical problem there? What was it that had stumped you? HEMINGWAY Getting the words right.
尾声:我的“苹果流”解决方案 [jsr-0012]
尾声:我的“苹果流”解决方案 [jsr-0012]
我该如何解决这个问题?我第一次看到它时,我脑海中浮现的第一个想法是:
groupBy
,这是对序列中连续元素进行分组的操作的通用名称。然后我们需要以 (group-element, counter) 对的形式呈现这些组。
How would I solve the problem? When I first saw it, the very first thought that came to mind was: groupBy, which is the common name for the operation to group consecutive elements of a sequence. We then need to present the groups in the form of (group-element, counter) pairs.
为了正确性(以及其他很快就会清楚的原因),我将使用 OCaml。假设我们有以下操作
I will use OCaml for concreteness (and for other reasons that should be clear soon). Suppose we have the operation
group : 'a list -> 'a list list
它将列表中(根据某些标准)相等的连续元素分组到它们自己的列表中,并返回这些组的列表(组不能为空)。然后,该问题解法的第一个近似写法如下,非常接近上面的第一个 Clojure 解决方案:
which groups equal (according to some criterion) consecutive elements of a list in their own list, and returns the list of such groups. (A group cannot be empty). Then the first approximation to the solution is as follows, quite close to the first Clojure solution above:
let rll (group : char list -> char list list) : char list -> (char * int) list =
group >> List.(map (fun g -> (hd g, length g)))
类型注释是可选的,但我倾向于写下它们(一个使用 Ocaml 的理由就是类型:不花哨,但没有过多细节,足够可视地展示数据的转移)。由于组是假设的,它作为参数(parameter/argument)出现。这里的
(>>)
是从左到右的函数组合(就像 F#)。将输入视为
序列
(例如列表)很方便。但字符串在 OCaml 中不是列表⸺不过可以借助几个标准库函数轻松地将其转换为列表:
Type annotations are optional, but I prefer to write them. (One reason to use OCaml is types: not fancy, but sufficient to visualize data transformations, without too many details.) Since group is the assumption, it appears as the parameter (argument). Here, (>>) is the left-to-right functional composition (as in F#). It is convenient to think of input as a sequence, such as list. But a string is not a list in OCaml -- but can be easily turned into one, with the help of a couple standard library functions:
let rll (group : char list -> char list list) : string -> (char * int) list =
String.to_seq >> List.of_seq >> group >> List.(map (fun g -> (hd g, length g)))
使用 APL 中经常使用的 fork 组合器(但是 left-to-write 版本):
With the fork combinator, so frequent in APL (but a left-to-write version):
let fork : ('a -> 'b) -> ('a -> 'c) -> ('b -> 'c -> 'd) -> 'a -> 'd = fun f g h -> fun x -> h (f x) (g x)
这个解法甚至可以写得更清楚些:
the solution can be written even cleaner:
let rll (group : char list -> char list list) : string -> (char * int) list =
String.to_seq >> List.of_seq >> group >> List.(map (fork hd length pair))
我们只需要
group
本身的实现即可运行此代码。可惜的是,OCaml 标准库目前不提供分组操作。我当然可以编写它。但让我更详细地重新表述这种方式。
We only need the implementation of group itself to run this code. Alas, the OCaml standard library at present does not provide a grouping operation. I can of course write it. But let me rephrase this approach in more detail.
我们从头开始,逐步开发解法,强调越来越紧地“
抓住
”数据流。重点在于理解数据流,而不是编写循环或递归函数,与终止条件和边缘情况作斗争。我们也不会沉迷于状态:它在需要时自然而然地出现。分组也是自然而然的,而且更简单。当我们找到解法时,只要一看就能知道它一定是正确的。确实如此。它也相当高效。它甚至可以稍后机械地重铸成 C 以提高效率⸺不是通常手写的那种 C,而是正确且没有问题的。
Let us start from scratch and develop the solution in smaller steps, emphasizing tighter and tighter `grasping' of the data flow. The stress is on understanding data flow, rather than on writing loops or recursive functions and struggling with termination conditions and edge cases. We won't obsess about state either: it comes naturally when needed. Grouping also comes naturally, and simpler. When we reach the solution, one look at it will tell that it must be right. It is. It is also reasonably efficient. It can even be later mechanically recast into C for better efficiency -- not the sort of C one typically writes by hand, but correct and problem-free.
首先从问题陈述中,我们很快就能明白我们面临的是一个
顺序
问题:将字符序列转换为其他序列(成对的),这是
局部
进行的。实际上,向示例输入
"aaaabbbcca"
添加更多字符不会影响示例输出的前缀 ("a",4), ("b", 3), ("c", 2)
。一旦理解了这个顺序特性,其余的东西几乎是自然而然的。
First, from the problem statement one quickly understands that we are facing a sequential problem: transforming a sequence of characters into some other sequence (of pairs), locally. Indeed, appending more characters to the sample input "aaaabbbcca" will not affect the prefix ("a",4), ("b", 3), ("c", 2) of the sample output. Once this sequential character is understood, the rest follows almost automatically.
为了思考和开发解法,我们需要一个
词汇表
,我们在过程中不断建立它。首先,我们需要一个表示序列的词。我们称之为
'a seq
:元素类型为 'a
的序列类型。现在我们先不要担心它是如何实现的,而是将其视为抽象类型。由于输入是字符串,我们显然需要操作
To think about and develop the solution, we need a vocabulary, which we build up as we go along. First we need a word for a sequence. Let's call it 'a seq: a type of a sequence of elements of type 'a. Let us not worry for now how it is actually implemented and treat it as an abstract type. Since the input is given as string, we obviously need the operation
of_string : string -> char seq
来将字符串
转换
为(或
渲染
为)字符序列。同样假设它已给定。再次重申,现在重要的是代表数据流的类型。
to transform (or, render) a string as a sequence of characters. Assume it is given. Again, what is important for now is the type, which represents the data flow.
接着我们必须处理如何将相邻字符分组在一起,这本质上是
有状态的
:字符属于哪个组取决于上下文,即取决于之前或之后的字符。我们采取后者,假设操作
Next we have to deal with grouping of neighboring characters, which is inherently stateful: which group the character belongs to depends on the context, that is, on the character that came before, or follows next. Let's go with the latter and assume the operation
look_ahead : 'a seq -> ('a * 'a option) seq
将序列的当前元素与下一个元素配对。下一个元素可能不存在(如果序列结束),所以使用
'a option
,允许缺失值。
that pairs the current element of sequence with the element that comes next. The next element may be absent (if the sequence is about to end), hence the 'a option type, allowing for the missing value.
然后,通过查看下一个元素,分组本身会插入分组中断:
The grouping itself is then inserting group breaks, determined by peeking at the next element:
type 'a annot = Plain of 'a | Break of 'a
let group : ('a * 'a option) seq -> 'a annot seq =
map @@ function
| (x,Some next) when x = next -> Plain x
| (x,_) -> Break x
也就是说,我们将
前瞻序列
转换为用
组边界
注释的元素序列:
Plain x
表示当前组继续,Break x
表示 x
是其组中的最后一个元素。如果有下一个元素,开始一个不同的组。('a annot
类型可以以不同的方式实现,比如一个对 'a * bool
。)我们还假设
映射
操作存在
That is, we transform a looked-ahead sequence into a sequence of elements annotated with group boundaries: Plain x means the current group continues, Break x means x is the last element in its group. The next element (if any) begins a different group. (The 'a annot type could have been realized differently, as a pair 'a * bool.) We have also assumed the existence of the mapping operation
map : ('a -> 'b) -> 'a seq -> 'b seq
最后,我们必须计数,这再次需要状态(当前组中元素的数量)并在组边界
发出
元组:
Finally, we have to count, which again requires state (the count of elements in the current group) and emit the tuples at the group boundary:
let count : 'a annot seq -> ('a * int) seq =
map_accum_filter 0 @@ fun cnt -> function
| Plain x -> (None, cnt+1)
| Break x -> (Some (x,cnt+1), 0)
分组中断标记强制输出(描述刚刚结束的组)并重置计数器。我们假设存在 accumulating map-filter,流处理中相当流行的操作:
The group break forces the output (describing the group that has just ended) and resets the counter. We assumed the existence of accumulating map-filter, a rather popular operation in stream processing:
map_accum_filter : 'state -> ('state -> 'a -> 'b option * 'state) -> 'a seq -> 'b seq
映射
('state -> 'a -> 'b option * 'state)
接受当前状态和当前元素,并返回可能转换的元素和新状态的元组。如果可能转换的元素为 None,则输出流中不会发出任何内容;但状态会继续前进。从类型来看,map_accum_filter
看起来是“
纯
”的,是函数式的状态传递。记住这个想法。
The mapping ('state ->'a -> 'b option * 'state) takes the current state and the current element and returns a tuple of a possibly transformed element and the new state. If the possibly transformed element is None, nothing is emitted in the output stream; the state advances nevertheless. From the type of it, map_accum_filter looks `pure', with functional-state passing. Hold that thought.
就是这样。问题的解法,即游程编码的字符序列,就是我们到现在开发的操作的组合:
That is it. The solution to the problem, the run-length-encoded sequence of characters, is then the composition of what we have developed so far:
let rll = of_string >> look_ahead >> group >> count
它以简洁且明显正确的方式表达了问题。我们只需要在链中添加一个观察操作,将结果
(char * int) seq
转换为列表,或者直接将其打印出来。
which expresses the problem in the concise and obviously correct way. We only need to add to our chain an observation operation, to transform the resulting (char * int) seq to a list, or to print it out.
我们解法的一个优点是易于修改。例如,如果源字符串是 UTF-8 编码的,我们只需要在管道中添加一个操作:UTF-8 解码和
字素聚类
:
One advantage of our solution is the ease of modification. For example, if the source string is UTF-8 encoded, we only need to add to the pipeline one more operation: UTF-8 decoding and grapheme clustering:
utf_decode : char seq -> uchar seq
留给读者作为练习。
We leave this as an exercise to the reader.
要真正运行
rll
程序,我们必须履行承诺:为 'a seq
选择一个具体表示并实现我们为其假设的操作。用 OCaml 术语来说,我们必须实现以下签名:
To actually run the rll program, we have to fulfill our promises: to pick a concrete representation for 'a seq and implement the operations that we assumed for it. In OCaml terms, we have to implement the following signature:
module type simple_seq = sig
type 'a seq
val of_string : string -> char seq
val map : ('a -> 'b) -> 'a seq -> 'b seq
val look_ahead : 'a seq -> ('a * 'a option) seq
val map_accum_filter : 'state -> ('state ->'a -> 'b option * 'state) -> 'a seq -> 'b seq
val iter : ('a -> unit) -> 'a seq -> unit
val to_list : 'a seq -> 'a list
end
事实证明,OCaml 标准库几乎拥有我们需要的一切:
Seq.t
数据类型以及映射、字符串/列表转换等操作。缺少的是 map_accum_filter
和 look_ahead
。不过,标准库提供了序列上的 filter_map
。我们要做的只是添加状态,实现为可变状态:
It turns out that the OCaml standard library has almost everything we need: the Seq.t data type and the mapping, string/list conversion, etc. operations on it. What is missing is map_accum_filter and look_ahead. However, the standard library provides filter_map on sequences. We merely need to add state, realized as mutable state:
let map_accum_filter : 'state -> ('state -> 'a -> 'b option * 'state) -> 'a seq -> 'b seq =
fun init f seq ->
let stat = ref init in
seq |> Seq.filter_map (fun a -> let (bo,st) = f !stat a in stat := st; bo)
总体而言,
map_accum_filter
确实有一个具有显式
状态传递
的“函数式”接口;但实现是命令式的。明确的规范和快速的实现之间没有内在冲突。剩下的 look_ahead
很容易通过 map_accum_filter
来表达(我们将实现留给读者⸺或者参见随附的代码)。
The overall map_accum_filter does have a `functional' interface with the explicit state-passing; the implementation is imperative however. There is no inherent conflict between a clear specification and a fast implementation. The remaining look_ahead turns out easily expressed via map_accum_filter (we leave the implementation to the reader -- or see the accompanying code).
回顾我们的解法,我们注意到所有状态最终都被封装了。
前瞻
的状态被封装在
look_ahead
中。分组本身变成了无状态的。计数需要多一个状态。每个操作都保留其所需的状态,而无需了解或干涉其他操作的状态。
Looking back at our solution, we notice that all state ends up encapsulated. The state for look-ahead is encapsulated in look_ahead. The grouping per se turned out stateless. Counting requires one more piece of state. Each operation keeps the state it needs, without knowing or interfering with the state of others.
我们还注意到这里没有循环或递归函数。我们也不必处理终止条件。这是所有 Group I 解决方案的特点,在 APL 中明确阐述:指定如何更改集合的内容或其形状,但不要浪费时间思考如何获取元素或将它们放在哪里。
We also notice that there are no loops or recursive functions. Neither did we have to deal with termination conditions. That is the feature of all Group I solutions, clearly articulated back in APL: specify how to change the content of a collection or its shape, but don't waste time thinking how to get the elements or where to put them.
我们的
'a seq
也可以实现为
迭代器
/
生成器
,通过 for
循环进行组合,拥有良好的性能。即使是 OCaml 的 Seq.t
也设计为了一定的
融合
(尽管不完整:仍有中间数据结构,但大小恒定)。但如果你使用如 strymonas 这样的库就可以实现完美的融合,还能生成C代码。
Our 'a seq can also be realized as iterator/generator, to be composed via for-loops -- with good performance. Even OCaml Seq.t fuses by design (albeit incompletely: there are still intermediate data structures, but of constant size). Perfect fusion is achievable, if one uses, say, strymonas, which also permits the generation of C code. Here it is.
void rll(const int64_t * aa_1,const int64_t al_2)
{
int64_t v_3 = 0;
int64_t v_4 = 0;
bool v_5 = 0;
int64_t v_6 = 0;
while (v_6 <= al_2)
{
int64_t t_7;
t_7 = v_6;
v_6++;
if (t_7 < al_2)
{
int64_t t_9;
t_9 = aa_1[t_7];
if (v_5)
{
int64_t t_10;
t_10 = v_4;
v_4 = t_9;
if (!(t_10 == t_9))
{
int64_t t_11;
t_11 = v_3;
v_3 = 0;
printf("%ld\n",(long)t_10);
printf("%ld\n",(long)(t_11 + 1));
}
else {
v_3++;
}
}
else {
v_4 = t_9;
v_5 = 1;
}
}
else {
if (v_5)
{
int64_t t_8;
t_8 = v_3;
v_3 = 0;
printf("%ld\n",(long)v_4);
printf("%ld\n",(long)(t_8 + 1));
}
}
}}
还有一件事:第一次尝试就成功了。
One more thing: it all worked on the first try.
examples/run-length-encoding
目录。apples.ml [4K]
The complete source code
apples.c [2K]
The strymonas solution: generated C code. For the generator, see the directory examples/run-length-encoding in the strymonas distribution.
Incremental, Linear Pretty-printing
The stepwise pipeline development can be taken quite far. The above reference develops a rather non-trivial algorithm with the optimal, difficult to achieve performance.
Translation. Using zippers to handle huge trees [jsr-000Q]
- December 22, 2024
-
Diego Olivier Fernandez Pons with contributions from Jinser Kafka
- Original
Translation. Using zippers to handle huge trees [jsr-000Q]
- December 22, 2024
- Diego Olivier Fernandez Pons with contributions from Jinser Kafka
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。有修复原文 typo。
嗨,
Bonjour,
2003年4月9日星期三,Yang Shouxun 写道:
我不知道如何编写尾递归版本来构建树。如果连续属性不多且数据集也不是很大,那树会在堆栈溢出之前停止生长。
(译者按:连续属性是指其在数据集中可以取无限多值,通常是实数,例如身高、温度等。)
On Wednesday, 09 Apr 2003, Yang Shouxun wrote:
> I don't know how to write a tail recursive version to build trees.
> If there are not that many continuous attributes and the dataset is
> not so large, the tree stops growing before stack overflow.
2003 年 4 月 9 日星期三,Markus Mottl 写道:
这里的诀窍是使用延续传递风格(CPS):传递一个包含后续计算所需的所有内容的函数闭包(延续)。子函数不返回结果,而是使用结果调用延续,这使得函数成为尾递归。
On Wednesday 09 April 2003, Markus Mottl wrote:
> The trick is to use continuation passing style (CPS): you pass a
> function closure (continuation) containing everything that's needed
> in subsequent computations. Instead of returning a result, the
> sub-function calls the continuation with the result, which makes the
> functions tail-recursive.
zipper 是一种处理巨大(实际是无限)tree 的简单方法。
Zippers are a simple way to handle huge (in fact infinite) trees.
Zipper 的解释
Explanation of zippers
[#282]
- December 22, 2024
- Diego Olivier Fernandez Pons
Zipper 的解释 Explanation of zippers [#282]
- December 22, 2024
- Diego Olivier Fernandez Pons
zipper 可以被视为一种“函数式指针”,因为它能够提供:
O(1)
的访问指针元素O(1)
的指针移动Zippers may be seen as 'functional pointers' since they offer :
- purely functional and typed operations
- O(1) access to the pointed element
- O(1) pointer movements
限制是数据结构只允许有一个指针,且每次指针移动都会分配内存。
The restrictions are that only one pointer is allowed by data structure and every pointer movement allocates memory.
采用二叉搜索树的经典类型声明
Take a classical type declaration for binary search trees
type 'a tree = E | N of 'a tree * 'a * 'a tree * int
考虑一棵二叉搜索树和一个你想要指向的内部节点。要对指向的子树进行
O(1)
访问,它必须可以直接从数据结构的基础中获取。然后,树的其余部分必须保存在单独的地方。我们将沿着从根到指向的节点的路径来解构它
Consider a binary search tree and an inner node to which you want to
point. To have a 0(1) access to the pointed subtree, it has to be
directly available from the base of the data structure. Then, the
rest of the tree must be kept in a separate place. We will deconstruct
it along the path from the root to the pointed node
type 'a path =
| Root
| Left of 'a * 'a tree * 'a path
| Right of 'a tree * 'a * 'a path
type 'a zipper = 'a tree * 'a path
zipper
约束
声明如下:
The zipper constraint as announced :
- the pointed subtree
- the rest of the tree breaked along the path to the root
接着我们定义指针的移动(数据结构中的每个指针各一个):
Then we define the pointer movements (one for each pointer in the data structure) :
exception Bottom
(* To be replaced by a balancing constructor *)
let makeDTree = fun l v r -> N (l, v, r, 0)
let move_left = fun (tree, path) ->
match tree with
| E -> raise Bottom
| N (l, v, r, _) -> (l, Left (v, r, path))
let move_right = fun (tree, path) ->
match tree with
| E -> raise Bottom
| N (l, v, r, _) -> (r, Right (l, v, path))
let move_up = fun (tree, path) ->
match path with
| Root -> raise Top
| Left (v, r, tail) -> (makeDTree tree v r, tail)
| Right (l, v, tail) -> (makeDTree l v tree, tail)
现在我们可以通过以下过程构建任意大的树:
Now we can build an arbitrary large tree by the following procedure :
- build a tree of bounded depth
- choose the node which will be developed next
- move the current pointer to that node
- continue building the tree
该过程使用有界的堆栈空间。
This procedure uses a bounded stack space
相关工作
Related work
[#283]
- December 22, 2024
- Diego Olivier Fernandez Pons
相关工作 Related work [#283]
- December 22, 2024
- Diego Olivier Fernandez Pons
zipper 由 Gérard Huet 提出。JFP 上有一篇论文:G. Huet. The Zipper. J. Functional Programming, 7 (5), Sept 1997, pp. 549--554。
Zippers were invented by Gérard Huet. There is a paper on the JFP
G. Huet. The Zipper. J. Functional Programming, 7 (5), Sept 1997, pp. 549--554.
他在例子中使用了 n 叉树和二叉树。这两者主要的区别在于,在二叉树中,指针的组织方式不同(他的“左”操作在 Baire 中是
(left o up)
)。
(译者按:Baire 是 Diego Olivier Fernandez Pons 开发的 Caml 的数据结构库。)
He uses n-ary trees and binary trees in his examples. The main
difference is that in binary trees the pointers are not organized in
the same way (his 'left' operation is what in Baire is (left o up))
Ralf Hinze 试图为函数指针提供一个通用框架,称之为
网
(你提供数据结构和基本转换,数据结构完成其余的工作)。
Ralf Hinze has tried to give a general framework for functional
pointers named a web (you give your data structure and the basic
transformations and the data structure does the rest)
Ralf Hinze and Johan Jeuring. Functional Pearl: Weaving a Web. Journal of Functional Programming, 11(6):681-689, November 2001。
Ralf Hinze and Johan Jeuring. Functional Pearl: Weaving a Web. Journal
of Functional Programming, 11(6):681-689, November 2001.
这篇文章可以通过 Hinze 的主页在网上获取。在我看来,他的文章并没有真正令人信服,也不是很清楚。
Available on the net via Hinze's home page.
In my opinion his article is not really convincing and not very clear.
一些已经使用 zipper 的库:
所有这些代码都可以在网络上找到。
Several libraries already use zippers
- Zen (Gérard Huet, Caml) uses zippers to handle acyclic automata
minimization
- SML/NJ Standard library (John Reppy) uses zippers to handle deletion
in red-black trees
- MetaPRL (Caml) uses zippers to handle insertion and deletion in
splay and red-black trees
- Grammatical Framework (Aarne Ranta, Haskell) uses zippers to
navigate through n-ary trees.
All this code is available on the web.
代码示例
Examples of code
[#284]
- December 22, 2024
- Diego Olivier Fernandez Pons
代码示例 Examples of code [#284]
- December 22, 2024
- Diego Olivier Fernandez Pons
以下是从 Baire 中提取的一些常用二叉搜索树操作的示例(
insert
,delete
,move_pointer_to
,...)(当前版本 11 avril 2003,大量错误和损坏的代码,您可以在 Baire 的下载页面中找到它) :
(译者按:Baire 相关资源基本已不可用。)
Here are some examples of usual binary search trees operations made
with zippers (insert, delete, move_pointer_to, ...) extracted from
Baire (current version 11 avril 2003, plenty of bugs and breaked
code, you will find it in Baire's download pages) :
(译者按:使用 ocamlformat
重新格式化,与原文的格式稍有不同。)
let rec move_to_top = function
| (tree, path) as pointer ->
(match path with
| Root -> pointer
| Left (v, r, tail) -> move_to_top (makeDTree tree v r, tail)
| Right (l, v, tail) -> move_to_top (makeDTree l v tree, tail))
;;
let rec move_to x = function
| (tree, path) as pointer ->
(match tree with
| E ->
(match path with
| Right (_, rv, _) when x <= rv -> move_to x (move_up pointer)
| Left (lv, _, _) when x >= lv -> move_to x (move_up pointer)
| _ -> pointer)
| N (_, v, _, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x < rv -> move_to x (move_up pointer)
| Right _ | Root | Left _ -> move_to x (move_left pointer))
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x > lv -> move_to x (move_up pointer)
| Left _ | Root | Right _ -> move_to x (move_right pointer))
| _ -> pointer))
;;
let rec member_path x = function
| Right (l, v, tail) ->
(match compare x v with
| n when n < 0 -> member x l
| 0 -> true
| _ -> member_path x tail)
| Left (v, r, tail) ->
(match compare x v with
| n when n > 0 -> member x r
| 0 -> true
| _ -> member_path x tail)
| Root -> false
;;
let rec zipper_member x = function
| tree, path ->
(match tree with
| E -> member_path x path
| N (l, v, r, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x <= rv -> member_path x path
| Right _ | Root | Left _ -> member x l)
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x >= lv -> member_path x path
| Left _ | Root | Right _ -> member x r)
| _ -> true))
;;
let current_tree = function
| tree, _ -> tree
;;
let current_value = function
| tree, _ ->
(match tree with
| E -> None
| N (_, v, _, _) -> Some v)
;;
let current_value' = function
| tree, _ ->
(match tree with
| E -> raise Empty
| N (_, v, _, _) -> v)
;;
let rec zipper_insert x = function
| (tree, path) as pointer ->
(match tree with
| E ->
(match path with
| Right (_, rv, _) when x <= rv -> zipper_insert x (move_up pointer)
| Left (lv, _, _) when x >= lv -> zipper_insert x (move_up pointer)
| _ -> makeTree E x E, path)
| N (_, v, _, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x < rv -> zipper_insert x (move_up pointer)
| Right _ | Root | Left _ -> zipper_insert x (move_left pointer))
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x > lv -> zipper_insert x (move_up pointer)
| Left _ | Root | Right _ -> zipper_insert x (move_right pointer))
| _ -> pointer))
;;
let rec zipper_delete x = function
| (tree, path) as pointer ->
(match tree with
| E ->
(match path with
| Right (_, rv, _) when x <= rv -> zipper_delete x (move_up pointer)
| Left (lv, _, _) when x >= lv -> zipper_delete x (move_up pointer)
| _ -> pointer)
| N (l, v, r, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x < rv -> zipper_delete x (move_up pointer)
| Right _ | Root | Left _ -> zipper_delete x (move_left pointer))
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x > lv -> zipper_delete x (move_up pointer)
| Left _ | Root | Right _ -> zipper_delete x (move_right pointer))
| _ -> move_to x (appendB l r, path)))
;;
let rec path_to_list result = function
| Root -> result
| Left (v, r, path) -> path_to_list (result @ (v :: to_ordered_list r)) path
| Right (l, v, path) -> path_to_list (to_ordered_list_rec (v :: result) l) path
;;
let zipper_to_list = function
| tree, path -> path_to_list (to_list tree) path
;;
Translation. Guile's reader, in guile [jsr-000R]
- December 22, 2024
-
Andy Wingo with contributions from Jinser Kafka
- Original
Translation. Guile's reader, in guile [jsr-000R]
- December 22, 2024
- Andy Wingo with contributions from Jinser Kafka
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
晚上好!今天的简短(大概?)记录是一些关于 guile
书呆子
的内容。
Good evening! A brief note today about some Guile nargery.
历史的弧线
the arc of history
[#280]
- December 22, 2024
- Andy Wingo
历史的弧线 the arc of history [#280]
- December 22, 2024
- Andy Wingo
就像在你打开收音机并期望听到
威豹乐队
时开始出现的许多语言实现一样,Guile 也有下半部分和上半部分。下半部分是用 C 编写的,暴露了一个共享库和一个可执行文件,而上半部分是用语言本身(在 Guile 的情况下是“Scheme”)编写的,并在语言实现开始时由 C 代码以某种方式加载。
Like many language implementations that started life when you could turn on the radio and expect to hear Def Leppard, Guile has a bottom half and a top half. The bottom half is written in C and exposes a shared library and an executable, and the top half is written in the language itself (Scheme, in the case of Guile) and somehow loaded by the C code when the language implementation starts.
自2010年左右以来,我们一直在努力将用C语言编写的部分改用 Scheme 编写。上周的信件讨论了动态链接的实现的替换,从使用
libltdl
库替换为在低级dlopen
包装器之上的 Scheme 实现。我以前写过关于在 Scheme 中重写 eval
的内容,更近期则谈到在 Scheme 中实现与 C 语言实现相同性能的道路有时是漫长的。
Since 2010 or so we have been working at replacing bits written in C with bits written in Scheme. Last week's missive was about replacing the implementation of dynamic-link from using the libltdl library to using Scheme on top of a low-level dlopen wrapper. I've written about rewriting eval in Scheme, and more recently about how the road to getting the performance of C implementations in Scheme has been sometimes long.
这些重写有
堂吉诃德式
的一面。我内心深处有一些关于正确与错误的感觉,并且我从根本上知道从 C 转向 Scheme 是正确的事情。很多时候这种感觉都是完全不理性的,在许多情况下也显得不合时宜⸺比如,如果你有一项需要为客户完成的任务,你需要坐下来思考从这里到目标的最小步骤,而直觉对于你如何到达那里并没有多大作用。但有一个项目可以让你按照自己喜欢的方式做某件事是美好的,即使这需要 10 年,那也没关系。
These rewrites have a quixotic aspect to them. I feel something in my gut about rightness and wrongness and I know at a base level that moving from C to Scheme is the right thing. Much of it is completely irrational and can be out of place in a lot of contexts -- like if you have a task to get done for a customer, you need to sit and think about minimal steps from here to the goal and the gut doesn't have much of a role to play in how you get there. But it's nice to have a project where you can do a thing in the way you'd like, and if it takes 10 years, that's fine.
不过除了难以言表的动机之外,用 Scheme 重写一些东西还是有具体的好处的。我发现 Scheme 代码更容易维护,嗯,而且相比 C 的常见
陷阱
,Scheme 显然更安全。如果有一天我重写 Guile 的垃圾收集器,这会减少我的工作量。而且,Scheme 代码还具有 C 语言所没有的功能:
尾部调用
、
可恢复的分隔延续
、
运行时检测
等等。
But besides the ineffable motivations, there are concrete advantages to rewriting something in Scheme. I find Scheme code to be more maintainable, yes, and more secure relative to the common pitfalls of C, obviously. It decreases the amount of work I will have when one day I rewrite Guile's garbage collector. But also, Scheme code gets things that C can't have: tail calls, resumable delimited continuations, run-time instrumentation, and so on.
以
定界延续
为例,大约五年前,我为 Guile 编写了一个以并行并发 ML 为模型的轻量级并发设施。它允许数百万条 fibers 存在于系统上。当一个 fiber 需要阻塞 I/O 操作(读或写)时,它会暂停其延续,并在操作变得可能时安排重启它。
Taking delimited continuations as an example, five years ago or so I wrote a lightweight concurrency facility for Guile, modelled on Parallel Concurrent ML. It lets millions of fibers to exist on a system. When a fiber would need to block on an I/O operation (read or write), instead it suspends its continuation, and arranges to restart it when the operation becomes possible.
为了让这一切成真,Guile 必须做出很多改变。首先是定界延续本身。后来是 Scheme 中 ports 设施上半部分的完全重写,以允许 port 操作挂起和恢复。可恢复 fibers 的许多障碍已被消除,但 Fibers 手册仍然列出了相当多的障碍。
A lot had to change in Guile for this to become a reality. Firstly, delimited continuations themselves. Later, a complete rewrite of the top half of the ports facility in Scheme, to allow port operations to suspend and resume. Many of the barriers to resumable fibers were removed, but the Fibers manual still names quite a few.
Scheme read, in Scheme [#281]
- December 22, 2024
- Andy Wingo
Scheme read, in Scheme [#281]
- December 22, 2024
- Andy Wingo
这给带来了我们今天的记录:我刚刚在 Scheme 中也重写了 Guile 的 reader!reader 是获取字符流并将其解析为 S 表达式的部分。以前是C语言,现在是 Scheme。
Which brings us to today's note: I just rewrote Guile's reader in Scheme too! The reader is the bit that takes a stream of characters and parses it into S-expressions. It was in C, and now is in Scheme.
这样做的主要动机之一是希望使 read 可挂起。通过此更改,现在可以在 fibers 上实现 REPL(读取-评估-打印循环)。
One of the primary motivators for this was to allow read to be suspendable. With this change, read-eval-print loops are now implementable on fibers.
另一个动机是最终修复 Guile 无法记录某些数据源位置的 bug。Guile 过去会使用
弱键
哈希表来使从 read 返回的数据与源位置相关联。但这仅适用于 fresh value,不适用于小整数或字符等立即数,也不适用于 keyword 和 symbol 等全局唯一的非立即数。因此对于这些,我们就没有任何源位置。
Another motivation was to finally fix a bug in which Guile couldn't record source locations for some kinds of datums. It used to be that Guile would use a weak-key hash table to associate datums returned from read with source locations. But this only works for fresh values, not for immediate values like small integers or characters, nor does it work for globally unique non-immediates like keywords and symbols. So for these, we just wouldn't have any source locations.
该问题的一个可靠解决方案是返回带
注解
的对象,而不是使用另外的表。由于 Scheme 的宏扩展器已经被设置为与带注解的对象(语法对象)一起使用,因此一个新的 read-syntax 接口会非常好用。
A robust solution to that problem is to return annotated objects rather than using a side table. Since Scheme's macro expander is already set to work with annotated objects (syntax objects), a new read-syntax interface would do us a treat.
在 C 语言中实现
read
很难做到。但在 Scheme 中实现 read
则毫无问题。不过,调整扩展器以期望在语法对象内包含源位置有些繁琐,且源位置信息的增加使得输出文件的大小增大了几个百分比⸺这在部分上是 .debug_lines
DWARF 数据的增加带来的,但也和宏中语法对象的序列化源位置有关。
With read in C, this was hard to do. But with read in Scheme, it was no problem to implement. Adapting the expander to expect source locations inside syntax objects was a bit fiddly, though, and the resulting increase in source location information makes the output files bigger by a few percent -- due somewhat to the increased size of the .debug_lines DWARF data, but also due to serialized source locations for syntax objects in macros.
速度方面,目前切换到 Scheme 的
read
是一个
退步
。旧的 reader 在这台笔记本电脑上记录源位置时每秒大概可以解析 15 或 16 MB,或者关闭源位置,那么有 22 或 23 MB/s。新的 reader 在旧模式下,使用弱键侧表记录源位置的解析速度大概为 10.5 MB/s,关闭位置时为 13.5 MB/s。新的 read-syntax
速度大约是 12 MB/s。我们将在未来几个月继续优化这些性能,但与原来的 reader 编写时的情况不同的是,现在的 reader 主要在编译时使用。(它在读取 s 表达式作为数据时仍然有用,因此仍然有理由提升其速度。)
Speed-wise, switching to read in Scheme is a regression, currently. The old reader could parse around 15 or 16 megabytes per second when recording source locations on this laptop, or around 22 or 23 MB/s with source locations off. The new one parses more like 10.5 MB/s, or 13.5 MB/s with positions off, when in the old mode where it uses a weak-key side table to record source locations. The new read-syntax runs at around 12 MB/s. We'll be noodling at these in the coming months, but unlike when the original reader was written, at least now the reader is mainly used only at compile time. (It still has a role when reading s-expressions as data, so there is still a reason to make it fast.)
与
eval
的情况一样 ,在加载 Scheme 版本之前,我们仍然有一个 C 版本的 reader 可用于引导目的。这次重写令人高兴的是,我能够从 C reader 中删除与非默认词法语法相关的所有缺陷,很好地简化了未来的维护。
As is the case with eval, we still have a C version of the reader available for bootstrapping purposes, before the Scheme version is loaded. Happily, with this rewrite I was able to remove all of the cruft from the C reader related to non-default lexical syntax, which simplifies maintenance going forward.
尝试
逐个 bug
重写的一个有趣方面是你会发现 bug 和意外行为。比如,事实证明,从出现以来,Guile 总是不需要终止分隔符地 read
#t
和 #f
,因此 read "(#t1)"
将得到列表 (#t 1)
。很奇怪,对吧?更奇怪的是,当 #true
和 #false
别名被添加到语言中,Guile 决定默认支持它们,但以一种奇怪的向后兼容的方式…所以 "(#false1)"
读作 (#f 1)
但 "(#falsa1)"
读作 (#f alsa1)
。诸如此类的事还有不少。
An interesting aspect of attempting to make a bug-for-bug rewrite is that you find bugs and unexpected behavior. For example, it turns out that since the dawn of time, Guile always read #t and #f without requiring a terminating delimiter, so reading "(#t1)" would result in the list (#t 1). Weird, right? Weirder still, when the #true and #false aliases were added to the language, Guile decided to support them by default, but in an oddly backwards-compatible way... so "(#false1)" reads as (#f 1) but "(#falsa1)" reads as (#f alsa1). Quite a few more things like that.
总的来说,这次重写似乎是成功的,没有引入新的行为,甚至产生了相同的错误。然而,对于
回溯
而言,情况并非如此,因为回溯可以暴露出 read 函数的内部实现,而之前由于 C 栈对 Scheme 是不透明的,这种情况并不会发生。因此,我们可能需要在调用 read 的地方添加更合理的错误处理,因为回溯信息无论如何都不是一个好的面向用户的错误反馈。
All in all it would seem to be a successful rewrite, introducing no new behavior, even producing the same errors. However, this is not the case for backtraces, which can expose the guts of read in cases where that previously wouldn't happen because the C stack was opaque to Scheme. Probably we will simply need to add more sensible error handling around callers to read, as a backtrace isn't a good user-facing error anyway.
好吧,今晚的闲聊已经够多了。祝大家 happy hacking,晚安!
OK enough rambling for this evening. Happy hacking to all and to all a good night!
Translation. Baby’s First Garbage Collector [jsr-000P]
- December 20, 2024
-
Bob Nystrom with contributions from Jinser Kafka
- Original
Translation. Baby’s First Garbage Collector [jsr-000P]
- December 20, 2024
- Bob Nystrom with contributions from Jinser Kafka
- Original
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
当我感到压力大、要做的事情太多时,我会产生一种矛盾的反应:通过找其他事情来逃避。通常这件事是一个我可以编写和完成的小型独立程序。
When I get stressed out and have too much to do, I have this paradoxical reaction where I escape from that by coming up with another thing to do. Usually it’s a tiny self-contained program that I can write and finish.
一天早上,我正对我正在写的书、我在工作中必须做的事情以及我正在为《Strange Loop》准备的演讲感到害怕,突然间,我想,“我应该写一个垃圾回收器。”
The other morning, I was freaking myself out about the book I’m working on (http://gameprogrammingpatterns.com/) and the stuff I have to do at work (http://dart.dev/) and a talk I’m preparing for Strange Loop (https://www.infoq.com/presentations/dart-introduction/), and all of the sudden, I thought, “I should write a garbage collector.”
是的,我意识到那段话让我看起来有多么疯狂。但我的搭错神经带来的是编程语言实现基础的免费教程!在大约一百行常规 C 代码中,我成功地创建了一个基础的标记清除,你知道的,垃圾回收。
Yes, I realize how crazy that paragraph makes me seem. But my faulty wiring is your free tutorial on a fundamental piece of programming language implementation! In about a hundred lines of vanilla C, I managed to whip up a basic mark-and-sweep (https://en.wikipedia.org/wiki/Tracing_garbage_collection#Na%C3%AFve_mark-and-sweep) collector that actually, you know, collects.
一般认为,垃圾回收是编程中充满挑战的领域之一,不过在这篇文章里,我会引导你走上一条相对平坦的道路。(路上仍然可能有坑,但至少会浅一点)
Garbage collection is considered one of the more shark-infested waters of programming, but in this post, I’ll give you a nice kiddie pool to paddle around in. (There may still be sharks in it, but at least it will be shallower.)
少使用、物尽其用、循环再用
Reduce, reuse, recycle
[#285]
- December 20, 2024
- Bob Nystrom
少使用、物尽其用、循环再用 Reduce, reuse, recycle [#285]
- December 20, 2024
- Bob Nystrom
垃圾回收背后的基本思想是该语言(在大多数情况下)似乎可以访问无限的内存。开发者可以不断地分配、分配、分配,就像变魔术一样,内存永远不会用完。
The basic idea behind garbage collection is that the language (for the most part) appears to have access to infinite memory. The developer can just keep allocating and allocating and allocating and, as if by magic, never run out.
当然,计算机并不具有无限的内存。因此,实现这个魔术的方式是,当它需要分配一些内存并意识到内存不足时,它会回收垃圾。
Of course, machines don’t have infinite memory. So the way the implementation does this is that when it needs to allocate a bit of memory and it realizes it’s running low, it collects garbage.
在这个语境中,“垃圾”是指之前分配的,现在不再使用了的内存。为了让无限内存的幻觉发挥作用,语言需要非常安全地“不再被使用”。当你的程序试图访问随机对象,如果这时它们就开始被回收,那就不好玩了。
“Garbage” in this context means memory it previously allocated that is no longer being used. For the illusion of infinite memory to work, the language needs to be very safe about “no longer being used”. It would be no fun if random objects just started getting reclaimed while your program was trying to access them.
为了可回收,该语言必须确保程序无法再次使用该对象。如果程序无法获取该对象的引用,那么它显然无法再次使用它。所以“使用中”的定义实际上非常简单:
In order to be collectible, the language has to ensure there’s no way for the program to use that object again. If the program can’t get a reference to the object, then it obviously can’t use it again. So the definition of “in use” is actually pretty simple:
1. Any object being referenced by a variable still in scope is in use.
2. Any object referenced by another in-use object is in use.
第二条规则是递归规则。如果对象 A 被变量引用,并且它有一些引用对象 B 的字段,则 B 正在使用中,因为你可以通过 A 访问它。
The second rule is the recursive one. If object A is referenced by a variable, and it has some field that references object B, then B is in use since you can get to it through A.
最后我们得到的是可达对象的图⸺可达对象即所有可以从一个变量开始,遍历其对象来到达的对象。任何不在可达对象图中的对象对于程序来说都是死的,它的内存已经时机成熟,可以被回收了。
The end result is a graph of reachable objects—all of the objects in the world that you can get to by starting at a variable and traversing through objects. Any object not in that graph of reachable objects is dead to the program and its memory is ripe for a reaping.
标记清除
Marking and sweeping
[#286]
- December 20, 2024
- Bob Nystrom
标记清除 Marking and sweeping [#286]
- December 20, 2024
- Bob Nystrom
有很多不同的方法可以实现查找和回收所有未使用对象的过程,但为此发明的最简单且第一个发明的算法称为“标记清除”。它是由 Lisp 和大胡子的发明者约翰·麦卡锡(John McCarthy)发明的,所以今天实现这个算法有点像与一位远古之神交流,希望不是以某种洛夫克拉夫特式的方式,否则你的思想和视网膜最终会被炸得一干二净。
There are a bunch of different ways (https://en.wikipedia.org/wiki/Tracing_garbage_collection) you can implement the process of finding and reclaiming all of the unused objects, but the simplest and first algorithm ever invented for it is called “mark-sweep”. It was invented by John McCarthy, the man who invented Lisp and beards, so implementing today is like communing with one of the Elder Gods, but hopefully not in some Lovecraftian way that ends with you having your mind and retinas blasted clean.
该算法的工作原理几乎与我们对可达性的定义完全相同:
The algorithm works almost exactly like our definition of reachability:
1. Starting at the roots, traverse the entire object graph. Every time you reach an object, set a “mark” bit on it to true.
2. Once that’s done, find all of the objects whose mark bits are not set and delete them.
就是这样。我知道,你本来可以想出这个的,对吧?如果你这么做了,你就会成为一篇被引用数百次的论文的作者。这件事给我们的教训是,要在 CS 领域出名,你不必想出真正晦涩难懂的东西,你只需首先想出明显的东西即可。
That’s it. I know, you could have come up with that, right? If you had, you’d be the author of a paper cited hundreds of times. The lesson here is that to be famous in CS, you don’t have to come up with really obscure stuff, you just have to come up with obvious stuff first.
一对对象
A pair of objects
[#287]
- December 20, 2024
- Bob Nystrom
一对对象 A pair of objects [#287]
- December 20, 2024
- Bob Nystrom
在我们开始实现这两个步骤之前,让我们先做一些准备工作。我们不会真正实现一种语言的解释器⸺没有语法分析器、字节码或任何弱智的东西⸺但我们确实需要一些最少量的代码来创建一些垃圾来回收。
Before we can get to implementing those two steps, let’s get a couple of preliminaries out of the way. We won’t be actually implementing an interpreter for a language—no parser, bytecode, or any of that foolishness—but we do need some minimal amount of code to create some garbage to collect.
假设我们正在为一种小语言编写一个解释器。它是动态类型的,有两种类型的对象:整数(int)和对(pair)。这是一个用来标记对象类型的枚举:
Let’s play pretend that we’re writing an interpreter for a little language. It’s dynamically typed, and has two types of objects: ints and pairs. Here’s an enum to identify an object’s type:
typedef enum {
OBJ_INT,
OBJ_PAIR
} ObjectType;
一个对可以是一对任何东西,两个整数,一个整数和另一个对,等等。仅凭这一点,你就可以走得更远。由于 VM 中的对象可以是其中任何一个,因此 C 中实现它的典型方法是使用标签联合(tagged union)。我们这样定义它:
A pair can be a pair of anything, two ints, an int and another pair, whatever. You can go surprisingly far (http://www.flickr.com/photos/raganwald/212588975/) with just that. Since an object in the VM can be either of these, the typical way in C to implement it is with a tagged union (http://en.wikipedia.org/wiki/Tagged_union). We define it thusly:
typedef struct sObject {
ObjectType type;
union {
/* OBJ_INT */
int value;
/* OBJ_PAIR */
struct {
struct sObject* head;
struct sObject* tail;
};
};
} Object;
主要的 Object 结构中有一个
type
字段,用于标识它的值类型⸺整数或对。接着它有一个联合(union)来保存这个整数或对的数据。如果你的 C 很生疏,那么联合就是一个结构体,其中字段在内存中重叠。由于给定的对象只能是一个整数或一个对,因此没有理由在单个对象中同时为所有三个字段提供内存。联合就是这么做的。非常好。
The main Object struct has a type field that identifies what kind of value it is—either an int or a pair. Then it has a union to hold the data for the int or pair. If your C is rusty, a union is a struct where the fields overlap in memory. Since a given object can only be an int or a pair, there’s no reason to have memory in a single object for all three fields at the same time. A union does that. Groovy.
迷你虚拟机
A minimal virtual machine
[#288]
- December 20, 2024
- Bob Nystrom
迷你虚拟机 A minimal virtual machine [#288]
- December 20, 2024
- Bob Nystrom
现在我们可以在小虚拟机中使用这个数据类型。虚拟机在这个背景中的作用是拥有一个存储当前范围内的变量的栈。大多数语言的虚拟机要么是基于栈的(如 JVM 和 CLR),要么是基于寄存器的(如 Lua)。在这两种情况下,实际上都仍然存在栈。它用于存储表达式中间所需的局部变量和临时变量。我们明确而简单地建模,如下所示:
Now we can use that datatype in a little virtual machine. The VM’s role in this story is to have a stack that stores the variables that are currently in scope. Most language VMs are either stack-based (like the JVM and CLR) or register-based (like Lua). In both cases, there is actually still a stack. It’s used to store local variables and temporary variables needed in the middle of an expression. We model that explicitly and simply like so:
#define STACK_MAX 256
typedef struct {
Object* stack[STACK_MAX];
int stackSize;
} VM;
现在已经有了基本的数据结构,让我们拼凑代码来创建一些东西。首先,编写一个创建并初始化虚拟机的函数:
Now that we’ve got our basic data structures in place, let’s slap together a bit of code to create some stuff. First, let’s write a function that creates and initializes a VM:
VM* newVM() {
VM* vm = malloc(sizeof(VM));
vm->stackSize = 0;
return vm;
}
一旦我们有了虚拟机,我们就需要能够操作它的栈:
Once we have a VM, we need to be able to manipulate its stack:
void push(VM* vm, Object* value) {
assert(vm->stackSize < STACK_MAX, "Stack overflow!");
vm->stack[vm->stackSize++] = value;
}
Object* pop(VM* vm) {
assert(vm->stackSize > 0, "Stack underflow!");
return vm->stack[--vm->stackSize];
}
现在我们可以将东西放入“变量”中,我们需要能够实际创建对象。首先,一个小辅助函数:
Now that we can stick stuff in “variables”, we need to be able to actually create objects. First, a little helper function:
Object* newObject(VM* vm, ObjectType type) {
Object* object = malloc(sizeof(Object));
object->type = type;
return object;
}
这会执行实际的内存分配并设置类型标记。我们稍后会重新讨论这个问题。使用这个函数,我们可以编写其他函数将每种类型的对象推送到虚拟机的栈上:
That does the actual memory allocation and sets the type tag. We’ll be revisiting this in a bit. Using that, we can write functions to push each kind of object onto the VM’s stack:
void pushInt(VM* vm, int intValue) {
Object* object = newObject(vm, OBJ_INT);
object->value = intValue;
push(vm, object);
}
Object* pushPair(VM* vm) {
Object* object = newObject(vm, OBJ_PAIR);
object->tail = pop(vm);
object->head = pop(vm);
push(vm, object);
return object;
}
这就是我们的小虚拟机。如果我们有一个语法分析器和一个解释器来调用这些函数,我们就拥有了一种对上帝诚实的语言。而且,如果我们有无限的内存,它甚至能够运行真正的程序。既然我们不这样做,那我们就开始回收一些垃圾吧。
And that’s it for our little VM. If we had a parser and an interpreter that called those functions, we’d have an honest to God language on our hands. And, if we had infinite memory, it would even be able to run real programs. Since we don’t, let’s start collecting some garbage.
标志标记
Marky mark
[#289]
- December 20, 2024
- Bob Nystrom
标志标记 Marky mark [#289]
- December 20, 2024
- Bob Nystrom
第一阶段是标记。我们需要遍历所有可到达的对象并设置它们的标记位。我们需要的第一件事是向
Object
添加一个标记位:
The first phase is marking. We need to walk all of the reachable objects and set their mark bit. The first thing we need then is to add a mark bit to Object:
typedef struct sObject {
unsigned char marked;
/* Previous stuff... */
} Object;
当我们创建了这个新对象,我们还要修改
newObject()
以将 marked
初始化为零。
When we create a new object, we modify newObject() to initialize marked to zero.
为了标记所有可到达的对象,我们从内存中的变量开始,这意味着遍历栈。看起来像这样:
To mark all of the reachable objects, we start with the variables that are in memory, so that means walking the stack. That looks like this:
void markAll(VM* vm)
{
for (int i = 0; i < vm->stackSize; i++) {
mark(vm->stack[i]);
}
}
这又调用了
mark
。我们将分阶段构建它。首先是:
That in turn calls mark. We’ll build that in phases. First:
void mark(Object* object) {
object->marked = 1;
}
这是字面上最重要的一点(bit)。我们已将对象本身标记为可达。但记住,我们还需要处理对象中的引用⸺可达性是递归的。如果该对象是一个对,则它的两个字段也是可达的。处理递归很简单:
This is the most important bit, literally. We’ve marked the object itself as reachable. But remember we also need to handle references in objects—reachability is recursive. If the object is a pair, its two fields are reachable too. Handling that is simple:
void mark(Object* object) {
object->marked = 1;
if (object->type == OBJ_PAIR) {
mark(object->head);
mark(object->tail);
}
}
你看到这里有一个 bug 了吗?我们现在正在递归,但我们没有检查成环。如果你有一堆在循环中相互指向的对,这将使 C 调用栈溢出并崩溃。
There’s a bug here. Do you see it? We’re recursing now, but we aren’t checking for cycles. If you have a bunch of pairs that point to each other in a loop, this will overflow the C callstack and crash.
为了解决这个问题,我们只需要在到达已经处理过的对象时退出即可。所以完整的
mark()
函数是:
To handle that, we simply need to bail out if we get to an object that we’ve already processed. So the complete mark() function is:
void mark(Object* object) {
/* If already marked, we're done. Check this first
to avoid recursing on cycles in the object graph. */
if (object->marked) return;
object->marked = 1;
if (object->type == OBJ_PAIR) {
mark(object->head);
mark(object->tail);
}
}
现在我们可以调用
markAll()
,它将正确标记内存中每个可达的对象。我们已经完成一半了!
Now we can call markAll() and it will correctly mark every reachable object in memory. We’re halfway done!
清楚清除
Sweepy sweep
[#290]
- December 20, 2024
- Bob Nystrom
清楚清除 Sweepy sweep [#290]
- December 20, 2024
- Bob Nystrom
下一阶段是清除我们分配的所有对象并释放任何未标记的对象。但这里有一个问题:根据定义,所有未标记的对象都是无法访问的!我们无法获取他们!
The next phase is to sweep through all of the objects we’ve allocated and free any of them that aren’t marked. But there’s a problem here: all of the unmarked objects are, by definition, unreachable! We can’t get to them!
我们的虚拟机已经实现了对象引用的语言语义,因此我们只在变量和对的字段中存储指向对象的指针。一旦其中一个对象不再指向某个对象,虚拟机就完全丢失了它,并且实际上泄漏了内存。
The VM has implemented the language’s semantics for object references, so we’re only storing pointers to objects in variables and the pair fields. As soon as an object is no longer pointed to by one of those, the VM has lost it entirely and actually leaked memory.
解决这个问题的技巧是,虚拟机可以拥有自己的对象引用,与语言用户可见的语义不同。换句话说,我们可以自己追踪它们。
The trick to solve this is that the VM can have its own references to objects that are distinct from the semantics that are visible to the language user. In other words, we can keep track of them ourselves.
最简单的方法是维护我们分配的每个对象的链表。我们将
Object
本身扩展为该链表中的一个节点:
The simplest way to do this is to just maintain a linked list of every object we’ve ever allocated. We extend Object itself to be a node in that list:
typedef struct sObject {
/* The next object in the list of all objects. */
struct sObject* next;
/* Previous stuff... */
} Object;
虚拟机追踪该链表的头节点:
The VM keeps track of the head of that list:
typedef struct {
/* The first object in the list of all objects. */
Object* firstObject;
/* Previous stuff... */
} VM;
在
newVM()
中,我们确保将 firstObject
初始化为 NULL
。每当我们创建一个对象时,我们都会将其添加到链表中:
In newVM() we make sure to initialize firstObject to NULL. Whenever we create an object, we add it to the list:
Object* newObject(VM* vm, ObjectType type) {
Object* object = malloc(sizeof(Object));
object->type = type;
object->marked = 0;
/* Insert it into the list of allocated objects. */
object->next = vm->firstObject;
vm->firstObject = object;
return object;
}
这样,即使语言找不到对象,但语言实现仍然可以。要扫描并删除未标记的对象,我们遍历列表:
This way, even if the language can’t find an object, the language implementation still can. To sweep through and delete the unmarked objects, we traverse the list:
void sweep(VM* vm)
{
Object** object = &vm->firstObject;
while (*object) {
if (!(*object)->marked) {
/* This object wasn't reached, so remove it from the list
and free it. */
Object* unreached = *object;
*object = unreached->next;
free(unreached);
} else {
/* This object was reached, so unmark it (for the next GC)
and move on to the next. */
(*object)->marked = 0;
object = &(*object)->next;
}
}
}
由于指针指向指针,该代码读起来有点棘手,但如果你仔细阅读它,你会发现它非常简单。它只是遍历整个链表。每当它遇到未标记的对象时,它就会释放其内存并将其从链表中删除。完成操作后,我们会删除所有无法访问的对象。
That code is a bit tricky to read because of that pointer to a pointer, but if you work through it, you can see it’s pretty straightforward. It just walks the entire linked list. Whenever it hits an object that isn’t marked, it frees its memory and removes it from the list. When this is done, we will have deleted every unreachable object.
恭喜!我们有了一个垃圾回收器!只剩下最后一个碎片:实际调用它。首先让我们将这两个阶段结合在一起:
Congratulations! We have a garbage collector! There’s just one missing piece: actually calling it. First let’s wrap the two phases together:
void gc(VM* vm) {
markAll(vm);
sweep(vm);
}
你找不到更清晰的标记清除实现了。最棘手的部分是要弄清楚什么时候真正调用它。“内存不足”到底意味着什么,尤其是在虚拟内存接近无限的现代计算机上?
You couldn’t ask for a more obvious mark-sweep implementation. The trickiest part is figuring out when to actually call this. What does “low on memory” even mean, especially on modern computers with near-infinite virtual memory?
事实证明,这里没有精确的正确或错误答案。这实际上取决于你使用虚拟机的目的以及它运行的硬件类型。为了使这个示例简单,我们将在一定次数的分配后进行回收。这实际上就是某些语言实现的工作原理,而且很容易实现。
It turns out there’s no precise right or wrong answer here. It really depends on what you’re using your VM for and what kind of hardware it runs on. To keep this example simple, we’ll just collect after a certain number of allocations. That’s actually how some language implementations work, and it’s easy to implement.
我们扩展
VM
来追踪我们创建了多少个:
We extend VM to track how many we’ve created:
typedef struct {
/* The total number of currently allocated objects. */
int numObjects;
/* The number of objects required to trigger a GC. */
int maxObjects;
/* Previous stuff... */
} VM;
然后初始化它们:
And then initialize them:
VM* newVM() {
/* Previous stuff... */
vm->numObjects = 0;
vm->maxObjects = INITIAL_GC_THRESHOLD;
return vm;
}
INITIAL_GC_THRESHOLD
将是我们启动第一次 GC 时的对象数量。较小的数字在内存方面更加保守,较大的数字在垃圾回收上花费的时间较少。按个人口味调整。
The INITIAL_GC_THRESHOLD will be the number of objects at which we kick off the first GC. A smaller number is more conservative with memory, a larger number spends less time on garbage collection. Adjust to taste.
我们每创建一个对象,都增加
numObjects
并在其达到最大值时进行一次回收:
Whenever we create an object, we increment numObjects and run a collection if it reaches the max:
Object* newObject(VM* vm, ObjectType type) {
if (vm->numObjects == vm->maxObjects) gc(vm);
/* Create object... */
vm->numObjects++;
return object;
}
我不会费心展示它,但我们还会调整
sweep()
以在每次释放时减少 numObjects
。最后我们修改 gc()
来更新最大值:
I won’t bother showing it, but we’ll also tweak sweep() to decrement numObjects every time it frees one. Finally, we modify gc() to update the max:
void gc(VM* vm) {
int numObjects = vm->numObjects;
markAll(vm);
sweep(vm);
vm->maxObjects = vm->numObjects * 2;
}
每次回收后,我们都会根据回收后剩余的存活对象数量更新
maxObjects
。那里的乘数让我们的堆随着存活对象数量的增加而增长。同样,如果一堆对象最终被释放,它会自动收缩。
After every collection, we update maxObjects based on the number of live objects left after the collection. The multiplier there lets our heap grow as the number of living objects increases. Likewise, it will shrink automatically if a bunch of objects end up being freed.
简单
Simple
[#291]
- December 20, 2024
- Bob Nystrom
简单 Simple [#291]
- December 20, 2024
- Bob Nystrom
你做到了!如果你跟着做了所有这些,那么你现在已经掌握了简单的垃圾回收算法。如果你想一次查看全部内容,在这里参阅完整代码。我在这里强调一下,虽然这个收集器很简单,但它不是一个玩具。
You made it! If you followed all of this, you’ve now got a handle on a simple garbage collection algorithm. If you want to see it all together, here’s the full code. Let me stress here that while this collector is simple, it isn’t a toy.
你可以在此基础上构建大量优化⸺在垃圾回收和编程语言中,优化占了
90%
的工作量⸺但这里的核心代码是合法的真实垃圾回收器。它与直到最近的 Ruby 和 Lua 中的回收器都非常相似。你可以在生产代码中使用这样的代码。现在去构建一些很棒的东西吧!
There are a ton of optimizations you can build on top of this—in GCs and programming languages, optimization is 90
Translation. Zipper in scheme [jsr-000O]
- December 18, 2024
-
Oleg Kiselyov with contributions from Jinser Kafka
- Original
Translation. Zipper in scheme [jsr-000O]
- December 18, 2024
- Oleg Kiselyov with contributions from Jinser Kafka
- Original
Title added by translator. 标题由译者添加。
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
前言 [#292]
- December 18, 2024
- Oleg Kiselyov
前言 [#292]
- December 18, 2024
- Oleg Kiselyov
zipper 是一种非常方便的数据结构,它允许我们替换复杂数据结构(例如树或 term)深处的项,而无需任何可变性。结果数据结构将尽可能多地与旧结构共享其组件 [见附录]。旧的数据结构仍然可用(这便于我们撤销该操作)。本质上,zipper 是一个“可更新”但纯函数式的数据结构游标。
Zipper is a very handy data structure that lets us replace an item deep in a complex data structure, e.g., a tree or a term, without any mutation. The resulting data structure will share as much of its components with the old structure as possible [see addendum]. The old data structure is still available (which can be handy if we wish to 'undo' the operation later on). Zipper is essentially an `updateable' and yet pure functional cursor into a data structure.
有用的参考:
Useful references:
http://www.nist.gov/dads/HTML/zipper.html
http://citeseer.ist.psu.edu/hinze01web.html
zipper 是一种由定界延续具体化而来的数据结构。不知何故,这个想法在 zipper 领域并不常见。因为 scheme 有一等支持的定界延续,我们可以更容易地派生和使用 zipper。
Zipper is a _delimited continuation_ reified as a data structure. Somehow that idea is not commonly discussed in the zipper literature. Because Scheme has first-class and delimited continuations, we can derive and use zipper far more easily.
本文将给出 zipper 的派生及其使用示例:交换两棵树的两个分支。该使用示例是遗传编程中典型的交叉操作。
Given below is a derivation of zipper and an example of its use: swapping out of two branches of a two trees. The latter is a typical cross-over operation in genetic programming.
[email protected] (Noel Welsh) 在消息中写
news:<[email protected]>...
我想以纯函数式的方式进行交叉,我设想的算法是:
[email protected] (Noel Welsh) wrote in message
news:<[email protected]>...
I would like to do the crossover in a purely functional manner. The
algorithm I envisage is:
- Count the number of nodes for each of the two trees
- Choose an random integer between 0 ... (number of nodes - 1)
This is the node where crossover will take place.
正如上文指出的,我们不需要通过计数来从树中选择随机节点。选择节点后,我们可以使用
eq?
测试来向下拉开树中的该节点。然而,在下文中,为了简单起见,我们跳过随机选择,按深度优先顺序和节点索引选择节点。
As pointed out earlier, we don't need counting to select a random node from a tree. After we selected the node, we can zip down to that node in the tree using the eq? test. In the following however, we skip the random selection for simplicity and thus we shall be selecting nodes by their index in the depth-first order.
派生 [#293]
- December 18, 2024
- Oleg Kiselyov
派生 [#293]
- December 18, 2024
- Oleg Kiselyov
为了派生 zipper,我们首先编写更熟悉的深度优先遍历例程:
To derive zipper, we first write the familiar depth-first traversal routine:
Welcome to Scheme 48 1.1
> ,open srfi-9
> ,open escapes signals
> ,load /usr/local/lib/scheme48/misc/shift-reset.scm
; deterministic, left-to-right map
(define (map* f l)
(if (null? l) l
(cons (f (car l)) (map* f (cdr l)))))
(define (depth-first handle tree)
(cond
((null? tree) tree)
((handle tree) => (lambda (new-tree) new-tree))
; the node was not handled -- descend
((not (pair? tree)) tree) ; an atom
(else
(cons (car tree) ; node name
(map* (lambda (kid) (depth-first handle kid)) (cdr tree))))))
函数
handle
,作为 depth-first
的第一个参数,接受一个 node
并应该返回一个 node
或 #f
。在第一种情况下,返回的 node
替换结果树中的现有节点。如果 handle
返回 #f
,则表明它已拒绝处理该节点,因此我们保留该节点,在可能的情况下继续处理该节点的子节点。
The function handle, the first-argument of depth-first, receives a node and should yield either a node or #f. In the first case, the return node replaces the existing node in the result tree. If the handle returned #f, it has declined to handle that node, so we keep that node and descend into it, if possible.
为了解其工作原理,我们定义两个示例树并打印出它们的节点:
To see how this works, we define two sample trees and print out their nodes:
(define tree1 '(a (b) (c (d 1 2)) e))
(define tree2 '(z (u) (v (w 10 12)) y))
(depth-first (lambda (node) (display node) (newline) #f) tree1)
==> prints
(a (b) (c (d 1 2)) e)
(b)
(c (d 1 2))
(d 1 2)
1
2
e
==> yields
'(a (b) (c (d 1 2)) e)
我们现在可以定义数据类型 zipper:
We can now define the zipper data structure:
(define-record-type zzipper
(zipper curr-node k)
zipper?
(curr-node z-curr-node)
(k z-k))
它包含两个字段:树的当前节点和延续。延续应该接收一个
node
或 #f
。在前一种情况,接收到的节点将替换现有节点。在后一种情况,我们保留现有节点并继续遍历。这个延续返回一个新的 zipper
或一课树(如果遍历完成)。可以看出,zipper 在某种意义上是函数 handle
的“逆”。
It contains two fields: the current node of a tree, and the continuation. The continuation should receive a node or #f. In the former case, the received node will replace the existing node. In the latter case, we keep the existing node and continue the traversal. The continuation returns either a new zipper, or a tree (if the traversal is finished). One can see that zipper is in a sense an 'inverse' of the function handle.
(define (zip-tree tree)
(reset (depth-first (lambda (tree) (shift f (zipper tree f))) tree)))
正如所承诺的,zipper 确实是定界延续的一种表现形式。
As promised, zipper is indeed a manifestation of a delimited continuation.
我们应该指出,record
zipper
和构造函数 zip-tree
都是 generic 的。它们本身既不依赖树的表示,也不依赖遍历策略。有关树数据结构及其遍历的所有信息都被封装在一个函数 depth-first
中。我们可以通过改变深度优先策略从深度优先策略切换到广度优先策略,或者从嵌套列表切换到树的嵌套向量实现。无论是 zipper
、zip-tree
还是任何使用 zipper
的代码(见下文)都不需要任何修改。我们的 zipper 的这一特性与 Gerard Huet 的 zipper 派生形成了鲜明的对比。在 Gerard Huet 的形式中,zipper 确实取决于数据类型的具体实现:zipper 是从数据类型 derived(双关语)的(译者按:派生和求导)。不同的数据类型(以及抽象数据类型的不同实现)将具有不同的对应 zipper 结构。在我们的形式中,zipper 是遍历函数的 generic
derivation(双关语)(译者按:派生和导数)。zipper 是遍历函数的导数——就只是机械的导数。因此,shift/reset 可以被视为遍历函数的导数运算符。
We should point out that both the zipper record and the constructor function zip-tree are _generic_. They by themselves depend neither on the representation of the tree nor on the traversal strategy. All the information about the tree data structure and its traversal is encapsulated in one single function depth-first. We can switch from depth-first to breadth-first strategy or from a nested list to a nested vector realization of trees just by changing depth-first. Neither zipper, nor zip-tree, nor any code that uses zipper (see below) will require any modifications. This property of our zipper is in a marked contrast with Gerard Huet's derivation of zipper. In Gerard Huet's formulation, zipper does depend on the concrete realization of the data type: zipper is derived (pun intended) from the data type. Different data types (and different realizations of an abstract data type) will have different corresponding zipper structures. In our formulation, zipper is a _generic_ derivation (pun intended) on the traversal function. Zipper is a derivative of the traversal function -- mechanical derivative at that. So, shift/reset can be considered traversal function derivative operators.
使用 [#294]
- December 18, 2024
- Oleg Kiselyov
使用 [#294]
- December 18, 2024
- Oleg Kiselyov
我们现在可以用一种不同的方式打印树:
We can now print out the tree in a different way:
(define (print-tree tree)
(do ((cursor (zip-tree tree) ((z-k cursor) #f)))
((not (zipper? cursor)))
(display (z-curr-node cursor))
(newline)))
我们使用 zipper(即
cursor
)逐个节点地检查整个树。从某种意义上说,我们翻转了 depth-first
的操作。
we use zipper, which is a cursor, to examine all of the tree, node by node. In a sense, we have inverted the operation of depth-first.
(print-tree tree1)
; prints as before
(print-tree tree2)
(z (u) (v (w 10 12)) y)
(u)
(v (w 10 12))
(w 10 12)
10
12
y
引入一些有用的函数
We introduce a few helpful functions
(define (zip-all-the-way-up zipper)
(if (zipper? zipper) (zip-all-the-way-up ((z-k zipper) (z-curr-node zipper)))
zipper))
(define (locate-nth-node n tree)
(do ((i 0 (+ 1 i)) (cursor (zip-tree tree) ((z-k cursor) #f)))
((and (= i n)
(if (zipper? cursor) #t
(error "too few nodes"))) cursor)
))
我们已准备好做一些事:
And we are ready for some action:
; replace the 3-d node of tree1 with 'xxx
(let ((desired-node (locate-nth-node 3 tree1)))
(display "Replacing the node: ")
(display (z-curr-node desired-node))
(newline)
(zip-all-the-way-up ((z-k desired-node) 'xxx)))
==> prints
Replacing the node: (d 1 2)
==> yieds
'(a (b) (c xxx) e)
它确实替换了,不是吗?
It did replace it, didn't it?
; cross-over of the 3d node of tree1 and 1st node of tree2
(let* ((desired-node1 (locate-nth-node 3 tree1))
(_ (begin
(display "Cross-over the node1: ")
(display (z-curr-node desired-node1))
(newline)))
(desired-node2 (locate-nth-node 1 tree2))
(_ (begin
(display "Cross-over the node2: ")
(display (z-curr-node desired-node2))
(newline)))
(new-tree1
(zip-all-the-way-up ((z-k desired-node1)
(z-curr-node desired-node2))))
(new-tree2
(zip-all-the-way-up ((z-k desired-node2)
(z-curr-node desired-node1))))
)
(display "new tree1: ") (display new-tree1) (newline)
(display "new tree2: ") (display new-tree2) (newline)
)
==> prints
Cross-over the node1: (d 1 2)
Cross-over the node2: (u)
new tree1: (a (b) (c (u)) e)
new tree2: (z (d 1 2) (v (w 10 12)) y)
嗯,看来可行...
Well, it seems to work...
如果我们交换
tree1
的第3个节点和 tree2
的第5个节点,我们得到
If we swap the 3d node of tree1 and the 5th node of tree2, we get
Cross-over the node1: (d 1 2)
Cross-over the node2: 12
new tree1: (a (b) (c 12) e)
new tree2: (z (u) (v (w 10 (d 1 2))) y)
总结 [#295]
- December 18, 2024
- Oleg Kiselyov
总结 [#295]
- December 18, 2024
- Oleg Kiselyov
总而言之,定界延续非常有用。可以在任何 R5RS scheme 系统中进行模拟;但如果它本身受支持,性能会更好。Scheme48 本身就支持定界延续(Martin Gasbichler and Michael Sperber, ICFP 2002)。如果你最喜欢的 Scheme 系统默认没有提供它们,请向实现者投诉。支持哪个特定的定界延续运算符(shift、control、shift0、splitter、cupto 等)并不重要——所有这些的表达性都是相同的:
To conclude, delimited continuations are quite useful. They can be emulated in any R5RS Scheme system; yet it is better for performance if they are supported natively. Scheme48 does support delimited continuations natively (Martin Gasbichler and Michael Sperber, ICFP 2002). If your favorite Scheme system does not offer them by default, please complain to the implementors. It doesn't matter which particular delimited continuation operator (shift, control, shift0, splitter, cupto, etc) is supported -- all of them are equally expressible:
- Chung-chieh Shan, Scheme2004 workshop
- http://www.eecs.harvard.edu/~ccshan/recur/
附录 [#296]
- December 18, 2024
- Oleg Kiselyov
附录 [#296]
- December 18, 2024
- Oleg Kiselyov
附录,2006 年 6 月 7 日 [受到 Andrew Wilcox 问题的启发] 更准确地说,zipper 与底层枚举器一样保留了共享。以下是最大共享保留枚举器。这两个函数应该取代本文中的函数。
Addendum, June 7, 2006 [inspired by a question from Andrew Wilcox] To be more precise, the zipper preserves sharing as much as the underlying enumerator does. The following is the maximal sharing preserving enumerator. Those two functions should replace the ones in the article.
; deterministic, left-to-right map
; It preserves sharing as much as possible: that is, if given the pair
; l == (cons h t), (and (eq? h (f h)) (eq? t (map* f t))) holds, then
; (eq? (map* f l) l) holds as well.
(define (map* f l)
(if (null? l) l
(let ((h (car l)) (t (cdr l)))
(let ((h1 (f h)) (t1 (map* f t)))
(if (and (eq? h1 h) (eq? t1 t)) l
(cons h1 t1))))))
(define (depth-first handle tree)
(cond
((null? tree) tree)
((handle tree) => (lambda (new-tree) new-tree))
; the node was not handled -- descend
((not (pair? tree)) tree) ; an atom
(else
(let ((kids1
(map* (lambda (kid) (depth-first handle kid)) (cdr tree))))
(if (eq? kids1 (cdr tree)) tree
(cons (car tree) ; node name
kids1))))))
为了测试新的
depth-first
确实保留了共享,我们求值
To test that the new depth-first indeed preserves sharing, we evaluate
(eq? tree1
(depth-first (lambda (node) (display node) (newline) #f) tree1))
以深度优先顺序打印所有节点后,给出结果
#t
。在这种情况下,深度优先(depth-first
)返回的树确实是原始树。
which, after printing all nodes in depth-first order, gives the result #t. The tree returned by depth-first in this case is indeed the original tree as it is.
zipper 代码无需更改,按原样工作,具有相同的结果。 为了测试共享是否保留,我们首先通过替换
tree2
中的第 6 个节点(即 y
)来生成一棵树:
The zipper code needs no changes, and it works as it was, with the same results. To test the sharing preservation, we first produce a tree by replacing the 6th node (which is y) in tree2:
(define tree2*
(let ((desired-node (locate-nth-node 6 tree2)))
(display "Replacing the node: ")
(display (z-curr-node desired-node))
(newline)
(zip-all-the-way-up ((z-k desired-node) 'newy))))
这是结果:
(z (u) (v (w 10 12)) newy)
here's the result: (z (u) (v (w 10 12)) newy)
现在,我们编写一个函数,它接受两棵树,同步遍历它们并打印出节点以及它们是否共享:
Now, we write a function that takes two trees, traverses them in lockstep and prints out the nodes and if they are shared:
(define (tree-compare-sharing t1 t2)
(do ((cursor1 (zip-tree t1) ((z-k cursor1) #f))
(cursor2 (zip-tree t2) ((z-k cursor2) #f)))
((cond
((and (zipper? cursor1) (zipper? cursor2)) #f)
((zipper? cursor1) (display "t2 finished early") #t)
((zipper? cursor2) (display "t1 finished early") #t)
(else #t)))
(let ((n1 (z-curr-node cursor1)) (n2 (z-curr-node cursor2)))
(cond
((eq? n1 n2) (display "shared node: ") (display n1))
(else (display "t1 node: ") (display n1) (newline)
(display "t2 node: ") (display n2)))
(newline))))
(tree-compare-sharing tree2 tree2*)
===>
t1 node: (z (u) (v (w 10 12)) y)
t2 node: (z (u) (v (w 10 12)) newy)
shared node: (u)
shared node: (v (w 10 12))
shared node: (w 10 12)
shared node: 10
shared node: 12
t1 node: y
t2 node: newy
Put a task list here.