Jinser Kafka › translations [jsr-000N]
Jinser Kafka › translations [jsr-000N]
翻译文章的列表,标记的时间是翻译时间。
Translation. Python Negatypes [jsr-0014]
- March 5, 2025
-
Hillel Wayne with contributions from Jinser Kafka
- Original
Translation. Python Negatypes [jsr-0014]
- March 5, 2025
- Hillel Wayne with contributions from Jinser Kafka
- Original
This article is translated from Hillel Wayne's article titled "Python Negatypes".
本文翻译自 Hillel Wayne 题为《Python Negatypes》的文章
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
Translated according to the version last updated on Sep 26, 2019.
根据2019年9月26日更新的版本翻译。
正文 [#255]
- March 5, 2025
- Hillel Wayne
正文 [#255]
- March 5, 2025
- Hillel Wayne
from abc import ABC
class AbstractIterable(ABC):
@abstractmethod
def __iter__(self):
while False:
yield None
def get_iterator(self):
return self.__iter__()
ABC 被添加来稍微增强
鸭子类型
。如果你继承了
AbstractIterable
,那么所有人都能知道你有一个实现了的 __iter__
方法,可以适当地对其处理。
ABCs were added to strengthen the duck typing a little. If you inherited AbstractIterable, then everybody knew you had an implemented __iter__ method, and could handle that appropriately.
不出所料,这个想法从未流行起来。人们更喜欢“请求原谅而不是请求许可”,然后将对
__iter__
的调用包装在 try
块中。这对静态类型检查可以很有用,但实际上 Mypy 并不使用它。如果你想进行类型检查,发现它有 __iter__
,但某人没有从 AbstractIterable
继承,该怎么办?Mypy 团队改用
协议
,它由 ABC
引导
,但对用户隐藏了这个细节。
Unsurprisingly, this idea never caught on. People instead preferred “better ask forgiveness than permission” and wrapped calls to __iter__ in a try block. This could be useful for static type checking, but in practice Mypy doesn’t use it. What if you wanted to typecheck it had __iter__ but the person did not inherit from AbstractIterable? The Mypy team instead uses protocols, which is bootstrapped off ABCs but hides that detail from the user.
但 ABC 旨在
向后兼容
。而且已经存在具有
__iter__
方法的类。我们如何将它们纳入我们的 AbstractIterable
ABC 中?为了解决这个问题,Python 团队添加了一个特殊的 ABC 方法:
But ABC was intended to be backwards compatible. And there were already existing classes that had a iter method. How could we include them under our AbstractIterable ABC? To handle this, the Python team added a special ABC method:
class AbstractIterable(ABC):
@classmethod
def __subclasshook__(cls, C):
return hasattr(C, "__iter__")
__subclasshook__
是使某些类算作这个 ABC 的子类的运行时条件。如果 OurClass
具有 __iter__
属性,那么 isinstance(OurClass(), AbstractIterable)
就为真,即使它不是从 AbstractIterable
继承的。
__subclasshook__ is the runtime conditions that makes something count as a child of this ABC. isinstance(OurClass(), AbstractIterable) is true if OurClass has a iter attribute, even if it didn’t inherit from AbstractIterable.
这个函数是一个运行时函数。我们可以在其中放入任意代码。它传入的是对象的类,而不是对象本身,因此我们无法检查特定对象的
属性
。但我们仍然可以做一些奇怪的事情:
That function is a runtime function. We can put arbitrary code in it. It passes in the object’s class, not the object itself, so we can’t inspect the properties of the specific object. We can still do some weird stuff:
class PalindromicName(ABC):
@classmethod
def __subclasshook__(cls, C):
name = C.__name__.lower()
return name[::-1] == name
任何名字是回文的类,像是“Noon”,都将被视为
PalindromicName
的子类。我们可以更进一步:既然可以跳下去,为什么还要
凝视深渊
呢?
Any class with a palindromic name, like “Noon”, will counts as a child class of PalindromicName. We can push this even further: why gaze into the abyss when you can jump in?
class NotIterable(ABC):
@classmethod
def __subclasshook__(cls, C):
return not hasattr(C, "__iter__")
这是所有不可迭代的类型。比如
isinstance(5, NotAString)
之类的。我们创建了一个
负类型
:一种仅指定它不是什么的类型。我们可以将其作为
正类型
集减去给定类型的子集,剩下的那部分。没有什么可以阻止我们为“不是字符串的可迭代对象”或“返回的对象与同一可调用对象不相同的可调用对象”创建 ABC。
This is the type of everything that isn’t iterable. We have isinstance(5, NotAString), etc. We’ve created a negatype: a type that only specifies what it isn’t. We can include this as part of a set of positive types, subtracting out a subset of a given type. There’s nothing stopping us from making an ABC for “iterables that aren’t strings”, or “callable objects that don’t return an object of the same callable”.
这有什么用?
How is this useful?
[#256]
- March 5, 2025
- Hillel Wayne
这有什么用? How is this useful? [#256]
- March 5, 2025
- Hillel Wayne
不知道。
No idea.
ABCs 不是作为方法解析顺序的一部分进行检查的,因此你不能使用它来
补丁
属性。Mypy 无法检查
__subclasshook__
。如果你想使用它进行运行时检查,编写函数会比创建 ABC 更简单、更可移植。唯一不同的情况是
单分派
函数,它可以在 virtual ABC 上分派。但仅此而已。
ABCs aren’t checked as part of the method resolution order, so you can’t use this to patch in properties. Mypy can’t check __subclasshook__. If you want it for runtime-checks, writing a function would be simpler and more portable than creating an ABC. Just about the only case where there is a difference is with single-dispatch functions, which can dispatch on virtual ABCs. But that’s about it.
但它很酷!
It’s pretty cool, though!
Translation. Grasping `all-the-apples-at-once' [jsr-000U]
- March 3, 2025
-
Oleg Kiselyov with contributions from Jinser Kafka, 阅卜录
- Original
Translation. Grasping `all-the-apples-at-once' [jsr-000U]
- March 3, 2025
- Oleg Kiselyov with contributions from Jinser Kafka, 阅卜录
- Original
This article is translated from Oleg Kiselyov's article titled "Grasping `all-the-apples-at-once'".
本文翻译自 Oleg Kiselyov 题为《Grasping `all-the-apples-at-once'》的文章
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
Translated according to the version last updated on June 3, 2024.
根据2024年6月3日更新的版本翻译。
介绍
Introduction
[#257]
- March 3, 2025
- Oleg Kiselyov
介绍 Introduction [#257]
- March 3, 2025
- Oleg Kiselyov
这是一次轻松的讨论,内容是对一次即兴编程竞赛的分析:参与者使用自选的编程语言解决一个简单但现实的问题。结果显示,那些明显正确且用恰当语言表达出的解决方案,都不是用我预期的语言写成的。这个讨论记录了令人不安的观察和几个意想不到的鼓励,同时也反思了常见的编程语言教学及讨论方式。
This is a light talk with the analysis of an impromptu programming contest: solving a simple but realistic problem in a language of participants' choice. It turns out that the solutions that were obviously right, expressed in just the right words were written not in the languages I have expected. This talk notes the unsettling observations and a few unexpected encouragements, along with a reflection on the common way of teaching and arguing about programming languages.
这场竞赛发生在一个以计算机为中心的社区 Lobsters。Lobsters 有点像 HN,也是一个论坛,会员们会提交和讨论各种文章,从学术期刊论文到博客文章、公告和视频。这个社区可以被称为“主流 IT”:包括嵌入式系统到 Web 的开发人员,系统管理员,学生,计算机爱好者以及少数学者。
The contest happened at a computer-focused community Lobsters. Somewhat like HN, Lobsters is a forum for its members to submit and discuss articles: from journal papers to blog posts, announcements and videos. The community is what one may call the `mainstream IT': developers, from embedded systems to the Web, system administrators as well as students, computer hobbyists, and a few academics.
某天,一篇关于常见工作面试问题的文章被发布出来以供讨论。随后,成员们自发地开始提交他们用自己选择的编程语言编写的解决方案。他们还对自己的和其他人的解决方案进行了评论。
One day an article about a common job interview question was posted for discussion. Spontaneously, members started submitting their own solutions, in their language of choice. They also commented on their own and others solutions.
令我惊讶的是,无论它们是使用什么编程语言编写的,是否是函数式语言(如 Haskell、Erlang、Clojure)或命令式语言(如 Go、Ruby、Raku),这些解决方案被整齐地分成了两类。我现在才意识到,递归不是只在理论上才是迭代的对偶,就像循环一样,它是 low-level 的:一次选择一个苹果。同样,“纯函数式”的状态处理和命令式的状态处理也是如此。
It has struck me how the the solutions neatly fall into two classes -- irrespective of the language they are written in, be it a functional (such as Haskell, Erlang or Clojure) or imperative (Go, Ruby, Raku). Only now I realized that recursion is dual to iteration not just in theory. Like loops, it is low-level: selecting one apple at a time. Likewise is the `pure-functional' and imperative handling of state.
我们如何一次抓住所有苹果?就像在自然语言的学习和教学中一样,我们是否应该少关注语法,而更多地关注对话和讲故事?
How do we grasp all the apples at once? As in learning and teaching of natural languages, should we focus less on grammar and more on conversation and storytelling?
该演讲是在 2021年7月1日 的 IFIP WG 2.1 线上会议进行的。
The talk was given at an online meeting of IFIP WG 2.1 on July 1, 2021.
引用
References
[#258]
- March 3, 2025
- Oleg Kiselyov
引用 References [#258]
- March 3, 2025
- Oleg Kiselyov
Lobsters 上竞赛的讨论串。2021年2月。
<https://lobste.rs/s/zpqrpc/random_job_interview_challenge_clojure>
The Lobsters thread with the contest. February 2021.
<https://lobste.rs/s/zpqrpc/random_job_interview_challenge_clojure>
竞赛问题
Contest problem
[jsr-000V]
竞赛问题 Contest problem [jsr-000V]
Alexey Grigorev (@Al_Grigor) February 3, 2021 (Twitter)
大多数候选人都无法解决这个面试问题:
输入: "aaaabbbcca"
输出: [("a",4), ("b", 3), ("c", 2), ("a", 1)]
写一个将输入转换为输出的函数
我在视频面试中提出这个问题,给予 25 分钟时间
你会如何解决?
Most candidates cannot solve this interview problem:
Input: "aaaabbbcca"
Output: [("a",4), ("b", 3), ("c", 2), ("a", 1)]
Write a function that converts the input to the output
I ask it in the screening interview and give it 25 minutes
How would you solve it?
Alexey Grigorev (@Al_Grigor) February 3, 2021 (Twitter)
引用
References
[#288]
引用 References [#288]
该问题的说明引用自 Lobsters 上发布的网页:
A Random Job Interview Challenge In Clojure. Feb 4 2021 by Savo Djuric
<https://savo.rocks/posts/a-random-job-interview-challenge-in-clojure/>
The statement of the problem is quoted from the web page posted on Lobsters:
A Random Job Interview Challenge In Clojure. Feb 4 2021 by Savo Djuric
<https://savo.rocks/posts/a-random-job-interview-challenge-in-clojure/>
解法样本
Sample solutions
[jsr-000W]
解法样本 Sample solutions [jsr-000W]
提交到 Lobsters 供讨论的博客文章描述了在 Clojure 中的一个解决方案。但不知为何,这促使会员们用自己选择的编程语言来解决这个问题并发布结果。于是这就形成了一场即兴竞赛。
The blog post submitted to Lobsters for discussion described one solution, in Clojure. Somehow it prompted the members to solve the problem in their language of choice, and post the results. Hence there developed an impromptu contest.
以下是已发布的解法样本。这些解决方案由不同的人用不同的编程语言编写⸺然而不知为何,它们分成了两组,我暂时称之为 Group I 和 Group II。我稍后会告诉你这些组的意义⸺请让我卖个关子。
Below is a sample of posted solutions. They are written in various languages by various people -- yet somehow fall into two groups, which I call for now Group I and Group II. I'll tell what they mean only later -- I hope you'd allow me a bit of suspense.
Clojure (Group I) [#277]
Clojure (Group I) [#277]
(->> "aaaabbbcca"
(partition-by identity)
(map (juxt (comp str first) count)))
提交者还提供了另一个解法:“有点繁琐,但更清晰”:
(for [p (partition-by identity "aaaabbbcca" )]
[(-> p first str) (count p)])
The submitter also gave another solution, ``a bit chattier but also more legible:''
值得注意的是,
comp
是
函数组合
,而 juxt
粗略的类型是 (a -> b * a -> c) -> a -> (b * c)
,这是一个组合子,它接收一个函数二元组和一个值,对该值应用两个函数,然后返回由这两个函数的返回值组成的元组。
It is worth noting that comp is the functional composition, and juxt, with the rough type (a->b * a->c) -> a -> (b * c), is the combinator that takes a tuple of functions and a value, applies each function to it and returns the results as a tuple.
Ruby (Group I) [#278]
Ruby (Group I) [#278]
Ruby 有三种解法⸺都是单行的,而且很相似。第一个可能是最容易理解的。
There were three solutions in Ruby -- all one-liners, and rather similar. The first is probably the easiest to understand.
"aaaabbbcca".split("").chunk_while {|a,b| a==b}.map {|a| [a.first, a.size]}
"aaaabbbcca".chars.slice_when {|c,n| c != n }.map {|s| [s[0], s.size]}
'aaaabbbcca'.chars.chunk(&:itself).map{|x,y| [x, y.size]}
APL (Group I) [#279]
APL (Group I) [#279]
两个 APL 解法:
Two APL solutions:
(((1∘+ +\) ⊢ ≠ ¯1∘⌽) (((⊃,≢) ⊢) ⌸) ⊢) 'aaaabbbcca'
f←(⊃,≢)¨⊢⊆⍨1+(+\0,2≠/⊢) ⊢x←f'aaaabbbcca'
GO (Group II) [#280]
GO (Group II) [#280]
我知道这是一个 Clojure 的帖子,但是出于好奇,我用 Go 编写了它(我最近一直在使用 Go)。只是为了看看是否可以快速解决。我花了大约 10 分钟,减去 3-4 分钟与 runes 战斗的时间。 这不是特别优雅,但能工作。
``I know this is a Clojure post, but out of curiosity I wrote it in Go (what I've been working in lately) just to see if I could solve it quickly. Took about 10 minutes, subtract 3-4 for fighting with runes. It's not particularly elegant, but it works.''
type result struct {
letter string
count int
}
func main() {
const input = "aaaabbbcca"
var ret []result
currentLetter := string(input[0])
countCurrentLetter := 1
for _, elem := range input[1:] {
elemAsString := string(elem)
if currentLetter == elemAsString {
countCurrentLetter++
} else {
ret = append(ret, result{currentLetter, countCurrentLetter})
currentLetter = elemAsString
countCurrentLetter = 1
}
}
ret = append(ret, result{currentLetter, countCurrentLetter})
fmt.Printf("%+v", ret)
}
该解法与类似解法的显著特征是明确的迭代:
for
循环。
The salient feature of this and similar solutions is explicit iteration: the for loop.
Racket (Group II) [#281]
Racket (Group II) [#281]
(for/fold ([p #f]
[cnt 0]
[counts null]
#:result (cdr (reverse (cons `(,p ,cnt) counts))))
([c (in-string "aaaabbbcca")])
(if (equal? p c)
(values c (add1 cnt) counts)
(values c 1 (cons `(,p ,cnt) counts))))
该解法是纯函数式的⸺但与上面 GO 的解决方案非常相似。
The solution is pure functional -- yet closely resembling the Go solution above.
Raku (Group I) [#282]
Raku (Group I) [#282]
"aaaabbbcca".subst( /(.) $0*/ , {"({$0},{$/.chars})" }, :g )
Raku 是前身为 Perl 6 的艺术品。Perl 一直以其(非常)正则表达式而闻名,威力可见一斑。
Raku is the artist formerly known as Perl 6. Perl has always been known for its (ir)regular expressions, whose power is on display.
Erlang (Group II) [#283]
Erlang (Group II) [#283]
L = "aaaabbbcca",
lists:reverse(lists:foldl(
fun
(Elt, [ {[Elt], N} | R ]) -> [ {[Elt], N + 1} | R];
(Elt, Acc) -> [{[Elt], 1} | Acc]
end,
[],
L
)).
与 Racket 解法相似。我们看到了 Erlang
非线性模式匹配
的特征:在第一个模式中变量
Elt
出现了两次。
It is rather similar to the Racket solution. We see the characteristic for Erlang non-linear pattern matching: the variable Elt appears twice in the first pattern.
Python (Group I) [#284]
Python (Group I) [#284]
[(k, len(list(g))) for k, g in itertools.groupby("aaaabbbcca")]
Rust (Group II) [#285]
Rust (Group II) [#285]
fn run_length_encode(ins: &str) -> Vec<(char, usize)> {
let mut out = vec![];
let mut i = ins.chars();
if let Some(mut c) = i.next() {
let mut count = 1;
for new_c in i {
if new_c == c {
count += 1;
} else {
out.push((c, count));
count = 1;
c = new_c;
}
}
out.push((c, count));
}
out}
就像 GO 的解法一样⸺但是更干净,并且更好地处理边缘情况。作者确实掌握了问题的本质:
游程编码
。
It is quite like the solution in Go -- but cleaner and with a better handling of edge cases. The author did grasp the nature of the problem: run-length encoding.
Julia (Group II) [#286]
Julia (Group II) [#286]
function rle(itr)
acc = Tuple{eltype(itr), Int}[]
x = iterate(itr)
isnothing(x) && return acc # iterable is empty
last, state = x
n = 1
for v in Iterators.rest(itr, state)
if last == v
n += 1
else
push!(acc, (last, n))
last = v
n = 1
end
end
push!(acc, (last, n))
return acc
end
镜像了上面的 Rust 解法。
It mirrors the Rust solution above.
Haskell (Group II) [#287]
Haskell (Group II) [#287]
main = interact soln
soln = show . reverse . go []
go counts [] = counts
go counts (x:xs) = go (inc x counts) xs
inc c ((x,ct):xs) | c == x = (x,ct+1):xs
inc c xs = (c, 1):xs
尽管该解法依赖于显式递归而不是
fold
,但它与 Erlang 解法类似。(后来有另一个人提到了另一种使用 group
的解法)。
Although this solution relies on the explicit recursion rather than fold, it is rather similar to the Erlang solution. (A different person later mentioned another solution, using group).
评论
Comments
[jsr-000X]
评论 Comments [jsr-000X]
Lobsters 竞赛的一个特色是参与者对自己和他人的解法的评论。以下是几条值得注意的评论,它们进一步证明了我在本次演讲中的观点。
A particular feature of the Lobsters' contest is comments of the participants, about their and others' solutions. Below are several notable comments, which go on to make my point in this talk.
一位评论者想知道:
One commenter wondered:
阅读 Clojure 解法需要跟着作者的思维过程。阅读等效的 Go 解法几乎没有认知开销。
Clojure 程序员如何编写代码,才能不仅在作者眼中简洁优雅,而且还可以让其他人快速阅读而无需花费时间进行脑力劳动?``Reading the Clojure solution involves following the thought process of how the author ended up there. Reading the equivalent Go solution has very little cognitive overhead.
How does Clojure programmers write code that is not just brief and elegant in the eyes of the author, but can also be read quickly by others without time consuming mental gymnastics?''
另一人回答:
and another replied:
我想这只是你的习惯的问题。我发现上面的两种 Go 解法都很难理解,因为你必须遵循算法并在脑海中构建它的功能。而在大多数 Clojure 解法中,代码会直接告诉你发生了什么。
I guess it's just a matter of what you're used to. I find both Go solutions above arduous to understand because you have to follow the algorithm and construct what it functionally does in your head. Whereas in most of the clojure solutions the code directly tells you what's happening.
例如,带有注释的最高帖子:
Eg. the top post with comments:
; threading macro (take input, then do stuff to it in steps)
(->> "aaaabbbcca"
; split into lists of consecutive same character
(partition-by identity)
; map each item (list of chars) with a function
(map
; return list of outputs of functions
(juxt
; first character as string
(comp str first)
; count of characters
count)))
该信息更加密集。如果你擅长阅读它,你可以一目了然地理解它。
The information is much more dense. If you are good at reading it, you understand it at a glance.
我相信 APL 解法也是如此。我看不到它们,但是确实有清楚看到算法的人。
I believe the same is true for the APL solutions. I can't read them, but someone who does sees the algorithm clearly.''
另一位评论者同意:
Yet another commenter concurred:
我也是;他们的[Go 解法]似乎超级繁琐和低级。就像你给要编译器灌输基本的东西,而不是自然地描述算法一样。
``Same here; they [Go solutions] just seem super tedious and low-level; like you're spoon-feeding the compiler on basic things instead of just describing the algorithm naturally.
如果你没有词汇来描述算法,那么当然低级版本会更容易阅读,就像刚刚学习英语的人会发现“简化英语”维基百科比经常使用专业术语的常规维基百科更容易理解一样。但如果你每天都使用算法,你就应该投资学习这些术语!
If you don't have the vocabulary to describe an algorithm, of course the low-level version is going to be easier to read, much like someone who is only just learning English will find the `Simplified English' wikipedia more accessible than the regular one that often uses specialized jargon. But if you work with algorithms every day, you owe it to yourself to invest in learning the jargon!''
苹果类比
The apples analogy
[jsr-000Y]
苹果类比 The apples analogy [jsr-000Y]
我已经提到了几次苹果。以下引用能说明一切。
I have mentioned apples several times already. The following quote explains what is it all about.
假设我们有一箱苹果,想从中挑选出所有好苹果。我们不想自己做这个任务,所以我们写了一张清单,要求助手来完成。
``Suppose we have a box of apples from which we wish to select all of the good apples. Not wishing to do the task ourselves, we write a list of instructions to be carried out by a helper.
常规语言的指令可能表达如下内容:从箱子里选一个苹果。如果是好的,就把它放在为好苹果保留的地方;如果不好,就把它扔掉。再选一个苹果;如果是好的,就把它放在保留的地方;如果不好,就把它扔掉…继续以这种方式依次检查每个苹果,直到选出所有好苹果。总之,从我们拿起的第一个苹果开始,一次检查一个苹果,直到检查到箱子里的最后一个苹果,直到我们选出并把所有好苹果都放在一边。
The instructions corresponding to a conventional language might be expressed something like the following: Select an apple from the box. If it is good, set it aside in some place reserved for the good apples; if it is not good, discard it. Select a second apple; if it is good put it in the reserved place, and if it is not good discard it.... Continue in this manner examining each apple in turn until all of the good apples have been selected. In summary, then, examine the apples one at a time starting with the first one we pick up and finishing with the last one in the box until we have selected and set aside all of the good apples.
另一方面,与数组语言相对应的指令可以简单地说明“从箱子里选择所有好苹果”。
On the other hand the instructions corresponding to an array language could be stated simply as ``Select all of the good apples from the box.'' Of course, the apples would still have to be examined individually, but the apple-by-apple details could be left to the helper.
可以考虑常规的编程语言是以机器语言为原始形式的“
一次一苹果
”语言,而数组语言可以视为“
一次所有苹果
”语言。
Conventional programming languages may be considered then ``one-apple-at-a-time'' languages with machine languages being a very primitive form, whereas array languages may be considered ``all-the-apples-at-once'' languages.''
2005年10月,
基思·斯迈利
:《My Life with Array Languages》。这个比喻要感谢那个以写出《人月神话》 而闻名的
弗雷德
·P·
布鲁克斯
。
Keith Smillie: ``My Life with Array Languages''. October 2005. This analogy is credited to Frederick P. Brooks: that Fred Brooks, of the Mythical Man-Month fame.
(译者按:《My Life with Array Languages》 原文
The apple analogy originated with Frederick P. Brooks who worked closely with Ken Iverson in the early days of the development of APL.原翻译有错,感谢 阅卜录 指出。)
观察
Observations
[jsr-000Z]
观察 Observations [jsr-000Z]
在一场不受控制的竞赛中,仅凭几个针对单个问题的解法,我们无法得出任何统计结论。但仍有足够的数据供我们观察和思考。第一个观察结果是,人们成功地使用了多种语言来解决问题,提交者自述大约花了 5-10 分钟。提交的解法似乎也很地道,反映了“社区标准”,这些语言教学(或自学)用于工业家的使用方式。
A few solutions to a single problem in an uncontrolled contest do not let us draw any statistical conclusions. Still there is enough data for observations and reflections. The first observation is that quite a variety of languages were used to solve the problem, successfully, and it took the submitters, by their self-reports, about 5-10 minutes. The submitted solutions also seem idiomatic, reflecting the `community standards', the way these languages are (self-) taught and used in the industry.
第二个令人惊讶的观察结果是,提交的解法分为两类:
The second, surprising observation is that the submitted solutions fall into two classes:
逐字符扫描输入,构造结果: Group II
将问题视为 groupBy,然后
映射
:Group I
scan the input string char by char, building the result: Group II
see the problem as groupBy followed by mapping: Group I
这是编写各组解法的语言:
Here are the languages used to write the solutions in each group:
- Group I: Clojure, Ruby, Raku, Python, Haskell, APL
- Group II: Go, Racket, Erlang, Rust, Julia, Haskell
无论选择哪种方式分类编程语言⸺函数式与命令式、旧式与新型、由“业余爱好者”创建与由“专业人士”创建⸺这两类语言中都有表现。
Whichever the classification of programming languages one may choose -- functional vs. imperative, old vs. new, created by `amateurs' vs. `professionals' -- there is a representative in both groups.
看看 Racket 和 Rust(或 Go、Julia)的解法,可以得出另一个关于
状态
的结论。尽管 Racket 解法是“纯函数式”的,但它与 Rust 和 Go 解法一样拥有状态。状态数量相同。Racket 解法中的状态使用
values
子句一次性更新;更新按位置而不是按名称与其目标匹配,这很容易出错。(顺便说一句,Rust 和 Julia 对边缘情况的处理比 Racket 或 Go 解法更清晰,显然更正确。)
Looking at the Racket and Rust (or Go, Julia) solutions leads to another conclusion, about state. Although the Racket solution is `pure functional', it is just as stateful as Rust and Go solutions. There is the same amount of state. The state is updated in the Racket solution at once, with the values clause; the update is matched to its target by position rather by name, which is quite error prone. (As a side observation, the handling of edge cases is clearer and obviously correct in Rust and Julia than in the Racket or Go solutions.)
因此,“纯函数式”并不会消除状态;它只是改变了处理状态的方式⸺这可能或多或少更清晰和方便。毫无疑问,改进状态处理的方法是将其划分并封装到“
更高级
”、
可组合
的操作中,如“groupBy”:参见 Clojure、Ruby、Python 解法。groupBy 本身可以实现可变或不可变状态⸺无论怎样,状态都被抽象出来,groupBy 的用户不必关心它。
Thus being `pure functional' does not eradicate state; it merely changes the way state is handled -- which could be either more or less clearer and convenient. What unequivocally improves the handling of the state is partitioning and encapsulating it into `higher-level', composable operations like `groupBy': see Clojure, Ruby, Python solutions. The groupBy may itself be implemented with mutable state or without -- no matter, the state is abstracted away and the user of groupBy does not have to care about it.
惊喜
Pleasant surprises
[jsr-0010]
惊喜 Pleasant surprises [jsr-0010]
第一个惊喜是 Group I(即使用
一次所有苹果
解法)的人数众多⸺而且当今使用的语言也具有足够的能力来表达这些解决方案。
The first pleasant surprise was the notable number of Group I, that is, all-the-apples-at-once solutions -- and the number of languages in use today with the adequate facilities to express these solutions.
引发此次竞赛的 Savo Djuric 的博客文章也描述了一个解法,值得引用。
The blog post by Savo Djuric that prompted the contest has also described a solution, which is worth quoting.
由于我最近才开始学习 Clojure,而且我经常做 4clojure 练习,所以这个问题激起了我的兴趣,我决定立即尝试解决它。我启动了我的 REPL,几分钟后,我写出了:
Since I recently started learning Clojure, and I regularly do 4clojure exercises, this problem seemed intriguing, so I decided to try to solve it immediately. I fired up my REPL, and within several minutes i came up with:(interleave (map str (map first (partition-by identity "aaaabbbcca"))) (map count (partition-by identity "aaaabbbcca"))) => ("a" 4 "b" 3 "c" 2 "a" 1)
耶!这有用!但它并不完全符合要求。可以说,它也不是很“干净”。这个解法最让我困扰的是重复的代码,所以我决定用 let:... 来解决这个问题。
Yaay! It works! But it is not exactly what was asked for. Nor is it very ‘clean’, so to say. What bothers me the most in this solution is that i have repetitive code, so I decided to solve that with let:...(defn chop-chop [coll] (let [x (partition-by identity coll)] (map list (map (comp str first) x) (map count x))))
我必须说,在几分钟之内解决问题感觉很棒(尽管这并不难),因为我只是刚开始 Clojure 的旅程。
I must say that it felt great to solve a problem within several minutes (albeit it was not that hard), since I'm only at the beginning of my journey with Clojure.
这篇博文展示了我认为的解决问题的理想方法:把握问题的本质⸺
groupBy
,或 Clojure 中的partition-by
⸺设计解法原型,理解其缺陷并希望做得更好,采取措施改进。我认为作者作为程序员前途光明。
The blog post shows the ideal, to me, way to solve the problem: grasping its nature -- groupBy, or partition-by in Clojure, -- prototyping the solution, understanding its imperfections and wanting to do better, taking steps to improve. I think the author has a bright future as a programmer.
我从他的博客中得知,作者从未上过大学,偶然进入编程领域(但受到他才华横溢的兄弟的启发),并且一直在自学编程语言(现在是 Clojure)。也许这就是(他能做到这件事的)秘密。
From his blog I learned that the author never went to college, got into programming by accident (but prompted by his talented brother), and has been studying programming languages (now, Clojure) entirely on his own. May be that's the secret.
另一个惊喜是看到了我眼中理想的解法:
Another pleasant surprise was seeing the solution that in my eyes is ideal:
[(k, len(list(g))) for k, g in itertools.groupby("aaaabbbcca")]
这个 Python 解法本质上是用常规的数学符号来表达的,每个人都可以读懂(与 APL 不同)。更重要的是,它在语言上很经济⸺同样,与 APL 不同。它只使用两个直接相关的功能:
groupBy
和 comprehension。符号在这里显然是思维的工具。
This Python solution is expressed in what is essentially the conventional mathematical notation, readable by everyone (unlike APL). More importantly, it is linguistically economical -- again, unlike APL. It uses just the two directly relevant facilities: groupBy and comprehension. Notation is clearly here the tool of thought.
结论:编程和自然语言,软件和文学 [jsr-0011]
结论:编程和自然语言,软件和文学 [jsr-0011]
前面引用的一条评论一针见血:要描述一种算法,就需要
词汇表
。编程的很大一部分是
构建
(
学习
或
构造
)词汇来讨论常见的(子)问题,例如
游程编码
、groupBy、
映射
、
累积
、
过滤
等。学习、寻找和命名“正确”的抽象是数学的核心活动⸺正如“符号作为思维工具”中所阐述的那样。
A comment cited earlier hits the nail on the head: to describe an algorithm, one needs vocabulary. A large part of programming is building -- learning or constructing -- vocabulary to talk about common (sub)problems, e.g., run-length-encoding, groupBy, mapping, accumulating, filtering, etc. Learning, finding and naming the `right' abstractions is the core mathematical activity -- as explicated in the `Notation as a tool of thought'.
自然语言也是一种思维工具。建立词汇表并理解何时以及如何使用某个词是学习一门语言最难的部分。自然语言和形式语言有很多共同之处,这一观点由来已久⸺
蒙塔古
在 20 世纪 70 年代就曾对此作出著名而有力的阐述。在之前引用过的
基思·斯迈利
关于数组语言的论文中,他进一步指出,编程教学应该借鉴外语教学最好的理念。
A natural language is also a tool of thought. Building vocabulary and understanding when and how to use a particular word is the hardest part of learning a language. That natural and formal languages have a lot in common is the idea with a long pedigree -- famously and forcefully articulated by Montague in 1970s. Keith Smillie, in his paper about array languages cited earlier further argues that teaching programming ought to adapt the best ideas of teaching foreign languages.
无论是编程语言还是自然语言,人们都可以相对较快地掌握基础知识。流利掌握则需要几十年的时间。似乎没有什么比不断练习、说和听、写和重写更好。我想起
欧内斯特·海明威
和
乔治·普林普顿
(《Interviewer》杂志的创始人和编辑)之间那次著名访谈的结尾(《
巴黎评论
》第 18 期,1958 年春季):
Both in programming and natural languages, one can acquire basics relatively quickly. Fluency takes decades. There doesn't seem to be anything better than constant practice, speaking and listening, writing and rewriting. The ending of the famous interview between Ernest Hemingway and George Plimpton (the `Interviewer', the magazine's founder and editor) (Paris Review, issue 18, Spring 1958) comes to mind:
采访者 你在读到前一天停下的地方时会重写吗?还是等到全部完成后再重写?
海明威 我总是每天重写到我停下的地方。都完成后,你自然会再看一遍。在别人打字时,你会在干净的排版中看到它,这也是另一个修改和重写的机会。在校样时你还有最后的机会。你会对这几次不同的机会心怀感激。
采访者 你进行了多少次重写?
海明威 看情况。《 永别了,武器 》的结尾,也就是最后一页,我重写了三十九次才满意。
采访者 是有什么技术问题吗?什么问题难住了你?
海明威 推敲用词。
INTERVIEWER Do you do any rewriting as you read up to the place you left off the day before? Or does that come later, when the whole is finished? HEMINGWAY I always rewrite each day up to the point where I stopped. When it is all finished, naturally you go over it. You get another chance to correct and rewrite when someone else types it, and you see it clean in type. The last chance is in the proofs. You’re grateful for these different chances. INTERVIEWER How much rewriting do you do? HEMINGWAY It depends. I rewrote the ending to Farewell to Arms, the last page of it, thirty-nine times before I was satisfied. INTERVIEWER Was there some technical problem there? What was it that had stumped you? HEMINGWAY Getting the words right.
尾声:我的“苹果流”解决方案 [jsr-0012]
尾声:我的“苹果流”解决方案 [jsr-0012]
我该如何解决这个问题?我第一次看到它时,我脑海中浮现的第一个想法是:
groupBy
,这是对序列中连续元素进行分组的操作的通用名称。然后我们需要以 (group-element, counter) 对的形式呈现这些组。
How would I solve the problem? When I first saw it, the very first thought that came to mind was: groupBy, which is the common name for the operation to group consecutive elements of a sequence. We then need to present the groups in the form of (group-element, counter) pairs.
为了正确性(以及其他很快就会清楚的原因),我将使用 OCaml。假设我们有以下操作
I will use OCaml for concreteness (and for other reasons that should be clear soon). Suppose we have the operation
group : 'a list -> 'a list list
它将列表中(根据某些标准)相等的连续元素分组到它们自己的列表中,并返回这些组的列表(组不能为空)。然后,该问题解法的第一个近似写法如下,非常接近上面的第一个 Clojure 解决方案:
which groups equal (according to some criterion) consecutive elements of a list in their own list, and returns the list of such groups. (A group cannot be empty). Then the first approximation to the solution is as follows, quite close to the first Clojure solution above:
let rll (group : char list -> char list list) : char list -> (char * int) list =
group >> List.(map (fun g -> (hd g, length g)))
类型注释是可选的,但我倾向于写下它们(一个使用 Ocaml 的理由就是类型:不花哨,但没有过多细节,足够可视地展示数据的转移)。由于组是假设的,它作为参数(parameter/argument)出现。这里的
(>>)
是从左到右的函数组合(就像 F#)。将输入视为
序列
(例如列表)很方便。但字符串在 OCaml 中不是列表⸺不过可以借助几个标准库函数轻松地将其转换为列表:
Type annotations are optional, but I prefer to write them. (One reason to use OCaml is types: not fancy, but sufficient to visualize data transformations, without too many details.) Since group is the assumption, it appears as the parameter (argument). Here, (>>) is the left-to-right functional composition (as in F#). It is convenient to think of input as a sequence, such as list. But a string is not a list in OCaml -- but can be easily turned into one, with the help of a couple standard library functions:
let rll (group : char list -> char list list) : string -> (char * int) list =
String.to_seq >> List.of_seq >> group >> List.(map (fun g -> (hd g, length g)))
使用 APL 中经常使用的 fork 组合器(但是 left-to-write 版本):
With the fork combinator, so frequent in APL (but a left-to-write version):
let fork : ('a -> 'b) -> ('a -> 'c) -> ('b -> 'c -> 'd) -> 'a -> 'd = fun f g h -> fun x -> h (f x) (g x)
这个解法甚至可以写得更清楚些:
the solution can be written even cleaner:
let rll (group : char list -> char list list) : string -> (char * int) list =
String.to_seq >> List.of_seq >> group >> List.(map (fork hd length pair))
我们只需要
group
本身的实现即可运行此代码。可惜的是,OCaml 标准库目前不提供分组操作。我当然可以编写它。但让我更详细地重新表述这种方式。
We only need the implementation of group itself to run this code. Alas, the OCaml standard library at present does not provide a grouping operation. I can of course write it. But let me rephrase this approach in more detail.
我们从头开始,逐步开发解法,强调越来越紧地“
抓住
”数据流。重点在于理解数据流,而不是编写循环或递归函数,与终止条件和边缘情况作斗争。我们也不会沉迷于状态:它在需要时自然而然地出现。分组也是自然而然的,而且更简单。当我们找到解法时,只要一看就能知道它一定是正确的。确实如此。它也相当高效。它甚至可以稍后机械地重铸成 C 以提高效率⸺不是通常手写的那种 C,而是正确且没有问题的。
Let us start from scratch and develop the solution in smaller steps, emphasizing tighter and tighter `grasping' of the data flow. The stress is on understanding data flow, rather than on writing loops or recursive functions and struggling with termination conditions and edge cases. We won't obsess about state either: it comes naturally when needed. Grouping also comes naturally, and simpler. When we reach the solution, one look at it will tell that it must be right. It is. It is also reasonably efficient. It can even be later mechanically recast into C for better efficiency -- not the sort of C one typically writes by hand, but correct and problem-free.
首先从问题陈述中,我们很快就能明白我们面临的是一个
顺序
问题:将字符序列转换为其他序列(成对的),这是
局部
进行的。实际上,向示例输入
"aaaabbbcca"
添加更多字符不会影响示例输出的前缀 ("a",4), ("b", 3), ("c", 2)
。一旦理解了这个顺序特性,其余的东西几乎是自然而然的。
First, from the problem statement one quickly understands that we are facing a sequential problem: transforming a sequence of characters into some other sequence (of pairs), locally. Indeed, appending more characters to the sample input "aaaabbbcca" will not affect the prefix ("a",4), ("b", 3), ("c", 2) of the sample output. Once this sequential character is understood, the rest follows almost automatically.
为了思考和开发解法,我们需要一个
词汇表
,我们在过程中不断建立它。首先,我们需要一个表示序列的词。我们称之为
'a seq
:元素类型为 'a
的序列类型。现在我们先不要担心它是如何实现的,而是将其视为抽象类型。由于输入是字符串,我们显然需要操作
To think about and develop the solution, we need a vocabulary, which we build up as we go along. First we need a word for a sequence. Let's call it 'a seq: a type of a sequence of elements of type 'a. Let us not worry for now how it is actually implemented and treat it as an abstract type. Since the input is given as string, we obviously need the operation
of_string : string -> char seq
来将字符串
转换
为(或
渲染
为)字符序列。同样假设它已给定。再次重申,现在重要的是代表数据流的类型。
to transform (or, render) a string as a sequence of characters. Assume it is given. Again, what is important for now is the type, which represents the data flow.
接着我们必须处理如何将相邻字符分组在一起,这本质上是
有状态的
:字符属于哪个组取决于上下文,即取决于之前或之后的字符。我们采取后者,假设操作
Next we have to deal with grouping of neighboring characters, which is inherently stateful: which group the character belongs to depends on the context, that is, on the character that came before, or follows next. Let's go with the latter and assume the operation
look_ahead : 'a seq -> ('a * 'a option) seq
将序列的当前元素与下一个元素配对。下一个元素可能不存在(如果序列结束),所以使用
'a option
,允许缺失值。
that pairs the current element of sequence with the element that comes next. The next element may be absent (if the sequence is about to end), hence the 'a option type, allowing for the missing value.
然后,通过查看下一个元素,分组本身会插入分组中断:
The grouping itself is then inserting group breaks, determined by peeking at the next element:
type 'a annot = Plain of 'a | Break of 'a
let group : ('a * 'a option) seq -> 'a annot seq =
map @@ function
| (x,Some next) when x = next -> Plain x
| (x,_) -> Break x
也就是说,我们将
前瞻序列
转换为用
组边界
注释的元素序列:
Plain x
表示当前组继续,Break x
表示 x
是其组中的最后一个元素。如果有下一个元素,开始一个不同的组。('a annot
类型可以以不同的方式实现,比如一个对 'a * bool
。)我们还假设
映射
操作存在
That is, we transform a looked-ahead sequence into a sequence of elements annotated with group boundaries: Plain x means the current group continues, Break x means x is the last element in its group. The next element (if any) begins a different group. (The 'a annot type could have been realized differently, as a pair 'a * bool.) We have also assumed the existence of the mapping operation
map : ('a -> 'b) -> 'a seq -> 'b seq
最后,我们必须计数,这再次需要状态(当前组中元素的数量)并在组边界
发出
元组:
Finally, we have to count, which again requires state (the count of elements in the current group) and emit the tuples at the group boundary:
let count : 'a annot seq -> ('a * int) seq =
map_accum_filter 0 @@ fun cnt -> function
| Plain x -> (None, cnt+1)
| Break x -> (Some (x,cnt+1), 0)
分组中断标记强制输出(描述刚刚结束的组)并重置计数器。我们假设存在 accumulating map-filter,流处理中相当流行的操作:
The group break forces the output (describing the group that has just ended) and resets the counter. We assumed the existence of accumulating map-filter, a rather popular operation in stream processing:
map_accum_filter : 'state -> ('state -> 'a -> 'b option * 'state) -> 'a seq -> 'b seq
映射
('state -> 'a -> 'b option * 'state)
接受当前状态和当前元素,并返回可能转换的元素和新状态的元组。如果可能转换的元素为 None,则输出流中不会发出任何内容;但状态会继续前进。从类型来看,map_accum_filter
看起来是“
纯
”的,是函数式的状态传递。记住这个想法。
The mapping ('state ->'a -> 'b option * 'state) takes the current state and the current element and returns a tuple of a possibly transformed element and the new state. If the possibly transformed element is None, nothing is emitted in the output stream; the state advances nevertheless. From the type of it, map_accum_filter looks `pure', with functional-state passing. Hold that thought.
就是这样。问题的解法,即游程编码的字符序列,就是我们到现在开发的操作的组合:
That is it. The solution to the problem, the run-length-encoded sequence of characters, is then the composition of what we have developed so far:
let rll = of_string >> look_ahead >> group >> count
它以简洁且明显正确的方式表达了问题。我们只需要在链中添加一个观察操作,将结果
(char * int) seq
转换为列表,或者直接将其打印出来。
which expresses the problem in the concise and obviously correct way. We only need to add to our chain an observation operation, to transform the resulting (char * int) seq to a list, or to print it out.
我们解法的一个优点是易于修改。例如,如果源字符串是 UTF-8 编码的,我们只需要在管道中添加一个操作:UTF-8 解码和
字素聚类
:
One advantage of our solution is the ease of modification. For example, if the source string is UTF-8 encoded, we only need to add to the pipeline one more operation: UTF-8 decoding and grapheme clustering:
utf_decode : char seq -> uchar seq
留给读者作为练习。
We leave this as an exercise to the reader.
要真正运行
rll
程序,我们必须履行承诺:为 'a seq
选择一个具体表示并实现我们为其假设的操作。用 OCaml 术语来说,我们必须实现以下签名:
To actually run the rll program, we have to fulfill our promises: to pick a concrete representation for 'a seq and implement the operations that we assumed for it. In OCaml terms, we have to implement the following signature:
module type simple_seq = sig
type 'a seq
val of_string : string -> char seq
val map : ('a -> 'b) -> 'a seq -> 'b seq
val look_ahead : 'a seq -> ('a * 'a option) seq
val map_accum_filter : 'state -> ('state ->'a -> 'b option * 'state) -> 'a seq -> 'b seq
val iter : ('a -> unit) -> 'a seq -> unit
val to_list : 'a seq -> 'a list
end
事实证明,OCaml 标准库几乎拥有我们需要的一切:
Seq.t
数据类型以及映射、字符串/列表转换等操作。缺少的是 map_accum_filter
和 look_ahead
。不过,标准库提供了序列上的 filter_map
。我们要做的只是添加状态,实现为可变状态:
It turns out that the OCaml standard library has almost everything we need: the Seq.t data type and the mapping, string/list conversion, etc. operations on it. What is missing is map_accum_filter and look_ahead. However, the standard library provides filter_map on sequences. We merely need to add state, realized as mutable state:
let map_accum_filter : 'state -> ('state -> 'a -> 'b option * 'state) -> 'a seq -> 'b seq =
fun init f seq ->
let stat = ref init in
seq |> Seq.filter_map (fun a -> let (bo,st) = f !stat a in stat := st; bo)
总体而言,
map_accum_filter
确实有一个具有显式
状态传递
的“函数式”接口;但实现是命令式的。明确的规范和快速的实现之间没有内在冲突。剩下的 look_ahead
很容易通过 map_accum_filter
来表达(我们将实现留给读者⸺或者参见随附的代码)。
The overall map_accum_filter does have a `functional' interface with the explicit state-passing; the implementation is imperative however. There is no inherent conflict between a clear specification and a fast implementation. The remaining look_ahead turns out easily expressed via map_accum_filter (we leave the implementation to the reader -- or see the accompanying code).
回顾我们的解法,我们注意到所有状态最终都被封装了。
前瞻
的状态被封装在
look_ahead
中。分组本身变成了无状态的。计数需要多一个状态。每个操作都保留其所需的状态,而无需了解或干涉其他操作的状态。
Looking back at our solution, we notice that all state ends up encapsulated. The state for look-ahead is encapsulated in look_ahead. The grouping per se turned out stateless. Counting requires one more piece of state. Each operation keeps the state it needs, without knowing or interfering with the state of others.
我们还注意到这里没有循环或递归函数。我们也不必处理终止条件。这是所有 Group I 解决方案的特点,在 APL 中明确阐述:指定如何更改集合的内容或其形状,但不要浪费时间思考如何获取元素或将它们放在哪里。
We also notice that there are no loops or recursive functions. Neither did we have to deal with termination conditions. That is the feature of all Group I solutions, clearly articulated back in APL: specify how to change the content of a collection or its shape, but don't waste time thinking how to get the elements or where to put them.
我们的
'a seq
也可以实现为
迭代器
/
生成器
,通过 for
循环进行组合,拥有良好的性能。即使是 OCaml 的 Seq.t
也设计为了一定的
融合
(尽管不完整:仍有中间数据结构,但大小恒定)。但如果你使用如 strymonas 这样的库就可以实现完美的融合,还能生成C代码。
Our 'a seq can also be realized as iterator/generator, to be composed via for-loops -- with good performance. Even OCaml Seq.t fuses by design (albeit incompletely: there are still intermediate data structures, but of constant size). Perfect fusion is achievable, if one uses, say, strymonas, which also permits the generation of C code. Here it is.
void rll(const int64_t * aa_1,const int64_t al_2)
{
int64_t v_3 = 0;
int64_t v_4 = 0;
bool v_5 = 0;
int64_t v_6 = 0;
while (v_6 <= al_2)
{
int64_t t_7;
t_7 = v_6;
v_6++;
if (t_7 < al_2)
{
int64_t t_9;
t_9 = aa_1[t_7];
if (v_5)
{
int64_t t_10;
t_10 = v_4;
v_4 = t_9;
if (!(t_10 == t_9))
{
int64_t t_11;
t_11 = v_3;
v_3 = 0;
printf("%ld\n",(long)t_10);
printf("%ld\n",(long)(t_11 + 1));
}
else {
v_3++;
}
}
else {
v_4 = t_9;
v_5 = 1;
}
}
else {
if (v_5)
{
int64_t t_8;
t_8 = v_3;
v_3 = 0;
printf("%ld\n",(long)v_4);
printf("%ld\n",(long)(t_8 + 1));
}
}
}}
还有一件事:第一次尝试就成功了。
One more thing: it all worked on the first try.
引用
References
[#276]
引用 References [#276]
apples.ml [4K]
完整的源码
apples.ml [4K]
The complete source code
apples.c [2K]
strymonas 解法:生成出的 C 代码。有关生成器,请参阅 strymonas 发行中的 examples/run-length-encoding
目录。
apples.c [2K]
The strymonas solution: generated C code. For the generator, see the directory examples/run-length-encoding in the strymonas distribution.
Incremental, Linear Pretty-printing
分步式
的
管道
开发可以走得很远。上述参考资料开发了一种相当
不平凡
的算法,具有最理想的、难以实现的性能。
Incremental, Linear Pretty-printing
The stepwise pipeline development can be taken quite far. The above reference develops a rather non-trivial algorithm with the optimal, difficult to achieve performance.
Translation. Using zippers to handle huge trees [jsr-000Q]
- December 22, 2024
-
Diego Olivier Fernandez Pons with contributions from Jinser Kafka
- Original
Translation. Using zippers to handle huge trees [jsr-000Q]
- December 22, 2024
- Diego Olivier Fernandez Pons with contributions from Jinser Kafka
- Original
This article is translated from Diego Olivier Fernandez Pons's article titled "Using zippers to handle huge trees".
本文翻译自 Diego Olivier Fernandez Pons 题为《Using zippers to handle huge trees》的文章
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。有修复原文 typo。
嗨,
Bonjour,
2003年4月9日星期三,Yang Shouxun 写道:
我不知道如何编写尾递归版本来构建树。如果连续属性不多且数据集也不是很大,那树会在堆栈溢出之前停止生长。
(译者按:连续属性是指其在数据集中可以取无限多值,通常是实数,例如身高、温度等。)
On Wednesday, 09 Apr 2003, Yang Shouxun wrote:
> I don't know how to write a tail recursive version to build trees.
> If there are not that many continuous attributes and the dataset is
> not so large, the tree stops growing before stack overflow.
2003 年 4 月 9 日星期三,Markus Mottl 写道:
这里的诀窍是使用延续传递风格(CPS):传递一个包含后续计算所需的所有内容的函数闭包(延续)。子函数不返回结果,而是使用结果调用延续,这使得函数成为尾递归。
On Wednesday 09 April 2003, Markus Mottl wrote:
> The trick is to use continuation passing style (CPS): you pass a
> function closure (continuation) containing everything that's needed
> in subsequent computations. Instead of returning a result, the
> sub-function calls the continuation with the result, which makes the
> functions tail-recursive.
zipper 是一种处理巨大(实际是无限)tree 的简单方法。
Zippers are a simple way to handle huge (in fact infinite) trees.
Zipper 的解释
Explanation of zippers
[#261]
- December 22, 2024
- Diego Olivier Fernandez Pons
Zipper 的解释 Explanation of zippers [#261]
- December 22, 2024
- Diego Olivier Fernandez Pons
zipper 可以被视为一种“函数式指针”,因为它能够提供:
O(1)
的访问指针元素O(1)
的指针移动Zippers may be seen as 'functional pointers' since they offer :
- purely functional and typed operations
- O(1) access to the pointed element
- O(1) pointer movements
限制是数据结构只允许有一个指针,且每次指针移动都会分配内存。
The restrictions are that only one pointer is allowed by data structure and every pointer movement allocates memory.
采用二叉搜索树的经典类型声明
Take a classical type declaration for binary search trees
type 'a tree = E | N of 'a tree * 'a * 'a tree * int
考虑一棵二叉搜索树和一个你想要指向的内部节点。要对指向的子树进行
O(1)
访问,它必须可以直接从数据结构的基础中获取。然后,树的其余部分必须保存在单独的地方。我们将沿着从根到指向的节点的路径来解构它
Consider a binary search tree and an inner node to which you want to
point. To have a 0(1) access to the pointed subtree, it has to be
directly available from the base of the data structure. Then, the
rest of the tree must be kept in a separate place. We will deconstruct
it along the path from the root to the pointed node
type 'a path =
| Root
| Left of 'a * 'a tree * 'a path
| Right of 'a tree * 'a * 'a path
type 'a zipper = 'a tree * 'a path
zipper
约束
声明如下:
The zipper constraint as announced :
- the pointed subtree
- the rest of the tree breaked along the path to the root
接着我们定义指针的移动(数据结构中的每个指针各一个):
Then we define the pointer movements (one for each pointer in the data structure) :
exception Bottom
(* To be replaced by a balancing constructor *)
let makeDTree = fun l v r -> N (l, v, r, 0)
let move_left = fun (tree, path) ->
match tree with
| E -> raise Bottom
| N (l, v, r, _) -> (l, Left (v, r, path))
let move_right = fun (tree, path) ->
match tree with
| E -> raise Bottom
| N (l, v, r, _) -> (r, Right (l, v, path))
let move_up = fun (tree, path) ->
match path with
| Root -> raise Top
| Left (v, r, tail) -> (makeDTree tree v r, tail)
| Right (l, v, tail) -> (makeDTree l v tree, tail)
现在我们可以通过以下过程构建任意大的树:
Now we can build an arbitrary large tree by the following procedure :
- build a tree of bounded depth
- choose the node which will be developed next
- move the current pointer to that node
- continue building the tree
该过程使用有界的堆栈空间。
This procedure uses a bounded stack space
相关工作
Related work
[#262]
- December 22, 2024
- Diego Olivier Fernandez Pons
相关工作 Related work [#262]
- December 22, 2024
- Diego Olivier Fernandez Pons
zipper 由 Gérard Huet 提出。JFP 上有一篇论文:G. Huet. The Zipper. J. Functional Programming, 7 (5), Sept 1997, pp. 549--554。
Zippers were invented by Gérard Huet. There is a paper on the JFP
G. Huet. The Zipper. J. Functional Programming, 7 (5), Sept 1997, pp. 549--554.
他在例子中使用了 n 叉树和二叉树。这两者主要的区别在于,在二叉树中,指针的组织方式不同(他的“左”操作在 Baire 中是
(left o up)
)。
(译者按:Baire 是 Diego Olivier Fernandez Pons 开发的 Caml 的数据结构库。)
He uses n-ary trees and binary trees in his examples. The main
difference is that in binary trees the pointers are not organized in
the same way (his 'left' operation is what in Baire is (left o up))
Ralf Hinze 试图为函数指针提供一个通用框架,称之为
网
(你提供数据结构和基本转换,数据结构完成其余的工作)。
Ralf Hinze has tried to give a general framework for functional
pointers named a web (you give your data structure and the basic
transformations and the data structure does the rest)
Ralf Hinze and Johan Jeuring. Functional Pearl: Weaving a Web. Journal of Functional Programming, 11(6):681-689, November 2001。
Ralf Hinze and Johan Jeuring. Functional Pearl: Weaving a Web. Journal
of Functional Programming, 11(6):681-689, November 2001.
这篇文章可以通过 Hinze 的主页在网上获取。在我看来,他的文章并没有真正令人信服,也不是很清楚。
Available on the net via Hinze's home page.
In my opinion his article is not really convincing and not very clear.
一些已经使用 zipper 的库:
所有这些代码都可以在网络上找到。
Several libraries already use zippers
- Zen (Gérard Huet, Caml) uses zippers to handle acyclic automata
minimization
- SML/NJ Standard library (John Reppy) uses zippers to handle deletion
in red-black trees
- MetaPRL (Caml) uses zippers to handle insertion and deletion in
splay and red-black trees
- Grammatical Framework (Aarne Ranta, Haskell) uses zippers to
navigate through n-ary trees.
All this code is available on the web.
代码示例
Examples of code
[#263]
- December 22, 2024
- Diego Olivier Fernandez Pons
代码示例 Examples of code [#263]
- December 22, 2024
- Diego Olivier Fernandez Pons
以下是从 Baire 中提取的一些常用二叉搜索树操作的示例(
insert
,delete
,move_pointer_to
,...)(当前版本 11 avril 2003,大量错误和损坏的代码,您可以在 Baire 的下载页面中找到它) :
(译者按:Baire 相关资源基本已不可用。)
Here are some examples of usual binary search trees operations made
with zippers (insert, delete, move_pointer_to, ...) extracted from
Baire (current version 11 avril 2003, plenty of bugs and breaked
code, you will find it in Baire's download pages) :
(译者按:使用 ocamlformat
重新格式化,与原文的格式稍有不同。)
let rec move_to_top = function
| (tree, path) as pointer ->
(match path with
| Root -> pointer
| Left (v, r, tail) -> move_to_top (makeDTree tree v r, tail)
| Right (l, v, tail) -> move_to_top (makeDTree l v tree, tail))
;;
let rec move_to x = function
| (tree, path) as pointer ->
(match tree with
| E ->
(match path with
| Right (_, rv, _) when x <= rv -> move_to x (move_up pointer)
| Left (lv, _, _) when x >= lv -> move_to x (move_up pointer)
| _ -> pointer)
| N (_, v, _, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x < rv -> move_to x (move_up pointer)
| Right _ | Root | Left _ -> move_to x (move_left pointer))
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x > lv -> move_to x (move_up pointer)
| Left _ | Root | Right _ -> move_to x (move_right pointer))
| _ -> pointer))
;;
let rec member_path x = function
| Right (l, v, tail) ->
(match compare x v with
| n when n < 0 -> member x l
| 0 -> true
| _ -> member_path x tail)
| Left (v, r, tail) ->
(match compare x v with
| n when n > 0 -> member x r
| 0 -> true
| _ -> member_path x tail)
| Root -> false
;;
let rec zipper_member x = function
| tree, path ->
(match tree with
| E -> member_path x path
| N (l, v, r, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x <= rv -> member_path x path
| Right _ | Root | Left _ -> member x l)
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x >= lv -> member_path x path
| Left _ | Root | Right _ -> member x r)
| _ -> true))
;;
let current_tree = function
| tree, _ -> tree
;;
let current_value = function
| tree, _ ->
(match tree with
| E -> None
| N (_, v, _, _) -> Some v)
;;
let current_value' = function
| tree, _ ->
(match tree with
| E -> raise Empty
| N (_, v, _, _) -> v)
;;
let rec zipper_insert x = function
| (tree, path) as pointer ->
(match tree with
| E ->
(match path with
| Right (_, rv, _) when x <= rv -> zipper_insert x (move_up pointer)
| Left (lv, _, _) when x >= lv -> zipper_insert x (move_up pointer)
| _ -> makeTree E x E, path)
| N (_, v, _, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x < rv -> zipper_insert x (move_up pointer)
| Right _ | Root | Left _ -> zipper_insert x (move_left pointer))
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x > lv -> zipper_insert x (move_up pointer)
| Left _ | Root | Right _ -> zipper_insert x (move_right pointer))
| _ -> pointer))
;;
let rec zipper_delete x = function
| (tree, path) as pointer ->
(match tree with
| E ->
(match path with
| Right (_, rv, _) when x <= rv -> zipper_delete x (move_up pointer)
| Left (lv, _, _) when x >= lv -> zipper_delete x (move_up pointer)
| _ -> pointer)
| N (l, v, r, _) ->
(match compare x v with
| n when n < 0 ->
(match path with
| Right (_, rv, _) when x < rv -> zipper_delete x (move_up pointer)
| Right _ | Root | Left _ -> zipper_delete x (move_left pointer))
| n when n > 0 ->
(match path with
| Left (lv, _, _) when x > lv -> zipper_delete x (move_up pointer)
| Left _ | Root | Right _ -> zipper_delete x (move_right pointer))
| _ -> move_to x (appendB l r, path)))
;;
let rec path_to_list result = function
| Root -> result
| Left (v, r, path) -> path_to_list (result @ (v :: to_ordered_list r)) path
| Right (l, v, path) -> path_to_list (to_ordered_list_rec (v :: result) l) path
;;
let zipper_to_list = function
| tree, path -> path_to_list (to_list tree) path
;;
Translation. Guile's reader, in guile [jsr-000R]
- December 22, 2024
-
Andy Wingo with contributions from Jinser Kafka
- Original
Translation. Guile's reader, in guile [jsr-000R]
- December 22, 2024
- Andy Wingo with contributions from Jinser Kafka
- Original
This article is translated from Andy Wingo's article titled "guile's reader, in guile".
本文翻译自 Andy Wingo 题为《guile's reader, in guile》的文章
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
晚上好!今天的简短(大概?)记录是一些关于 guile
书呆子
的内容。
Good evening! A brief note today about some Guile nargery.
历史的弧线
the arc of history
[#259]
- December 22, 2024
- Andy Wingo
历史的弧线 the arc of history [#259]
- December 22, 2024
- Andy Wingo
就像在你打开收音机并期望听到
威豹乐队
时开始出现的许多语言实现一样,Guile 也有下半部分和上半部分。下半部分是用 C 编写的,暴露了一个共享库和一个可执行文件,而上半部分是用语言本身(在 Guile 的情况下是“Scheme”)编写的,并在语言实现开始时由 C 代码以某种方式加载。
Like many language implementations that started life when you could turn on the radio and expect to hear Def Leppard, Guile has a bottom half and a top half. The bottom half is written in C and exposes a shared library and an executable, and the top half is written in the language itself (Scheme, in the case of Guile) and somehow loaded by the C code when the language implementation starts.
自2010年左右以来,我们一直在努力将用C语言编写的部分改用 Scheme 编写。上周的信件讨论了动态链接的实现的替换,从使用
libltdl
库替换为在低级dlopen
包装器之上的 Scheme 实现。我以前写过关于在 Scheme 中重写 eval
的内容,更近期则谈到在 Scheme 中实现与 C 语言实现相同性能的道路有时是漫长的。
Since 2010 or so we have been working at replacing bits written in C with bits written in Scheme. Last week's missive was about replacing the implementation of dynamic-link from using the libltdl library to using Scheme on top of a low-level dlopen wrapper. I've written about rewriting eval in Scheme, and more recently about how the road to getting the performance of C implementations in Scheme has been sometimes long.
这些重写有
堂吉诃德式
的一面。我内心深处有一些关于正确与错误的感觉,并且我从根本上知道从 C 转向 Scheme 是正确的事情。很多时候这种感觉都是完全不理性的,在许多情况下也显得不合时宜⸺比如,如果你有一项需要为客户完成的任务,你需要坐下来思考从这里到目标的最小步骤,而直觉对于你如何到达那里并没有多大作用。但有一个项目可以让你按照自己喜欢的方式做某件事是美好的,即使这需要 10 年,那也没关系。
These rewrites have a quixotic aspect to them. I feel something in my gut about rightness and wrongness and I know at a base level that moving from C to Scheme is the right thing. Much of it is completely irrational and can be out of place in a lot of contexts -- like if you have a task to get done for a customer, you need to sit and think about minimal steps from here to the goal and the gut doesn't have much of a role to play in how you get there. But it's nice to have a project where you can do a thing in the way you'd like, and if it takes 10 years, that's fine.
不过除了难以言表的动机之外,用 Scheme 重写一些东西还是有具体的好处的。我发现 Scheme 代码更容易维护,嗯,而且相比 C 的常见
陷阱
,Scheme 显然更安全。如果有一天我重写 Guile 的垃圾收集器,这会减少我的工作量。而且,Scheme 代码还具有 C 语言所没有的功能:
尾部调用
、
可恢复的分隔延续
、
运行时检测
等等。
But besides the ineffable motivations, there are concrete advantages to rewriting something in Scheme. I find Scheme code to be more maintainable, yes, and more secure relative to the common pitfalls of C, obviously. It decreases the amount of work I will have when one day I rewrite Guile's garbage collector. But also, Scheme code gets things that C can't have: tail calls, resumable delimited continuations, run-time instrumentation, and so on.
以
定界延续
为例,大约五年前,我为 Guile 编写了一个以并行并发 ML 为模型的轻量级并发设施。它允许数百万条 fibers 存在于系统上。当一个 fiber 需要阻塞 I/O 操作(读或写)时,它会暂停其延续,并在操作变得可能时安排重启它。
Taking delimited continuations as an example, five years ago or so I wrote a lightweight concurrency facility for Guile, modelled on Parallel Concurrent ML. It lets millions of fibers to exist on a system. When a fiber would need to block on an I/O operation (read or write), instead it suspends its continuation, and arranges to restart it when the operation becomes possible.
为了让这一切成真,Guile 必须做出很多改变。首先是定界延续本身。后来是 Scheme 中 ports 设施上半部分的完全重写,以允许 port 操作挂起和恢复。可恢复 fibers 的许多障碍已被消除,但 Fibers 手册仍然列出了相当多的障碍。
A lot had to change in Guile for this to become a reality. Firstly, delimited continuations themselves. Later, a complete rewrite of the top half of the ports facility in Scheme, to allow port operations to suspend and resume. Many of the barriers to resumable fibers were removed, but the Fibers manual still names quite a few.
Scheme read, in Scheme [#260]
- December 22, 2024
- Andy Wingo
Scheme read, in Scheme [#260]
- December 22, 2024
- Andy Wingo
这给带来了我们今天的记录:我刚刚在 Scheme 中也重写了 Guile 的 reader!reader 是获取字符流并将其解析为 S 表达式的部分。以前是C语言,现在是 Scheme。
Which brings us to today's note: I just rewrote Guile's reader in Scheme too! The reader is the bit that takes a stream of characters and parses it into S-expressions. It was in C, and now is in Scheme.
这样做的主要动机之一是希望使 read 可挂起。通过此更改,现在可以在 fibers 上实现 REPL(读取-评估-打印循环)。
One of the primary motivators for this was to allow read to be suspendable. With this change, read-eval-print loops are now implementable on fibers.
另一个动机是最终修复 Guile 无法记录某些数据源位置的 bug。Guile 过去会使用
弱键
哈希表来使从 read 返回的数据与源位置相关联。但这仅适用于 fresh value,不适用于小整数或字符等立即数,也不适用于 keyword 和 symbol 等全局唯一的非立即数。因此对于这些,我们就没有任何源位置。
Another motivation was to finally fix a bug in which Guile couldn't record source locations for some kinds of datums. It used to be that Guile would use a weak-key hash table to associate datums returned from read with source locations. But this only works for fresh values, not for immediate values like small integers or characters, nor does it work for globally unique non-immediates like keywords and symbols. So for these, we just wouldn't have any source locations.
该问题的一个可靠解决方案是返回带
注解
的对象,而不是使用另外的表。由于 Scheme 的宏扩展器已经被设置为与带注解的对象(语法对象)一起使用,因此一个新的 read-syntax 接口会非常好用。
A robust solution to that problem is to return annotated objects rather than using a side table. Since Scheme's macro expander is already set to work with annotated objects (syntax objects), a new read-syntax interface would do us a treat.
在 C 语言中实现
read
很难做到。但在 Scheme 中实现 read
则毫无问题。不过,调整扩展器以期望在语法对象内包含源位置有些繁琐,且源位置信息的增加使得输出文件的大小增大了几个百分比⸺这在部分上是 .debug_lines
DWARF 数据的增加带来的,但也和宏中语法对象的序列化源位置有关。
With read in C, this was hard to do. But with read in Scheme, it was no problem to implement. Adapting the expander to expect source locations inside syntax objects was a bit fiddly, though, and the resulting increase in source location information makes the output files bigger by a few percent -- due somewhat to the increased size of the .debug_lines DWARF data, but also due to serialized source locations for syntax objects in macros.
速度方面,目前切换到 Scheme 的
read
是一个
退步
。旧的 reader 在这台笔记本电脑上记录源位置时每秒大概可以解析 15 或 16 MB,或者关闭源位置,那么有 22 或 23 MB/s。新的 reader 在旧模式下,使用弱键侧表记录源位置的解析速度大概为 10.5 MB/s,关闭位置时为 13.5 MB/s。新的 read-syntax
速度大约是 12 MB/s。我们将在未来几个月继续优化这些性能,但与原来的 reader 编写时的情况不同的是,现在的 reader 主要在编译时使用。(它在读取 s 表达式作为数据时仍然有用,因此仍然有理由提升其速度。)
Speed-wise, switching to read in Scheme is a regression, currently. The old reader could parse around 15 or 16 megabytes per second when recording source locations on this laptop, or around 22 or 23 MB/s with source locations off. The new one parses more like 10.5 MB/s, or 13.5 MB/s with positions off, when in the old mode where it uses a weak-key side table to record source locations. The new read-syntax runs at around 12 MB/s. We'll be noodling at these in the coming months, but unlike when the original reader was written, at least now the reader is mainly used only at compile time. (It still has a role when reading s-expressions as data, so there is still a reason to make it fast.)
与
eval
的情况一样 ,在加载 Scheme 版本之前,我们仍然有一个 C 版本的 reader 可用于引导目的。这次重写令人高兴的是,我能够从 C reader 中删除与非默认词法语法相关的所有缺陷,很好地简化了未来的维护。
As is the case with eval, we still have a C version of the reader available for bootstrapping purposes, before the Scheme version is loaded. Happily, with this rewrite I was able to remove all of the cruft from the C reader related to non-default lexical syntax, which simplifies maintenance going forward.
尝试
逐个 bug
重写的一个有趣方面是你会发现 bug 和意外行为。比如,事实证明,从出现以来,Guile 总是不需要终止分隔符地 read
#t
和 #f
,因此 read "(#t1)"
将得到列表 (#t 1)
。很奇怪,对吧?更奇怪的是,当 #true
和 #false
别名被添加到语言中,Guile 决定默认支持它们,但以一种奇怪的向后兼容的方式…所以 "(#false1)"
读作 (#f 1)
但 "(#falsa1)"
读作 (#f alsa1)
。诸如此类的事还有不少。
An interesting aspect of attempting to make a bug-for-bug rewrite is that you find bugs and unexpected behavior. For example, it turns out that since the dawn of time, Guile always read #t and #f without requiring a terminating delimiter, so reading "(#t1)" would result in the list (#t 1). Weird, right? Weirder still, when the #true and #false aliases were added to the language, Guile decided to support them by default, but in an oddly backwards-compatible way... so "(#false1)" reads as (#f 1) but "(#falsa1)" reads as (#f alsa1). Quite a few more things like that.
总的来说,这次重写似乎是成功的,没有引入新的行为,甚至产生了相同的错误。然而,对于
回溯
而言,情况并非如此,因为回溯可以暴露出 read 函数的内部实现,而之前由于 C 栈对 Scheme 是不透明的,这种情况并不会发生。因此,我们可能需要在调用 read 的地方添加更合理的错误处理,因为回溯信息无论如何都不是一个好的面向用户的错误反馈。
All in all it would seem to be a successful rewrite, introducing no new behavior, even producing the same errors. However, this is not the case for backtraces, which can expose the guts of read in cases where that previously wouldn't happen because the C stack was opaque to Scheme. Probably we will simply need to add more sensible error handling around callers to read, as a backtrace isn't a good user-facing error anyway.
好吧,今晚的闲聊已经够多了。祝大家 happy hacking,晚安!
OK enough rambling for this evening. Happy hacking to all and to all a good night!
Translation. Baby’s First Garbage Collector [jsr-000P]
- December 20, 2024
-
Bob Nystrom with contributions from Jinser Kafka
- Original
Translation. Baby’s First Garbage Collector [jsr-000P]
- December 20, 2024
- Bob Nystrom with contributions from Jinser Kafka
- Original
This article is translated from Bob Nystrom's article titled "Baby’s First Garbage Collector".
本文翻译自 Bob Nystrom 题为《Baby’s First Garbage Collector》的文章
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
当我感到压力大、要做的事情太多时,我会产生一种矛盾的反应:通过找其他事情来逃避。通常这件事是一个我可以编写和完成的小型独立程序。
When I get stressed out and have too much to do, I have this paradoxical reaction where I escape from that by coming up with another thing to do. Usually it’s a tiny self-contained program that I can write and finish.
一天早上,我正对我正在写的书、我在工作中必须做的事情以及我正在为《Strange Loop》准备的演讲感到害怕,突然间,我想,“我应该写一个垃圾回收器。”
The other morning, I was freaking myself out about the book I’m working on (http://gameprogrammingpatterns.com/) and the stuff I have to do at work (http://dart.dev/) and a talk I’m preparing for Strange Loop (https://www.infoq.com/presentations/dart-introduction/), and all of the sudden, I thought, “I should write a garbage collector.”
是的,我意识到那段话让我看起来有多么疯狂。但我的搭错神经带来的是编程语言实现基础的免费教程!在大约一百行常规 C 代码中,我成功地创建了一个基础的标记清除,你知道的,垃圾回收。
Yes, I realize how crazy that paragraph makes me seem. But my faulty wiring is your free tutorial on a fundamental piece of programming language implementation! In about a hundred lines of vanilla C, I managed to whip up a basic mark-and-sweep (https://en.wikipedia.org/wiki/Tracing_garbage_collection#Na%C3%AFve_mark-and-sweep) collector that actually, you know, collects.
一般认为,垃圾回收是编程中充满挑战的领域之一,不过在这篇文章里,我会引导你走上一条相对平坦的道路。(路上仍然可能有坑,但至少会浅一点)
Garbage collection is considered one of the more shark-infested waters of programming, but in this post, I’ll give you a nice kiddie pool to paddle around in. (There may still be sharks in it, but at least it will be shallower.)
少使用、物尽其用、循环再用
Reduce, reuse, recycle
[#264]
- December 20, 2024
- Bob Nystrom
少使用、物尽其用、循环再用 Reduce, reuse, recycle [#264]
- December 20, 2024
- Bob Nystrom
垃圾回收背后的基本思想是该语言(在大多数情况下)似乎可以访问无限的内存。开发者可以不断地分配、分配、分配,就像变魔术一样,内存永远不会用完。
The basic idea behind garbage collection is that the language (for the most part) appears to have access to infinite memory. The developer can just keep allocating and allocating and allocating and, as if by magic, never run out.
当然,计算机并不具有无限的内存。因此,实现这个魔术的方式是,当它需要分配一些内存并意识到内存不足时,它会回收垃圾。
Of course, machines don’t have infinite memory. So the way the implementation does this is that when it needs to allocate a bit of memory and it realizes it’s running low, it collects garbage.
在这个语境中,“垃圾”是指之前分配的,现在不再使用了的内存。为了让无限内存的幻觉发挥作用,语言需要非常安全地“不再被使用”。当你的程序试图访问随机对象,如果这时它们就开始被回收,那就不好玩了。
“Garbage” in this context means memory it previously allocated that is no longer being used. For the illusion of infinite memory to work, the language needs to be very safe about “no longer being used”. It would be no fun if random objects just started getting reclaimed while your program was trying to access them.
为了可回收,该语言必须确保程序无法再次使用该对象。如果程序无法获取该对象的引用,那么它显然无法再次使用它。所以“使用中”的定义实际上非常简单:
In order to be collectible, the language has to ensure there’s no way for the program to use that object again. If the program can’t get a reference to the object, then it obviously can’t use it again. So the definition of “in use” is actually pretty simple:
1. Any object being referenced by a variable still in scope is in use.
2. Any object referenced by another in-use object is in use.
第二条规则是递归规则。如果对象 A 被变量引用,并且它有一些引用对象 B 的字段,则 B 正在使用中,因为你可以通过 A 访问它。
The second rule is the recursive one. If object A is referenced by a variable, and it has some field that references object B, then B is in use since you can get to it through A.
最后我们得到的是可达对象的图⸺可达对象即所有可以从一个变量开始,遍历其对象来到达的对象。任何不在可达对象图中的对象对于程序来说都是死的,它的内存已经时机成熟,可以被回收了。
The end result is a graph of reachable objects—all of the objects in the world that you can get to by starting at a variable and traversing through objects. Any object not in that graph of reachable objects is dead to the program and its memory is ripe for a reaping.
标记清除
Marking and sweeping
[#265]
- December 20, 2024
- Bob Nystrom
标记清除 Marking and sweeping [#265]
- December 20, 2024
- Bob Nystrom
有很多不同的方法可以实现查找和回收所有未使用对象的过程,但为此发明的最简单且第一个发明的算法称为“标记清除”。它是由 Lisp 和大胡子的发明者约翰·麦卡锡(John McCarthy)发明的,所以今天实现这个算法有点像与一位远古之神交流,希望不是以某种洛夫克拉夫特式的方式,否则你的思想和视网膜最终会被炸得一干二净。
There are a bunch of different ways (https://en.wikipedia.org/wiki/Tracing_garbage_collection) you can implement the process of finding and reclaiming all of the unused objects, but the simplest and first algorithm ever invented for it is called “mark-sweep”. It was invented by John McCarthy, the man who invented Lisp and beards, so implementing today is like communing with one of the Elder Gods, but hopefully not in some Lovecraftian way that ends with you having your mind and retinas blasted clean.
该算法的工作原理几乎与我们对可达性的定义完全相同:
The algorithm works almost exactly like our definition of reachability:
1. Starting at the roots, traverse the entire object graph. Every time you reach an object, set a “mark” bit on it to true.
2. Once that’s done, find all of the objects whose mark bits are not set and delete them.
就是这样。我知道,你本来可以想出这个的,对吧?如果你这么做了,你就会成为一篇被引用数百次的论文的作者。这件事给我们的教训是,要在 CS 领域出名,你不必想出真正晦涩难懂的东西,你只需首先想出明显的东西即可。
That’s it. I know, you could have come up with that, right? If you had, you’d be the author of a paper cited hundreds of times. The lesson here is that to be famous in CS, you don’t have to come up with really obscure stuff, you just have to come up with obvious stuff first.
一对对象
A pair of objects
[#266]
- December 20, 2024
- Bob Nystrom
一对对象 A pair of objects [#266]
- December 20, 2024
- Bob Nystrom
在我们开始实现这两个步骤之前,让我们先做一些准备工作。我们不会真正实现一种语言的解释器⸺没有语法分析器、字节码或任何弱智的东西⸺但我们确实需要一些最少量的代码来创建一些垃圾来回收。
Before we can get to implementing those two steps, let’s get a couple of preliminaries out of the way. We won’t be actually implementing an interpreter for a language—no parser, bytecode, or any of that foolishness—but we do need some minimal amount of code to create some garbage to collect.
假设我们正在为一种小语言编写一个解释器。它是动态类型的,有两种类型的对象:整数(int)和对(pair)。这是一个用来标记对象类型的枚举:
Let’s play pretend that we’re writing an interpreter for a little language. It’s dynamically typed, and has two types of objects: ints and pairs. Here’s an enum to identify an object’s type:
typedef enum {
OBJ_INT,
OBJ_PAIR
} ObjectType;
一个对可以是一对任何东西,两个整数,一个整数和另一个对,等等。仅凭这一点,你就可以走得更远。由于 VM 中的对象可以是其中任何一个,因此 C 中实现它的典型方法是使用标签联合(tagged union)。我们这样定义它:
A pair can be a pair of anything, two ints, an int and another pair, whatever. You can go surprisingly far (http://www.flickr.com/photos/raganwald/212588975/) with just that. Since an object in the VM can be either of these, the typical way in C to implement it is with a tagged union (http://en.wikipedia.org/wiki/Tagged_union). We define it thusly:
typedef struct sObject {
ObjectType type;
union {
/* OBJ_INT */
int value;
/* OBJ_PAIR */
struct {
struct sObject* head;
struct sObject* tail;
};
};
} Object;
主要的 Object 结构中有一个
type
字段,用于标识它的值类型⸺整数或对。接着它有一个联合(union)来保存这个整数或对的数据。如果你的 C 很生疏,那么联合就是一个结构体,其中字段在内存中重叠。由于给定的对象只能是一个整数或一个对,因此没有理由在单个对象中同时为所有三个字段提供内存。联合就是这么做的。非常好。
The main Object struct has a type field that identifies what kind of value it is—either an int or a pair. Then it has a union to hold the data for the int or pair. If your C is rusty, a union is a struct where the fields overlap in memory. Since a given object can only be an int or a pair, there’s no reason to have memory in a single object for all three fields at the same time. A union does that. Groovy.
迷你虚拟机
A minimal virtual machine
[#267]
- December 20, 2024
- Bob Nystrom
迷你虚拟机 A minimal virtual machine [#267]
- December 20, 2024
- Bob Nystrom
现在我们可以在小虚拟机中使用这个数据类型。虚拟机在这个背景中的作用是拥有一个存储当前范围内的变量的栈。大多数语言的虚拟机要么是基于栈的(如 JVM 和 CLR),要么是基于寄存器的(如 Lua)。在这两种情况下,实际上都仍然存在栈。它用于存储表达式中间所需的局部变量和临时变量。我们明确而简单地建模,如下所示:
Now we can use that datatype in a little virtual machine. The VM’s role in this story is to have a stack that stores the variables that are currently in scope. Most language VMs are either stack-based (like the JVM and CLR) or register-based (like Lua). In both cases, there is actually still a stack. It’s used to store local variables and temporary variables needed in the middle of an expression. We model that explicitly and simply like so:
#define STACK_MAX 256
typedef struct {
Object* stack[STACK_MAX];
int stackSize;
} VM;
现在已经有了基本的数据结构,让我们拼凑代码来创建一些东西。首先,编写一个创建并初始化虚拟机的函数:
Now that we’ve got our basic data structures in place, let’s slap together a bit of code to create some stuff. First, let’s write a function that creates and initializes a VM:
VM* newVM() {
VM* vm = malloc(sizeof(VM));
vm->stackSize = 0;
return vm;
}
一旦我们有了虚拟机,我们就需要能够操作它的栈:
Once we have a VM, we need to be able to manipulate its stack:
void push(VM* vm, Object* value) {
assert(vm->stackSize < STACK_MAX, "Stack overflow!");
vm->stack[vm->stackSize++] = value;
}
Object* pop(VM* vm) {
assert(vm->stackSize > 0, "Stack underflow!");
return vm->stack[--vm->stackSize];
}
现在我们可以将东西放入“变量”中,我们需要能够实际创建对象。首先,一个小辅助函数:
Now that we can stick stuff in “variables”, we need to be able to actually create objects. First, a little helper function:
Object* newObject(VM* vm, ObjectType type) {
Object* object = malloc(sizeof(Object));
object->type = type;
return object;
}
这会执行实际的内存分配并设置类型标记。我们稍后会重新讨论这个问题。使用这个函数,我们可以编写其他函数将每种类型的对象推送到虚拟机的栈上:
That does the actual memory allocation and sets the type tag. We’ll be revisiting this in a bit. Using that, we can write functions to push each kind of object onto the VM’s stack:
void pushInt(VM* vm, int intValue) {
Object* object = newObject(vm, OBJ_INT);
object->value = intValue;
push(vm, object);
}
Object* pushPair(VM* vm) {
Object* object = newObject(vm, OBJ_PAIR);
object->tail = pop(vm);
object->head = pop(vm);
push(vm, object);
return object;
}
这就是我们的小虚拟机。如果我们有一个语法分析器和一个解释器来调用这些函数,我们就拥有了一种对上帝诚实的语言。而且,如果我们有无限的内存,它甚至能够运行真正的程序。既然我们不这样做,那我们就开始回收一些垃圾吧。
And that’s it for our little VM. If we had a parser and an interpreter that called those functions, we’d have an honest to God language on our hands. And, if we had infinite memory, it would even be able to run real programs. Since we don’t, let’s start collecting some garbage.
标志标记
Marky mark
[#268]
- December 20, 2024
- Bob Nystrom
标志标记 Marky mark [#268]
- December 20, 2024
- Bob Nystrom
第一阶段是标记。我们需要遍历所有可到达的对象并设置它们的标记位。我们需要的第一件事是向
Object
添加一个标记位:
The first phase is marking. We need to walk all of the reachable objects and set their mark bit. The first thing we need then is to add a mark bit to Object:
typedef struct sObject {
unsigned char marked;
/* Previous stuff... */
} Object;
当我们创建了这个新对象,我们还要修改
newObject()
以将 marked
初始化为零。
When we create a new object, we modify newObject() to initialize marked to zero.
为了标记所有可到达的对象,我们从内存中的变量开始,这意味着遍历栈。看起来像这样:
To mark all of the reachable objects, we start with the variables that are in memory, so that means walking the stack. That looks like this:
void markAll(VM* vm)
{
for (int i = 0; i < vm->stackSize; i++) {
mark(vm->stack[i]);
}
}
这又调用了
mark
。我们将分阶段构建它。首先是:
That in turn calls mark. We’ll build that in phases. First:
void mark(Object* object) {
object->marked = 1;
}
这是字面上最重要的一点(bit)。我们已将对象本身标记为可达。但记住,我们还需要处理对象中的引用⸺可达性是递归的。如果该对象是一个对,则它的两个字段也是可达的。处理递归很简单:
This is the most important bit, literally. We’ve marked the object itself as reachable. But remember we also need to handle references in objects—reachability is recursive. If the object is a pair, its two fields are reachable too. Handling that is simple:
void mark(Object* object) {
object->marked = 1;
if (object->type == OBJ_PAIR) {
mark(object->head);
mark(object->tail);
}
}
你看到这里有一个 bug 了吗?我们现在正在递归,但我们没有检查成环。如果你有一堆在循环中相互指向的对,这将使 C 调用栈溢出并崩溃。
There’s a bug here. Do you see it? We’re recursing now, but we aren’t checking for cycles. If you have a bunch of pairs that point to each other in a loop, this will overflow the C callstack and crash.
为了解决这个问题,我们只需要在到达已经处理过的对象时退出即可。所以完整的
mark()
函数是:
To handle that, we simply need to bail out if we get to an object that we’ve already processed. So the complete mark() function is:
void mark(Object* object) {
/* If already marked, we're done. Check this first
to avoid recursing on cycles in the object graph. */
if (object->marked) return;
object->marked = 1;
if (object->type == OBJ_PAIR) {
mark(object->head);
mark(object->tail);
}
}
现在我们可以调用
markAll()
,它将正确标记内存中每个可达的对象。我们已经完成一半了!
Now we can call markAll() and it will correctly mark every reachable object in memory. We’re halfway done!
清楚清除
Sweepy sweep
[#269]
- December 20, 2024
- Bob Nystrom
清楚清除 Sweepy sweep [#269]
- December 20, 2024
- Bob Nystrom
下一阶段是清除我们分配的所有对象并释放任何未标记的对象。但这里有一个问题:根据定义,所有未标记的对象都是无法访问的!我们无法获取他们!
The next phase is to sweep through all of the objects we’ve allocated and free any of them that aren’t marked. But there’s a problem here: all of the unmarked objects are, by definition, unreachable! We can’t get to them!
我们的虚拟机已经实现了对象引用的语言语义,因此我们只在变量和对的字段中存储指向对象的指针。一旦其中一个对象不再指向某个对象,虚拟机就完全丢失了它,并且实际上泄漏了内存。
The VM has implemented the language’s semantics for object references, so we’re only storing pointers to objects in variables and the pair fields. As soon as an object is no longer pointed to by one of those, the VM has lost it entirely and actually leaked memory.
解决这个问题的技巧是,虚拟机可以拥有自己的对象引用,与语言用户可见的语义不同。换句话说,我们可以自己追踪它们。
The trick to solve this is that the VM can have its own references to objects that are distinct from the semantics that are visible to the language user. In other words, we can keep track of them ourselves.
最简单的方法是维护我们分配的每个对象的链表。我们将
Object
本身扩展为该链表中的一个节点:
The simplest way to do this is to just maintain a linked list of every object we’ve ever allocated. We extend Object itself to be a node in that list:
typedef struct sObject {
/* The next object in the list of all objects. */
struct sObject* next;
/* Previous stuff... */
} Object;
虚拟机追踪该链表的头节点:
The VM keeps track of the head of that list:
typedef struct {
/* The first object in the list of all objects. */
Object* firstObject;
/* Previous stuff... */
} VM;
在
newVM()
中,我们确保将 firstObject
初始化为 NULL
。每当我们创建一个对象时,我们都会将其添加到链表中:
In newVM() we make sure to initialize firstObject to NULL. Whenever we create an object, we add it to the list:
Object* newObject(VM* vm, ObjectType type) {
Object* object = malloc(sizeof(Object));
object->type = type;
object->marked = 0;
/* Insert it into the list of allocated objects. */
object->next = vm->firstObject;
vm->firstObject = object;
return object;
}
这样,即使语言找不到对象,但语言实现仍然可以。要扫描并删除未标记的对象,我们遍历列表:
This way, even if the language can’t find an object, the language implementation still can. To sweep through and delete the unmarked objects, we traverse the list:
void sweep(VM* vm)
{
Object** object = &vm->firstObject;
while (*object) {
if (!(*object)->marked) {
/* This object wasn't reached, so remove it from the list
and free it. */
Object* unreached = *object;
*object = unreached->next;
free(unreached);
} else {
/* This object was reached, so unmark it (for the next GC)
and move on to the next. */
(*object)->marked = 0;
object = &(*object)->next;
}
}
}
由于指针指向指针,该代码读起来有点棘手,但如果你仔细阅读它,你会发现它非常简单。它只是遍历整个链表。每当它遇到未标记的对象时,它就会释放其内存并将其从链表中删除。完成操作后,我们会删除所有无法访问的对象。
That code is a bit tricky to read because of that pointer to a pointer, but if you work through it, you can see it’s pretty straightforward. It just walks the entire linked list. Whenever it hits an object that isn’t marked, it frees its memory and removes it from the list. When this is done, we will have deleted every unreachable object.
恭喜!我们有了一个垃圾回收器!只剩下最后一个碎片:实际调用它。首先让我们将这两个阶段结合在一起:
Congratulations! We have a garbage collector! There’s just one missing piece: actually calling it. First let’s wrap the two phases together:
void gc(VM* vm) {
markAll(vm);
sweep(vm);
}
你找不到更清晰的标记清除实现了。最棘手的部分是要弄清楚什么时候真正调用它。“内存不足”到底意味着什么,尤其是在虚拟内存接近无限的现代计算机上?
You couldn’t ask for a more obvious mark-sweep implementation. The trickiest part is figuring out when to actually call this. What does “low on memory” even mean, especially on modern computers with near-infinite virtual memory?
事实证明,这里没有精确的正确或错误答案。这实际上取决于你使用虚拟机的目的以及它运行的硬件类型。为了使这个示例简单,我们将在一定次数的分配后进行回收。这实际上就是某些语言实现的工作原理,而且很容易实现。
It turns out there’s no precise right or wrong answer here. It really depends on what you’re using your VM for and what kind of hardware it runs on. To keep this example simple, we’ll just collect after a certain number of allocations. That’s actually how some language implementations work, and it’s easy to implement.
我们扩展
VM
来追踪我们创建了多少个:
We extend VM to track how many we’ve created:
typedef struct {
/* The total number of currently allocated objects. */
int numObjects;
/* The number of objects required to trigger a GC. */
int maxObjects;
/* Previous stuff... */
} VM;
然后初始化它们:
And then initialize them:
VM* newVM() {
/* Previous stuff... */
vm->numObjects = 0;
vm->maxObjects = INITIAL_GC_THRESHOLD;
return vm;
}
INITIAL_GC_THRESHOLD
将是我们启动第一次 GC 时的对象数量。较小的数字在内存方面更加保守,较大的数字在垃圾回收上花费的时间较少。按个人口味调整。
The INITIAL_GC_THRESHOLD will be the number of objects at which we kick off the first GC. A smaller number is more conservative with memory, a larger number spends less time on garbage collection. Adjust to taste.
我们每创建一个对象,都增加
numObjects
并在其达到最大值时进行一次回收:
Whenever we create an object, we increment numObjects and run a collection if it reaches the max:
Object* newObject(VM* vm, ObjectType type) {
if (vm->numObjects == vm->maxObjects) gc(vm);
/* Create object... */
vm->numObjects++;
return object;
}
我不会费心展示它,但我们还会调整
sweep()
以在每次释放时减少 numObjects
。最后我们修改 gc()
来更新最大值:
I won’t bother showing it, but we’ll also tweak sweep() to decrement numObjects every time it frees one. Finally, we modify gc() to update the max:
void gc(VM* vm) {
int numObjects = vm->numObjects;
markAll(vm);
sweep(vm);
vm->maxObjects = vm->numObjects * 2;
}
每次回收后,我们都会根据回收后剩余的存活对象数量更新
maxObjects
。那里的乘数让我们的堆随着存活对象数量的增加而增长。同样,如果一堆对象最终被释放,它会自动收缩。
After every collection, we update maxObjects based on the number of live objects left after the collection. The multiplier there lets our heap grow as the number of living objects increases. Likewise, it will shrink automatically if a bunch of objects end up being freed.
简单
Simple
[#270]
- December 20, 2024
- Bob Nystrom
简单 Simple [#270]
- December 20, 2024
- Bob Nystrom
你做到了!如果你跟着做了所有这些,那么你现在已经掌握了简单的垃圾回收算法。如果你想一次查看全部内容,在这里参阅完整代码。我在这里强调一下,虽然这个收集器很简单,但它不是一个玩具。
You made it! If you followed all of this, you’ve now got a handle on a simple garbage collection algorithm. If you want to see it all together, here’s the full code. Let me stress here that while this collector is simple, it isn’t a toy.
你可以在此基础上构建大量优化⸺在垃圾回收和编程语言中,优化占了
90%
的工作量⸺但这里的核心代码是合法的真实垃圾回收器。它与直到最近的 Ruby 和 Lua 中的回收器都非常相似。你可以在生产代码中使用这样的代码。现在去构建一些很棒的东西吧!
There are a ton of optimizations you can build on top of this—in GCs and programming languages, optimization is 90
Translation. Zipper in scheme [jsr-000O]
- December 18, 2024
-
Oleg Kiselyov with contributions from Jinser Kafka
- Original
Translation. Zipper in scheme [jsr-000O]
- December 18, 2024
- Oleg Kiselyov with contributions from Jinser Kafka
- Original
This article is translated from Oleg Kiselyov's article titled "zipper in scheme".
本文翻译自 Oleg Kiselyov 题为《zipper in scheme》的文章
Title added by translator. 标题由译者添加。
tips: Click on the translation to expand and view the original text. 点击译文可展开查看原文。
前言 [#271]
- December 18, 2024
- Oleg Kiselyov
前言 [#271]
- December 18, 2024
- Oleg Kiselyov
zipper 是一种非常方便的数据结构,它允许我们替换复杂数据结构(例如树或 term)深处的项,而无需任何可变性。结果数据结构将尽可能多地与旧结构共享其组件 [见附录]。旧的数据结构仍然可用(这便于我们撤销该操作)。本质上,zipper 是一个“可更新”但纯函数式的数据结构游标。
Zipper is a very handy data structure that lets us replace an item deep in a complex data structure, e.g., a tree or a term, without any mutation. The resulting data structure will share as much of its components with the old structure as possible [see addendum]. The old data structure is still available (which can be handy if we wish to 'undo' the operation later on). Zipper is essentially an `updateable' and yet pure functional cursor into a data structure.
有用的参考:
Useful references:
http://www.nist.gov/dads/HTML/zipper.html
http://citeseer.ist.psu.edu/hinze01web.html
zipper 是一种由定界延续具体化而来的数据结构。不知何故,这个想法在 zipper 领域并不常见。因为 scheme 有一等支持的定界延续,我们可以更容易地派生和使用 zipper。
Zipper is a _delimited continuation_ reified as a data structure. Somehow that idea is not commonly discussed in the zipper literature. Because Scheme has first-class and delimited continuations, we can derive and use zipper far more easily.
本文将给出 zipper 的派生及其使用示例:交换两棵树的两个分支。该使用示例是遗传编程中典型的交叉操作。
Given below is a derivation of zipper and an example of its use: swapping out of two branches of a two trees. The latter is a typical cross-over operation in genetic programming.
[email protected] (Noel Welsh) 在消息中写
news:<[email protected]>...
我想以纯函数式的方式进行交叉,我设想的算法是:
[email protected] (Noel Welsh) wrote in message
news:<[email protected]>...
I would like to do the crossover in a purely functional manner. The
algorithm I envisage is:
- Count the number of nodes for each of the two trees
- Choose an random integer between 0 ... (number of nodes - 1)
This is the node where crossover will take place.
正如上文指出的,我们不需要通过计数来从树中选择随机节点。选择节点后,我们可以使用
eq?
测试来向下拉开树中的该节点。然而,在下文中,为了简单起见,我们跳过随机选择,按深度优先顺序和节点索引选择节点。
As pointed out earlier, we don't need counting to select a random node from a tree. After we selected the node, we can zip down to that node in the tree using the eq? test. In the following however, we skip the random selection for simplicity and thus we shall be selecting nodes by their index in the depth-first order.
派生 [#272]
- December 18, 2024
- Oleg Kiselyov
派生 [#272]
- December 18, 2024
- Oleg Kiselyov
为了派生 zipper,我们首先编写更熟悉的深度优先遍历例程:
To derive zipper, we first write the familiar depth-first traversal routine:
Welcome to Scheme 48 1.1
> ,open srfi-9
> ,open escapes signals
> ,load /usr/local/lib/scheme48/misc/shift-reset.scm
; deterministic, left-to-right map
(define (map* f l)
(if (null? l) l
(cons (f (car l)) (map* f (cdr l)))))
(define (depth-first handle tree)
(cond
((null? tree) tree)
((handle tree) => (lambda (new-tree) new-tree))
; the node was not handled -- descend
((not (pair? tree)) tree) ; an atom
(else
(cons (car tree) ; node name
(map* (lambda (kid) (depth-first handle kid)) (cdr tree))))))
函数
handle
,作为 depth-first
的第一个参数,接受一个 node
并应该返回一个 node
或 #f
。在第一种情况下,返回的 node
替换结果树中的现有节点。如果 handle
返回 #f
,则表明它已拒绝处理该节点,因此我们保留该节点,在可能的情况下继续处理该节点的子节点。
The function handle, the first-argument of depth-first, receives a node and should yield either a node or #f. In the first case, the return node replaces the existing node in the result tree. If the handle returned #f, it has declined to handle that node, so we keep that node and descend into it, if possible.
为了解其工作原理,我们定义两个示例树并打印出它们的节点:
To see how this works, we define two sample trees and print out their nodes:
(define tree1 '(a (b) (c (d 1 2)) e))
(define tree2 '(z (u) (v (w 10 12)) y))
(depth-first (lambda (node) (display node) (newline) #f) tree1)
==> prints
(a (b) (c (d 1 2)) e)
(b)
(c (d 1 2))
(d 1 2)
1
2
e
==> yields
'(a (b) (c (d 1 2)) e)
我们现在可以定义数据类型 zipper:
We can now define the zipper data structure:
(define-record-type zzipper
(zipper curr-node k)
zipper?
(curr-node z-curr-node)
(k z-k))
它包含两个字段:树的当前节点和延续。延续应该接收一个
node
或 #f
。在前一种情况,接收到的节点将替换现有节点。在后一种情况,我们保留现有节点并继续遍历。这个延续返回一个新的 zipper
或一课树(如果遍历完成)。可以看出,zipper 在某种意义上是函数 handle
的“逆”。
It contains two fields: the current node of a tree, and the continuation. The continuation should receive a node or #f. In the former case, the received node will replace the existing node. In the latter case, we keep the existing node and continue the traversal. The continuation returns either a new zipper, or a tree (if the traversal is finished). One can see that zipper is in a sense an 'inverse' of the function handle.
(define (zip-tree tree)
(reset (depth-first (lambda (tree) (shift f (zipper tree f))) tree)))
正如所承诺的,zipper 确实是定界延续的一种表现形式。
As promised, zipper is indeed a manifestation of a delimited continuation.
我们应该指出,record
zipper
和构造函数 zip-tree
都是 generic 的。它们本身既不依赖树的表示,也不依赖遍历策略。有关树数据结构及其遍历的所有信息都被封装在一个函数 depth-first
中。我们可以通过改变深度优先策略从深度优先策略切换到广度优先策略,或者从嵌套列表切换到树的嵌套向量实现。无论是 zipper
、zip-tree
还是任何使用 zipper
的代码(见下文)都不需要任何修改。我们的 zipper 的这一特性与 Gerard Huet 的 zipper 派生形成了鲜明的对比。在 Gerard Huet 的形式中,zipper 确实取决于数据类型的具体实现:zipper 是从数据类型 derived(双关语)的(译者按:派生和求导)。不同的数据类型(以及抽象数据类型的不同实现)将具有不同的对应 zipper 结构。在我们的形式中,zipper 是遍历函数的 generic
derivation(双关语)(译者按:派生和导数)。zipper 是遍历函数的导数——就只是机械的导数。因此,shift/reset 可以被视为遍历函数的导数运算符。
We should point out that both the zipper record and the constructor function zip-tree are _generic_. They by themselves depend neither on the representation of the tree nor on the traversal strategy. All the information about the tree data structure and its traversal is encapsulated in one single function depth-first. We can switch from depth-first to breadth-first strategy or from a nested list to a nested vector realization of trees just by changing depth-first. Neither zipper, nor zip-tree, nor any code that uses zipper (see below) will require any modifications. This property of our zipper is in a marked contrast with Gerard Huet's derivation of zipper. In Gerard Huet's formulation, zipper does depend on the concrete realization of the data type: zipper is derived (pun intended) from the data type. Different data types (and different realizations of an abstract data type) will have different corresponding zipper structures. In our formulation, zipper is a _generic_ derivation (pun intended) on the traversal function. Zipper is a derivative of the traversal function -- mechanical derivative at that. So, shift/reset can be considered traversal function derivative operators.
使用 [#273]
- December 18, 2024
- Oleg Kiselyov
使用 [#273]
- December 18, 2024
- Oleg Kiselyov
我们现在可以用一种不同的方式打印树:
We can now print out the tree in a different way:
(define (print-tree tree)
(do ((cursor (zip-tree tree) ((z-k cursor) #f)))
((not (zipper? cursor)))
(display (z-curr-node cursor))
(newline)))
我们使用 zipper(即
cursor
)逐个节点地检查整个树。从某种意义上说,我们翻转了 depth-first
的操作。
we use zipper, which is a cursor, to examine all of the tree, node by node. In a sense, we have inverted the operation of depth-first.
(print-tree tree1)
; prints as before
(print-tree tree2)
(z (u) (v (w 10 12)) y)
(u)
(v (w 10 12))
(w 10 12)
10
12
y
引入一些有用的函数
We introduce a few helpful functions
(define (zip-all-the-way-up zipper)
(if (zipper? zipper) (zip-all-the-way-up ((z-k zipper) (z-curr-node zipper)))
zipper))
(define (locate-nth-node n tree)
(do ((i 0 (+ 1 i)) (cursor (zip-tree tree) ((z-k cursor) #f)))
((and (= i n)
(if (zipper? cursor) #t
(error "too few nodes"))) cursor)
))
我们已准备好做一些事:
And we are ready for some action:
; replace the 3-d node of tree1 with 'xxx
(let ((desired-node (locate-nth-node 3 tree1)))
(display "Replacing the node: ")
(display (z-curr-node desired-node))
(newline)
(zip-all-the-way-up ((z-k desired-node) 'xxx)))
==> prints
Replacing the node: (d 1 2)
==> yieds
'(a (b) (c xxx) e)
它确实替换了,不是吗?
It did replace it, didn't it?
; cross-over of the 3d node of tree1 and 1st node of tree2
(let* ((desired-node1 (locate-nth-node 3 tree1))
(_ (begin
(display "Cross-over the node1: ")
(display (z-curr-node desired-node1))
(newline)))
(desired-node2 (locate-nth-node 1 tree2))
(_ (begin
(display "Cross-over the node2: ")
(display (z-curr-node desired-node2))
(newline)))
(new-tree1
(zip-all-the-way-up ((z-k desired-node1)
(z-curr-node desired-node2))))
(new-tree2
(zip-all-the-way-up ((z-k desired-node2)
(z-curr-node desired-node1))))
)
(display "new tree1: ") (display new-tree1) (newline)
(display "new tree2: ") (display new-tree2) (newline)
)
==> prints
Cross-over the node1: (d 1 2)
Cross-over the node2: (u)
new tree1: (a (b) (c (u)) e)
new tree2: (z (d 1 2) (v (w 10 12)) y)
嗯,看来可行...
Well, it seems to work...
如果我们交换
tree1
的第3个节点和 tree2
的第5个节点,我们得到
If we swap the 3d node of tree1 and the 5th node of tree2, we get
Cross-over the node1: (d 1 2)
Cross-over the node2: 12
new tree1: (a (b) (c 12) e)
new tree2: (z (u) (v (w 10 (d 1 2))) y)
总结 [#274]
- December 18, 2024
- Oleg Kiselyov
总结 [#274]
- December 18, 2024
- Oleg Kiselyov
总而言之,定界延续非常有用。可以在任何 R5RS scheme 系统中进行模拟;但如果它本身受支持,性能会更好。Scheme48 本身就支持定界延续(Martin Gasbichler and Michael Sperber, ICFP 2002)。如果你最喜欢的 Scheme 系统默认没有提供它们,请向实现者投诉。支持哪个特定的定界延续运算符(shift、control、shift0、splitter、cupto 等)并不重要——所有这些的表达性都是相同的:
To conclude, delimited continuations are quite useful. They can be emulated in any R5RS Scheme system; yet it is better for performance if they are supported natively. Scheme48 does support delimited continuations natively (Martin Gasbichler and Michael Sperber, ICFP 2002). If your favorite Scheme system does not offer them by default, please complain to the implementors. It doesn't matter which particular delimited continuation operator (shift, control, shift0, splitter, cupto, etc) is supported -- all of them are equally expressible:
- Chung-chieh Shan, Scheme2004 workshop
- http://www.eecs.harvard.edu/~ccshan/recur/
附录 [#275]
- December 18, 2024
- Oleg Kiselyov
附录 [#275]
- December 18, 2024
- Oleg Kiselyov
附录,2006 年 6 月 7 日 [受到 Andrew Wilcox 问题的启发] 更准确地说,zipper 与底层枚举器一样保留了共享。以下是最大共享保留枚举器。这两个函数应该取代本文中的函数。
Addendum, June 7, 2006 [inspired by a question from Andrew Wilcox] To be more precise, the zipper preserves sharing as much as the underlying enumerator does. The following is the maximal sharing preserving enumerator. Those two functions should replace the ones in the article.
; deterministic, left-to-right map
; It preserves sharing as much as possible: that is, if given the pair
; l == (cons h t), (and (eq? h (f h)) (eq? t (map* f t))) holds, then
; (eq? (map* f l) l) holds as well.
(define (map* f l)
(if (null? l) l
(let ((h (car l)) (t (cdr l)))
(let ((h1 (f h)) (t1 (map* f t)))
(if (and (eq? h1 h) (eq? t1 t)) l
(cons h1 t1))))))
(define (depth-first handle tree)
(cond
((null? tree) tree)
((handle tree) => (lambda (new-tree) new-tree))
; the node was not handled -- descend
((not (pair? tree)) tree) ; an atom
(else
(let ((kids1
(map* (lambda (kid) (depth-first handle kid)) (cdr tree))))
(if (eq? kids1 (cdr tree)) tree
(cons (car tree) ; node name
kids1))))))
为了测试新的
depth-first
确实保留了共享,我们求值
To test that the new depth-first indeed preserves sharing, we evaluate
(eq? tree1
(depth-first (lambda (node) (display node) (newline) #f) tree1))
以深度优先顺序打印所有节点后,给出结果
#t
。在这种情况下,深度优先(depth-first
)返回的树确实是原始树。
which, after printing all nodes in depth-first order, gives the result #t. The tree returned by depth-first in this case is indeed the original tree as it is.
zipper 代码无需更改,按原样工作,具有相同的结果。 为了测试共享是否保留,我们首先通过替换
tree2
中的第 6 个节点(即 y
)来生成一棵树:
The zipper code needs no changes, and it works as it was, with the same results. To test the sharing preservation, we first produce a tree by replacing the 6th node (which is y) in tree2:
(define tree2*
(let ((desired-node (locate-nth-node 6 tree2)))
(display "Replacing the node: ")
(display (z-curr-node desired-node))
(newline)
(zip-all-the-way-up ((z-k desired-node) 'newy))))
这是结果:
(z (u) (v (w 10 12)) newy)
here's the result: (z (u) (v (w 10 12)) newy)
现在,我们编写一个函数,它接受两棵树,同步遍历它们并打印出节点以及它们是否共享:
Now, we write a function that takes two trees, traverses them in lockstep and prints out the nodes and if they are shared:
(define (tree-compare-sharing t1 t2)
(do ((cursor1 (zip-tree t1) ((z-k cursor1) #f))
(cursor2 (zip-tree t2) ((z-k cursor2) #f)))
((cond
((and (zipper? cursor1) (zipper? cursor2)) #f)
((zipper? cursor1) (display "t2 finished early") #t)
((zipper? cursor2) (display "t1 finished early") #t)
(else #t)))
(let ((n1 (z-curr-node cursor1)) (n2 (z-curr-node cursor2)))
(cond
((eq? n1 n2) (display "shared node: ") (display n1))
(else (display "t1 node: ") (display n1) (newline)
(display "t2 node: ") (display n2)))
(newline))))
(tree-compare-sharing tree2 tree2*)
===>
t1 node: (z (u) (v (w 10 12)) y)
t2 node: (z (u) (v (w 10 12)) newy)
shared node: (u)
shared node: (v (w 10 12))
shared node: (w 10 12)
shared node: 10
shared node: 12
t1 node: y
t2 node: newy