Guile's reader, in guile › Scheme read, in Scheme [#260]
Guile's reader, in guile › Scheme read, in Scheme [#260]
这给带来了我们今天的记录:我刚刚在 Scheme 中也重写了 Guile 的 reader!reader 是获取字符流并将其解析为 S 表达式的部分。以前是C语言,现在是 Scheme。
Which brings us to today's note: I just rewrote Guile's reader in Scheme too! The reader is the bit that takes a stream of characters and parses it into S-expressions. It was in C, and now is in Scheme.
这样做的主要动机之一是希望使 read 可挂起。通过此更改,现在可以在 fibers 上实现 REPL(读取-评估-打印循环)。
One of the primary motivators for this was to allow read to be suspendable. With this change, read-eval-print loops are now implementable on fibers.
另一个动机是最终修复 Guile 无法记录某些数据源位置的 bug。Guile 过去会使用
弱键
哈希表来使从 read 返回的数据与源位置相关联。但这仅适用于 fresh value,不适用于小整数或字符等立即数,也不适用于 keyword 和 symbol 等全局唯一的非立即数。因此对于这些,我们就没有任何源位置。
Another motivation was to finally fix a bug in which Guile couldn't record source locations for some kinds of datums. It used to be that Guile would use a weak-key hash table to associate datums returned from read with source locations. But this only works for fresh values, not for immediate values like small integers or characters, nor does it work for globally unique non-immediates like keywords and symbols. So for these, we just wouldn't have any source locations.
该问题的一个可靠解决方案是返回带
注解
的对象,而不是使用另外的表。由于 Scheme 的宏扩展器已经被设置为与带注解的对象(语法对象)一起使用,因此一个新的 read-syntax 接口会非常好用。
A robust solution to that problem is to return annotated objects rather than using a side table. Since Scheme's macro expander is already set to work with annotated objects (syntax objects), a new read-syntax interface would do us a treat.
在 C 语言中实现
read
很难做到。但在 Scheme 中实现 read
则毫无问题。不过,调整扩展器以期望在语法对象内包含源位置有些繁琐,且源位置信息的增加使得输出文件的大小增大了几个百分比⸺这在部分上是 .debug_lines
DWARF 数据的增加带来的,但也和宏中语法对象的序列化源位置有关。
With read in C, this was hard to do. But with read in Scheme, it was no problem to implement. Adapting the expander to expect source locations inside syntax objects was a bit fiddly, though, and the resulting increase in source location information makes the output files bigger by a few percent -- due somewhat to the increased size of the .debug_lines DWARF data, but also due to serialized source locations for syntax objects in macros.
速度方面,目前切换到 Scheme 的
read
是一个
退步
。旧的 reader 在这台笔记本电脑上记录源位置时每秒大概可以解析 15 或 16 MB,或者关闭源位置,那么有 22 或 23 MB/s。新的 reader 在旧模式下,使用弱键侧表记录源位置的解析速度大概为 10.5 MB/s,关闭位置时为 13.5 MB/s。新的 read-syntax
速度大约是 12 MB/s。我们将在未来几个月继续优化这些性能,但与原来的 reader 编写时的情况不同的是,现在的 reader 主要在编译时使用。(它在读取 s 表达式作为数据时仍然有用,因此仍然有理由提升其速度。)
Speed-wise, switching to read in Scheme is a regression, currently. The old reader could parse around 15 or 16 megabytes per second when recording source locations on this laptop, or around 22 or 23 MB/s with source locations off. The new one parses more like 10.5 MB/s, or 13.5 MB/s with positions off, when in the old mode where it uses a weak-key side table to record source locations. The new read-syntax runs at around 12 MB/s. We'll be noodling at these in the coming months, but unlike when the original reader was written, at least now the reader is mainly used only at compile time. (It still has a role when reading s-expressions as data, so there is still a reason to make it fast.)
与
eval
的情况一样 ,在加载 Scheme 版本之前,我们仍然有一个 C 版本的 reader 可用于引导目的。这次重写令人高兴的是,我能够从 C reader 中删除与非默认词法语法相关的所有缺陷,很好地简化了未来的维护。
As is the case with eval, we still have a C version of the reader available for bootstrapping purposes, before the Scheme version is loaded. Happily, with this rewrite I was able to remove all of the cruft from the C reader related to non-default lexical syntax, which simplifies maintenance going forward.
尝试
逐个 bug
重写的一个有趣方面是你会发现 bug 和意外行为。比如,事实证明,从出现以来,Guile 总是不需要终止分隔符地 read
#t
和 #f
,因此 read "(#t1)"
将得到列表 (#t 1)
。很奇怪,对吧?更奇怪的是,当 #true
和 #false
别名被添加到语言中,Guile 决定默认支持它们,但以一种奇怪的向后兼容的方式…所以 "(#false1)"
读作 (#f 1)
但 "(#falsa1)"
读作 (#f alsa1)
。诸如此类的事还有不少。
An interesting aspect of attempting to make a bug-for-bug rewrite is that you find bugs and unexpected behavior. For example, it turns out that since the dawn of time, Guile always read #t and #f without requiring a terminating delimiter, so reading "(#t1)" would result in the list (#t 1). Weird, right? Weirder still, when the #true and #false aliases were added to the language, Guile decided to support them by default, but in an oddly backwards-compatible way... so "(#false1)" reads as (#f 1) but "(#falsa1)" reads as (#f alsa1). Quite a few more things like that.
总的来说,这次重写似乎是成功的,没有引入新的行为,甚至产生了相同的错误。然而,对于
回溯
而言,情况并非如此,因为回溯可以暴露出 read 函数的内部实现,而之前由于 C 栈对 Scheme 是不透明的,这种情况并不会发生。因此,我们可能需要在调用 read 的地方添加更合理的错误处理,因为回溯信息无论如何都不是一个好的面向用户的错误反馈。
All in all it would seem to be a successful rewrite, introducing no new behavior, even producing the same errors. However, this is not the case for backtraces, which can expose the guts of read in cases where that previously wouldn't happen because the C stack was opaque to Scheme. Probably we will simply need to add more sensible error handling around callers to read, as a backtrace isn't a good user-facing error anyway.
好吧,今晚的闲聊已经够多了。祝大家 happy hacking,晚安!
OK enough rambling for this evening. Happy hacking to all and to all a good night!