Planet Scheme

July 03, 2020

Programming Praxis

Spelling Numbers

Your task is to write a program that lists all of the numbers from zero to one hundred, inclusive, in alphabetical order. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at July 03, 2020 09:00 AM

July 01, 2020

GNU Guix

Securing updates

Software deployment tools like Guix are in a key position when it comes to securing the “software supply chain”—taking source code fresh from repositories and providing users with ready-to-use binaries. We have been paying attention to several aspects of this problem in Guix: authentication of pre-built binaries, reproducible builds, bootstrapping, and security updates.

A couple of weeks ago, we addressed the elephant in the room: authentication of Guix code itself by guix pull, the tool that updates Guix and its package collection. This article looks at what we set out to address, how we achieved it, and how it compares to existing work in this area.

What updates should be protected against

The problem of securing distro updates is often viewed through the lens of binary distributions such as Debian, where the main asset to be protected are binaries themselves. The functional deployment model that Guix and Nix implement is very different: conceptually, Guix is a source distribution, like Gentoo if you will.

Pre-built binaries are of course available and very useful, but they’re optional; we call them substitutes because they’re just that: substitutes for local builds. When you do choose to accept substitutes, they must be signed by one of the keys you authorized (this has been the case since version 0.6 in 2014).

Guix consists of source code for the tools as well as all the package definitions—the distro. When users run guix pull, what happens behind the scene is equivalent to git clone or git pull. There are many ways this can go wrong. An attacker can trick the user into pulling code from an alternate repository that contains malicious code or definitions for backdoored packages. This is made more difficult by the fact that code is fetched over HTTPS from Savannah by default. If Savannah is compromised (as happened in 2010), an attacker can push code to the Guix repository, which everyone would pull. The change might even go unnoticed and remain in the repository forever. An attacker with access to Savannah can also reset the main branch to an earlier revision, leading users to install outdated software with known vulnerabilities—a downgrade attack. These are the kind of attacks we want to protect against.

Authenticating Git checkouts

If we take a step back, the problem we’re trying to solve is not specific to Guix and to software deployment tools: it’s about authenticating Git checkouts. By that, we mean that when guix pull obtains code from Git, it should be able to tell that all the commits it fetched were pushed by authorized developers of the project. We’re really looking at individual commits, not tags, because users can choose to pull arbitrary points in the commit history of Guix and third-party channels.

Checkout authentication requires cryptographically signed commits. By signing a commit, a Guix developer asserts that they are the one who made the commit; they may be its author, or they may be the person who applied somebody else’s changes after review. It also requires a notion of authorization: we don’t simply want commits to have a valid signature, we want them to be signed by an authorized key. The set of authorized keys changes over time as people join and leave the project.

To implement that, we came up with the following mechanism and rule:

  1. The repository contains a .guix-authorizations file that lists the OpenPGP key fingerprints of authorized committers.
  2. A commit is considered authentic if and only if it is signed by one of the keys listed in the .guix-authorizations file of each of its parents. This is the authorization invariant.

(Remember that Git commits form a directed acyclic graph (DAG) where each commit can have zero or more parents; merge commits have two parent commits, for instance. Do not miss Git for Computer Scientists for a pedagogical overview!)

Let’s take an example to illustrate. In the figure below, each box is a commit, and each arrow is a parent relationship:

Example commit graph.

This figure shows two lines of development: the orange line may be the main development branch, while the purple line may correspond to a feature branch that was eventually merged in commit F. F is a merge commit, so it has two parents: D and E.

Labels next to boxes show who’s in .guix-authorizations: for commit A, only Alice is an authorized committer, and for all the other commits, both Bob and Alice are authorized committers. For each commit, we see that the authorization invariant holds; for example:

  • commit B was made by Alice, who was the only authorized committer in its parent, commit A;
  • commit C was made by Bob, who was among the authorized committers as of commit B;
  • commit F was made by Alice, who was among the authorized committers of both parents, commits D and E.

The authorization invariant has the nice property that it’s simple to state, and it’s simple to check and enforce. This is what guix pull implements. If your current Guix, as returned by guix describe, is at commit A and you want to pull to commit F, guix pull traverses all these commits and checks the authorization invariant.

Once a commit has been authenticated, all the commits in its transitive closure are known to be already authenticated. guix pull keeps a local cache of the commits it has previously authenticated, which allows it to traverse only new commits. For instance, if you’re at commit F and later update to a descendant of F, authentication starts at F.

Since .guix-authorizations is a regular file under version control, granting or revoking commit authorization does not require special support. In the example above, commit B is an authorized commit by Alice that adds Bob’s key to .guix-authorizations. Revocation is similar: any authorized committer can remove entries from .guix-authorizations. Key rotation can be handled similarly: a committer can remove their former key and add their new key in a single commit, signed by the former key.

The authorization invariant satisfies our needs for Guix. It has one downside: it prevents pull-request-style workflows. Indeed, merging the branch of a contributor not listed in .guix-authorizations would break the authorization invariant. It’s a good tradeoff for Guix because our workflow relies on patches carved into stone tablets (patch tracker), but it’s not suitable for every project out there.

Bootstrapping

The attentive reader may have noticed that something’s missing from the explanation above: what do we do about commit A in the example above? In other words, which commit do we pick as the first one where we can start verifying the authorization invariant?

We solve this bootstrapping issue by defining channel introductions. Previously, one would identify a channel simply by its URL. Now, when introducing a channel to users, one needs to provide an additional piece of information: the first commit where the authorization invariant holds, and the fingerprint of the OpenPGP key used to sign that commit (it’s not strictly necessary but provides an additional check). Consider this commit graph:

Example commit graph with introduction.

On this figure, B is the introduction commit. Its ancestors, such as A are considered authentic. To authenticate, C, D, E, and F, we check the authorization invariant.

As always when it comes to establishing trust, distributing channel introductions is very sensitive. The introduction of the official guix channel is built into Guix. Users obtain it when they install Guix the first time; hopefully they verify the signature on the Guix tarball or ISO image, as noted in the installation instructions, which reduces chances of getting the “wrong” Guix, but it is still very much trust-on-first-use (TOFU).

For signed third-party channels, users have to provide the channel’s introduction in their channels.scm file, like so:

(channel
  (name 'my-channel)
  (url "https://example.org/my-channel.git")
  (introduction
   (make-channel-introduction
    "6f0d8cc0d88abb59c324b2990bfee2876016bb86"
    (openpgp-fingerprint
     "CABB A931 C0FF EEC6 900D  0CFB 090B 1199 3D9A EBB5"))))

The guix describe command now prints the introduction if there’s one. That way, one can share their channel configuration, including introductions, without having to be an expert.

Channel introductions also solve another problem: forks. Respecting the authorization invariant “forever” would effectively prevent “unauthorized” forks—forks made by someone who’s not in .guix-authorizations. Someone publishing a fork simply needs to emit a new introduction for their fork, pointing to a different starting commit.

Last, channel introductions give a point of reference: if an attacker manipulates branch heads on Savannah to have them point to unrelated commits (such as commits on an orphan branch that do not share any history with the “official” branches), authentication will necessarily fail as it stumbles upon the first unauthorized commit made by the attacker. In the figure above, the red branch with commits G and H cannot be authenticated because it starts from A, which lacks .guix-authorizations and thus fails the authorization invariant.

That’s all for authentication! I’m glad you read this far. At this point you can take a break or continue with the next section on how guix pull prevents downgrade attacks.

Downgrade attacks

An important threat for software deployment tools is downgrade or roll-back attacks. The attack consists in tricking users into installing older, known-vulnerable software packages, which in turn may offer new ways to break into their system. This is not strictly related to the authentication issue we’ve been discussing, except that it’s another important issue in this area that we took the opportunity to address.

Guix saves provenance info for itself: guix describe prints that information, essentially the Git commits of the channels used during git pull:

$ guix describe
Generation 149  Jun 17 2020 20:00:14    (current)
  guix 8b1f7c0
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: 8b1f7c03d239ca703b56f2a6e5f228c79bc1857e

Thus, guix pull, once it has retrieved the latest commit of the selected branch, can verify that it is doing a fast-forward update in Git parlance—just like git pull does, but compared to the previously-deployed Guix. A fast-forward update is when the new commit is a descendant of the current commit. Going back to the figure above, going from commit A to commit F is a fast-forward update, but going from F to A or from D to E is not.

Not doing a fast-forward update would mean that the user is deploying an older version of the Guix currently used, or deploying an unrelated version from another branch. In both cases, the user is at risk of ending up installing older, vulnerable software.

By default guix pull now errors out on non-fast-forward updates, thereby protecting from roll-backs. Users who understand the risks can override that by passing --allow-downgrades.

Authentication and roll-back prevention allow users to safely refer to mirrors of the Git repository. If git.savannah.gnu.org is down, one can still update by fetching from a mirror, for instance with:

guix pull --url=https://github.com/guix-mirror/guix

If the repository at this URL is behind what the user already deployed, or if it’s not a genuine mirror, guix pull will abort. In other cases, it will proceed.

Unfortunately, there is no way to answer the general question “is X the latest commit of branch B ?”. Rollback detection prevents just that, rollbacks, but there’s no mechanism in place to tell whether a given mirror is stale. To mitigate that, channel authors can specify, in the repository, the channel’s primary URL. This piece of information lives in the .guix-channel file, in the repository, so it’s authenticated. guix pull uses it to print a warning when the user pulls from a mirror:

$ guix pull --url=https://github.com/guix-mirror/guix
Updating channel 'guix' from Git repository at 'https://github.com/guix-mirror/guix'...
Authenticating channel 'guix', commits 9edb3f6 to 3e51f9e (44 new commits)...
guix pull: warning: pulled channel 'guix' from a mirror of https://git.savannah.gnu.org/git/guix.git, which might be stale
Building from this channel:
  guix      https://github.com/guix-mirror/guix 3e51f9e
…

So far we talked about mechanics in a rather abstract way. That might satisfy the graph theorist or the Git geek in you, but if you’re up for a quick tour of the implementation, the next section is for you!

A long process

We’re kinda celebrating these days, but the initial bug report was opened… in 2016. One of the reasons was that we were hoping the general problem was solved already and we’d “just” have to adapt what others had done. As for the actual design: you would think it can be implemented in ten lines of shell script invoking gpgv and git. Perhaps that’s a possibility, but the resulting performance would be problematic—keep in mind that users may routinely have to authenticate hundreds of commits. So we took a long road, but the end result is worth it. Let’s recap.

Back in April 2016, committers started signing commits, with a server-side hook prohibiting unsigned commits. In July 2016, we had proof-of-concept libgit2 bindings with the primitives needed to verify signatures on commits, passing them to gpgv; later Guile-Git was born, providing good coverage of the libgit2 interface. Then there was a two-year hiatus during which no code was produced in that area.

Everything went faster starting from December 2019. Progress was incremental and may have been hard to follow, even for die-hard Guix hackers, so here are the major milestones:

Whether you’re a channel author or a user, the feature is now fully documented in the manual, and we’d love to get your feedback!

SHA-1

We can’t really discuss Git commit signing without mentioning SHA-1. The venerable crytographic hash function is approaching end of life, as evidenced by recent breakthroughs. Signing a Git commit boils down to signing a SHA-1 hash, because all objects in the Git store are identified by their SHA-1 hash.

Git now relies on a collision attack detection library to mitigate practical attacks. Furthermore, the Git project is planning a hash function transition to address the problem.

Some projects such as Bitcoin Core choose to not rely on SHA-1 at all. Instead, for the commits they sign, they include in the commit log the SHA512 hash of the tree, which the verification scripts check.

Computing a tree hash for each commit in Guix would probably be prohibitively costly. For now, for lack of a better solution, we rely on Git’s collision attack detection and look forward to a hash function transition.

As for SHA-1 in an OpenPGP context: our authentication code rejects SHA-1 OpenPGP signatures, as recommended.

Related work

A lot of work has gone into securing the software supply chain, often in the context of binary distros, sometimes in a more general context; more recent work also looks into Git authentication and related issues. This section attempts to summarize how Guix relates to similar work that we’re aware of in these two areas. More detailed discussions can be found in the issue tracker.

The Update Framework (TUF) is a reference for secure update systems, with a well-structured spec and a number of implementations. TUF is a great source of inspiration to think about this problem space. Many of its goals are shared by Guix. Not all the attacks it aims to protect against (Section 1.5.2 of the spec) are addressed by what’s presented in this post: indefinite freeze attacks, where updates never become available, are not addressed per se (though easily observable), and slow retrieval attacks aren’t addressed either. The notion of role is also something currently missing from the Guix authentication model, where any authorized committer can touch any files, though the model and .guix-authorizations format leave room for such an extension.

However, both in its goals and system descriptions, TUF is biased towards systems that distribute binaries as plain files with associated meta-data. That creates a fundamental impedance mismatch. As an example, attacks such as fast-forward attacks or mix-and-match attacks don’t apply in the context of Guix; likewise, the repository depicted in Section 3 of the spec has little in common with a Git repository.

Developers of OPAM, the OCaml package manager, adapted TUF for use with their Git-based package repository, later updated to write Conex, a separate tool to authenticate OPAM repositories. OPAM is interesting because like Guix it’s a source distro and its package repository is a Git repository containing “build recipe”. To date, it appears that opam update itself does not authenticate repositories though; it’s up to users or developer to run Conex.

Another very insightful piece of work is the 2016 paper On omitting commits and committing omissions. The paper focuses on the impact of malicious modifications to Git repository meta-data. An attacker with access to the repository can modify, for instance, branch references, to cause a rollback attack or a “teleport” attack, causing users to pull an older commit or an unrelated commit. As written above, guix pull would detect such attacks. However, guix pull would fail to detect cases where metadata modification does not yield a rollback or teleport, yet gives users a different view than the intended one—for instance, a user is directed to an authentic but different branch rather than the intended one. The “secure push” operation and the associated reference state log (RSL) the authors propose would be an improvement.

Wrap-up and outlook

Guix now has a mechanism that allows it to authenticate updates. If you’ve run guix pull recently, perhaps you’ve noticed additional output and a progress bar as new commits are being authenticated. Apart from that, the switch has been completely transparent. The authentication mechanism is built around the commit graph of Git; in fact, it’s a mechanism to authenticate Git checkouts and in that sense it is not tied to Guix and its application domain. It is available not only for the main guix channel, but also for third-party channels.

To bootstrap trust, we added the notion of channel introductions. These are now visible in the user interface, in particular in the output of guix describe and in the configuration file of guix pull and guix time-machine. While channel configuration remains a few lines of code that users typically paste, this extra bit of configuration might be intimidating. It certainly gives an incentive to provide a command-line interface to manage the user’s list of channels: guix channel add, etc.

The solution here is built around the assumption that Guix is fundamentally a source-based distribution, and is thus completely orthogonal to the public key infrastructure (PKI) Guix uses for the signature of substitutes. Yet, the substitute PKI could probably benefit from the fact that we now have a secure update mechanism for the Guix source code: since guix pull can securely retrieve a new substitute signing key, perhaps it could somehow handle substitute signing key revocation and delegation automatically? Related to that, channels could perhaps advertise a substitute URL and its signing key, possibly allowing users to register those when they first pull from the channel. All this requires more thought, but it looks like there are new opportunities here.

Until then, if you’re a user or a channel author, we’d love to hear from you! We’ve already gotten feedback that these new mechanisms broke someone’s workflow; hopefully it didn’t break yours, but either way your input is important in improving the system. If you’re into security and think this design is terrible or awesome, please do provide feedback.

It’s a long and article describing a long ride on a path we discovered as we went, and it felt like an important milestone to share!

Acknowledgments

Thanks to everyone who provided feedback, ideas, or carried out code review during this long process, notably (in no particular order): Christopher Lemmer Webber, Leo Famulari, David Thompson, Mike Gerwitz, Ricardo Wurmus, Werner Koch, Justus Winter, Vagrant Cascadian, Maxim Cournoyer, Simon Tournier, John Soo, and Jakub Kądziołka. Thanks also to janneke, Ricardo, Marius, and Simon for reviewing an earlier draft of this post.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Ludovic Courtès at July 01, 2020 05:40 PM

June 30, 2020

Programming Praxis

Shuffle An Array

Today’s exercise comes to us from Leetcode via Reddit:

Given an array consisting of 2n elements in the form
[x1,x2,…,xn,y1,y2,…,yn], return the array in the form [x1,y1,x2,y2,…,xn,yn].

The Reddit poster claims to be new to Scheme and functional programming, and was thinking of a solution using length and list-ref, but couldn’t solve the problem.

Your task is to show the student how to solve the problem. When you are
finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 30, 2020 09:00 AM

June 20, 2020

Göran Weinholt

Akku Archive Improvements

Akku.scm is a language package manager for R6RS and R7RS Scheme. The software that powers the package index has been growing beyond the simple one-liner it was in the beginning and today I’ve finally pushed it to a public repository. I’ve also made preparations for hosting packages as tarballs directly in the archive.

Tarballs

The Akku archive has never hosted packages directly. The index points at git repositories and commit revisions. These are added to each project’s Akku.lock file and are used when akku install clones the repository.

This has two major drawbacks. Cloning a git repository can be really slow. The repositories are also hosted on sites like GitHub where users sometimes decide to force-push or remove the repositories completely. I feel this is likely to happen more often the more politics and business influences GitHub in the future.

I’ve prepared the Akku archive to host tarballs directly. These are made with git archive from the submitted git repository. Downloading these is much faster than cloning a repository, they are not at risk of being removed at a whim, and they are cached in a local shared cache. Other package managers as a rule host their own archives as well, so this is nothing unusual.

Provenance

It’s important to me that users of Akku can trust that they get original software that has not been tampered with. I review all code that goes into the archive to protect Akku against use in supply chain attacks.

Building tarballs changes the equation a little bit since you now need to trust that the tarballs have not been tampered with. Tarballs are verified when they are downloaded, but how do you know that they match the original software?

This can be seen as an issue of provenance, or providing proof of the history of a piece of software. Here is the chain for the new tarballs:

  • Akku packages are submitted through akku publish by a developer (or by the Snow mirror software) as a .akku file with a detached GPG signature. This signature can be independently verified by fetching the key from the keyservers.

    The signed .akku file contains a git commit id. Because it is signed by the person who submitted the package, we can use the signature to verify that it was not tampered with after it went into the archive.

    Copies of these files are hosted under /archive/packages.

  • The archive software creates a tarball from the original repository using the git archive command. It also creates a new .akku file which contains information about the original repository and commit id as a comment. The non-comment part of the file contains the URL and hash of the new tarball. Like other .akku files, it is signed. This provides a signature linking the original git commit to the new tarball’s hash.

    These files are available under /archive/pkg. The signature is made with the current Akku archive key, which is in turn signed by my own key (which is in the Debian keyring).

  • The .akku files for Snow packages and the new tarballs are combined using akku archive-scan and written to Akku-index.scm, which is then XZ-compressed and signed with the archive key. The akku update command verifies the signature when it downloads this file. When Akku creates an Akku.lock file it incorporates the hash from the index, which is verified when akku install runs.

The above should make it possible for any interested party to check the integrity of the archive. It also protects against attackers uploading funky tarballs that don’t match the git repository.

All git repositories and Snow packages are mirrored in the archive under /archive/mirror. This mirror is not used in the index and is mostly provided for backup purposes.

Beta testers

The new index with tarballs is not live yet, it needs some testing.

Anyone who wants to do so can try it and report successes or failures in the comments section below or in GitLab issues. Here is how to update to the new archive manually:

curl https://archive.akkuscm.org/beta/Akku-index.scm \
  > ~/.local/share/akku/index.db

There is a GPG signature (.sig) in the same directory in case you want to verify that it was not tampered with.

Run akku lock in your existing project to get a lockfile that uses the new index. Then run akku install to download your packages as usual.

If all goes well then some time soon the switch to the new index will happen and akku update will use the new style index. You will still be able to revert to the old index by downloading Akku-origin.scm manually from the archive site and then use that as your index. This file will keep being maintained because that is where the Akku website generator finds pointers to upstream Git repositories.

Further reading

More about Akku:

by weinholt at June 20, 2020 12:00 AM

June 19, 2020

Programming Praxis

Summing A String

In a string consisting of digits and other non-digit characters, the digits form an embedded series of positive integers. For instance, the string “123abc45def” contains the embedded integers 123 and 45, which sum to 168.

Your task is to write a program that takes a string and writes the sum of the integers embedded in the string. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 19, 2020 09:00 AM

June 16, 2020

Programming Praxis

Counting Fingers

A little girl counts on her fingers in a curious way. She counts 1 on her thumb, 2 on her index finger, 3 on her middle finger, 4 on her ring finger, and 5 on her pinkie finger, then works back, counting 6 on her ring finger, 7 on her middle finger, 8 on her index finger, and 9 on her thumb, when she again turns around and counts 10 on her index finger, 11 on her middle finger, and so on.

Your task is to write a program that determines which finger the little girl will be on when she reaches a thousand. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 16, 2020 09:00 AM

June 15, 2020

GNU Guix

Guix Further Reduces Bootstrap Seed to 25%

We are delighted to announce that the second reduction by 50% of the Guix bootstrap binaries has now been officially released!

The initial set of binaries from which packages are built now weighs in at approximately 60~MiB, a quarter of what it used to be.

In a previous blog post we elaborate on why this reduction and bootstrappability in general is so important. One reason is to eliminate---or greatly reduce the attack surface of---a “trusting trust” attack. Last summer at the Breaking Bitcoin conference, Carl Dong gave a fun and remarkably gentle introduction and at FOSDEM2020 I also gave a short talk about this. If you choose to believe that building from source is the proper way to do computing, then it follows that the “trusting trust” attack is only a symptom of an incomplete or missing bootstrap story.

Further Reduced Binary Seed bootstrap

Last year, the first reduction removed the GCC, glibc and Binutils binary seeds. The new Further Reduced Binary Seed bootstrap, merged in Guix master last month, removes the “static-binaries tarball” containing GNU Awk, Bash, Bzip2, the GNU Core Utilities, Grep, Gzip, GNU Make, Patch, sed, Tar, and Xz. It replaces them by Gash and Gash Core Utils. Gash is a minimalist POSIX shell written in Guile Scheme, while Gash Core Utils is a Scheme implementation for most of the tools found in GNU Coreutils, as well as the most essential bits of Awk, grep and sed.

After three new GNU Mes releases with numerous Mes C Library updates and fixes, a major update of Gash and the first official Gash Utils release, and the delicate balancing of 17 new bootstrap source packages and versions, the bottom of the package graph now looks like this (woohoo!):

                              gcc-mesboot (4.9.4)
                                      ^
                                      |
                                    (...)
                                      ^
                                      |
               binutils-mesboot (2.14), glibc-mesboot (2.2.5),
                          gcc-core-mesboot (2.95.3)
                                      ^
                                      |
            bash-mesboot (2.05), bzip2-mesboot, gawk-mesboot (3.0.0)
       diffutils-mesboot (2.7), patch-mesboot (2.5.9), sed-mesboot (1.18)
                                      ^
                                      |
                             gnu-make-mesboot (3.82)
                                      ^
                                      |
                                gzip-mesboot (1.2.4)
                                      ^
                                      |
                                  tcc-boot
                                      ^
                                      |
                                  mes-boot
                                      ^
                                      |
                          gash-boot, gash-utils-boot
                                      ^
                                      |
                                      *
                 bootstrap-mescc-tools, bootstrap-mes (~12 MiB)
                            bootstrap-guile (~48 MiB)

full graph

We are excited that the Nlnet Foundation has sponsored this work!

However, we aren't done yet; far from it.

Lost Paths

The idea of reproducible builds and bootstrappable software is not very new. Much of that was implemented for the GNU tools in the early 1990s. Working to recreate it in present time shows us much of that practice was forgotten.

Readers who are familiar with the GNU toolchain may have noticed the version numbers of the *-mesboot source packages in this great new bootstrap: They are ancient! That's a problem.

Typically, newer versions of the tool chain fix all kinds of bugs, make the software easier to build and add support for new CPU architectures, which is great. However---more often than not--- simultaneously new features are introduced or dependencies are added that are not necessary for bootstrapping and may increase the bootstrap hurdle. Sometimes, newer tools are more strict or old configure scripts do not recognise newer tool versions.

A trivial example is GNU sed. In the current bootstrap we are using version 1.18, which was released in 1993. Until recently the latest version of sed we could hope to bootstrap was sed-4.2.2 (2012). Newer releases ship as xz-compressed tarballs only, and xz is notoriously difficult to bootstrap (it needs a fairly recent GCC and try building that without sed).

Luckily, the sed maintainers (Jim Meyering) were happy to correct this mistake and starting from release sed-4.8 (2020) also gzip-compressed tarballs will be shipped. Similar for the GNU Core Utils: Releases made between 2011 and 2019 will probably be useless for bootstrapping. Confronted with this information, also the coreutils maintainers (Pádraig Brady) were happy to release coreutils-8.32 also in gzip compression from now on.

Even these simple cases show that solving bootstrap problems can only be done together: For GNU it really is a project-wide responsibility that needs to be addressed.

Most bootstrap problems or loops are not so easy to solve and sometimes there are no obvious answers, for example:

and while these examples make for a delightful puzzle from a bootstrappability perspective, we would love to see the maintainers of GNU softwares to consider bootstrappability and start taking more responsibility for the bootstrap story of their packages.

Towards a Universal, Full Source Bootstrap

Our next target will be a third reduction by ~50%; the Full-Source bootstrap will replace the MesCC-Tools and GNU Mes binaries by Stage0 and M2-Planet.

The Stage0 project by Jeremiah Orians starts everything from ~512 bytes; virtually nothing. Have a look at this incredible project if you haven’t already done so.

We are most grateful and excited that the Nlnet Foundation has again decided to sponsor this work!

While the reduced bootstrap currently only applies to the i686-linux and x86_64-linux architectures, we are thrilled that ARM will be joining soon. The Trusted ARM bootstrapping work is progressing nicely, and GNU Mes is now passing its entire mescc test suite on native ARMv7, and passing nigh its entire gcc test suite on native ARMv7. Work is underway to compile tcc using that GNU Mes. Adding this second architecture is a very important one towards the creation of a universal bootstrap!

Upcoming releases of Gash and Gash-Utils will allow us to clean up the bottom of the package graph and remove many of the “vintage” packages. In particular, the next version of Gash-Utils will be sophisticated enough to build everything up to gcc-mesboot using only old versions of GNU Make and Gzip. This is largely thanks to improvements to the implementation of Awk, which now includes nearly all of the standard features.

Looking even further into the future, we will likely have to remove the “vintage” GCC-2.95.3 that was such a helpful stepping stone and reach straight for GCC-4.6.4. Interesting times ahead!

About Bootstrappable Builds and GNU Mes

Software is bootstrappable when it does not depend on a binary seed that cannot be built from source. Software that is not bootstrappable---even if it is free software---is a serious security risk for a variety of reasons. The Bootstrappable Builds project aims to reduce the number and size of binary seeds to a bare minimum.

GNU Mes is closely related to the Bootstrappable Builds project. Mes aims to create an entirely source-based bootstrapping path for the Guix System and other interested GNU/Linux distributions. The goal is to start from a minimal, easily inspectable binary (which should be readable as source) and bootstrap into something close to R6RS Scheme.

Currently, Mes consists of a mutual self-hosting scheme interpreter and C compiler. It also implements a C library. Mes, the scheme interpreter, is written in about 5,000 lines of code of simple C. MesCC, the C compiler, is written in scheme. Together, Mes and MesCC can compile a lightly patched TinyCC that is self-hosting. Using this TinyCC and the Mes C library, it is possible to bootstrap the entire Guix System for i686-linux and x86_64-linux.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Jan (janneke) Nieuwenhuizen at June 15, 2020 12:00 PM

June 05, 2020

Programming Praxis

2Max

Today’s exercise comes from Stack Overflow:

Given an array A consisting of N integers, return the maximum sum of two numbers whose digits add up to an equal sum. If there are not two numbers whose digits have an equal sum, the function should return -1. For example, A = [51, 71, 17, 42] would output 93 because there are two sets of numbers with the same digit-sum, (51, 42) with a digit-sum of 6 and (17, 71) with a digit-sum of 8, and the first pair has the maximum sum of two numbers of 93.

Your task is to write a program to calculated the requested maximum sum. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 05, 2020 09:00 AM

June 03, 2020

Andy Wingo

a baseline compiler for guile

Greets, my peeps! Today's article is on a new compiler for Guile. I made things better by making things worse!

The new compiler is a "baseline compiler", in the spirit of what modern web browsers use to get things running quickly. It is a very simple compiler whose goal is speed of compilation, not speed of generated code.

Honestly I didn't think Guile needed such a thing. Guile's distribution model isn't like the web, where every page you visit requires the browser to compile fresh hot mess; in Guile I thought it would be reasonable for someone to compile once and run many times. I was never happy with compile latency but I thought it was inevitable and anyway amortized over time. Turns out I was wrong on both points!

The straw that broke the camel's back was Guix, which defines the graph of all installable packages in an operating system using Scheme code. Lately it has been apparent that when you update the set of available packages via a "guix pull", Guix would spend too much time compiling the Scheme modules that contain the package graph.

The funny thing is that it's not important that the package definitions be optimized; they just need to be compiled in a basic way so that they are quick to load. This is the essential use-case for a baseline compiler: instead of trying to make an optimizing compiler go fast by turning off all the optimizations, just write a different compiler that goes from a high-level intermediate representation straight to code.

So that's what I did!

it don't do much

The baseline compiler skips any kind of flow analysis: there's no closure optimization, no contification, no unboxing of tagged numbers, no type inference, no control-flow optimizations, and so on. The only whole-program analysis that is done is a basic free-variables analysis so that closures can capture variables, as well as assignment conversion. Otherwise the baseline compiler just does a traversal over programs as terms of a simple tree intermediate language, emitting bytecode as it goes.

Interestingly the quality of the code produced at optimization level -O0 is pretty much the same.

This graph shows generated code performance of the CPS compiler relative to new baseline compiler, at optimization level 0. Bars below the line mean the CPS compiler produces slower code. Bars above mean CPS makes faster code. You can click and zoom in for details. Note that the Y axis is logarithmic.

The tests in which -O0 CPS wins are mostly because the CPS-based compiler does a robust closure optimization pass that reduces allocation rate.

At optimization level -O1, which adds partial evaluation over the high-level tree intermediate language and support for inlining "primitive calls" like + and so on, I am not sure why CPS peels out in the lead. No additional important optimizations are enabled in CPS at that level. That's probably something to look into.

Note that the baseline of this graph is optimization level -O1, with the new baseline compiler.

But as I mentioned, I didn't write the baseline compiler to produce fast code; I wrote it to produce code fast. So does it actually go fast?

Well against the -O0 and -O1 configurations of the CPS compiler, it does excellently:

Here you can see comparisons between what will be Guile 3.0.3's -O0 and -O1, compared against their equivalents in 3.0.2. (In 3.0.2 the -O1 equivalent is actually -O1 -Oresolve-primitives, if you are following along at home.) What you can see is that at these optimization levels, for these 8 files, the baseline compiler is around 4 times as fast.

If we compare to Guile 3.0.3's default -O2 optimization level, or -O3, we see bigger disparities:

Which is to say that Guile's baseline compiler runs at about 10x the speed of its optimizing compiler, which incidentally is similar to what I found for WebAssembly compilers a while back.

Also of note is that -O0 and -O1 take essentially the same time, with -O1 often taking less time than -O0. This is because partial evaluation can make the program smaller, at a cost of being less straightforward to debug.

Similarly, -O3 usually takes less time than -O2. This is because -O3 is allowed to assume top-level bindings that aren't exported from a module can be transformed to lexical bindings, which are more available for contification and inlining, which usually leads to smaller programs; it is a similar debugging/performance tradeoff to the -O0/-O1 case.

But what does one gain when choosing to spend 10 times more on compilation? Here I have a gnarly graph that plots performance on some microbenchmarks for all the different optimization levels.

Like I said, it's gnarly, but the summary is that -O1 typically gets you a factor of 2 or 4 over -O0, and -O2 often gets you another factor of 2 above that. -O3 is mostly the same as -O2 except in magical circumstances like the mbrot case, where it adds an extra 16x or so over -O2.

worse is better

I haven't seen the numbers yet of this new compiler in Guix, but I hope it can have a good impact. Already in Guile itself though I've seen a couple interesting advantages.

One is that because it produces code faster, Guile's boostrap from source can take less time. There is also a felicitous feedback effect in that because the baseline compiler is much smaller than the CPS compiler, it takes less time to macro-expand, which reduces bootstrap time (as bootstrap has to pay the cost of expanding the compiler, until the compiler is compiled).

The second fortunate result is that now I can use the baseline compiler as an oracle for the CPS compiler, when I'm working on new optimizations. There's nothing worse than suspecting that your compiler miscompiled itself, after all, and having a second compiler helps keep me sane.

stay safe, friends

The code, you ask? Voici.

Although this work has been ongoing throughout the past month, I need to add some words on the now before leaving you: there is a kind of cognitive dissonance between nerding out on compilers in the comfort of my home, rain pounding on the patio, and at the same time the world on righteous fire. I hope it is clear to everyone by now that the US police are an essentially racist institution: they harass, maim, and murder Black people at much higher rates than whites. My heart is with the protestors. Godspeed to you all, from afar. At the same time, all my non-Black readers should reflect on the ways they participate in systems that support white supremacy, and on strategies to tear them down. I know I will be. Stay safe, wear eye protection, and until next time: peace.

by Andy Wingo at June 03, 2020 08:39 PM

June 02, 2020

Programming Praxis

Hidden Squares

We have a simple task today:

A number n may have squares hidden among its digits. For instance, in the number 1625649, the consecutive digits 1, 4, 9, 16, 25, 49, 64, 256 and 625 are all squares.

Your task is to write a program that takes a positive integer n and finds all of its hidden squares. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at June 02, 2020 09:00 AM

May 29, 2020

Programming Praxis

Decreasing-Increasing Array

Today’s task is somebody’s homework:

Given an array of integers, determine if it is in decreasing-increasing order, with some initial segment of the array in decreasing order and the remainder of the array in increasing order. If the array is in decreasing-increasing order, return the pivot element (the minimum element in the array); otherwise, return an indication the array is not in decreasing-increasing order. Array elements may be duplicated. For example, the array (10 10 10 8 8 6 4 4 3 12 13 22 31 40 59 68) is in decreasing-increasing order, with pivot element 3, but the array (1 2 4 8 12 32 64) is not in decreasing-increasing order.

The student who asked the question suggests a solution using binary search that takes O(log n) time, but can’t get it to work.

Your task is to write a program to determine if an array is in decreasing-increasing order as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at May 29, 2020 09:00 AM

May 22, 2020

Programming Praxis

Prime Power Triples

Today’s exercise is Project Euler Problem 87:

The smallest number expressible as the sum of a prime square, prime cube, and prime fourth power is 28. In fact, there are exactly four numbers below fifty that can be expressed in such a way:

28 = 22 + 23 + 24
33 = 32 + 23 + 24
49 = 52 + 23 + 24
47 = 22 + 33 + 24

How many numbers below fifty million can be expressed as the sum of a prime square, prime cube, and prime fourth power?

Your task is to solve Project Euler 87; in the spirit of Project Euler, show only your code but not the solution. When you are finished, you are welcome to read
or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at May 22, 2020 09:00 AM

May 19, 2020

Programming Praxis

MinStack

[ My apologies that this exercise is so long delayed. We have been rewriting all of our business practices in the light of the pandemic, and work has been madness. The biggest projects are now complete, and we have retreated to merely super-busy, so maybe I will have time to write some more exercises. I hope everyone is well; I am. ]

Today’s task is to implement a minstack data structure that provides the normal stack operations push, pop and top and also the operation least which returns the smallest item in the stack without altering the stack in any way. All four operations must operate in O(1) time and O(n) space, where n is the size of the stack.

Your task is to implement a minstack as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at May 19, 2020 09:00 AM

May 17, 2020

GNU Guix

GNU Shepherd user services

One of the things which sets Guix apart from other GNU/Linux distributions is that it uses GNU Shepherd instead of the now ubiquitous systemd. A side effect of this is that user systemd units do not work on Guix System. Love, hate or extremely ambivalent toward systemd, this means that users cannot rely on already written systemd unit files for their regular user-level services.

There are a couple of benefits to using GNU Shepherd, and not all of them are due to it already being installed on Guix. Becoming comfortable with using Shepherd and understanding how to write and edit Shepherd service configurations makes the transition from other GNU/Linux distributions to Guix System easier. More complex services with their own logic tree, using the full power of GNU Guile, are also possible. This means you can have one service that behaves differently if it's running on a different system or architecture without needing to call out to shell scripts or using minimally different service definitions.

The GNU Shepherd manual suggests putting all the services inside a monolithic init.scm file, located by default at $XDG_CONFIG_DIR/shepherd/init.scm. While this does make it easy to keep everything in one place, it does create one glaring issue: any changes to the file mean that all the services need to be stopped and restarted in order for any changes to take place.

Luckily there's a nice function called scandir hiding in ice-9 ftw which returns a list of all files in a specified directory (with options for narrowing down the list or sorting it). This means that our init.scm can contain a minimum of code and all actual services can be loaded from individual files.

First the minimal init.scm:

(use-modules (shepherd service)
             ((ice-9 ftw) #:select (scandir)))

;; Load all the files in the directory 'init.d' with a suffix '.scm'.
(for-each
  (lambda (file)
    (load (string-append "init.d/" file)))
  (scandir (string-append (dirname (current-filename)) "/init.d")
           (lambda (file)
             (string-suffix? ".scm" file))))

;; Send shepherd into the background
(action 'shepherd 'daemonize)

Let's take a sample service for running syncthing, as defined in $XDG_CONFIG_DIR/shepherd/init.d/syncthing.scm:

(define syncthing
  (make <service>
    #:provides '(syncthing)
    #:docstring "Run `syncthing' without calling the browser"
    #:start (make-forkexec-constructor
              '("syncthing" "-no-browser")
              #:log-file (string-append (getenv "HOME")
                                        "/log/syncthing.log"))
    #:stop (make-kill-destructor)
    #:respawn? #t))
(register-services syncthing)

(start syncthing)

As with any other shepherd service it is defined and registered, and in this case it will start automatically. When the file is loaded by shepherd after being discovered by scandir everything works exactly as though the service definition were located directly inside the init.scm.

Now lets make a change. Since syncthing already has a -logfile flag and it has built-in log rotation that sounds better than using shepherd's #:log-file option. First we'll make our changes to the service:

(define syncthing
  (make <service>
    #:provides '(syncthing)
    #:docstring "Run `syncthing' without calling the browser"
    #:start (make-forkexec-constructor
              '("syncthing" "-no-browser"
                "-logflags=3" ; prefix with date & time
                "-logfile=/home/user/log/syncthing.log"))
    #:stop (make-kill-destructor)
    #:respawn? #t))
(register-services syncthing)

(start syncthing)

Now we stop syncthing:

$ herd stop syncthing

And we load the new service:

$ herd load root ~/.config/shepherd/init.d/syncthing.scm

This allows for quickly iterating on services without needing to stop all the services! Let's take a look at another service:

(define fccache
  (make <service>
    #:provides '(fccache)
    #:docstring "Run 'fc-cache -frv'"
    #:start (make-forkexec-constructor
              '("guix" "environment" "--ad-hoc" "fontconfig" "--"
                "fc-cache" "-frv")
              #:log-file (string-append (getenv "HOME")
                                        "/log/fccache.log"))
    #:one-shot? #t))

(register-services fccache)

In this example I want to refresh my font cache but I don't want to actually install fontconfig either system-wide or in my profile.

$ which fc-cache
which: no fc-cache in (/home/user/.config/guix/current/bin:/home/user/.guix-profile/bin:/home/user/.guix-profile/sbin:/run/setuid-programs:/run/current-system/profile/bin:/run/current-system/profile/sbin)
$ herd start fccache
Service fccache has been started.

Of course we can import other modules and leverage the code already written there. In this case, instead of using the string "guix environment --ad-hoc fontutils -- fc-cache -frv" let's use the guix environment function already available in guix scripts environment:

(use-modules (guix scripts environment))

(define fccache
  (make <service>
    #:provides '(fccache)
    #:docstring "Run 'fc-cache -frv'"
    #:start (lambda () ; Don't run immediately when registered!
              (guix-environment "--ad-hoc" "fontconfig" "--" "fc-cache" "-frv"))
    #:one-shot? #t))

(register-services fccache)
$ herd load root ~/.config/shepherd/init.d/fccache.scm
Loading /home/user/.config/shepherd/init.d/fccache.scm.
$ herd start fccache
/gnu/store/hbqlzgd8hcf6ndcmx7q7miqrsxb4dmkk-gs-fonts-8.11/share/fonts: caching, new cache contents: 0 fonts, 1 dirs
/gnu/store/hbqlzgd8hcf6ndcmx7q7miqrsxb4dmkk-gs-fonts-8.11/share/fonts/type1: caching, new cache contents: 0 fonts, 1 dirs
/gnu/store/hbqlzgd8hcf6ndcmx7q7miqrsxb4dmkk-gs-fonts-8.11/share/fonts/type1/ghostscript: caching, new cache contents: 35 fonts, 0 dirs
/home/user/.guix-profile/share/fonts: caching, new cache contents: 0 fonts, 7 dirs
/home/user/.guix-profile/share/fonts/opentype: caching, new cache contents: 8 fonts, 0 dirs
/home/user/.guix-profile/share/fonts/otf: caching, new cache contents: 12 fonts, 0 dirs
/home/user/.guix-profile/share/fonts/terminus: caching, new cache contents: 18 fonts, 0 dirs
/home/user/.guix-profile/share/fonts/truetype: caching, new cache contents: 58 fonts, 0 dirs
/home/user/.guix-profile/share/fonts/ttf: caching, new cache contents: 12 fonts, 0 dirs
/home/user/.guix-profile/share/fonts/type1: caching, new cache contents: 0 fonts, 1 dirs
/home/user/.guix-profile/share/fonts/type1/ghostscript: caching, new cache contents: 35 fonts, 0 dirs
/home/user/.guix-profile/share/fonts/woff: caching, new cache contents: 1 fonts, 0 dirs
/run/current-system/profile/share/fonts: skipping, no such directory
/home/user/.local/share/fonts: skipping, no such directory
/home/user/.fonts: skipping, no such directory
/gnu/store/hbqlzgd8hcf6ndcmx7q7miqrsxb4dmkk-gs-fonts-8.11/share/fonts/type1: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/opentype: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/otf: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/terminus: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/truetype: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/ttf: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/type1: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/woff: skipping, looped directory detected
/gnu/store/hbqlzgd8hcf6ndcmx7q7miqrsxb4dmkk-gs-fonts-8.11/share/fonts/type1/ghostscript: skipping, looped directory detected
/home/user/.guix-profile/share/fonts/type1/ghostscript: skipping, looped directory detected
/var/cache/fontconfig: not cleaning unwritable cache directory
/home/user/.cache/fontconfig: cleaning cache directory
/home/user/.fontconfig: not cleaning non-existent cache directory
fc-cache: succeeded
herd: exception caught while executing 'start' on service 'fccache':
Throw to key `quit' with args `(0)'.

The problem with this approach is that guix-environment returns the exit code of the programs it calls and #:start expects a constructor to return #t or #f so there's some work to be done here.

This was just a quick peek into what's possible with GNU Shepherd when run as a user. Next time we'll take a look at integrating mcron to replicate some of systemd's timer functionality.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Efraim Flashner at May 17, 2020 08:00 PM

May 15, 2020

Göran Weinholt

Quasiquote - Literal Magic

While I was writing a manpage for Scheme’s quasiquote, something I saw surprised me and changed my understanding of quasiquote. It turns out that a new language, with semantics that are interesting to PLT enthusiasts, hides behind the innocent backtick character. Starting with R6RS Scheme, quasiquote became total magic.

Background

It is not going to be easy to understand the argument in this article if you lack some background knowledge, so here’s a brief explanation of quasiquote, Scheme’s concept of locations, Scheme’s handling of literal constants, and finally referential transparency.

Briefly on quasiquote

Quasiquote is a language feature in Scheme that lets you write a template for a structure of lists and vectors. These templates are more like web templates than C++ templates; don’t let the terminology confuse you.

Basically you write a backtick character to start a template. The code immediately following the backtick is the template. You write a comma wherever you want to fill in some variable or other expression. (There’s also a list-splicing version of the comma which often comes in handy).

The expression `(b "Hello " ,x) builds a list with these elements: the symbol b, the string "Hello" and lastly whatever value the variable x happened to have. So perhaps (b "Hello " "John").

Quasiquote is very useful to build SXML and SHTML. Forget learning a new templating system every week; this one’s a keeper. But this being Scheme, the most popular use is likely to write S-expressions that represent code in some language. It’s used for just that in the nanopass framework.

Location, location, location!

Making a language is a difficult job. Everything should ideally work smoothly together as a coherent whole, like pineapples on a pizza. Objects in Scheme programs implicitly refer to locations. There are many details around that which affect the whole language, and they have interesting consequences for quasiquote.

What’s a location? It’s just a place where you can store a value. The vector #(1 2 3) has three locations where values are stored, and currently it’s the numbers 1, 2 and 3. If you make a vector using make-vector then the vector is mutable and you can change the values. Later when you see the same vector again it will still contain the new values.

In practice a location is some address in memory, but the garbage collector might move it around, so its address changes, but it is still the same location.

Other objects in Scheme do not have locations. The number 1 does not have any locations that hold values that you can change. This is only because of the wisdom, kindheartedness and foresight of the language designers, because it is possible to design things differently.

As a consequence of numbers not having locations, there is also very little point in worrying about which number object you have. Suppose that numbers did have locations and you could store important information in them. You’d be very concerned that the 1 number object where you stored all your passwords is the same 1 that you now have in your secrets variable. (Nobody will ever think to look inside the number 1 for your passwords, so your secrets are safe there.) But number objects do not actually have locations, so it doesn’t matter if the Scheme implementation fiddles with them behind your back and gives you a different 1 object the next time you’re looking.

Literal constants

Pairs and vectors have locations, but the rules for their locations are much relaxed if they are literal constants in the program code.

Constants in Scheme are allowed to have read-only locations. If you compile a program with Loko Scheme, you will notice that you get an assertion if you try to change any of the constants. From R6RS:

It is desirable for constants (i.e. the values of literal expressions) to reside in read-only memory. To express this, it is convenient to imagine that every object that refers to locations is associated with a flag telling whether that object is mutable or immutable. Literal constants, the strings returned by symbol->string, records with no mutable fields, and other values explicitly designated as immutable are immutable objects, while all objects created by the other procedures listed in this report are mutable. An attempt to store a new value into a location referred to by an immutable object should raise an exception with condition type &assertion.

Literal constants can also share locations. If the same constant appears in different places in the program then the compiler is allowed to create a single shared instance of the constant, here explained as it applies to structures, in the section on eqv?:

Implementations may share structure between constants where appropriate. Thus the value of eqv? on constants is sometimes implementation-dependent.

The illustrative examples are:

(eqv? '(a) '(a))         ;⇒ unspecified
(eqv? "a" "a")           ;⇒ unspecified
(eqv? '(b) (cdr '(a b))) ;⇒ unspecified
(let ((x '(a)))
  (eqv? x x))            ;⇒ #t

So when it comes to literal constants, Scheme’s normal storage rules do not apply. A program might find that two different locations have become the same location, so that changing the value in a quoted vector ends up changing the value in another quoted vector. It’s also very likely that the program gets an exception when it tries to change the value in such a location. The last example shows that going the other way is not allowed: the compiler is not allowed to create two different versions of the list (a) in that program.

Referential transparency

Referential transparency is a concept that is important in purely functional programming languages. An expression that is referentially transparent can be replaced by the values it returns.

Some expressions in Scheme are referentially transparent. Constants, references to variables that are never mutated, arithmetic in general, type predicates, etc. A Scheme compiler is allowed to replace (+ 1 2) with 3. It doesn’t matter that the program might actually have returned a “different” 3 each time the expression runs. In the same way it doesn’t matter if the compiler turns two different constants into the same constant.

Most parts of Scheme are not referentially transparent. As an example, a Scheme compiler cannot replace (vector 1 2 3) with '#(1 2 3). The locations created by the vector procedure need to be fresh and mutable. But it can replace (vector-ref '#(1 2 3) 0) with 1, so this expression is referentially transparent. And as we previously saw, it can replace (cdr '(a b)) with '(b).

All of this should be fairly widely known, but now comes the interesting part.

There is a crack in everything

So far I have explained Scheme’s notion of locations, referential transparency and that the rules are different for literal constants.

Behold this hidden gem in R6RS:

A quasiquote expression may return either fresh, mutable objects or literal structure for any structure that is constructed at run time during the evaluation of the expression. Portions that do not need to be rebuilt are always literal. Thus,

(let ((a 3)) `((1 2) ,a ,4 ,'five 6))

may be equivalent to either of the following expressions:

'((1 2) 3 4 five 6)

(let ((a 3))
   (cons '(1 2)
         (cons a (cons 4 (cons 'five '(6))))))

However, it is not equivalent to this expression:

(let ((a 3)) (list (list 1 2) a 4 'five 6))

This part of R6RS originally came from Kent Dybvig’s formal comment #204. The same type of language was adopted in R7RS.

The meaning is that a quasiquoted expression can be turned into a literal, or parts may be turned into literals. Where there was code in the quasiquote expression, there can now be a literal. Going the other direction is not allowed: literals cannot be turned into code that returns fresh, mutable structure. But as the example '((1 2) 3 4 five 6) shows, a compiler is allowed to even propagate constants into quasiquote.

There is a very deep rabbit hole here! Have a look again: return […] literal structure for any structure that is constructed at run time during the evaluation of the expression.

There is now a way to construct literals from run time code, but to do so ahead of run time.

Literal magic

Let me demonstrate the power of Scheme’s magic quasiquote. Let --> mean “equivalent to”. It can be the result of an expansion or another compiler pass, such as a partial evaluator. Here is the original, innocuous-looking example:

(let ((a 3)) `((1 2) ,a ,4 ,'five 6))
; -->
'((1 2) 3 4 five 6)

Can we get literal structure copied into the constant part? Easy:

(let ((a '(3))) `((1 2) ,a ,4 ,'five 6))
; -->
'((1 2) (3) 4 five 6)

But we’re just getting started. Can we construct a structure at runtime and have that appear as a constant? Of course!

`((1 2) ,(list 3) ,4 ,'five 6)
;; -->
'((1 2) (3) 4 five 6)

Those Schemers who are paying attention will be thinking I’ve gone mad now. Maybe I have, but this example simply returned literal structure for the (list) structure that was constructed at run time during the evaluation of the expression, to paraphrase the quasiquote specification. Let’s increase the volume:

`((1 2) ,(map + '(1 1) '(2 3)) ,'five 6)
;; -->
'((1 2) (3 4) five 6)

Hang on, shouldn’t map return a fresh, mutable list? Not anymore, this is quasiquote. The map function constructs a list at run time during the evaluation of the quasiquote expression, so the structure no longer needs to be fresh. (Besides, the R6RS and R7RS definitions of map do not actually say that the list needs to be fresh and mutable, but everyone probably assumes it does.)

(letrec ((iota
          (lambda (n)
            (let lp ((m 0) (n n))
              (if (>= m n)
                  '()
                  (cons m (lp (+ m 1) n)))))))
  `((1 2) ,(iota 4) ,'five 6))
;; -->
'((1 2) (0 1 2 3) five 6)

Is this valid? I think most would say it isn’t; the list created by iota is constructed with cons, which returns fresh, mutable pairs. But this is happening inside quasiquote where the normal rules of society break down. The iota procedure constructs a list structure at runtime during the evaluation of a quasiquote expression, so a compiler is allowed to return literal structure for that list structure.

These go to 11

Let’s crank it up and make quasiquote transform code.

Compilers like Chez Scheme, Loko Scheme, Guile Scheme and many others use partial evaluators to perform source-level optimizations. A partial evaluator runs code symbolically. The output is a new program that is hopefully more efficient than the program that went into it.

The partial evaluators used by Scheme compilers are pretty mild as far as partial evaluators go, mostly because of the semantics of Scheme. Doing more powerful transformations on Scheme programs would require quite powerful static analysis, and that is both slow and difficult.

To get quasiquote to work with code, we need something that enables a partial evaluator to run a given procedure in such as way that it’s always inside a quasiquote expression. If we have such a procedure then the partial evaluator can start using the tricks described above, and treat all code that constructs new structures as if their structures were literals. This makes the partial evaluator very happy, so here is happy:

(define (happy proc)
  (lambda x
    `,(apply proc x)))

In a normal Scheme implementation this operator doesn’t do much more than maybe waste a little time and space. But in a Scheme that knows the magic nature of quasiquote, it would enable powerful program transformations on lists and vectors, without the need for as much analysis as it would normally require. In particular, it should no longer be necessary to analyze if intermediate results are mutated, nor to analyze if programs check results for pointer equality with eq?.

Here is an illustrative example of a potential program transformation, based on Philip Wadler’s 1990 paper Deforestation: Transforming programs to eliminate trees:

(define (upto m n)
  (if (> m n)
      '()
      (cons m (upto (+ m 1) n))))

(define (square x)
  (* x x))

(define (sum xs)
  (let sum* ((a 0) (xs xs))
    (match xs
      [() a]
      [(x . xs) (sum* (+ a x) xs)])))

(define (squares xs)
  (match xs
    [() '()]
    [(x . xs) (cons (square x) (squares xs))]))

(define f
  (happy (lambda (n) (sum (squares (upto 1 n))))))
;; -->
(define f
  (lambda (n)
    (letrec ((h0 (lambda (u0 u1 n)
                   (if (> u1 n)
                       u0
                       (h0 (+ u0 (square u1)) (+ u1 1) n)))))
      (h0 0 1 n))))

After deforestation (or fusion), the intermediate lists used in f have been eliminated. This is beneficial in that you can write high-level code, but still have the compiler produce the efficient loop you would have had to write by hand. Scheme compilers normally don’t do these transformations due to the required analysis.

The happy operator does not completely open the barn doors: the transformation still needs to not change other program side effects, such as I/O and exceptions.

The fly in the ointment

Imagine a program written according to this template:

(define (main)
  ...)

(define main* (happy main))

(main*)

What are the consequences for the main program? It would seem that everything in it follows the rules of quasiquote and it can’t use cons in the normal way. This is bad news for the main program.

Conclusions

I don’t know where this leads. What is the precise limit for what a compiler can and can’t turn into literal structure in quasiquote? The example with the main program makes it seem that quasiquote actually gives the compiler a bit too much freedom.

Perhaps it’s actually just poor wording, so R6RS and R7RS will get a new erratum that clarifies what is and what isn’t allowed. I suspect that this is the most likely outcome.

But it doesn’t stop someone who is working on a partial evaluator or another program transformation from proposing a happy operator as a SRFI, giving it semantics that enable even more powerful transformations, but without the need to rely on language lawyering.

There is one conclusion I can draw from all this: don’t assume that what comes of out quasiquote can be mutated.

by weinholt at May 15, 2020 12:00 AM