Planet Scheme

January 28, 2020

Programming Praxis

Non-Abundant Sums

It’s been a long time since we did an exercise from Project Euler; here is number 23:

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Your task is to solve Project Euler 23; in the spirit of Project Euler, please do not display the solution. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 28, 2020 10:00 AM

January 26, 2020

Joe Marshall

The pros and cons of Agile

Agile methodology is currently the popular way of attempting to develop commodity software in a factory-like manner.  It's a little hard to define what exactly is Agile and what is not.  I've worked in several companies and with several groups in companies that all claim to be Agile, yet they did things in very different ways.  But they all seemed to agree on a few things, and this I suppose could be the spirit of Agile, if not the letter.

The basic characteristic of Agile is that you break down the software into smaller and smaller tasks until you reach those tasks that can be comfortably handled by a small (2-8 person) team of engineers to complete in a small (1-2 week) time frame.  Then at the end of the time frame, you supposedly have some software that “works” in some sense of the word.  It exhibits basic functionality and the general gist of what the customer wanted, if not satisfying all the requirements for the entire project. Then over the next several time periods of development, the software is iteratively improved by fleshing out remaining requirements, adding needed functionality, hooking components together, etc. During this time, the customer is involved to make sure that what is being delivered is actually what is needed.

Some projects I worked on did this formally by a book and followed strict guidelines for the methodology, others just winged it, but all had the basic characteristics above.

One of the primary advantages of Agile is its use to management.  By having the large problem broken down into team-size, biweekly pieces, management can easily allocate and track resources usage and progress.  They can treat a team as a black box of software development and assign tasks as they arise.  Management can attempt to measure the performance of a team and see whether it is increasing, decreasing, or remaining steady.  Teams are what managers like to manage.

Another advantage is frequent feedback from the customer.  Since after each time period, a somewhat working version of some fragment of the product is available for demonstration, the customer can give feedback as to how and whether this seems to meet his needs.  He can offer suggestions about what might be improved, what features he needs to make the product at least minimally useful to him, and prevent development from getting off track.

But Agile is not a panacea.  There is still a significant amount of software produced with the traditional “waterfall” methodology with specification completed before coding begins and integration done as a final step in coding and only then releasing to the customer.  There is also a fair amount of software written “artistically”. I would define artistic software as that written by a single developer working alone over a period of several months. Frequently, such a project never gets beyond the hobbyist stage, and as such it is a risky approach to writing production code. But on occasion, an artistic project can turn into something novel and useful. It more often exhibits a unity of vision and coherence that is harder to find in software written by groups of people. (Which isn't to say that small groups cannot write software with unity of vision and coherence, it's just harder. Or they'll have one particular person in the group that has more insight than the others.)

Managers aren't as useful to artistic developers. Artistic developers tend to manage themselves. And you cannot swap out one developer for another without swapping out the entire project with him. A manager can work with an artistic developer as a peer, and help manage the project, but cannot manage the developer.

Frequently involving customers has its pros and cons as well. Often customers have difficulty imagining anything beyond incremental improvements to the current ways of doing things. They'll want a UI widget that will make some task slightly easier, but not think of automating the task altogether. They'll want to automate a series of inefficient tasks when a different viewpoint of the end result might make those intermediate tasks irrelevant. Customers are not impressed with changes to the code that don't produce visible effects. You may have spent a week refactoring in order to make it trivial to add new commands and new outputs, but customers don't care. Customers don't care about potential use cases, they care about their specific use case to the exclusion of everything else. This can be discouraging to developers.

Because Agile is so useful to managers, big and intermediate sized companies will continue to use it to develop commodity software in a factory-like style. It isn't going to be replaced any time soon. But there is still ample room in the market for small companies and individuals with vision to carve out niches that Agile methodologies will overlook and find tricky to fill.

(But I'm a romantic at heart, and I like the image of the lone hacker crafting software on his home computer in his room late at night. If only it were easy to make a living that way.)

by Joe Marshall (noreply@blogger.com) at January 26, 2020 01:59 PM

January 24, 2020

GNU Guix

Guile 3 & Guix

Version 3.0 of GNU Guile, an implementation of the Scheme programming language, was released just last week. This is a major milestone for Guile, which gets compiler improvements and just-in-time (JIT) native code generation, leading to significant performance improvements over 2.2. It’s also great news for all the users of Guile, and in particular for Guix!

Guile 3 logo.

This post discusses what it means for Guix to migrate to Guile 3 and how that migration is already taking place.

Guile in Guix

Most users interact with Guix through its command-line interface, and we work hard to make it as approachable as possible. As any user quickly notices, Guix uses the Scheme programming language uniformly for its configuration—from channels to manifests and operating systems—and anyone who starts packaging software knows that package definitions are in fact Scheme code as well.

This is a significant departure from many other, and in particular from Nix. While Nix defines several domain-specific languages (DSLs) for these aspects—the Nix language but also specific configuration languages—Guix chooses Scheme as the single language for all this, together with the definition of high-level embedded domain-specific languages (EDSLs).

It goes beyond that: in Guix System, all the things traditionally implemented in C or as a set of Perl or shell scripts are implemented in Scheme. That includes the init system, package builds, the initial RAM disk (initrd), system tests, and more. Because this leads to several layers of Scheme code, executed at different points in time, Guix includes a code staging mechanism built upon the nice properties of Scheme.

Why do that? The arguments, right from the start, were twofold: using a general-purpose language allows us to benefit from its implementation tooling, and having interfaces for “everything” in Scheme makes it easy for users to navigate their distro or OS code and to reuse code to build new features or applications. Guix developers benefit from the ease of code reuse every day; demonstrative examples include the use of Guix container facilities in the init system, the development of many tools providing facilities around packages, the implementation of additional user interfaces, and work on applications that use Guix as a library such as the Guix Workflow Language and Guix-Jupyter.

As for the benefits of the host general-purpose language, these are rather obvious: Guix developers benefit from an expressive language, an optimizing compiler, a debugger, a powerful read-eval-print loop (REPL), an interactive development environment, and all sorts of libraries. Moving to Guile 3 should add to that better performance, essentially for free. To be comprehensive, Guile 3 may well come with a set of brand new bugs too, but so far we seem to be doing OK!

Migrating to Guile 3

What does it mean for Guix to migrate to Guile 3? We’ve seen above different ways in which Guix relies on Guile. In short, we can say that migration is threefold:

  1. Guix is a distro that ships Guile-related packages. Like any other distro, it will have to upgrade its guile package to 3.0 and to ensure packages that depend on it and updated as well.
  2. Guix is a program written in Guile. As such, we need to make sure that all its dependencies (half a dozen of Guile libraries) work with Guile 3 and that Guix itself runs fine with Guile 3.
  3. Guix ties together operating system components. In particular, the init system (the Shepherd) and other boot-time facilities will also migrate.

The packages

Updating the distro is the boring part, but it’s best to get it right. Guix makes it possible to have unrelated versions of variants of packages in different environments or different profiles, which is very nice. We’ll have performed a smooth transition if users and tools see that the packages named guile and guile-ssh (say) transparently move from Guile 2.2 to 3.0, in lockstep.

Put differently, most of the upgrade work upon a programming language version bump deals with conventions, and in particular package names. Currently, guile corresponds to the 2.2 stable series and all the guile-* packages are built against it. In the meantime, the package for Guile 3 is named guile-next and packages built against it are called guile3.0-*. Over the last few weeks we created guile3.0- variants for most Guile packages, something that’s easily achieved with Guix.

The big switch will consist in renaming all current guile-* packages to guile2.2-* packages, for use with the legacy 2.2 series, and renaming all the guile3.0-* packages to guile-*. We will switch soon, but before getting there, we’re making sure important packages are available for 3.0.

Guix-the-program

A more interesting part is “porting” Guix itself from Guile 2.2 to Guile 3. It seems that developers have become wary of 2-to-3 transitions for programming languages. Fear not! Switching from Guile 2 to Guile 3 turned out to be an easy task. In fact, very little changed in the language itself; what did change—e.g., semantics on fine points of the module system, support for structured exceptions—is either optional or backwards-compatible.

As Guile 2.9 pre-releases trickled in, we started testing all the Guile libraries Guix relies on against 2.9. For the vast majority of them, all we had to do was to update their configure.ac to allow builds with 3.0.

Guix itself was a bit more work, mostly because it’s a rather large code base with a picky test suite. The bit that required most work has to do with the introduction of declarative modules, an optional semantic change in modules to support more compiler optimizations. We had several “white-box tests” where tests would happily peek at private module bindings through the magical-evil @@ operator. Because we chose to enable declarative modules, we also had to adjust our tests to no longer do that. And well, that’s about it!

At that point, we were able to create a guile3.0-guix package variant, primarily for testing purposes. Soon after, we told guix pull to build Guix with 3.0 instead of 2.2. Thus, Guix users who upgrade will transparently find themselves running Guix on Guile 3.0.

The main benefit is improved performance. Guile 3 is known to be up to 32 times faster than Guile 2.2 on some micro-benchmarks. Assessing the performance gains on a “real-world” application like Guix is the real test. What would be a relevant benchmark? At its core, Guix is essentially a compiler from high-level descriptions of packages, operating systems, and the like, to low-level build instructions (derivations). Thus, a good benchmark is a command that exercises little more than this compilation step:

guix build libreoffice ghc-pandoc guix --dry-run --derivation

or:

guix system build config.scm --dry-run --derivation

On x86_64, the guix build command above on Guile 3 is 7% faster than on Guile 2.2, and guix system build, which is more computing-intensive, is 10% faster (heap usage is ~5% higher). This is lower than the skyrocketing speedups observed on some microbenchmarks, but it’s probably no surprise: these guix commands are short-lived (a couple of seconds) and they’re rather I/O- and GC-intensive—something JIT compilation cannot help with.

On 32-bit ARM, we temporarily disabled JIT due to a bug; there we observe a slight slowdown compared to 2.2. This can be explained by the fact that virtual machine (VM) instructions in 3.0 are lower-level than in 2.2 and will hopefully be more than compensated for when JIT is re-enabled.

Gluing it all together

The last part of the Guile 3 migration has to do with how Guix, and in particular Guix System, glues things together. As explained above, Guix manipulates several stages of Scheme code that will run a different points in time.

Firstly, the code that runs package builds, such as the one that runs ./configure && make && make install, is Guile code. Currently that code runs on Guile 2.2, but on the next major rebuild-the-world upgrade, we will switch to Guile_3.

Additionally, Guix produces Scheme code consumed by the Shepherd, by GNU mcron, and for the graphical installer. These will soon switch to Guile 3 as well. This kind of change is made easy by the fact that both the package definitions and the staged code that depends on those packages live in the same repository.

Long live, Guile 3!

Migrating Guix to Guile 3 is a bit of work because of the many ways Guix interacts with Guile and because of the sheer size of the code base. For a “2-to-3” transition though, it was easy. And fundamentally, it remains a cheap transition compared to what it brings: better performance and new features. That’s another benefit of using a general-purpose language.

Thumbs up to everyone involved in its development, and long live Guile 3!

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Ludovic Courtès at January 24, 2020 03:00 PM

Programming Praxis

Mertens’ Conjecture

Dr Holly Krieger discussed Mertens’ Conjecture on Numberphile yesterday:

Mertens’ Conjecture: The absolute value of the Mertens function M(n), computed as the sum for k from 1 to n of the Moebius function μ(k), is less than the square root of n. The Moebius function μ(n) is (-1)^k, where k is the number of prime factors of n, but 0 if n has any repeated prime factors.

The conjecture has been proved false, though no counter-examples are known. You can read more about Mertens’ Conjecture at MathWorld or Wikipedia.

Your task is to write a program to compute Mertens’ function M(n) and use it to explore some of the sequences at OEIS. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 24, 2020 10:00 AM

January 22, 2020

Joe Marshall

But what if you really want to push a stack frame?

If you really don't want tail recursion, the solution is simple: don't put the call in “tail position”. We define a rather trivial function dont-tail-call and use it like this:
(dont-tail-call
  (foo x))
The semantics of Scheme are that the arguments are evaluated before the call to the function, so the call to foo is required to occur before the call to dont-tail-call which necessitates allocation of a continuation.

But what about a really clever compiler that somehow “optimizes” away the call to dont-tail-call? Such an “optimization” is highly questionable because it changes the observable semantics of the program so it is no longer call-by-value. A compiler that performed that transformation wouldn't be a Scheme compiler because Scheme is a call-by-value language.

But what if we had really clever compiler and we disregard the change in semantics? We can easily thwart the compiler by deferring the definition of dont-tail-call until run time. Even the cleverest compiler cannot see into the future.

The definition of dont-tail-call is left to the reader, as is how to defer it's definition until run time.

by Joe Marshall (noreply@blogger.com) at January 22, 2020 03:28 AM

January 21, 2020

Joe Marshall

Afraid of Tail Recursion

It's well known fact among proponents of tail recursion that some people just don't get it. They view tail recursion at best as a quirky compiler optimization that turns some recursive calls into loops. At worst, they see it as some sort of voodoo, or a debugging pessimization. They see little value in it. Some have outright disdain for it.

Tail recursion isn't just about turning recursive calls into loops. It's about changing how you look at function calling. Tail recursion just happens to fall out of this new viewpoint.

Most programmers, I think, view function calls as if they were akin to a short vacation. You pack up the arguments in your luggage, travel to the destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return home, unpack everything and resume life where you left off. Your souvenirs are the return value.

Should you need a vacation from your vacation, you do the same thing: pack up the arguments in your luggage, travel to your new destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return to your original vacation spot and resume your original vacation.

Tail recursion aficionados realize that the journey itself is the important part of the function call, and that a vacation includes two journeys. On the first journey you pack up the arguments, including the return ticket, in your luggage, use the outbound ticket to journey to the destination, unpack your luggage, and start doing stuff. When you run out of stuff to do, you make the second journey. You fetch the return ticket, repack your luggage, take the ticket to wherever it leads (presumably back home), unpack everything, and resume doing whatever you were doing there.

But perhaps you want to visit grandma instead of going directly home. Then we change the script slightly. When you run out of things to do on your vacation, you pack up your luggage with your souvenirs and the return ticket, then you journey to grandma's house, where you unpack and start doing stuff. Eventually you are done visiting grandma, so then you fetch the return ticket, repack your luggage, take the ticket to wherever it leads, unpack everything, and resume doing stuff there. It's a three-legged journey. You don't go from grandma's back to the vacation resort — there's nothing left for you to do there. You take the return ticket directly home.

Viewing things this way, a function call involves packaging the arguments in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the function, where we unpack the arguments and resume execution of the program. It is simply “a goto that passes arguments”.*

A function return is simply “a goto that passes a return value”. It involves packaging the return value in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the return address, where we resume execution of the program.

A tail recursive function call is simply “a goto that passes arguments”. It involves packaging the arguments in a suitable way, deallocating any temporary storage and then making an unconditional transfer to the function, where we resume execution of the program.

Do we really deallocate temporary storage before every control transfer? Certainly a return pops the topmost stack frame, and as often implemented, a tail recursive function call deallocates its stack frame or replaces it before transferring control, but a non tail recursive call? It does so as well, it's just that it also has to pack those values into a new continuation for the return trip. We use an implementation trick to avoid the absurdity of actually moving these values around: we move the base of frame pointer instead. Voila, we simultaneously deallocate the stack frame and allocate the continuation with the right values already in place.

Deallocating storage before each control transfer is an important part of the protocol. We're making a unconditional transfer to a destination with the hope, but no guarantee, that we'll come back, so we'd better tidy up before we leave. This ensures that we won't leave a trail of garbage behind us on each transfer which would accumulate and adversely affect the space complexity of our program.

Once you view a function call and return as not being a single sequence, but each one a separate, and virtually identical sequence, then tail recursion becomes a natural consequence. Tail recursion isn't a special case of function call, it is the same thing as a function call, the only difference being whether a new continuation (the "return ticket") is allocated in order to come back. Even function returns are the same thing, the only difference being that destination is (usually) determined dynamically rather than statically. Tail recursion isn't just another optimization, it's the result of treating inter-procedural control transfer symmetrically.

Another natural consequence is greatly increased options for control flow. Instead of a strict call/return pattern, you can make "three-legged" trips, or however many legs you need. You can make loops that incorporate one, two, or even a dynamically changing number of functions. You can replace compiler-generated returns with user-provided function calls (continuation-passing style) and implement arbitrarily complex control and data flow like multiple return values, exception handling, backtracking, coroutines, and patterns that don't even have names yet. And of course you can mix and match these patterns with the standard call and return pattern as well.

The phrase "tail recursion" is shorthand for this symmetric view of interprocedural control flow and is meant to encompass all these consequences and control flow options that spring from it. It's not about simply turning recursive functions into loops.

People who are afraid of tail recursion seem unable to see any value in taking up the symmetric viewpoint despite the fact that it opens up a whole new set of control flow techniques (in particular continuation-passing style). They find the notion that a procedure call is “a goto that passes arguments” “nonsensical”. A lot of good research has come from taking this nonsense seriously.


*The view that a function call is simply a “a goto that passes arguments” was developed by Steele in his “Lambda papers”.

The important point of cleaning up before the control transfer was formalized by Clinger in “Proper Tail Recursion and Space Efficiency”.

Someone — it might have been Clinger, but I cannot find a reference — called tail recursion “garbage collection for the stack”. The stack, being so much more limited in size than the heap, needs it that much more. Indeed Clinger notes the tight connection between tail recursion and heap garbage collection and points out that heap garbage collection is hampered if the stack is retaining pointers to logically dead data structures. If the dead structures are large enough, garbage collection can be rendered useless. Yet many popular languages provide garbage collection but not tail recursion.

The only difference between a call and return is that typically the call is to a statically known location and the return address is dynamically passed as a "hidden" argument. But some compilers, like Siskind's Stalin compiler, statically determine the return address as well.

The only difference between a function call and a tail recursive function call is when you need to return to the caller to complete some work. In this case, the caller needs to allocate a new continuation so that control is eventually returned. If there is no further work to be done in the caller, it doesn't create a new continuation, but simply passes along the one that was passed to it.

Many compilers have been written that handle function calls, tail recursive function calls, and returns identically. They only change what code they emit for handling the continuation allocation. These compilers naturally produce tail recursive code.

Most machines provide special purpose support for a LIFO stack. It is tempting to use the stack for allocation of continuations because they are almost always allocated and deallocated in LIFO order, and a stack gives higher performance when this is the case. Many compilers do in fact use the stack for continuations and argument passing. Some, like Winklemann's Chicken compiler follow Baker's suggestion and treat the stack as an "nursery" for the heap. Others avoid using the stack altogether because of the extra complexity it entails. And others cannot use the stack because of constraints placed on stack usage by OS conventions or the underlying virtual machine model.

by Joe Marshall (noreply@blogger.com) at January 21, 2020 12:38 PM

Afraid of Recursion

Here's a trick I learned several years ago for transferring time-series data. In the case in question, I needed to transfer a bunch of timestamped records, but the server had a quirk to it. If you asked for too many records at once, it would simply return an error code and give up on your request. There was no way to know beforehand how many records might exist in any given time span, so you could get an error code on nearly any request, unless it was for a very short time span. On the other hand, many requests for long time spans would succeed because they had few records in them. Despite this quirk, the code was really simple:
List<record> xfer (Timestamp start, Timestamp end) {
    try {
        return tryFetchRecords(start, end);
    } catch (TooManyRecordsException e) {
        Timestamp mid = (start + end)/2;
        List<record> firstHalf = xfer (start, mid);
        List<record> secondHalf = xfer (mid, end);
        return firstHalf.addAll(secondHalf);
    }
}
On any request, if the server returned the error code, we would simply bisect the time span, recursively ask for each half separately, and combine the two halves. Should the bisected time span still contain too many records, the time span would be bisected again. The recursion would continue until the time span was small enough that the server could honor the request and return some records. The recursion would then unwind, combining the returned records into larger and larger lists until we had all the records for our original time span. Since the time span would be cut in half on each recurrence, the depth of recursion would be proportional to the logarithm (base 2) of the total number of records, which would be a reasonably small number even with an enormous number of records.

It's certainly possible to avoid recursion and do this iteratively by some form of paging, but the code would be slightly more complex. The server is not very cooperative, so there is no easy way to determine an appropriate page size beforehand, and the server doesn't support a “paging token” to help keep track of progress. The recursive solution finds an appropriate transfer size by trial and error, and keeps track of progress more or less automatically. An iterative paging solution would have to do these things more explicitly and this would make the iterative code a bit more complex. And why add any complexity when it isn't really necessary?

I thought this solution was really cool when I first saw it. I've used this trick for transferring time series data many times. It makes the server very simple to write because the protocol requires so little of it. It simply has to refuse to answer requests that return too many results. The client code is just about the 10 lines above.

But when I first suggest this code to people I usually get “push back” (that is, until they see it work in action, then they usually get on board with it). People seem unsure about the use of recursion and want a more complex protocol where the client and server negotiate a page size or cooperatively pass a paging token back and forth on each request and response. Their eyes glaze over as soon as they see the recursive call. They want to avoid recursion just because it's recursive.

I've seen “aversion to recursion” happen in a lot of circumstances, not just this one. Recursion isn't the solution to everything. No tool solves all problems. But it is an important tool that often offers elegant solutions to many problems. Programmers shouldn't be afraid of using it when it is appropriate.

by Joe Marshall (noreply@blogger.com) at January 21, 2020 11:25 AM

Programming Praxis

Mini Markdown

Today’s exercise is an interview question. The situation is an interviewee at home, sharing his screen with an interviewer at his office. The interviewee has one hour to write the following program while the interviewer watches:

Write a program that takes a text file as input and writes an html file as output. The input consists of one or more paragraphs (maximal lines of text separated by blank lines). The following elements of John Gruber’s Markdown language are supported:

– ###### up to six levels of headings

– unordered lists introduce by a dash (like this paragraph)

– plain text paragraphs

Your task is to write a mini-Markdown processor; you have one hour to complete it. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 21, 2020 10:00 AM

January 18, 2020

Joe Marshall

Unsyndicated blog

I've noticed that my blog posts are replicated in Planet Lisp and Planet Scheme, and here I am spamming them with random math stuff. So I'm creating a new blog, Jrm's Random Blog, where I can feel free to post about math, science, computers in general, and whatever else bugs me, without spamming the Lisp and Scheme readers. I'll keep posting to Abstract Heresies, but try to keep it more Lisp and computer language focused.

by Joe Marshall (noreply@blogger.com) at January 18, 2020 01:33 PM

January 17, 2020

Programming Praxis

Coprime Numbers

This exercise comes from CodeForces:

Given two natural numbers lo < hi, find a triple (a, b, c) such that a is coprime to b, b is coprime to c, and a is not coprime to c, or indicate that no such triple exists. If there is more than one such triple, you may return any of them. Two numbers are coprime if their greatest common divisor is 1. Lo and hi will be less than 1018 and differ by no more than 50.

Your task is to find a triple satisfying the conditions above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 17, 2020 10:00 AM

Joe Marshall

Groups, semigroups, monoids, and computers

The day after I rant about mathematicians, I make a math post. “Do I contradict myself? Very well, then, I contradict myself, I am large, I contain multitudes.” — Walt Whitman

A group is a mathematical concept. It's pretty simple. It consists of a set, G, and an operation, *, which can be used to combine any two elements of G. What the set contains is not that important. It is the * operation we're interested in, and we can usually swap out G for another set without causing too many problems other than having to change the type signature of *. There are four axioms that * must obey
  • Closure—combining any two elements of G using * just gives you another element in G.
    Note that this means you can build an arbitrary binary tree of combinations: e.g.(* (* a b) (* (* c d) e))). These trees will always be like a tree of cons cells. In some sense, the closure axiom is equivalent to saying that all the elements of G have the same type and that the * operator operates on values of that type and produces values of that type. The closure axiom along with the binary operation means that we can reduce any tree of combinations to a single value.
  • Associativity(* (* a b) c) = (* a (* b c)) for any a, b, and c. This implies that you can take any arbitrary tree of combinations: e.g.(* (* a b) (* (* c d) e))) and simply flatten it into a list (* a b c d e), or given the flat sequence (* a b c d e) we can add parenthesis anywhere we like: (* a (* b c) d e). If we stop here and only have the closure and associativity axiom, we have what is called a “semigroup”. You can use the * operation to “fold” a semigroup down to single value, or to keep an accumulator and incrementally fold elements into the accumulator.
  • Identity element—There has to be an identity element id such that (* id x) = (* x id) = x for all x. It will be unique. If you see the identity object in a combination (* a b id c d), you can simply remove it: (* a b c d). The identity element also comes in handy as an initial value when you are folding a sequence. If you have some concept that would be a group except it doesn't have an identity element, then you can often just make one up and add it to the set G.
  • Inverse element—For every element in G there has to be another element, that when combined with the first, gives you the identity. So if a is an element in G, there has to be some other element, call it b, such that (* a b) = (* b a) = id. The inverse element is usually notated with a little -1: a-1. If you have an element in a combination right next to it's inverse: (* a x x-1 c), you can combine the element and it's inverse to get the identity: (* a id c), and then remove the identity: (* a c)
Frequently you run into something that obeys all the axioms but the inverse element axiom. This is called a monoid. A monoid is very much like a group except that you can get “stuck” when manipulating it if you run into one of the non-invertible elements because there's no inverse to “undo” it. There are certain things about monoids that are true only “if the appropriate inverses exist”. You run into that qualifier a lot when dealing with monoids. You don't need that qualifier if you are dealing with a group because they do exist by axiom. Or we could say that calling something a group is simply shorthand for adding “if the appropriate inverses exist” everywhere.

What does this have to do with computers? Consider the set of all subroutines with the operation of concatenation. It is closed — concatenating two subroutines gives you a third subroutine. It is associative — you just concatenate them linearly. There is an identity element, usually called no-op. And many, but not all, subroutines have inverses. So we have a monoid.

Consider the set of all strings with the operation of concatenation. It is closed, associative, the empty string is the identity element. It is a monoid.

Consider the set of functions whose input type is the same as the result type with the operation of composition. It is closed, associative, the identity function is the identity element. It is a monoid. If we consider only the subset of functions that also have inverses, we have a group. This particular monoid or group comes in especially handy because composition of functions is so useful.

Consider the set of invertible 2x2 matrices with integer components, a determinant of 1 or -1, and the operation of matrix multiply. It is closed, associative, there is an identity matrix, and I already said just consider the invertible ones. It forms a group. This group comes in handy for implementing arbitrary precision arithmetic. (Thanks to Bradley Lucier for the correction of the condition on the determinant. This makes the matrix continue to have integer components upon inversion, keeping things closed.)

The permutations of a list form a group. The integers under addition form a group.

These things are everywhere. And it isn't a coincidence. The concepts of a group, monoid, and semigroup are meant to capture the essence of what it is to have a foldable sequence of elements. (Can I complain about mathematicians here? They make up so much terminology and abstraction that it is virtually impossible to get at what they really mean. We're just talking about sequences of elements and trying to find some minimal axioms that you need to have to fold them, but try to find literature that actually says that's what we're doing is like trying to pull hen's teeth.)

So what good are groups, monoids, and semigroups? Aside from the obvious fact that foldable sequences are ubiquitous and really useful, that is. Not immediately apparent from the axioms is that in addition to folding a sequence, you can transform a sequence into a different, but equivalent one. If the appropriate inverses exist (there's that phrase), you can “unfold” some or all elements of a sequence. So by judicious folding and unfolding, you can transform a sequence.

Here's an unusual abstract example. Consider a pipeline which has a set of nodes and communicates values of the same type between the nodes. Values accumulate at the nodes until they are transmitted to the next node in the pipeline. We start with all the values in the initial node (on the right) and transmit them to the left:
(pipeline (node) (node) (node a b c))  ;; transmit the a
(pipeline (node) (node a) (node b c))  ;; transmit the b
(pipeline (node) (node a b) (node c))  ;; transmit the a
(pipeline (node a) (node b) (node c))  ;; transmit the c
(pipeline (node a) (node b c) (node))  ;; transmit the b
(pipeline (node a b) (node c) (node))  ;; transmit the c
(pipeline (node a b c) (node) (node))  ;; done
If the values we transmit are drawn from a group, we can replace each node with the group's * operator:
(* identity identity (* a b c))  ;; transmit the a
(* identity (* identity a) (* b c))  ;; transmit the b
(* identity (* a b) (* identity c))  ;; transmit the a
(* (* identity a) (* identity  b) (* identity c))  ;; transmit the c
(* (* identity a) (* b c) identity)  ;; transmit the b
(* (* a b) (* identity c) identity)  ;; transmit the c
(* (* a b c) identity identity)  ;; done
The astute reader will notice that all we're doing is making use of the associativity axiom and moving the parenthesis around so that the values seem to move between the different nodes. But we preserve the invariant that the “value” of the entire pipeline doesn't change as the values move. The * operator need not be concatenate, which would give simple queuing behavior, but can be any operator satisfying the axioms giving us much more interesting pipelines. One implementation of arbitrary precision arithmetic transmits Möbius transformations along just such a pipeline to refine the upper and lower limits of a computed approximation. In this implementation, the * operator is the composition of Möbius transformations.

Here's a more concrete example. If you have a series of nested functions: (f (g x)) and both f and g take and return the same type, rewrite it as ((compose f g) x) and use a little group theory on it.
(f (g x))
((compose f g) x)
;; or more explicitly
((fold-left compose identity (list f g)) x)
If the appropriate inverses exist, then there will be another function h such that (compose f g) is equal to (compose h f) essentially allowing you to “slide” g to the left “through” f. It is relatively easy to see that h must be equivalent to (compose f g f-1). Mathematicians say that h is conjugate to g. Conjugates always have a form like aba-1. By finding conjugates, you can take a sequence and slide the elements left and right through other elements. This also allows you to fold things out of order. (Or in the pipeline example, transmit items out of order.) If we were left folding into an accumulator, folding h before f is equivalent to folding g after f. Another way of looking at it is this. Suppose we're standing to the left of f and looking through the “lens” of f at g. h is what g “looks like” when viewed through f.

If we want, we can define slide such that (compose slide (compose f g)) is equivalent to (compose h f). slide is (compose h f g-1 f-1). (This isn't a generic slide sequence, it only works on (compose f g). It ought to be an identity because (compose f g) is equivalent to (compose h f).) I complained that mathematicians provided too few concrete examples, so here is a concrete example using list permutations:
> (reverse (rotate-left '(a b c d)))
(a d c b)

;; rewrite as explicit fold-left of compose
> ((fold-left compose identity (list reverse rotate-left)) '(a b c d))
(a d c b)

;; sliding rotate-left through reverse turns it into rotate-right
> ((fold-left compose identity (list rotate-right reverse)) '(a b c d))
(a d c b)

;; A sequence that when composed with (list reverse rotate-left) turns it into
;; (rotate-right reverse)
> (define slide 
    (fold-left compose identity (list rotate-right reverse rotate-right reverse)))
slide

> ((fold-left compose identity (list slide reverse rotate-left)) '(a b c d))
(a d c b)

;; rewrite back to direct procedure calls
> (rotate-right (reverse '(a b c d)))
(a d c b)

;; and slide ought to be an identity
> ((fold-left compose identity (list slide)) '(a b c d))
(a b c d)

Or suppose you have (f (g x)), but for some reason you want(g (f x)) (which would, in general, be a different value unless f and g happen to commute). Again, rewrite (f (g x)) as ((compose f g) x) and apply a little group theory. If the appropriate inverses exist, there will be a function commute-fg such that (compose commute-fg (compose f g)) is equivalent to (compose g f). With a little thought, you can see that commute-fg is equivalent to (compose g f g-1 f-1). (Again, this isn't a generic commute, it only causes this specific f and g to commute.) commute-fg is called a commutator because it makes f and g commute. Commutators always have the form aba-1b-1. By finding commutators and inserting them in the right place, you can take a sequence and swap adjacent elements. Again, a concrete example with lists:
;; an illustration of what swap-first two does
> (swap-first-two '(a b c d))
(b a c d)

;; we're given
> (reverse (swap-first-two '(a b c d)))
(d c a b)

;; but we want, for some reason to reverse first
> (swap-first-two (reverse '(a b c d)))
(c d b a)

;; rewrite as fold-left of compose
> ((fold-left compose identity (list reverse swap-first-two)) '(a b c d))
(d c a b)

;; define our commutator
;; note that swap-first-two and reverse are their own inverses
> (define commute-fg
    (fold-left compose identity (list swap-first-two reverse swap-first-two reverse)))

;; make f and g commute
;; observe that it returns the desired result
> ((fold-left compose identity (list commute-fg reverse swap-first-two)) '(a b c d))
(c d b a)

There's two interesting things here. First, notice that in both examples I convert (f (g x)) to ((fold-left compose identity (list f g)) x) and then proceed to ignore x and just consider (fold-left compose identity (list f g)) as if x didn't exist. I've abstracted away the x. (Of course I have to eventually supply the x if I want an answer, but it only comes back at the last moment.) Second, notice that although slide and commute-fg are foldable sequences, I use them as if they were higher order functions operating on the foldable sequence (compose f g) to transform it, first into (compose h f), second into (compose g f). This second thing is a neat trick. We're taking a function that operates on lists and treating it as if it were a higher-order function that operates on functions. This is called the “action” of slide and commute-fg because it appears as if elements of the set G of our group can “act” directly on other elements.

Every element in the underlying set G of a group has an action associated with it which operates directly on other elements in G. This is an important concept in group theory. Now earlier I said that the actual elements of G don't matter much, so the action must be more closely tied to the operator *. And if we swap out G for another set we'll still have the same actions, they'll just be associated with the elements of the new set (in an isomorphic way). The actions are pretty abstract.

There's a lot more one could say about the actions. They are a rich source of interesting math. My brain is getting fatigued with all this abstraction, so I'll leave the topic be for now.

If group theory is about the essence of what it means to have a foldable sequence, then category theory is about the essence of composition. They offer two somewhat different approaches to similar material. What do you do with sequences but compose them? What comes from composition but a sequence? Many concepts in group theory carry over into category theory. Naturally a completely different set of terminology is used, but the concepts are there.

But that's enough group theory for today and category theory can wait until later posts.

by Joe Marshall (noreply@blogger.com) at January 17, 2020 09:13 AM

January 15, 2020

Joe Marshall

Math is hard, let's go shopping

I find mathematics, with all it's weird terminology and abstraction and equations, hard to understand. That's kind of funny coming from someone like me who makes a living from a branch of mathematics. I find computers and programming to be rather easy to understand — probably because I've had a lot of practice. But computer science is just applied logic and programming is arguably just the study of the computable functions, so you'd think math would come naturally. It doesn't.

One problem I've found is that as much as mathematicians pride themselves on rigor, they tend to be a bit sloppy and leave out important details. Computer scientists don't leave out important details because then the programs won't run. It's true that too much detail can clutter things up, but leaving out the detail and relying on “context” just increases the intellectual burden on the reader.

I will give mathematician's credit for thinking about edge cases perhaps more than a computer scientist would. It can be easy to be a bit complacent with edge cases because the computer will likely do something even if you don't think too hard about what it ought to do. But a good computer scientist tries to reduce the number of edge cases or at least make them coherent with the non-edge cases.*

Mathematicians seem to take perverse pleasure in being obscure. Computer scientists strive to be as obvious as possible because like as not, they are the ones that have to revisit the code they wrote and don't want to have to remember what they were thinking at the time. It's just easier to spell things out explicitly and obviously so that you can get back up to speed quickly when you have to debug your own stupid code. Every time I pick up some literature on category theory, I get hit with a “Wall of Terminology” denser than the “Wall of Sound” on a Phil Spector recording. It's fundamentally simple stuff, but it is dressed up in pants so fancy one has a hard time extracting the plain meaning. What seems to be universal in category theory is my difficulty in getting past page 4.

I once read a mathematical paper that talked about an algorithm with three tuning parameters: α, β, and another α. No decent computer programmer would give the same name to two different variables. Which α was which was supposed to be “obvious” from the context. The brainpower needed to keep track of the different αs was absurd and a complete waste of effort when calling the variable something else, like γ would have done the trick.

And don't ask a mathematician to write computer code. That's the one time they'll leave out all the abstraction. Instead of a nice piece of abstract, functional code, you'll get a mess of imperative code that smashes and bashes its way to a solution with no explanation of how it got there. It's a lot easier to take some abstract, functional code and figure out a more optimal way, probably imperative way to do it than it is to take a more optimal imperative piece of code and figure out the abstract, functional meaning of it.

I've found it to be extremely helpful when a computer paper includes one or two concrete examples of what it is talking about. That way, if I try to go implement code that does what the paper suggests, there's some indication that I'm on the right track. I'm more confident that I understand the paper if I have working code that produces the exact same values the paper's authors got. It's harder to find concrete examples in a math paper, and it is easier to think you know what it says but be far off base if there aren't any examples.

Maybe I shouldn't blame mathematicians so much and look a little closer to home. Perhaps I should study harder instead of demanding to be spoon fed difficult concepts. But then I read Feynman, S&ICP, S&ICM, and Jaynes and discover that maybe I just need a simple explanation that makes sense to me.

Sturgeon's Revelation is “90% of everything is crap”. This is true of both mathematical papers and computer science papers.



*An old joke illustrates the importance of thinking of edge cases: A programmer implements a bar. The test engineer goes in and orders a beer, orders zero beers, orders 999999999 beers, orders -1 beers, orders a lizard, and declares the bar ready for release. The first customer comes in and asks to use the restroom. The bar catches fire and burns down.

by Joe Marshall (noreply@blogger.com) at January 15, 2020 02:55 PM

January 14, 2020

GNU Guix

Reproducible computations with Guix

This post is about reproducible computations, so let's start with a computation. A short, though rather uninteresting, C program is a good starting point. It computes π in three different ways:

#include <math.h>
#include <stdio.h>

int main()
{
    printf( "M_PI                         : %.10lf\n", M_PI);
    printf( "4 * atan(1.)                 : %.10lf\n", 4.*atan(1.));
    printf( "Leibniz' formula (four terms): %.10lf\n", 4.*(1.-1./3.+1./5.-1./7.));
    return 0;
}

This program uses no random element, such as a random number generator or parallelism. It's strictly deterministic. It is reasonable to expect it to produce exactly the same output, on any computer and at any point in time. And yet, many programs whose results should be perfectly reproducible are in fact not. Programs using floating-point arithmetic, such as this short example, are particularly prone to seemingly inexplicable variations.

My goal is to explain why deterministic programs often fail to be reproducible, and what it takes to fix this. The short answer to that question is "use Guix", but even though Guix provides excellent support for reproducibility, you still have to use it correctly, and that requires some understanding of what's going on. The explanation I will give is rather detailed, to the point of discussing parts of the Guile API of Guix. You should be able to follow the reasoning without knowing Guile though, you will just have to believe me that the scripts I will show do what I claim they do. And in the end, I will provide a ready-to-run Guile script that will let you explore package dependencies right from the shell.

Dependencies: what it takes to run a program

One keyword in discussions of reproducibility is "dependencies". I will revisit the exact meaning of this term later, but to get started, I will define it loosely as "any software package required to run a program". Running the π computation shown above is normally done using something like

gcc pi.c -o pi
./pi

C programmers know that gcc is a C compiler, so that's one obvious dependency for running our little program. But is a C compiler enough? That question is surprisingly difficult to answer in practice. Your computer is loaded with tons of software (otherwise it wouldn't be very useful), and you don't really know what happens behind the scenes when you run gcc or pi.

Containers are good

A major element of reproducibility support in Guix is the possibility to run programs in well-defined environments that contain exactly the software packages you request, and no more. So if your program runs in an environment that contains only a C compiler, you can be sure it has no other dependencies. Let's create such an environment:

guix environment --container --ad-hoc gcc-toolchain

The option --container ensures the best possible isolation from the standard environment that your system installation and user account provide for day-to-day work. This environment contains nothing but a C compiler and a shell (which you need to type in commands), and has access to no other files than those in the current directory.

If the term "container" makes you think of Docker, note that this is something different. Note also that the option --container requires support from the Linux kernel, which may not be present on your system, or may be disabled by default. Finally, note that by default, a containerized environment has no network access, which may be a problem. If for whatever reason you cannot use --container, use --pure instead. This yields a less isolated environment, but it is usually good enough. For a more detailed discussion of these options, see the Guix manual.

The above command leaves me in a shell inside my environment, where I can now compile and run my little program:

gcc pi.c -o pi
./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

It works! So now I can be sure that my program has a single dependency: the Guix package gcc-toolchain. I'll leave that special-environment shell by typing Ctrl-D, as otherwise the following examples won't work.

Perfectionists who want to exclude the possibility that my program requires a shell could run each step in a separate container:

guix environment --container --ad-hoc gcc-toolchain -- gcc pi.c -o pi
guix environment --container --ad-hoc gcc-toolchain -- ./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

Welcome to dependency hell!

Now that we know that our only dependency is gcc-toolchain, let's look at it in more detail:

guix show gcc-toolchain
name: gcc-toolchain
version: 9.2.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 8.3.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@8.3.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 7.4.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@7.4.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 6.5.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@6.5.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 5.5.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@5.5.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 4.9.4
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@4.9.4 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 4.8.5
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@4.8.5 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

Guix actually knows about several versions of this toolchain. We didn't ask for a specific one, so what we got is the first one in this list, which is the one with the highest version number. Let's check that this is true:

guix environment --container --ad-hoc gcc-toolchain -- gcc --version
gcc (GCC) 9.2.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The output of guix show contains a line about dependencies. These are the dependencies of our dependency, and you may already have guessed that they will have dependencies as well. That's why reproducibility is such a difficult job in practice! The dependencies of gcc-toolchain@9.2.0 are:

guix show gcc-toolchain@9.2.0 | recsel -P dependencies
binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0

To dig deeper, we can try feeding these dependencies to guix show, one by one, in order to learn more about them:

guix show binutils@2.32
name: binutils
version: 2.32
outputs: out
systems: x86_64-linux i686-linux
dependencies: 
location: gnu/packages/base.scm:415:2
homepage: https://www.gnu.org/software/binutils/
license: GPL 3+
synopsis: Binary utilities: bfd gas gprof ld  
description: GNU Binutils is a collection of tools for working with binary
+ files.  Perhaps the most notable are "ld", a linker, and "as", an assembler.
+ Other tools include programs to display binary profiling information, list the
+ strings in a binary file, and utilities for working with archives.  The "bfd"
+ library for working with executable and object formats is also included.
guix show gcc@9.2.0
guix show: error: gcc@9.2.0: package not found

This looks a bit surprising. What's happening here is that gcc is defined as a hidden package in Guix. The package is there, but it is hidden from package queries. There is a good reason for this: gcc on its own is rather useless, you need gcc-toolchain to actually use the compiler. But if both gcc and gcc-toolchain showed up in a search, that would be more confusing than helpful for most users. Hiding the package is a way of saying "for experts only".

Let's take this as a sign that it's time to move on to the next level of Guix hacking: Guile scripts. Guile, an implementation of the Scheme language, is Guix' native language, so using Guile scripts, you get access to everything there is to know about Guix and its packages.

A note in passing: the emacs-guix package provides an intermediate level of Guix exploration for Emacs users. It lets you look at hidden packages, for example. But much of what I will show in the following really requires Guile scripts. Another nice tool for package exploration is guix graph, which creates a diagram showing dependency relations between packages. Unfortunately that diagram is legible only for a relatively small number of dependencies, and as we will see later, most packages end up having lots of them.

Anatomy of a Guix package

From the user's point of view, a package is a piece of software with a name and a version number that can be installed using guix install. The packager's point of view is quite a bit different. In fact, what users consider a package is more precisely called the package's output in Guix jargon. The package is a recipe for creating this output.

To see how all these concepts fit together, let's look at an example of a package definition: xmag. I have chosen this package not because I care much about it, but because its definition is short while showcasing all the features I want to explain. You can access it most easily by typing guix edit xmag. Here is what you will see:

(package
  (name "xmag")
  (version "1.0.6")
  (source
   (origin
     (method url-fetch)
     (uri (string-append
           "mirror://xorg/individual/app/" name "-" version ".tar.gz"))
     (sha256
      (base32
       "19bsg5ykal458d52v0rvdx49v54vwxwqg8q36fdcsv9p2j8yri87"))))
  (build-system gnu-build-system)
  (arguments
   `(#:configure-flags
     (list (string-append "--with-appdefaultdir="
                          %output ,%app-defaults-dir))))
  (inputs
   `(("libxaw" ,libxaw)))
  (native-inputs
   `(("pkg-config" ,pkg-config)))
  (home-page "https://www.x.org/wiki/")
  (synopsis "Display or capture a magnified part of a X11 screen")
  (description "Xmag displays and captures a magnified snapshot of a portion
of an X11 screen.")
  (license license:x11))

The package definition starts with the name and version information you expected. Next comes source, which says how to obtain the source code and from where. It also provides a hash that allows to check the integrity of the downloaded files. The next four items, build-system, arguments, inputs, and native-inputs supply the information required for building the package, which is what creates its outputs. The remaining items are documentation for human consumption, important for other reasons but not for reproducibility, so I won't say any more about them. (See this packaging tutorial if you want to define your own package.)

The example package definition has native-inputs in addition to "plain" inputs. There's a third variant, propagated-inputs, but xmag doesn't have any. The differences between these variants don't matter for my topic, so I will just refer to "inputs" from now on. Another omission I will make is the possibility to define several outputs for a package. This is done for particularly big packages, in order to reduce the footprint of installations, but for the purposes of reproducibility, it's OK to treat all outputs of a package a single unit.

The following figure illustrates how the various pieces of information from a package are used in the build process (done explicitly by guix build, or implicitly when installing or otherwise using a package): Diagram of a Guix package.

It may help to translate the Guix jargon to the vocabulary of C programming:

| Guix package | C program        |
|--------------+------------------|
| source code  | source code      |
| inputs       | libraries        |
| arguments    | compiler options |
| build system | compiler         |
| output       | executable       |

Building a package can be considered a generalization of compiling a program. We could in fact create a "GCC build system" for Guix that would simply run gcc. However, such a build system would be of little practical use, since most real-life software consists of more than just one C source code file, and requires additional pre- or post-processing steps. The gnu-build-system used in the example is based on tools such as make and autoconf, in addition to gcc.

Package exploration in Guile

Guile uses a record type called <package> to represent packages, which is defined in module (guix packages). There is also a module (gnu packages), which contains the actual package definitions - be careful not to confuse the two (as I always do). Here is a simple Guile script that shows some package information, much like the guix show command that I used earlier:

(use-modules (guix packages)
             (gnu packages)) 

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Name   : ~a\n" (package-name gcc-toolchain))
(format #t "Version: ~a\n" (package-version gcc-toolchain))
(format #t "Inputs : ~a\n" (package-direct-inputs gcc-toolchain))
Name   : gcc-toolchain
Version: 9.2.0
Inputs : ((gcc #<package gcc@9.2.0 gnu/packages/gcc.scm:524 7fc2d76af160>) (ld-wrapper #<package ld-wrapper@0 gnu/packages/base.scm:505 7fc2d306f580>) (binutils #<package binutils@2.32 gnu/packages/commencement.scm:2187 7fc2d306fdc0>) (libc #<package glibc@2.29 gnu/packages/commencement.scm:2145 7fc2d306fe70>) (libc-debug #<package glibc@2.29 gnu/packages/commencement.scm:2145 7fc2d306fe70> debug) (libc-static #<package glibc@2.29 gnu/packages/commencement.scm:2145 7fc2d306fe70> static))

This script first calls specification->package to look up the package using the same rules as the guix command line interface: pick the latest available version if none is explicitly requested. Then it extracts various information about the package. Note that package-direct-inputs returns the combination of package-inputs, package-native-inputs, and package-propagated-inputs. As I said above, I don't care about the distinction here.

The inputs are not shown in a particularly nice form, so let's write two Guile functions to improve it:

(use-modules (guix packages)
             (gnu packages)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    (other-item
     (format #f "~a" other-item))))

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Package: ~a\n"
        (package->specification gcc-toolchain))
(format #t "Inputs : ~a\n"
        (map input->specification (package-direct-inputs gcc-toolchain)))
Package: gcc-toolchain@9.2.0
Inputs : (gcc@9.2.0 ld-wrapper@0 binutils@2.32 glibc@2.29 glibc@2.29 glibc@2.29)

That looks much better. As you can see from the code, a list of inputs is a bit more than a list of packages. It is in fact a list of labelled package outputs. That also explains why we see glibc three times in the input list: glibc defines three distinct outputs, all of which are used in gcc-toolchain. For reproducibility, all we care about is the package references. Later on, we will deal with much longer input lists, so as a final cleanup step, let's show only unique package references from the list of inputs:

(use-modules (guix packages)
             (gnu packages)
             (srfi srfi-1)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    (other-item
     (format #f "~a" other-item))))

(define (unique-inputs inputs)
  (delete-duplicates
   (map input->specification inputs)))

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Package: ~a\n"
        (package->specification gcc-toolchain))
(format #t "Inputs : ~a\n"
        (unique-inputs (package-direct-inputs gcc-toolchain)))
Package: gcc-toolchain@9.2.0
Inputs : (gcc@9.2.0 ld-wrapper@0 binutils@2.32 glibc@2.29)

Dependencies

You may have noticed the absence of the term "dependency" from the last two sections. There is a good reason for that: the term is used in somewhat different meanings, and that can create confusion. Guix jargon therefore avoids it.

The figure above shows three kinds of input to the build system: source, inputs, and arguments. These categories reflect the packagers' point of view: source is what the authors of the software supply, inputs are other packages, and arguments is what the packagers themselves add to the build procedure. It is important to understand that from a purely technical point of view, there is no fundamental difference between the three categories. You could, for example, define a package that contains C source code in the build system arguments, but leaves source empty. This would be inconvenient, and confusing for others, so I don't recommend you actually do this. The three categories are important, but for humans, not for computers. In fact, even the build system is not fundamentally distinct from its inputs. You could define a special-purpose build system for one package, and put all the source code in there. At the level of the CPU and the computer's memory, a build process (as in fact any computation) looks like Image of a computation. It is human interpretation that decomposes this into Code and data. and in a next step into Data, program, and environment. We can go on and divide the environment into operating system, development tools, and application software, for example, but the further we go in decomposing the input to a computation, the more arbitrary it gets.

From this point of view, a software's dependencies consist of everything required to run it in addition to its source code. For a Guix package, the dependencies are thus,

  • its inputs
  • the build system arguments
  • the build system itself
  • Guix (which is a piece of software as well)
  • the GNU/Linux operating system (kernel, file system, etc.).

In the following, I will not mention the last two items any more, because they are a common dependency of all Guix packages, but it's important not to forget about them. A change in Guix or in GNU/Linux can actually make a computation non-reproducible, although in practice that happens very rarely. Moreover, Guix is actually designed to run older versions of itself, as we will see later.

Build systems are (mostly) packages as well

I hope that by now you have a good idea of what a package is: a recipe for building outputs from source and inputs, with inputs being the outputs of other packages. The recipe involves a build system and arguments supplied to it. So... what exactly is a build system? I have introduced it as a generalization of a compiler, which describes its role. But where does a build system come from in Guix?

The ultimate answer is of course the source code. Build systems are pieces of Guile code that are part of Guix. But this Guile code is only a shallow layer orchestrating invocations of other software, such as gcc or make. And that software is defined by packages. So in the end, from a reproducibility point of view, we can replace the "build system" item in our list of dependenies by "a bundle of packages". In other words: more inputs.

Before Guix can build a package, it must gather all the required ingredients, and that includes replacing the build system by the packages it represents. The resulting list of ingredients is called a bag, and we can access it using a Guile script:

(use-modules (guix packages)
             (gnu packages)
             (srfi srfi-1)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    ((label (? origin? origin))
     (format #f "[source code from ~a]"
             (origin-uri origin)))
    (other-input
     (format #f "~a" other-input))))

(define (unique-inputs inputs)
  (delete-duplicates
   (map input->specification inputs)))

(define hello
  (specification->package "hello"))

(format #t "Package       : ~a\n"
        (package->specification hello))
(format #t "Package inputs: ~a\n"
        (unique-inputs (package-direct-inputs hello)))
(format #t "Build inputs  : ~a\n"
        (unique-inputs
         (bag-direct-inputs
          (package->bag hello))))
Package       : hello@2.10
Package inputs: ()
Build inputs  : ([source code from mirror://gnu/hello/hello-2.10.tar.gz] tar@1.32 gzip@1.10 bzip2@1.0.6 xz@5.2.4 file@5.33 diffutils@3.7 patch@2.7.6 findutils@4.6.0 gawk@5.0.1 sed@4.7 grep@3.3 coreutils@8.31 make@4.2.1 bash-minimal@5.0.7 ld-wrapper@0 binutils@2.32 gcc@7.4.0 glibc@2.29 glibc-utf8-locales@2.29)

I have used a different example, hello, because for gcc-toolchain, there is no difference between package inputs and build inputs (check for yourself if you want!) My new example, hello (a short demo program printing "Hello, world" in the language of the system installation), is interesting because it has no package inputs at all. All the build inputs except for the source code have thus been contributed by the build system.

If you compare this script to the previous one that printed only the package inputs, you will notice two major new features. In input->specification, there is an additional case for the source code reference. And in the last statement, package->bag constructs a bag from the package, before bag-direct-inputs is called to get that bag's input list.

Inputs are outputs

I have mentioned before that one package's inputs are other packages' outputs, but that fact deserves a more in-depth discussion because of its crucial importance for reproducibility. A package is a recipe for building outputs from source and inputs. Since these inputs are outputs, they must have been built as well. Package building is therefore a process consisting of multiple steps. An immediate consequence is that any computation making use of packaged software is a multi-step computation as well.

Remember the short C program computing π from the beginning of this post? Running that program is only the last step in a long series of computations. Before you can run pi, you must compile pi.c. That requires the package gcc-toolchain, which must first be built. And before it can be built, its inputs must be built. And so on. If you want the output of pi to be reproducible, the whole chain of computations must be reproducible, because each step can have an impact on the results produced by pi.

So... where does this chain start? Few people write machine code these days, so almost all software requires some compiler or interpreter. And that means that for every package, there are other packages that must be built first. The question of how to get this chain started is known as the bootstrapping problem. A rough summary of the solution is that the chain starts on somebody else's computer, which creates a bootstrap seed, an ideally small package that is downloaded in precompiled form. See this post by Jan Nieuwenhuizen for details of this procedure. The bootstrap seed is not the real start of the chain, but as long as we can retrieve an identical copy at a later time, that's good enough for reproducibility. In fact, the reason for requiring the bootstrap seed to be small is not reproducibility, but inspectability: it should be possible to audit the seed for bugs and malware, even in the absence of source code.

Reaching closure

Now we are finally ready for the ultimate step in dependency analysis: identifying all packages on which a computation depends, right up to the bootstrap seed. The starting point is the list of direct inputs of the bag derived from a package, which we looked at in the previous script. For each package in that list, we must apply this same procedure, recursively. We don't have to write this code ourselves, because the function package-closure in Guix does that job. These closures have nothing to do with closures in Lisp, and even less with the Clojure programming language. They are a case of what mathematicians call transitive closures: starting with a set of packages, you extend the set repeatedly by adding the inputs of the packages that are already in the set, until there is nothing more to add. If you have a basic knowledge of Scheme, you should now be able to understand implementation of this function. Let's add it to our dependency analysis code:

(use-modules (guix packages)
             (gnu packages)
             (srfi srfi-1)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    ((label (? origin? origin))
     (format #f "[source code from ~a]"
             (origin-uri origin)))
    (other-input
     (format #f "~a" other-input))))

(define (unique-inputs inputs)
  (delete-duplicates
   (map input->specification inputs)))

(define (length-and-list lists)
  (list (length lists) lists))

(define hello
  (specification->package "hello"))

(format #t "Package        : ~a\n"
        (package->specification hello))
(format #t "Package inputs : ~a\n"
        (length-and-list (unique-inputs (package-direct-inputs hello))))
(format #t "Build inputs   : ~a\n"
        (length-and-list
         (unique-inputs
          (bag-direct-inputs
           (package->bag hello)))))
(format #t "Package closure: ~a\n"
        (length-and-list
         (delete-duplicates
          (map package->specification
               (package-closure (list hello))))))
Package        : hello@2.10
Package inputs : (0 ())
Build inputs   : (20 ([source code from mirror://gnu/hello/hello-2.10.tar.gz] tar@1.32 gzip@1.10 bzip2@1.0.6 xz@5.2.4 file@5.33 diffutils@3.7 patch@2.7.6 findutils@4.6.0 gawk@5.0.1 sed@4.7 grep@3.3 coreutils@8.31 make@4.2.1 bash-minimal@5.0.7 ld-wrapper@0 binutils@2.32 gcc@7.4.0 glibc@2.29 glibc-utf8-locales@2.29))
Package closure: (84 (m4@1.4.18 libatomic-ops@7.6.10 gmp@6.1.2 libgc@7.6.12 libltdl@2.4.6 libunistring@0.9.10 libffi@3.2.1 pkg-config@0.29.2 guile@2.2.6 libsigsegv@2.12 lzip@1.21 ed@1.15 perl@5.30.0 guile-bootstrap@2.0 zlib@1.2.11 xz@5.2.4 ncurses@6.1-20190609 libxml2@2.9.9 attr@2.4.48 gettext-minimal@0.20.1 gcc-cross-boot0-wrapped@7.4.0 libstdc++@7.4.0 ld-wrapper-boot3@0 bootstrap-binaries@0 ld-wrapper-boot0@0 flex@2.6.4 glibc-intermediate@2.29 libstdc++-boot0@4.9.4 expat@2.2.7 gcc-mesboot1-wrapper@4.7.4 mesboot-headers@0.19 gcc-core-mesboot@2.95.3 bootstrap-mes@0 bootstrap-mescc-tools@0.5.2 tcc-boot0@0.9.26-6.c004e9a mes-boot@0.19 tcc-boot@0.9.27 make-mesboot0@3.80 gcc-mesboot0@2.95.3 binutils-mesboot0@2.20.1a make-mesboot@3.82 diffutils-mesboot@2.7 gcc-mesboot1@4.7.4 glibc-headers-mesboot@2.16.0 glibc-mesboot0@2.2.5 binutils-mesboot@2.20.1a linux-libre-headers@4.19.56 linux-libre-headers-bootstrap@0 gcc-mesboot@4.9.4 glibc-mesboot@2.16.0 gcc-cross-boot0@7.4.0 bash-static@5.0.7 gettext-boot0@0.19.8.1 python-minimal@3.5.7 perl-boot0@5.30.0 texinfo@6.6 bison@3.4.1 gzip@1.10 libcap@2.27 acl@2.2.53 glibc-utf8-locales@2.29 gcc-mesboot-wrapper@4.9.4 file-boot0@5.33 findutils-boot0@4.6.0 diffutils-boot0@3.7 make-boot0@4.2.1 binutils-cross-boot0@2.32 glibc@2.29 gcc@7.4.0 binutils@2.32 ld-wrapper@0 bash-minimal@5.0.7 make@4.2.1 coreutils@8.31 grep@3.3 sed@4.7 gawk@5.0.1 findutils@4.6.0 patch@2.7.6 diffutils@3.7 file@5.33 bzip2@1.0.6 tar@1.32 hello@2.10))

That's 84 packages, just for printing "Hello, world!". As promised, it includes the boostrap seed, called bootstrap-binaries. It may be more surprising to see Perl and Python in the dependency list of what is a pure C program. The explanation is that the build process of gcc and glibc contains Perl and Python code. Considering that both Perl and Python are written in C and use glibc, this hints at why bootstrapping is a hard problem!

Get ready for your own analyses

As promised, here is a Guile script that you can download and run from the command line to do dependency analyses much like the ones I have shown. Just give the packages whose combined list of dependencies you want to analyze. For example:

./show-dependencies.scm hello
Packages: 1
  hello@2.10
Package inputs: 0 packages
 
Build inputs: 20 packages
  [source code from mirror://gnu/hello/hello-2.10.tar.gz] bash-minimal@5.0.7 binutils@2.32 bzip2@1.0.6 coreutils@8.31 diffutils@3.7 file@5.33 findutils@4.6.0 gawk@5.0.1 gcc@7.4.0 glibc-utf8-locales@2.29 glibc@2.29 grep@3.3 gzip@1.10 ld-wrapper@0 make@4.2.1 patch@2.7.6 sed@4.7 tar@1.32 xz@5.2.4
Package closure: 84 packages
  acl@2.2.53 attr@2.4.48 bash-minimal@5.0.7 bash-static@5.0.7 binutils-cross-boot0@2.32 binutils-mesboot0@2.20.1a binutils-mesboot@2.20.1a binutils@2.32 bison@3.4.1 bootstrap-binaries@0 bootstrap-mes@0 bootstrap-mescc-tools@0.5.2 bzip2@1.0.6 coreutils@8.31 diffutils-boot0@3.7 diffutils-mesboot@2.7 diffutils@3.7 ed@1.15 expat@2.2.7 file-boot0@5.33 file@5.33 findutils-boot0@4.6.0 findutils@4.6.0 flex@2.6.4 gawk@5.0.1 gcc-core-mesboot@2.95.3 gcc-cross-boot0-wrapped@7.4.0 gcc-cross-boot0@7.4.0 gcc-mesboot-wrapper@4.9.4 gcc-mesboot0@2.95.3 gcc-mesboot1-wrapper@4.7.4 gcc-mesboot1@4.7.4 gcc-mesboot@4.9.4 gcc@7.4.0 gettext-boot0@0.19.8.1 gettext-minimal@0.20.1 glibc-headers-mesboot@2.16.0 glibc-intermediate@2.29 glibc-mesboot0@2.2.5 glibc-mesboot@2.16.0 glibc-utf8-locales@2.29 glibc@2.29 gmp@6.1.2 grep@3.3 guile-bootstrap@2.0 guile@2.2.6 gzip@1.10 hello@2.10 ld-wrapper-boot0@0 ld-wrapper-boot3@0 ld-wrapper@0 libatomic-ops@7.6.10 libcap@2.27 libffi@3.2.1 libgc@7.6.12 libltdl@2.4.6 libsigsegv@2.12 libstdc++-boot0@4.9.4 libstdc++@7.4.0 libunistring@0.9.10 libxml2@2.9.9 linux-libre-headers-bootstrap@0 linux-libre-headers@4.19.56 lzip@1.21 m4@1.4.18 make-boot0@4.2.1 make-mesboot0@3.80 make-mesboot@3.82 make@4.2.1 mes-boot@0.19 mesboot-headers@0.19 ncurses@6.1-20190609 patch@2.7.6 perl-boot0@5.30.0 perl@5.30.0 pkg-config@0.29.2 python-minimal@3.5.7 sed@4.7 tar@1.32 tcc-boot0@0.9.26-6.c004e9a tcc-boot@0.9.27 texinfo@6.6 xz@5.2.4 zlib@1.2.11

You can now easily experiment yourself, even if you are not at ease with Guile. For example, suppose you have a small Python script that plots some data using matplotlib. What are its dependencies? First you should check that it runs in a minimal environment:

guix environment --container --ad-hoc python python-matplotlib -- python my-script.py

Next, find its dependencies:

./show-dependencies.scm python python-matplotlib

I won't show the output here because it is rather long - the package closure contains 499 packages!

OK, but... what are the real dependencies?

I have explained dependencies along these lines in a few seminars. There's one question that someone in the audience is bound to ask: What do the results of a computation really depend on? The output of hello is "Hello, world!", no matter which version of gcc I use to compile it, and no matter which version of python was used in building glibc. The package closure is a worst-case estimate: it contains everything that can potentially influence the results, though most of it doesn't in practice. Unfortunately, there is no way to identify the dependencies that matter automatically, because answering that question in general (i.e. for arbitrary software) is equivalent to solving the halting problem.

Most package managers, such as Debian's apt or the multi-platform conda, take a different point of view. They define the dependencies of a program as all packages that need to be loaded into memory in order to run it. They thus exclude the software that is required to build the program and its run-time dependencies, but can then be discarded. Whereas Guix' definition errs on the safe side (its dependency list is often longer than necessary but never too short), the run-time-only definition is both too vast and too restrictive. Many run-time dependencies don't have an impact on most programs' results, but some build-time dependencies do.

One important case where build-time dependencies matter is floating-point computations. For historical reasons, they are surrounded by an aura of vagueness and imprecision, which goes back to its early days, when many details were poorly understood and implementations varied a lot. Today, all computers used for scientific computing respect the IEEE 754 standard that precisely defines how floating-point numbers are represented in memory and what the result of each arithmetic operation must be. Floating-point arithmetic is thus perfectly deterministic and even perfectly portable between machines, if expressed in terms of the operations defined by the standard. However, high-level languages such as C or Fortran do not allow programmers to do that. Its designers assume (probably correctly) that most programmers do not want to deal with the intricate details of rounding. Therefore they provide only a simplified interface to the arithmetic operations of IEEE 754, which incidentally also leaves more liberty for code optimization to compiler writers. The net result is that the complete specification of a program's results is its source code plus the compiler and the compilation options. You thus can get reproducible floating-point results if you include all compilation steps into the perimeter of your computation, at least for code running on a single processor. Parallel computing is a different story: it involves voluntarily giving up reproducibility in exchange for speed. Reproducibility then becomes a best-effort approach of limiting the collateral damage done by optimization through the clever design of algorithms.

Reproducing a reproducible computation

So far, I have explained the theory behind reproducible computations. The take-home message is that to be sure to get exactly the same results in the future, you have to use the exact same versions of all packages in the package closure of your immediate dependencies. I have also shown you how you can access that package closure. There is one missing piece: how do you actually run your program in the future, using the same environment?

The good news is that doing this is a lot simpler than understanding my lengthy explanations (which is why I leave this for the end!). The complex dependency graphs that I have analyzed up to here are encoded in the Guix source code, so all you need to re-create your environment is the exact same version of Guix! You get that version using

guix describe
Generation 15 Jan 06 2020 13:30:45    (current)
  guix 769b96b
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: 769b96b62e8c09b078f73adc09fb860505920f8f

The critical information here is the unpleasantly looking string of hexadecimal digits after "commit". This is all it takes to uniquely identify a version of Guix. And to re-use it in the future, all you need is Guix' time machine:

guix time-machine --commit=769b96b62e8c09b078f73adc09fb860505920f8f -- environment --ad-hoc gcc-toolchain
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
gcc pi.c -o pi
./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

The time machine actually downloads the specified version of Guix and passes it the rest of the command line. You are running the same code again. Even bugs in Guix will be reproduced faithfully! As before, guix environment leaves us in a special-environment shell which needs to be terminated by Ctrl-D.

For many practical use cases, this technique is sufficient. But there are two variants you should know about for more complicated situations:

  • If you need an environment with many packages, you should use a manifest rather than list the packages on the command line. See the manual for details.

  • If you need packages from additional channels, i.e. packages that are not part of the official Guix distribution, you should store a complete channel description in a file using

guix describe -f channels > guix-version-for-reproduction.txt

and feed that file to the time machine:

guix time-machine --channels=guix-version-for-reproduction.txt -- environment --ad-hoc gcc-toolchain
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
gcc pi.c -o pi
./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

Last, if your colleagues do not use Guix yet, you can pack your reproducible software for use on other systems: as a tarball, or as a Docker or Singularity container image. For example:

guix pack            \
     -f docker       \
     -C none         \
     -S /bin=bin     \
     -S /lib=lib     \
     -S /share=share \
     -S /etc=etc     \
     gcc-toolchain
/gnu/store/iqn9yyvi8im18g7y9f064lw9s9knxp0w-docker-pack.tar

will produce a Docker container image, and with the knowledge of the Guix commit (or channel specification), you will be able in the future to reproduce this container bit-to-bit using guix time-machine.

And now... congratulations for having survived to the end of this long journey! May all your computations be reproducible, with Guix.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Konrad Hinsen at January 14, 2020 04:30 PM

Joe Marshall

Palindromes, redux, and the Sufficiently Smart Compiler

The Sufficiently Smart Compiler is mentioned by authors as shorthand for “a compiler that performs nearly all reasonable optimizations, but in particular this one I want”. Many attempts were made up through the 80's and maybe into the 90's to write a Sufficiently Smart Compiler that would perform all “reasonable” optimizations, and although many impressive results have been obtained, there always seem to be fairly obvious optimizations that remain unoptimized. These days it seems that people realize that there will be good compilers and some very good compilers, but never a Sufficiently Smart Compiler. Nonetheless, it is worth considering a Sufficiently Smart Compiler as a tool for thought experiments.

I was curious what would be necessary for a Sufficiently Smart Compiler to generate optimal code for the palindrome problem given the naive algorithm.

The naive algorithm is inspired by the axioms
  • A zero or one element string is a palindrome.
  • If the first char matches the last char, and the middle is a palindrome, the result is a palindrome.
and gives us this:
(define (palindrome1? string)
  (or (< (string-length string) 2)
      (and (char=? (string-ref string 0)
                   (string-ref string (- (string-length string) 1)))
           (palindrome1? (substring string 1 (- (string-length string) 1))))))

The higher performing algorithm is inspired by the idea of keeping two pointers to each end of a string and comparing the characters at the pointers. If the characters are the same, you move the pointers inward and when they meet, you have seen a palindrome. If at any point the characters differ, you don't have a palindrome:
(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (>= front-pointer rear-pointer)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string rear-pointer))
             (scan (+ front-pointer 1) (- rear-pointer 1))))
  (scan 0 (- (string-length string) 1)))
As you can see, these really aren't very different to start with. Both algorithms are iterative and both work their way in from the outside of the string. There are basically two differences. First, access to the rear of the string is either by a rear pointer, or by using the string-length of the string and subtracting 1. Second, the iterative call either uses substring or moves the pointers closer together.

First, let's assume that our processor has can reference through an indexed offset. This would mean we could point at the element one beyond the rear-pointer and not incur overhead. This isn't an unreasonable assumption for a CISC architecture such as an x86, but would probably cause 1 instruction overhead on a RISC architecture. So the second algorithm becomes this:
(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (< (- rear-pointer front-pointer) 2)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string (- rear-pointer 1)))
             (scan (+ front-pointer 1) (- rear-pointer 1)))))
  (scan 0 (string-length string)))

Now this next assumption is a bit more of a stretch. The implementation of palindrome1? uses substring on each iteration and that's going to result in a lot of string copying. If our implementation used “slices” instead of copying the string, then there will be a lot less copying going on:
(define (palindrome1? string)
  (or (< (- (slice-end string) (slice-start string)) 2)
      (and (char=? (string-ref string (slice-start string))
                   (string-ref string (- (slice-end string) 1)))
           (palindrome1? 
             (substring string (+ (slice-start string) 1) (- (slice-end string) 1))))))

It is not uncommon for a compiler to introduce internal procedures for looping, so we can do that.
(define (palindrome1? string)
  (define (scan slice)
    (or (< (- (slice-end slice) (slice-start slice)) 2)
        (and (char=? (slice-ref slice (slice-start slice))
                     (slice-ref slice (- (slice-end slice) 1)))
             (scan (subslice slice (+ (slice-start slice) 1) (- (slice-end slice) 1))))))
  (scan (make-slice 0 (string-length string))))

We'll enter fantasy land again and let our compiler be smart enough to “spread” the slice data structure into the argument list of scan. This is no doubt asking too much from our compiler, but the information is available and it could in theory be done:
(define (palindrome1? string)
  (define (scan slice-start slice-end)
    (or (< (- slice-end slice-start) 2)
        (and (char=? (slice-ref string slice-start)
                     (slice-ref string (- slice-end 1)))
             (scan (+ slice-start 1) (- slice-end 1)))))
  (scan 0 (string-length string)))

And now we have palindrome2? (modulo renaming).

This doesn't really prove anything. But with a couple of somewhat unlikely compiler tricks, the naive version could be transformed to the more optimized version. It suggests that a it would be surprising but not a complete shock for an ambitious compiler writer to attempt.

I wish someone would write that Sufficiently Smart Compiler.

by Joe Marshall (noreply@blogger.com) at January 14, 2020 12:09 PM

Programming Praxis

Newbie Question

We have something different today.

Although most of my readers are accomplished, savvy programmers, (or at least those who are confident enough to comment) I will always include content designed for beginners; it’s part of the DNA of the blog, and sometimes brings in new readers. Thus, I spend a lot of time in beginner-programmer online forums. Today’s question at Reddit caught my eye:

Newbie question: What’s the difference between a “loop” and a “function”?

Sorry if this is answered somewhere but I am completely new to programming.

What is the difference between a function and a loop, when both of them can be used more than once?

The posting history of the person who asked that question indicates that he has recently been expelled from college (his GPA wasn’t great) and that

In the meantime I just started a computer programming course online. Was never really interested in it but everyone says it’s a skill you should have so I figured I’ll give it a shot. Plus it’s free so there’s that.

Your task is to answer the newbie question; if you answer on Reddit, please also leave your answer here, with a pointer to your Reddit posting. When you are finished, you are welcome to read my answer or discuss the exercise in the comments below.

by programmingpraxis at January 14, 2020 10:00 AM