Planet Scheme

January 19, 2020

Joe Marshall

Afraid of Recursion

Here's a trick I learned several years ago for transferring time-series data. In the case in question, I needed to transfer a bunch of timestamped records, but the server had a quirk to it. If you asked for too many records at once, it would simply return an error code and give up on your request. There was no way to know beforehand how many records might exist in any given time span, so you could get an error code on nearly any request, unless it was for a very short time span. On the other hand, many requests for long time spans would succeed because they had few records in them. Despite this quirk, the code was really simple:
List<record> xfer (Timestamp start, Timestamp end) {
    try {
        return tryFetchRecords(start, end);
    } catch (TooManyRecordsException e) {
        Timestamp mid = (start + end)/2;
 List<record> firstHalf = xfer (start, mid);
 List<record> secondHalf = xfer (mid, end);
 return firstHalf.addAll(secondHalf);
    }
}
On any request, if the server returned the error code, we would simply bisect the time span, recursively ask for each half separately, and combine the two halves. Should the bisected time span still contain too many records, the time span would be bisected again. The recursion would continue until the time span was small enough that the server could honor the request and return some records. The recursion would then unwind, combining the returned records into larger and larger lists until we had all the records for our original time span. Since the time span would be cut in half on each recurrence, the depth of recursion would be proportional to the logarithm (base 2) of the total number of records, which would be a reasonably small number even with an enormous number of records.

It's certainly possible to avoid recursion and do this iteratively by some form of paging, but the code would be slightly more complex. The server is not very cooperative, so there is no easy way to determine an appropriate page size beforehand, and the server doesn't support a “paging token” to help keep track of progress. The recursive solution finds an appropriate transfer size by trial and error, and keeps track of progress more or less automatically. An iterative paging solution would have to do these things more explicitly and this would make the iterative code a bit more complex. And why add any complexity when it isn't really necessary?

I thought this solution was really cool when I first saw it. I've used this trick for transferring time series data many times. It makes the server very simple to write because the protocol requires so little of it. It simply has to refuse to answer requests that return too many results. The client code is just about the 10 lines above.

But when I first suggest this code to people I usually get “push back” (that is, until they see it work in action, then they usually get on board with it). People seem unsure about the use of recursion and want a more complex protocol where the client and server negotiate a page size or cooperatively pass a paging token back and forth on each request and response. Their eyes glaze over as soon as they see the recursive call. They want to avoid recursion just because it's recursive.

I've seen “aversion to recursion” happen in a lot of circumstances, not just this one. Recursion isn't the solution to everything. No tool solves all problems. But it is an important tool that often offers elegant solutions to many problems. Programmers shouldn't be afraid of using it when it is appropriate.

by Joe Marshall (noreply@blogger.com) at January 19, 2020 11:52 PM

January 18, 2020

Joe Marshall

Unsyndicated blog

I've noticed that my blog posts are replicated in Planet Lisp and Planet Scheme, and here I am spamming them with random math stuff. So I'm creating a new blog, Jrm's Random Blog, where I can feel free to post about math, science, computers in general, and whatever else bugs me, without spamming the Lisp and Scheme readers. I'll keep posting to Abstract Heresies, but try to keep it more Lisp and computer language focused.

by Joe Marshall (noreply@blogger.com) at January 18, 2020 01:33 PM

January 17, 2020

Programming Praxis

Coprime Numbers

This exercise comes from CodeForces:

Given two natural numbers lo < hi, find a triple (a, b, c) such that a is coprime to b, b is coprime to c, and a is not coprime to c, or indicate that no such triple exists. If there is more than one such triple, you may return any of them. Two numbers are coprime if their greatest common divisor is 1. Lo and hi will be less than 1018 and differ by no more than 50.

Your task is to find a triple satisfying the conditions above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 17, 2020 10:00 AM

Joe Marshall

Groups, semigroups, monoids, and computers

The day after I rant about mathematicians, I make a math post. “Do I contradict myself? Very well, then, I contradict myself, I am large, I contain multitudes.” — Walt Whitman

A group is a mathematical concept. It's pretty simple. It consists of a set, G, and an operation, *, which can be used to combine any two elements of G. What the set contains is not that important. It is the * operation we're interested in, and we can usually swap out G for another set without causing too many problems other than having to change the type signature of *. There are four axioms that * must obey
  • Closure—combining any two elements of G using * just gives you another element in G.
    Note that this means you can build an arbitrary binary tree of combinations: e.g.(* (* a b) (* (* c d) e))). These trees will always be like a tree of cons cells. In some sense, the closure axiom is equivalent to saying that all the elements of G have the same type and that the * operator operates on values of that type and produces values of that type. The closure axiom along with the binary operation means that we can reduce any tree of combinations to a single value.
  • Associativity(* (* a b) c) = (* a (* b c)) for any a, b, and c. This implies that you can take any arbitrary tree of combinations: e.g.(* (* a b) (* (* c d) e))) and simply flatten it into a list (* a b c d e), or given the flat sequence (* a b c d e) we can add parenthesis anywhere we like: (* a (* b c) d e). If we stop here and only have the closure and associativity axiom, we have what is called a “semigroup”. You can use the * operation to “fold” a semigroup down to single value, or to keep an accumulator and incrementally fold elements into the accumulator.
  • Identity element—There has to be an identity element id such that (* id x) = (* x id) = x for all x. It will be unique. If you see the identity object in a combination (* a b id c d), you can simply remove it: (* a b c d). The identity element also comes in handy as an initial value when you are folding a sequence. If you have some concept that would be a group except it doesn't have an identity element, then you can often just make one up and add it to the set G.
  • Inverse element—For every element in G there has to be another element, that when combined with the first, gives you the identity. So if a is an element in G, there has to be some other element, call it b, such that (* a b) = (* b a) = id. The inverse element is usually notated with a little -1: a-1. If you have an element in a combination right next to it's inverse: (* a x x-1 c), you can combine the element and it's inverse to get the identity: (* a id c), and then remove the identity: (* a c)
Frequently you run into something that obeys all the axioms but the inverse element axiom. This is called a monoid. A monoid is very much like a group except that you can get “stuck” when manipulating it if you run into one of the non-invertible elements because there's no inverse to “undo” it. There are certain things about monoids that are true only “if the appropriate inverses exist”. You run into that qualifier a lot when dealing with monoids. You don't need that qualifier if you are dealing with a group because they do exist by axiom. Or we could say that calling something a group is simply shorthand for adding “if the appropriate inverses exist” everywhere.

What does this have to do with computers? Consider the set of all subroutines with the operation of concatenation. It is closed — concatenating two subroutines gives you a third subroutine. It is associative — you just concatenate them linearly. There is an identity element, usually called no-op. And many, but not all, subroutines have inverses. So we have a monoid.

Consider the set of all strings with the operation of concatenation. It is closed, associative, the empty string is the identity element. It is a monoid.

Consider the set of functions whose input type is the same as the result type with the operation of composition. It is closed, associative, the identity function is the identity element. It is a monoid. If we consider only the subset of functions that also have inverses, we have a group. This particular monoid or group comes in especially handy because composition of functions is so useful.

Consider the set of invertible 2x2 matrices with integer components, a determinant of 1 or -1, and the operation of matrix multiply. It is closed, associative, there is an identity matrix, and I already said just consider the invertible ones. It forms a group. This group comes in handy for implementing arbitrary precision arithmetic. (Thanks to Bradley Lucier for the correction of the condition on the determinant. This makes the matrix continue to have integer components upon inversion, keeping things closed.)

The permutations of a list form a group. The integers under addition form a group.

These things are everywhere. And it isn't a coincidence. The concepts of a group, monoid, and semigroup are meant to capture the essence of what it is to have a foldable sequence of elements. (Can I complain about mathematicians here? They make up so much terminology and abstraction that it is virtually impossible to get at what they really mean. We're just talking about sequences of elements and trying to find some minimal axioms that you need to have to fold them, but try to find literature that actually says that's what we're doing is like trying to pull hen's teeth.)

So what good are groups, monoids, and semigroups? Aside from the obvious fact that foldable sequences are ubiquitous and really useful, that is. Not immediately apparent from the axioms is that in addition to folding a sequence, you can transform a sequence into a different, but equivalent one. If the appropriate inverses exist (there's that phrase), you can “unfold” some or all elements of a sequence. So by judicious folding and unfolding, you can transform a sequence.

Here's an unusual abstract example. Consider a pipeline which has a set of nodes and communicates values of the same type between the nodes. Values accumulate at the nodes until they are transmitted to the next node in the pipeline. We start with all the values in the initial node (on the right) and transmit them to the left:
(pipeline (node) (node) (node a b c))  ;; transmit the a
(pipeline (node) (node a) (node b c))  ;; transmit the b
(pipeline (node) (node a b) (node c))  ;; transmit the a
(pipeline (node a) (node b) (node c))  ;; transmit the c
(pipeline (node a) (node b c) (node))  ;; transmit the b
(pipeline (node a b) (node c) (node))  ;; transmit the c
(pipeline (node a b c) (node) (node))  ;; done
If the values we transmit are drawn from a group, we can replace each node with the group's * operator:
(* identity identity (* a b c))  ;; transmit the a
(* identity (* identity a) (* b c))  ;; transmit the b
(* identity (* a b) (* identity c))  ;; transmit the a
(* (* identity a) (* identity  b) (* identity c))  ;; transmit the c
(* (* identity a) (* b c) identity)  ;; transmit the b
(* (* a b) (* identity c) identity)  ;; transmit the c
(* (* a b c) identity identity)  ;; done
The astute reader will notice that all we're doing is making use of the associativity axiom and moving the parenthesis around so that the values seem to move between the different nodes. But we preserve the invariant that the “value” of the entire pipeline doesn't change as the values move. The * operator need not be concatenate, which would give simple queuing behavior, but can be any operator satisfying the axioms giving us much more interesting pipelines. One implementation of arbitrary precision arithmetic transmits Möbius transformations along just such a pipeline to refine the upper and lower limits of a computed approximation. In this implementation, the * operator is the composition of Möbius transformations.

Here's a more concrete example. If you have a series of nested functions: (f (g x)) and both f and g take and return the same type, rewrite it as ((compose f g) x) and use a little group theory on it.
(f (g x))
((compose f g) x)
;; or more explicitly
((fold-left compose identity (list f g)) x)
If the appropriate inverses exist, then there will be another function h such that (compose f g) is equal to (compose h f) essentially allowing you to “slide” g to the left “through” f. It is relatively easy to see that h must be equivalent to (compose f g f-1). Mathematicians say that h is conjugate to g. Conjugates always have a form like aba-1. By finding conjugates, you can take a sequence and slide the elements left and right through other elements. This also allows you to fold things out of order. (Or in the pipeline example, transmit items out of order.) If we were left folding into an accumulator, folding h before f is equivalent to folding g after f. Another way of looking at it is this. Suppose we're standing to the left of f and looking through the “lens” of f at g. h is what g “looks like” when viewed through f.

If we want, we can define slide such that (compose slide (compose f g)) is equivalent to (compose h f). slide is (compose h f g-1 f-1). (This isn't a generic slide sequence, it only works on (compose f g). It ought to be an identity because (compose f g) is equivalent to (compose h f).) I complained that mathematicians provided too few concrete examples, so here is a concrete example using list permutations:
> (reverse (rotate-left '(a b c d)))
(a d c b)

;; rewrite as explicit fold-left of compose
> ((fold-left compose identity (list reverse rotate-left)) '(a b c d))
(a d c b)

;; sliding rotate-left through reverse turns it into rotate-right
> ((fold-left compose identity (list rotate-right reverse)) '(a b c d))
(a d c b)

;; A sequence that when composed with (list reverse rotate-left) turns it into
;; (rotate-right reverse)
> (define slide 
    (fold-left compose identity (list rotate-right reverse rotate-right reverse)))
slide

> ((fold-left compose identity (list slide reverse rotate-left)) '(a b c d))
(a d c b)

;; rewrite back to direct procedure calls
> (rotate-right (reverse '(a b c d)))
(a d c b)

;; and slide ought to be an identity
> ((fold-left compose identity (list slide)) '(a b c d))
(a b c d)

Or suppose you have (f (g x)), but for some reason you want(g (f x)) (which would, in general, be a different value unless f and g happen to commute). Again, rewrite (f (g x)) as ((compose f g) x) and apply a little group theory. If the appropriate inverses exist, there will be a function commute-fg such that (compose commute-fg (compose f g)) is equivalent to (compose g f). With a little thought, you can see that commute-fg is equivalent to (compose g f g-1 f-1). (Again, this isn't a generic commute, it only causes this specific f and g to commute.) commute-fg is called a commutator because it makes f and g commute. Commutators always have the form aba-1b-1. By finding commutators and inserting them in the right place, you can take a sequence and swap adjacent elements. Again, a concrete example with lists:
;; an illustration of what swap-first two does
> (swap-first-two '(a b c d))
(b a c d)

;; we're given
> (reverse (swap-first-two '(a b c d)))
(d c a b)

;; but we want, for some reason to reverse first
> (swap-first-two (reverse '(a b c d)))
(c d b a)

;; rewrite as fold-left of compose
> ((fold-left compose identity (list reverse swap-first-two)) '(a b c d))
(d c a b)

;; define our commutator
;; note that swap-first-two and reverse are their own inverses
> (define commute-fg
    (fold-left compose identity (list swap-first-two reverse swap-first-two reverse)))

;; make f and g commute
;; observe that it returns the desired result
> ((fold-left compose identity (list commute-fg reverse swap-first-two)) '(a b c d))
(c d b a)

There's two interesting things here. First, notice that in both examples I convert (f (g x)) to ((fold-left compose identity (list f g)) x) and then proceed to ignore x and just consider (fold-left compose identity (list f g)) as if x didn't exist. I've abstracted away the x. (Of course I have to eventually supply the x if I want an answer, but it only comes back at the last moment.) Second, notice that although slide and commute-fg are foldable sequences, I use them as if they were higher order functions operating on the foldable sequence (compose f g) to transform it, first into (compose h f), second into (compose g f). This second thing is a neat trick. We're taking a function that operates on lists and treating it as if it were a higher-order function that operates on functions. This is called the “action” of slide and commute-fg because it appears as if elements of the set G of our group can “act” directly on other elements.

Every element in the underlying set G of a group has an action associated with it which operates directly on other elements in G. This is an important concept in group theory. Now earlier I said that the actual elements of G don't matter much, so the action must be more closely tied to the operator *. And if we swap out G for another set we'll still have the same actions, they'll just be associated with the elements of the new set (in an isomorphic way). The actions are pretty abstract.

There's a lot more one could say about the actions. They are a rich source of interesting math. My brain is getting fatigued with all this abstraction, so I'll leave the topic be for now.

If group theory is about the essence of what it means to have a foldable sequence, then category theory is about the essence of composition. They offer two somewhat different approaches to similar material. What do you do with sequences but compose them? What comes from composition but a sequence? Many concepts in group theory carry over into category theory. Naturally a completely different set of terminology is used, but the concepts are there.

But that's enough group theory for today and category theory can wait until later posts.

by Joe Marshall (noreply@blogger.com) at January 17, 2020 09:13 AM

January 15, 2020

Joe Marshall

Math is hard, let's go shopping

I find mathematics, with all it's weird terminology and abstraction and equations, hard to understand. That's kind of funny coming from someone like me who makes a living from a branch of mathematics. I find computers and programming to be rather easy to understand — probably because I've had a lot of practice. But computer science is just applied logic and programming is arguably just the study of the computable functions, so you'd think math would come naturally. It doesn't.

One problem I've found is that as much as mathematicians pride themselves on rigor, they tend to be a bit sloppy and leave out important details. Computer scientists don't leave out important details because then the programs won't run. It's true that too much detail can clutter things up, but leaving out the detail and relying on “context” just increases the intellectual burden on the reader.

I will give mathematician's credit for thinking about edge cases perhaps more than a computer scientist would. It can be easy to be a bit complacent with edge cases because the computer will likely do something even if you don't think too hard about what it ought to do. But a good computer scientist tries to reduce the number of edge cases or at least make them coherent with the non-edge cases.*

Mathematicians seem to take perverse pleasure in being obscure. Computer scientists strive to be as obvious as possible because like as not, they are the ones that have to revisit the code they wrote and don't want to have to remember what they were thinking at the time. It's just easier to spell things out explicitly and obviously so that you can get back up to speed quickly when you have to debug your own stupid code. Every time I pick up some literature on category theory, I get hit with a “Wall of Terminology” denser than the “Wall of Sound” on a Phil Spector recording. It's fundamentally simple stuff, but it is dressed up in pants so fancy one has a hard time extracting the plain meaning. What seems to be universal in category theory is my difficulty in getting past page 4.

I once read a mathematical paper that talked about an algorithm with three tuning parameters: α, β, and another α. No decent computer programmer would give the same name to two different variables. Which α was which was supposed to be “obvious” from the context. The brainpower needed to keep track of the different αs was absurd and a complete waste of effort when calling the variable something else, like γ would have done the trick.

And don't ask a mathematician to write computer code. That's the one time they'll leave out all the abstraction. Instead of a nice piece of abstract, functional code, you'll get a mess of imperative code that smashes and bashes its way to a solution with no explanation of how it got there. It's a lot easier to take some abstract, functional code and figure out a more optimal way, probably imperative way to do it than it is to take a more optimal imperative piece of code and figure out the abstract, functional meaning of it.

I've found it to be extremely helpful when a computer paper includes one or two concrete examples of what it is talking about. That way, if I try to go implement code that does what the paper suggests, there's some indication that I'm on the right track. I'm more confident that I understand the paper if I have working code that produces the exact same values the paper's authors got. It's harder to find concrete examples in a math paper, and it is easier to think you know what it says but be far off base if there aren't any examples.

Maybe I shouldn't blame mathematicians so much and look a little closer to home. Perhaps I should study harder instead of demanding to be spoon fed difficult concepts. But then I read Feynman, S&ICP, S&ICM, and Jaynes and discover that maybe I just need a simple explanation that makes sense to me.

Sturgeon's Revelation is “90% of everything is crap”. This is true of both mathematical papers and computer science papers.



*An old joke illustrates the importance of thinking of edge cases: A programmer implements a bar. The test engineer goes in and orders a beer, orders zero beers, orders 999999999 beers, orders -1 beers, orders a lizard, and declares the bar ready for release. The first customer comes in and asks to use the restroom. The bar catches fire and burns down.

by Joe Marshall (noreply@blogger.com) at January 15, 2020 02:55 PM

January 14, 2020

GNU Guix

Reproducible computations with Guix

This post is about reproducible computations, so let's start with a computation. A short, though rather uninteresting, C program is a good starting point. It computes π in three different ways:

#include <math.h>
#include <stdio.h>

int main()
{
    printf( "M_PI                         : %.10lf\n", M_PI);
    printf( "4 * atan(1.)                 : %.10lf\n", 4.*atan(1.));
    printf( "Leibniz' formula (four terms): %.10lf\n", 4.*(1.-1./3.+1./5.-1./7.));
    return 0;
}

This program uses no random element, such as a random number generator or parallelism. It's strictly deterministic. It is reasonable to expect it to produce exactly the same output, on any computer and at any point in time. And yet, many programs whose results should be perfectly reproducible are in fact not. Programs using floating-point arithmetic, such as this short example, are particularly prone to seemingly inexplicable variations.

My goal is to explain why deterministic programs often fail to be reproducible, and what it takes to fix this. The short answer to that question is "use Guix", but even though Guix provides excellent support for reproducibility, you still have to use it correctly, and that requires some understanding of what's going on. The explanation I will give is rather detailed, to the point of discussing parts of the Guile API of Guix. You should be able to follow the reasoning without knowing Guile though, you will just have to believe me that the scripts I will show do what I claim they do. And in the end, I will provide a ready-to-run Guile script that will let you explore package dependencies right from the shell.

Dependencies: what it takes to run a program

One keyword in discussions of reproducibility is "dependencies". I will revisit the exact meaning of this term later, but to get started, I will define it loosely as "any software package required to run a program". Running the π computation shown above is normally done using something like

gcc pi.c -o pi
./pi

C programmers know that gcc is a C compiler, so that's one obvious dependency for running our little program. But is a C compiler enough? That question is surprisingly difficult to answer in practice. Your computer is loaded with tons of software (otherwise it wouldn't be very useful), and you don't really know what happens behind the scenes when you run gcc or pi.

Containers are good

A major element of reproducibility support in Guix is the possibility to run programs in well-defined environments that contain exactly the software packages you request, and no more. So if your program runs in an environment that contains only a C compiler, you can be sure it has no other dependencies. Let's create such an environment:

guix environment --container --ad-hoc gcc-toolchain

The option --container ensures the best possible isolation from the standard environment that your system installation and user account provide for day-to-day work. This environment contains nothing but a C compiler and a shell (which you need to type in commands), and has access to no other files than those in the current directory.

If the term "container" makes you think of Docker, note that this is something different. Note also that the option --container requires support from the Linux kernel, which may not be present on your system, or may be disabled by default. Finally, note that by default, a containerized environment has no network access, which may be a problem. If for whatever reason you cannot use --container, use --pure instead. This yields a less isolated environment, but it is usually good enough. For a more detailed discussion of these options, see the Guix manual.

The above command leaves me in a shell inside my environment, where I can now compile and run my little program:

gcc pi.c -o pi
./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

It works! So now I can be sure that my program has a single dependency: the Guix package gcc-toolchain. I'll leave that special-environment shell by typing Ctrl-D, as otherwise the following examples won't work.

Perfectionists who want to exclude the possibility that my program requires a shell could run each step in a separate container:

guix environment --container --ad-hoc gcc-toolchain -- gcc pi.c -o pi
guix environment --container --ad-hoc gcc-toolchain -- ./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

Welcome to dependency hell!

Now that we know that our only dependency is gcc-toolchain, let's look at it in more detail:

guix show gcc-toolchain
name: gcc-toolchain
version: 9.2.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 8.3.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@8.3.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 7.4.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@7.4.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 6.5.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@6.5.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 5.5.0
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@5.5.0 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 4.9.4
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@4.9.4 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

name: gcc-toolchain
version: 4.8.5
outputs: out debug static
systems: x86_64-linux i686-linux
dependencies: binutils@2.32 gcc@4.8.5 glibc@2.29 ld-wrapper@0
location: gnu/packages/commencement.scm:2532:4
homepage: https://gcc.gnu.org/
license: GPL 3+
synopsis: Complete GCC tool chain for C/C++ development  
description: This package provides a complete GCC tool chain for C/C++
+ development to be installed in user profiles.  This includes GCC, as well as
+ libc (headers an d binaries, plus debugging symbols in the `debug' output),
+ and Binutils.

Guix actually knows about several versions of this toolchain. We didn't ask for a specific one, so what we got is the first one in this list, which is the one with the highest version number. Let's check that this is true:

guix environment --container --ad-hoc gcc-toolchain -- gcc --version
gcc (GCC) 9.2.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The output of guix show contains a line about dependencies. These are the dependencies of our dependency, and you may already have guessed that they will have dependencies as well. That's why reproducibility is such a difficult job in practice! The dependencies of gcc-toolchain@9.2.0 are:

guix show gcc-toolchain@9.2.0 | recsel -P dependencies
binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0

To dig deeper, we can try feeding these dependencies to guix show, one by one, in order to learn more about them:

guix show binutils@2.32
name: binutils
version: 2.32
outputs: out
systems: x86_64-linux i686-linux
dependencies: 
location: gnu/packages/base.scm:415:2
homepage: https://www.gnu.org/software/binutils/
license: GPL 3+
synopsis: Binary utilities: bfd gas gprof ld  
description: GNU Binutils is a collection of tools for working with binary
+ files.  Perhaps the most notable are "ld", a linker, and "as", an assembler.
+ Other tools include programs to display binary profiling information, list the
+ strings in a binary file, and utilities for working with archives.  The "bfd"
+ library for working with executable and object formats is also included.
guix show gcc@9.2.0
guix show: error: gcc@9.2.0: package not found

This looks a bit surprising. What's happening here is that gcc is defined as a hidden package in Guix. The package is there, but it is hidden from package queries. There is a good reason for this: gcc on its own is rather useless, you need gcc-toolchain to actually use the compiler. But if both gcc and gcc-toolchain showed up in a search, that would be more confusing than helpful for most users. Hiding the package is a way of saying "for experts only".

Let's take this as a sign that it's time to move on to the next level of Guix hacking: Guile scripts. Guile, an implementation of the Scheme language, is Guix' native language, so using Guile scripts, you get access to everything there is to know about Guix and its packages.

A note in passing: the emacs-guix package provides an intermediate level of Guix exploration for Emacs users. It lets you look at hidden packages, for example. But much of what I will show in the following really requires Guile scripts. Another nice tool for package exploration is guix graph, which creates a diagram showing dependency relations between packages. Unfortunately that diagram is legible only for a relatively small number of dependencies, and as we will see later, most packages end up having lots of them.

Anatomy of a Guix package

From the user's point of view, a package is a piece of software with a name and a version number that can be installed using guix install. The packager's point of view is quite a bit different. In fact, what users consider a package is more precisely called the package's output in Guix jargon. The package is a recipe for creating this output.

To see how all these concepts fit together, let's look at an example of a package definition: xmag. I have chosen this package not because I care much about it, but because its definition is short while showcasing all the features I want to explain. You can access it most easily by typing guix edit xmag. Here is what you will see:

(package
  (name "xmag")
  (version "1.0.6")
  (source
   (origin
     (method url-fetch)
     (uri (string-append
           "mirror://xorg/individual/app/" name "-" version ".tar.gz"))
     (sha256
      (base32
       "19bsg5ykal458d52v0rvdx49v54vwxwqg8q36fdcsv9p2j8yri87"))))
  (build-system gnu-build-system)
  (arguments
   `(#:configure-flags
     (list (string-append "--with-appdefaultdir="
                          %output ,%app-defaults-dir))))
  (inputs
   `(("libxaw" ,libxaw)))
  (native-inputs
   `(("pkg-config" ,pkg-config)))
  (home-page "https://www.x.org/wiki/")
  (synopsis "Display or capture a magnified part of a X11 screen")
  (description "Xmag displays and captures a magnified snapshot of a portion
of an X11 screen.")
  (license license:x11))

The package definition starts with the name and version information you expected. Next comes source, which says how to obtain the source code and from where. It also provides a hash that allows to check the integrity of the downloaded files. The next four items, build-system, arguments, inputs, and native-inputs supply the information required for building the package, which is what creates its outputs. The remaining items are documentation for human consumption, important for other reasons but not for reproducibility, so I won't say any more about them. (See this packaging tutorial if you want to define your own package.)

The example package definition has native-inputs in addition to "plain" inputs. There's a third variant, propagated-inputs, but xmag doesn't have any. The differences between these variants don't matter for my topic, so I will just refer to "inputs" from now on. Another omission I will make is the possibility to define several outputs for a package. This is done for particularly big packages, in order to reduce the footprint of installations, but for the purposes of reproducibility, it's OK to treat all outputs of a package a single unit.

The following figure illustrates how the various pieces of information from a package are used in the build process (done explicitly by guix build, or implicitly when installing or otherwise using a package): Diagram of a Guix package.

It may help to translate the Guix jargon to the vocabulary of C programming:

| Guix package | C program        |
|--------------+------------------|
| source code  | source code      |
| inputs       | libraries        |
| arguments    | compiler options |
| build system | compiler         |
| output       | executable       |

Building a package can be considered a generalization of compiling a program. We could in fact create a "GCC build system" for Guix that would simply run gcc. However, such a build system would be of little practical use, since most real-life software consists of more than just one C source code file, and requires additional pre- or post-processing steps. The gnu-build-system used in the example is based on tools such as make and autoconf, in addition to gcc.

Package exploration in Guile

Guile uses a record type called <package> to represent packages, which is defined in module (guix packages). There is also a module (gnu packages), which contains the actual package definitions - be careful not to confuse the two (as I always do). Here is a simple Guile script that shows some package information, much like the guix show command that I used earlier:

(use-modules (guix packages)
             (gnu packages)) 

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Name   : ~a\n" (package-name gcc-toolchain))
(format #t "Version: ~a\n" (package-version gcc-toolchain))
(format #t "Inputs : ~a\n" (package-direct-inputs gcc-toolchain))
Name   : gcc-toolchain
Version: 9.2.0
Inputs : ((gcc #<package gcc@9.2.0 gnu/packages/gcc.scm:524 7fc2d76af160>) (ld-wrapper #<package ld-wrapper@0 gnu/packages/base.scm:505 7fc2d306f580>) (binutils #<package binutils@2.32 gnu/packages/commencement.scm:2187 7fc2d306fdc0>) (libc #<package glibc@2.29 gnu/packages/commencement.scm:2145 7fc2d306fe70>) (libc-debug #<package glibc@2.29 gnu/packages/commencement.scm:2145 7fc2d306fe70> debug) (libc-static #<package glibc@2.29 gnu/packages/commencement.scm:2145 7fc2d306fe70> static))

This script first calls specification->package to look up the package using the same rules as the guix command line interface: pick the latest available version if none is explicitly requested. Then it extracts various information about the package. Note that package-direct-inputs returns the combination of package-inputs, package-native-inputs, and package-propagated-inputs. As I said above, I don't care about the distinction here.

The inputs are not shown in a particularly nice form, so let's write two Guile functions to improve it:

(use-modules (guix packages)
             (gnu packages)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    (other-item
     (format #f "~a" other-item))))

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Package: ~a\n"
        (package->specification gcc-toolchain))
(format #t "Inputs : ~a\n"
        (map input->specification (package-direct-inputs gcc-toolchain)))
Package: gcc-toolchain@9.2.0
Inputs : (gcc@9.2.0 ld-wrapper@0 binutils@2.32 glibc@2.29 glibc@2.29 glibc@2.29)

That looks much better. As you can see from the code, a list of inputs is a bit more than a list of packages. It is in fact a list of labelled package outputs. That also explains why we see glibc three times in the input list: glibc defines three distinct outputs, all of which are used in gcc-toolchain. For reproducibility, all we care about is the package references. Later on, we will deal with much longer input lists, so as a final cleanup step, let's show only unique package references from the list of inputs:

(use-modules (guix packages)
             (gnu packages)
             (srfi srfi-1)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    (other-item
     (format #f "~a" other-item))))

(define (unique-inputs inputs)
  (delete-duplicates
   (map input->specification inputs)))

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Package: ~a\n"
        (package->specification gcc-toolchain))
(format #t "Inputs : ~a\n"
        (unique-inputs (package-direct-inputs gcc-toolchain)))
Package: gcc-toolchain@9.2.0
Inputs : (gcc@9.2.0 ld-wrapper@0 binutils@2.32 glibc@2.29)

Dependencies

You may have noticed the absence of the term "dependency" from the last two sections. There is a good reason for that: the term is used in somewhat different meanings, and that can create confusion. Guix jargon therefore avoids it.

The figure above shows three kinds of input to the build system: source, inputs, and arguments. These categories reflect the packagers' point of view: source is what the authors of the software supply, inputs are other packages, and arguments is what the packagers themselves add to the build procedure. It is important to understand that from a purely technical point of view, there is no fundamental difference between the three categories. You could, for example, define a package that contains C source code in the build system arguments, but leaves source empty. This would be inconvenient, and confusing for others, so I don't recommend you actually do this. The three categories are important, but for humans, not for computers. In fact, even the build system is not fundamentally distinct from its inputs. You could define a special-purpose build system for one package, and put all the source code in there. At the level of the CPU and the computer's memory, a build process (as in fact any computation) looks like Image of a computation. It is human interpretation that decomposes this into Code and data. and in a next step into Data, program, and environment. We can go on and divide the environment into operating system, development tools, and application software, for example, but the further we go in decomposing the input to a computation, the more arbitrary it gets.

From this point of view, a software's dependencies consist of everything required to run it in addition to its source code. For a Guix package, the dependencies are thus,

  • its inputs
  • the build system arguments
  • the build system itself
  • Guix (which is a piece of software as well)
  • the GNU/Linux operating system (kernel, file system, etc.).

In the following, I will not mention the last two items any more, because they are a common dependency of all Guix packages, but it's important not to forget about them. A change in Guix or in GNU/Linux can actually make a computation non-reproducible, although in practice that happens very rarely. Moreover, Guix is actually designed to run older versions of itself, as we will see later.

Build systems are (mostly) packages as well

I hope that by now you have a good idea of what a package is: a recipe for building outputs from source and inputs, with inputs being the outputs of other packages. The recipe involves a build system and arguments supplied to it. So... what exactly is a build system? I have introduced it as a generalization of a compiler, which describes its role. But where does a build system come from in Guix?

The ultimate answer is of course the source code. Build systems are pieces of Guile code that are part of Guix. But this Guile code is only a shallow layer orchestrating invocations of other software, such as gcc or make. And that software is defined by packages. So in the end, from a reproducibility point of view, we can replace the "build system" item in our list of dependenies by "a bundle of packages". In other words: more inputs.

Before Guix can build a package, it must gather all the required ingredients, and that includes replacing the build system by the packages it represents. The resulting list of ingredients is called a bag, and we can access it using a Guile script:

(use-modules (guix packages)
             (gnu packages)
             (srfi srfi-1)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    ((label (? origin? origin))
     (format #f "[source code from ~a]"
             (origin-uri origin)))
    (other-input
     (format #f "~a" other-input))))

(define (unique-inputs inputs)
  (delete-duplicates
   (map input->specification inputs)))

(define hello
  (specification->package "hello"))

(format #t "Package       : ~a\n"
        (package->specification hello))
(format #t "Package inputs: ~a\n"
        (unique-inputs (package-direct-inputs hello)))
(format #t "Build inputs  : ~a\n"
        (unique-inputs
         (bag-direct-inputs
          (package->bag hello))))
Package       : hello@2.10
Package inputs: ()
Build inputs  : ([source code from mirror://gnu/hello/hello-2.10.tar.gz] tar@1.32 gzip@1.10 bzip2@1.0.6 xz@5.2.4 file@5.33 diffutils@3.7 patch@2.7.6 findutils@4.6.0 gawk@5.0.1 sed@4.7 grep@3.3 coreutils@8.31 make@4.2.1 bash-minimal@5.0.7 ld-wrapper@0 binutils@2.32 gcc@7.4.0 glibc@2.29 glibc-utf8-locales@2.29)

I have used a different example, hello, because for gcc-toolchain, there is no difference between package inputs and build inputs (check for yourself if you want!) My new example, hello (a short demo program printing "Hello, world" in the language of the system installation), is interesting because it has no package inputs at all. All the build inputs except for the source code have thus been contributed by the build system.

If you compare this script to the previous one that printed only the package inputs, you will notice two major new features. In input->specification, there is an additional case for the source code reference. And in the last statement, package->bag constructs a bag from the package, before bag-direct-inputs is called to get that bag's input list.

Inputs are outputs

I have mentioned before that one package's inputs are other packages' outputs, but that fact deserves a more in-depth discussion because of its crucial importance for reproducibility. A package is a recipe for building outputs from source and inputs. Since these inputs are outputs, they must have been built as well. Package building is therefore a process consisting of multiple steps. An immediate consequence is that any computation making use of packaged software is a multi-step computation as well.

Remember the short C program computing π from the beginning of this post? Running that program is only the last step in a long series of computations. Before you can run pi, you must compile pi.c. That requires the package gcc-toolchain, which must first be built. And before it can be built, its inputs must be built. And so on. If you want the output of pi to be reproducible, the whole chain of computations must be reproducible, because each step can have an impact on the results produced by pi.

So... where does this chain start? Few people write machine code these days, so almost all software requires some compiler or interpreter. And that means that for every package, there are other packages that must be built first. The question of how to get this chain started is known as the bootstrapping problem. A rough summary of the solution is that the chain starts on somebody else's computer, which creates a bootstrap seed, an ideally small package that is downloaded in precompiled form. See this post by Jan Nieuwenhuizen for details of this procedure. The bootstrap seed is not the real start of the chain, but as long as we can retrieve an identical copy at a later time, that's good enough for reproducibility. In fact, the reason for requiring the bootstrap seed to be small is not reproducibility, but inspectability: it should be possible to audit the seed for bugs and malware, even in the absence of source code.

Reaching closure

Now we are finally ready for the ultimate step in dependency analysis: identifying all packages on which a computation depends, right up to the bootstrap seed. The starting point is the list of direct inputs of the bag derived from a package, which we looked at in the previous script. For each package in that list, we must apply this same procedure, recursively. We don't have to write this code ourselves, because the function package-closure in Guix does that job. These closures have nothing to do with closures in Lisp, and even less with the Clojure programming language. They are a case of what mathematicians call transitive closures: starting with a set of packages, you extend the set repeatedly by adding the inputs of the packages that are already in the set, until there is nothing more to add. If you have a basic knowledge of Scheme, you should now be able to understand implementation of this function. Let's add it to our dependency analysis code:

(use-modules (guix packages)
             (gnu packages)
             (srfi srfi-1)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    ((label (? origin? origin))
     (format #f "[source code from ~a]"
             (origin-uri origin)))
    (other-input
     (format #f "~a" other-input))))

(define (unique-inputs inputs)
  (delete-duplicates
   (map input->specification inputs)))

(define (length-and-list lists)
  (list (length lists) lists))

(define hello
  (specification->package "hello"))

(format #t "Package        : ~a\n"
        (package->specification hello))
(format #t "Package inputs : ~a\n"
        (length-and-list (unique-inputs (package-direct-inputs hello))))
(format #t "Build inputs   : ~a\n"
        (length-and-list
         (unique-inputs
          (bag-direct-inputs
           (package->bag hello)))))
(format #t "Package closure: ~a\n"
        (length-and-list
         (delete-duplicates
          (map package->specification
               (package-closure (list hello))))))
Package        : hello@2.10
Package inputs : (0 ())
Build inputs   : (20 ([source code from mirror://gnu/hello/hello-2.10.tar.gz] tar@1.32 gzip@1.10 bzip2@1.0.6 xz@5.2.4 file@5.33 diffutils@3.7 patch@2.7.6 findutils@4.6.0 gawk@5.0.1 sed@4.7 grep@3.3 coreutils@8.31 make@4.2.1 bash-minimal@5.0.7 ld-wrapper@0 binutils@2.32 gcc@7.4.0 glibc@2.29 glibc-utf8-locales@2.29))
Package closure: (84 (m4@1.4.18 libatomic-ops@7.6.10 gmp@6.1.2 libgc@7.6.12 libltdl@2.4.6 libunistring@0.9.10 libffi@3.2.1 pkg-config@0.29.2 guile@2.2.6 libsigsegv@2.12 lzip@1.21 ed@1.15 perl@5.30.0 guile-bootstrap@2.0 zlib@1.2.11 xz@5.2.4 ncurses@6.1-20190609 libxml2@2.9.9 attr@2.4.48 gettext-minimal@0.20.1 gcc-cross-boot0-wrapped@7.4.0 libstdc++@7.4.0 ld-wrapper-boot3@0 bootstrap-binaries@0 ld-wrapper-boot0@0 flex@2.6.4 glibc-intermediate@2.29 libstdc++-boot0@4.9.4 expat@2.2.7 gcc-mesboot1-wrapper@4.7.4 mesboot-headers@0.19 gcc-core-mesboot@2.95.3 bootstrap-mes@0 bootstrap-mescc-tools@0.5.2 tcc-boot0@0.9.26-6.c004e9a mes-boot@0.19 tcc-boot@0.9.27 make-mesboot0@3.80 gcc-mesboot0@2.95.3 binutils-mesboot0@2.20.1a make-mesboot@3.82 diffutils-mesboot@2.7 gcc-mesboot1@4.7.4 glibc-headers-mesboot@2.16.0 glibc-mesboot0@2.2.5 binutils-mesboot@2.20.1a linux-libre-headers@4.19.56 linux-libre-headers-bootstrap@0 gcc-mesboot@4.9.4 glibc-mesboot@2.16.0 gcc-cross-boot0@7.4.0 bash-static@5.0.7 gettext-boot0@0.19.8.1 python-minimal@3.5.7 perl-boot0@5.30.0 texinfo@6.6 bison@3.4.1 gzip@1.10 libcap@2.27 acl@2.2.53 glibc-utf8-locales@2.29 gcc-mesboot-wrapper@4.9.4 file-boot0@5.33 findutils-boot0@4.6.0 diffutils-boot0@3.7 make-boot0@4.2.1 binutils-cross-boot0@2.32 glibc@2.29 gcc@7.4.0 binutils@2.32 ld-wrapper@0 bash-minimal@5.0.7 make@4.2.1 coreutils@8.31 grep@3.3 sed@4.7 gawk@5.0.1 findutils@4.6.0 patch@2.7.6 diffutils@3.7 file@5.33 bzip2@1.0.6 tar@1.32 hello@2.10))

That's 84 packages, just for printing "Hello, world!". As promised, it includes the boostrap seed, called bootstrap-binaries. It may be more surprising to see Perl and Python in the dependency list of what is a pure C program. The explanation is that the build process of gcc and glibc contains Perl and Python code. Considering that both Perl and Python are written in C and use glibc, this hints at why bootstrapping is a hard problem!

Get ready for your own analyses

As promised, here is a Guile script that you can download and run from the command line to do dependency analyses much like the ones I have shown. Just give the packages whose combined list of dependencies you want to analyze. For example:

./show-dependencies.scm hello
Packages: 1
  hello@2.10
Package inputs: 0 packages
 
Build inputs: 20 packages
  [source code from mirror://gnu/hello/hello-2.10.tar.gz] bash-minimal@5.0.7 binutils@2.32 bzip2@1.0.6 coreutils@8.31 diffutils@3.7 file@5.33 findutils@4.6.0 gawk@5.0.1 gcc@7.4.0 glibc-utf8-locales@2.29 glibc@2.29 grep@3.3 gzip@1.10 ld-wrapper@0 make@4.2.1 patch@2.7.6 sed@4.7 tar@1.32 xz@5.2.4
Package closure: 84 packages
  acl@2.2.53 attr@2.4.48 bash-minimal@5.0.7 bash-static@5.0.7 binutils-cross-boot0@2.32 binutils-mesboot0@2.20.1a binutils-mesboot@2.20.1a binutils@2.32 bison@3.4.1 bootstrap-binaries@0 bootstrap-mes@0 bootstrap-mescc-tools@0.5.2 bzip2@1.0.6 coreutils@8.31 diffutils-boot0@3.7 diffutils-mesboot@2.7 diffutils@3.7 ed@1.15 expat@2.2.7 file-boot0@5.33 file@5.33 findutils-boot0@4.6.0 findutils@4.6.0 flex@2.6.4 gawk@5.0.1 gcc-core-mesboot@2.95.3 gcc-cross-boot0-wrapped@7.4.0 gcc-cross-boot0@7.4.0 gcc-mesboot-wrapper@4.9.4 gcc-mesboot0@2.95.3 gcc-mesboot1-wrapper@4.7.4 gcc-mesboot1@4.7.4 gcc-mesboot@4.9.4 gcc@7.4.0 gettext-boot0@0.19.8.1 gettext-minimal@0.20.1 glibc-headers-mesboot@2.16.0 glibc-intermediate@2.29 glibc-mesboot0@2.2.5 glibc-mesboot@2.16.0 glibc-utf8-locales@2.29 glibc@2.29 gmp@6.1.2 grep@3.3 guile-bootstrap@2.0 guile@2.2.6 gzip@1.10 hello@2.10 ld-wrapper-boot0@0 ld-wrapper-boot3@0 ld-wrapper@0 libatomic-ops@7.6.10 libcap@2.27 libffi@3.2.1 libgc@7.6.12 libltdl@2.4.6 libsigsegv@2.12 libstdc++-boot0@4.9.4 libstdc++@7.4.0 libunistring@0.9.10 libxml2@2.9.9 linux-libre-headers-bootstrap@0 linux-libre-headers@4.19.56 lzip@1.21 m4@1.4.18 make-boot0@4.2.1 make-mesboot0@3.80 make-mesboot@3.82 make@4.2.1 mes-boot@0.19 mesboot-headers@0.19 ncurses@6.1-20190609 patch@2.7.6 perl-boot0@5.30.0 perl@5.30.0 pkg-config@0.29.2 python-minimal@3.5.7 sed@4.7 tar@1.32 tcc-boot0@0.9.26-6.c004e9a tcc-boot@0.9.27 texinfo@6.6 xz@5.2.4 zlib@1.2.11

You can now easily experiment yourself, even if you are not at ease with Guile. For example, suppose you have a small Python script that plots some data using matplotlib. What are its dependencies? First you should check that it runs in a minimal environment:

guix environment --container --ad-hoc python python-matplotlib -- python my-script.py

Next, find its dependencies:

./show-dependencies.scm python python-matplotlib

I won't show the output here because it is rather long - the package closure contains 499 packages!

OK, but... what are the real dependencies?

I have explained dependencies along these lines in a few seminars. There's one question that someone in the audience is bound to ask: What do the results of a computation really depend on? The output of hello is "Hello, world!", no matter which version of gcc I use to compile it, and no matter which version of python was used in building glibc. The package closure is a worst-case estimate: it contains everything that can potentially influence the results, though most of it doesn't in practice. Unfortunately, there is no way to identify the dependencies that matter automatically, because answering that question in general (i.e. for arbitrary software) is equivalent to solving the halting problem.

Most package managers, such as Debian's apt or the multi-platform conda, take a different point of view. They define the dependencies of a program as all packages that need to be loaded into memory in order to run it. They thus exclude the software that is required to build the program and its run-time dependencies, but can then be discarded. Whereas Guix' definition errs on the safe side (its dependency list is often longer than necessary but never too short), the run-time-only definition is both too vast and too restrictive. Many run-time dependencies don't have an impact on most programs' results, but some build-time dependencies do.

One important case where build-time dependencies matter is floating-point computations. For historical reasons, they are surrounded by an aura of vagueness and imprecision, which goes back to its early days, when many details were poorly understood and implementations varied a lot. Today, all computers used for scientific computing respect the IEEE 754 standard that precisely defines how floating-point numbers are represented in memory and what the result of each arithmetic operation must be. Floating-point arithmetic is thus perfectly deterministic and even perfectly portable between machines, if expressed in terms of the operations defined by the standard. However, high-level languages such as C or Fortran do not allow programmers to do that. Its designers assume (probably correctly) that most programmers do not want to deal with the intricate details of rounding. Therefore they provide only a simplified interface to the arithmetic operations of IEEE 754, which incidentally also leaves more liberty for code optimization to compiler writers. The net result is that the complete specification of a program's results is its source code plus the compiler and the compilation options. You thus can get reproducible floating-point results if you include all compilation steps into the perimeter of your computation, at least for code running on a single processor. Parallel computing is a different story: it involves voluntarily giving up reproducibility in exchange for speed. Reproducibility then becomes a best-effort approach of limiting the collateral damage done by optimization through the clever design of algorithms.

Reproducing a reproducible computation

So far, I have explained the theory behind reproducible computations. The take-home message is that to be sure to get exactly the same results in the future, you have to use the exact same versions of all packages in the package closure of your immediate dependencies. I have also shown you how you can access that package closure. There is one missing piece: how do you actually run your program in the future, using the same environment?

The good news is that doing this is a lot simpler than understanding my lengthy explanations (which is why I leave this for the end!). The complex dependency graphs that I have analyzed up to here are encoded in the Guix source code, so all you need to re-create your environment is the exact same version of Guix! You get that version using

guix describe
Generation 15 Jan 06 2020 13:30:45    (current)
  guix 769b96b
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: 769b96b62e8c09b078f73adc09fb860505920f8f

The critical information here is the unpleasantly looking string of hexadecimal digits after "commit". This is all it takes to uniquely identify a version of Guix. And to re-use it in the future, all you need is Guix' time machine:

guix time-machine --commit=769b96b62e8c09b078f73adc09fb860505920f8f -- environment --ad-hoc gcc-toolchain
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
gcc pi.c -o pi
./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

The time machine actually downloads the specified version of Guix and passes it the rest of the command line. You are running the same code again. Even bugs in Guix will be reproduced faithfully! As before, guix environment leaves us in a special-environment shell which needs to be terminated by Ctrl-D.

For many practical use cases, this technique is sufficient. But there are two variants you should know about for more complicated situations:

  • If you need an environment with many packages, you should use a manifest rather than list the packages on the command line. See the manual for details.

  • If you need packages from additional channels, i.e. packages that are not part of the official Guix distribution, you should store a complete channel description in a file using

guix describe -f channels > guix-version-for-reproduction.txt

and feed that file to the time machine:

guix time-machine --channels=guix-version-for-reproduction.txt -- environment --ad-hoc gcc-toolchain
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
gcc pi.c -o pi
./pi
M_PI                         : 3.1415926536
4 * atan(1.)                 : 3.1415926536
Leibniz' formula (four terms): 2.8952380952

Last, if your colleagues do not use Guix yet, you can pack your reproducible software for use on other systems: as a tarball, or as a Docker or Singularity container image. For example:

guix pack            \
     -f docker       \
     -C none         \
     -S /bin=bin     \
     -S /lib=lib     \
     -S /share=share \
     -S /etc=etc     \
     gcc-toolchain
/gnu/store/iqn9yyvi8im18g7y9f064lw9s9knxp0w-docker-pack.tar

will produce a Docker container image, and with the knowledge of the Guix commit (or channel specification), you will be able in the future to reproduce this container bit-to-bit using guix time-machine.

And now... congratulations for having survived to the end of this long journey! May all your computations be reproducible, with Guix.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Konrad Hinsen at January 14, 2020 04:30 PM

Joe Marshall

Palindromes, redux, and the Sufficiently Smart Compiler

The Sufficiently Smart Compiler is mentioned by authors as shorthand for “a compiler that performs nearly all reasonable optimizations, but in particular this one I want”. Many attempts were made up through the 80's and maybe into the 90's to write a Sufficiently Smart Compiler that would perform all “reasonable” optimizations, and although many impressive results have been obtained, there always seem to be fairly obvious optimizations that remain unoptimized. These days it seems that people realize that there will be good compilers and some very good compilers, but never a Sufficiently Smart Compiler. Nonetheless, it is worth considering a Sufficiently Smart Compiler as a tool for thought experiments.

I was curious what would be necessary for a Sufficiently Smart Compiler to generate optimal code for the palindrome problem given the naive algorithm.

The naive algorithm is inspired by the axioms
  • A zero or one element string is a palindrome.
  • If the first char matches the last char, and the middle is a palindrome, the result is a palindrome.
and gives us this:
(define (palindrome1? string)
  (or (< (string-length string) 2)
      (and (char=? (string-ref string 0)
                   (string-ref string (- (string-length string) 1)))
           (palindrome1? (substring string 1 (- (string-length string) 1))))))

The higher performing algorithm is inspired by the idea of keeping two pointers to each end of a string and comparing the characters at the pointers. If the characters are the same, you move the pointers inward and when they meet, you have seen a palindrome. If at any point the characters differ, you don't have a palindrome:
(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (>= front-pointer rear-pointer)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string rear-pointer))
             (scan (+ front-pointer 1) (- rear-pointer 1))))
  (scan 0 (- (string-length string) 1)))
As you can see, these really aren't very different to start with. Both algorithms are iterative and both work their way in from the outside of the string. There are basically two differences. First, access to the rear of the string is either by a rear pointer, or by using the string-length of the string and subtracting 1. Second, the iterative call either uses substring or moves the pointers closer together.

First, let's assume that our processor has can reference through an indexed offset. This would mean we could point at the element one beyond the rear-pointer and not incur overhead. This isn't an unreasonable assumption for a CISC architecture such as an x86, but would probably cause 1 instruction overhead on a RISC architecture. So the second algorithm becomes this:
(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (< (- rear-pointer front-pointer) 2)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string (- rear-pointer 1)))
             (scan (+ front-pointer 1) (- rear-pointer 1)))))
  (scan 0 (string-length string)))

Now this next assumption is a bit more of a stretch. The implementation of palindrome1? uses substring on each iteration and that's going to result in a lot of string copying. If our implementation used “slices” instead of copying the string, then there will be a lot less copying going on:
(define (palindrome1? string)
  (or (< (- (slice-end string) (slice-start string)) 2)
      (and (char=? (string-ref string (slice-start string))
                   (string-ref string (- (slice-end string) 1)))
           (palindrome1? 
             (substring string (+ (slice-start string) 1) (- (slice-end string) 1))))))

It is not uncommon for a compiler to introduce internal procedures for looping, so we can do that.
(define (palindrome1? string)
  (define (scan slice)
    (or (< (- (slice-end slice) (slice-start slice)) 2)
        (and (char=? (slice-ref slice (slice-start slice))
                     (slice-ref slice (- (slice-end slice) 1)))
             (scan (subslice slice (+ (slice-start slice) 1) (- (slice-end slice) 1))))))
  (scan (make-slice 0 (string-length string))))

We'll enter fantasy land again and let our compiler be smart enough to “spread” the slice data structure into the argument list of scan. This is no doubt asking too much from our compiler, but the information is available and it could in theory be done:
(define (palindrome1? string)
  (define (scan slice-start slice-end)
    (or (< (- slice-end slice-start) 2)
        (and (char=? (slice-ref string slice-start)
                     (slice-ref string (- slice-end 1)))
             (scan (+ slice-start 1) (- slice-end 1)))))
  (scan 0 (string-length string)))

And now we have palindrome2? (modulo renaming).

This doesn't really prove anything. But with a couple of somewhat unlikely compiler tricks, the naive version could be transformed to the more optimized version. It suggests that a it would be surprising but not a complete shock for an ambitious compiler writer to attempt.

I wish someone would write that Sufficiently Smart Compiler.

by Joe Marshall (noreply@blogger.com) at January 14, 2020 12:09 PM

Programming Praxis

Newbie Question

We have something different today.

Although most of my readers are accomplished, savvy programmers, (or at least those who are confident enough to comment) I will always include content designed for beginners; it’s part of the DNA of the blog, and sometimes brings in new readers. Thus, I spend a lot of time in beginner-programmer online forums. Today’s question at Reddit caught my eye:

Newbie question: What’s the difference between a “loop” and a “function”?

Sorry if this is answered somewhere but I am completely new to programming.

What is the difference between a function and a loop, when both of them can be used more than once?

The posting history of the person who asked that question indicates that he has recently been expelled from college (his GPA wasn’t great) and that

In the meantime I just started a computer programming course online. Was never really interested in it but everyone says it’s a skill you should have so I figured I’ll give it a shot. Plus it’s free so there’s that.

Your task is to answer the newbie question; if you answer on Reddit, please also leave your answer here, with a pointer to your Reddit posting. When you are finished, you are welcome to read my answer or discuss the exercise in the comments below.

by programmingpraxis at January 14, 2020 10:00 AM

January 13, 2020

GNU Guix

Join GNU Guix through Outreachy

We are happy to announce that for the fourth time GNU Guix offers a three-month internship through Outreachy, the inclusion program for groups traditionally underrepresented in free software and tech. We currently propose four subjects to work on:

  1. Implement netlink bindings for Guile.
  2. Improve internationalization support for the Guix Data Service.
  3. Add accessibility support for the Guix System Installer.
  4. Add monitoring support for the Guix daemon and Cuirass.

The initial applications for this round open on Jan. 20, 2020 at 4PM UTC and the initial application deadline is on Feb. 25, 2020 at 4PM UTC.

The final project list is announced on Feb. 25, 2020.

For further information, check out the timeline, information about the application process, and the eligibility rules.

If you’d like to contribute to computing freedom, Scheme, functional programming, or operating system development, now is a good time to join us. Let’s get in touch on the mailing lists and on the #guix channel on the Freenode IRC network, or come chat with us at FOSDEM!

Last year we had the pleasure to welcome Laura Lazzati as an Outreachy intern working on documentation video creation, which led to the videos you can now see on the home page.

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Gábor Boskovits at January 13, 2020 02:30 PM

Joe Marshall

Cons cells vs. Linked Lists

Cons cells and linked lists are the meat and potatoes of Lisp programming. Linked lists are the primary structure that everything operates on and cons cells are the Lego blocks they are made of. For an experienced Lisp programmer, cons cells just fade into the background. You know they are there as the glue holding everything together, but it is the linked list that you keep in mind. One could construct all sorts of weird trees, dags, and graphs out of cons cells, but in general you keep things in nice linear singly-linked lists terminated with a nice, full-stop NIL.

Cons cells are nearly the perfect concrete implementation of an abstract two-tuple. They are first-class objects: you can assign them to variables, stuff them in arrays, pass and return them as values, and check them for identity. They are orthogonal to other data types; only a cons-cell returns 't to consp. They are opaque — except for the defined operations of car and cdr, you cannot access the contents of a cons cell. And while they are usually implemented as adjacent memory locations, they hide their representation and there have been many Lisps that have used unusual concrete representations of cons cells like parallel arrays of the car and cdr parts or bit codes to omit the cdr altogether through “cdr coding”. All operations on cons cells can be reduced to the basic operations cons, consp, car, cdr, (setf car), and (setf cdr). (If we had immutable cons cells, we could even get rid of the last two, but then we'd want some other means for creating circular and semi-circular structure.*)

So I find it somewhat surprising that the standard linked list implementation in Lisp is a just a terrible example of an abstract data type. This no doubt happened because linked lists got standardized well before abstract data types were really understood.

The big problem with linked lists is that instead of being orthogonal to other data types, it is a subdomain of cons-cells. The representation of a singly linked list is completely exposed: it is a cons cell, without even a wrapper object to tell you if you are dealing with the list itself or its representation. It is only by common convention that certain cons cell structures are considered to represent linked lists. And it isn't immediately clear whether the representation is meant to be a pointer to the first pair of the list, or to the entire “spine” of the list. It is often treated both ways. There is little distinction between a list primitive and a cons cell primitive, which usually doesn't get you into trouble, except in those few cases where it can cause major confusion, like when you have to handle “improper” or “dotted” lists.

Lists are mutable because their representation is mutable and not hidden. It is possible to mutate the representation such that it no longer represents a list anymore, “magically” changing any list that includes the mutated structure into something else. This means either a lot of defensive copying must be done if lists are used as arguments or passed as values, or an unenforced convention to avoid mutation of list structure must be developed in the Lisp culture. We've been pretty good at the latter, even documenting when you can and when you cannot rely on lists being mutated by library functions, but there are always a few people who go against the grain for the sake of “efficiency” (or plain orneriness) and write code that is impossible to use because you cannot easily tell what might be mutated behind your back.

With any abstract data type, there are conceptually a pair of functions that are used to transport objects across the abstraction barrier. One, call it abs->rep, takes an abstract object and exposes its representation. It is usually provided automatically by the compiler and called upon entry to the object's methods. In Java, for example, it establishes bindings for the this pointer and the private and protected fields of the object so that the method can use them. The complimentary function, call it rep->abs takes the representation of an object and hides it in an opaque, abstract version for clients of the object to use. The clients have no way to manipulate the representation of the object because they only have access to opaque, abstract version. In Java, for example, the compiler automatically does this after object construction and when the this pointer is returned properly cast to the abstract data type. The this pointer and private and protected fields of the object go out of scope and are no longer accessible.

These functions are usually provided by the compiler and often have no real implementation. The compiler simply ensures that representation comes into scope when the method is called (conceptually calling abs->rep) and that the representation goes out of scope when the method returns (conceptually calling rep->abs). No actual code is generated or exists at run time. It's easy to forget this is happening because the compiler does all the work for you. You just toggle the little bit in your head about whether you are “inside” the object or “outside” the object. If you forget, you can just examine the lexical nesting to see if the representation is in scope.

In Lisp, however, for a singly linked list, not only are these functions omitted, they are completely fictitious. It is only in the programmers head that what was once considered a linked list is now to be considered a pointer to head cell of list (abs->rep) and only probably in the programmers head that the reverse (rep->abs) is happening on the way out. It doesn't matter much if he or she forgets this because the written code is the same either way. It only matters if he or she somewhere down the line uses a cons-cell operation where a list operation is actually what should be used. This can lead to common rookie mistakes like
  • Using cons where list is wanted, yielding (1 . 2) where (1 2) is desired. (The “unwanted dot” problem.)
  • Using list where cons is wanted, yielding (1 (2)) where (1 2) is desired. (The “too many parenthesis” problem.)
  • Confusion about whether ((1 2) 3 4) is meant to be a three-tuple of a list and two integers, or a two-tuple of two lists. (It's both, depending on the unwritten intent of the programmer.)
  • Using cons or list where append is wanted, yielding ((1 2) 3 4) or ((1 2) (3 4)) when (1 2 3 4) is desired. (Again, “too many parenthesis”.)
  • Use of (append ... (list <element>)) to “cons” to the “right end” of a list, leading to O(n2) algorithms rather than O(n).
Now don't get me wrong. I like Lisp and I like linked lists. And I'm not suggesting we avoid using them in favor of some other well-designed abstract data type. I just think they're an awful example of how to implement an abstract data type and perhaps that's why it is difficult for beginners to learn how to use them properly. It might also be worthwhile to implement a Lisp with proper (and immutable) abstract linked lists. It wouldn't make much difference to experienced programmers who are already used to applying the representation/abstraction interface in their heads, but it might make it easier for novices to manipulate linked list and cons cells (and keep them apart).

If you want to be completely contrary, consider Olin Shiver's suggestion: all objects — cons cells, strings, integers, null, etc. — are lists. It's just that every object other than a cons cell is a zero element dotted list. Now rather than being a subtype of cons cells, lists become a supertype of all objects. This viewpoint can probably be made coherent, but it does raise a lot of questions. Here are some that come to mind:
  • Is (length '(1 2 . 3)) the same as (length '(1 2 3))? If not, what is (length '(1 2 . 3))
  • Should lists retain their “dottedness” when passed through functions like memq or map? What is (memq 2 '(1 2 . 3))? What about (memq 3 '(1 2 . 3))?
  • What is (reverse '(1 2 . 3))? Is (compose reverse reverse) an identity?
This was extensively discussed on the SRFI-1 mailing list, so I won't rehash the discussion here. The questions I raised above, and many more, were raised and discussed. Eventually, it was decided that continuing to be backwards compatible was an important consideration. (Personally, I think the notion plays havoc with the group theoretic properties of lists, and that is enough to make it suspect.)

There is a good argument that “dotted” lists are rarely used and almost always a mistake, but they are built in to the grammar of Scheme as an indicator of “rest” arguments, so getting rid of them would require some other way to specify “rest” arguments. Racket takes things further by allowing doubly dotted lists to indicate infix notation: (a . < . b)

Just for kicks, I took things in the other direction and wrote some C# code that implements singly-linked lists as their own abstract data type using special, immutable cons cells that require that their CDR be either an existing singly-linked list or the empty list. “Dotted” lists are not a problem because you simply cannot construct one. The representation of a list is explicitly coded as a pointer to the head cons cell of the list. The code illustrates how the abstract list is turned into a the pointer to the cons cell when it is carried across the abstraction barrier and how it is turned back into an abstract list when carried back out. Again, I'm not suggesting anyone use the code, or take it as a serious proposal. (For one thing, it doesn't address what to do about circular lists, or the dotted lists in the Scheme grammar.) It was just a fun hack for illustrative purposes. It is available here for those interested.

*Many years back, Henry Baker said “C'mon, cons cells should just be immutable.” (if I am remembering the exact quote correctly). I agree with his sentiment. Combine immutable cons cells with “hash consing” and the appropriate equality primitives and you get directed acyclic graphs (and their space properties) “for free”. We'd either have to do without circular structure or use another means to achieve it. Since circular structure often leads to divergent programs I wouldn't consider it a great loss, but some may disagree. Perhaps they might be assuaged by a nice set of primitive procedures for creating and manipulating circular cons cell structure.

by Joe Marshall (noreply@blogger.com) at January 13, 2020 01:48 PM

January 12, 2020

Joe Marshall

Just for fun, transformations on lists

The mathematician in me likes to think about what happens to data and programs when you apply certain transformations to them. Here's a simple example. Imagine the function swap that simply makes a new cons cell by swapping the car and cdr of an existing cons cell:
(define (swap cell) (cons (cdr cell) (car cell)))

> (swap '(1 . 2))
(2 . 1)

> (swap (swap '(1 . 2)))
(1 . 2)
As we'd expect, two swaps are equivalent to no swaps at all. Indeed any even number of swaps are equivalent. Any odd number of swaps is equivalent to one swap.

What if we call swap on a list?
> (swap '(1 2 3))
((2 3) . 1)
That's odd looking. But swapping again returns it to normal
> (swap (swap '(1 2 3)))
(1 2 3)

But swap only swaps the top-level cell. Let's define deep-swap that descends into the car and cdr if possible:
(define (deep-swap cell)
  (cons (if (pair? (cdr cell)) (deep-swap (cdr cell)) (cdr cell))
        (if (pair? (car cell)) (deep-swap (car cell)) (car cell))))

> (deep-swap '((1 . 2) . (3 . 4)))
((4 . 3) 2 . 1)
Wait, what? Oh, the list printer is just interpreting the second cons cell as a part of a top-level list. We can see this by trying this:
> '((4 . 3) . (2 . 1))
((4 . 3) 2 . 1)
So we just have to be aware of list printer eliding the dots for us.

What if we call deep-swap on a list?
> (deep-swap '(1 2 3 4))
((((() . 4) . 3) . 2) . 1)
Fortunately, deep-swap, like swap, undoes itself.
> (deep-swap (deep-swap '(1 2 3 4)))
(1 2 3 4)
It's easy to see that swap and deep-swap should commute.
> (equal? (swap (deep-swap '((a . b) . (c . d))))
          (deep-swap (swap '((a . b) . (c . d)))))
#t
Alternatively, define compose
;; Simple composition of two functions
(define (compose2 outer inner)
  (lambda (x) (outer (inner x))))

;; Composition of arbitrary number of functions
(define (compose f . fs)
  (if (null? fs)
      f
      (compose2 f (apply compose fs))))

> (equal? ((compose swap deep-swap) '((a . b) . (c . d)))
          ((compose deep-swap swap) '((a . b) . (c . d))))
#t
So you can just move all the swaps together and all the deep-swaps together, then remove pairs of each one.

swap and deep-swap don't have very complex behavior, so let's turn to lists. We can define rotate-left as follows:
(define (rotate-left l) (append (cdr l) (list (car l))))

> (rotate-left '(1 2 3 4))
(2 3 4 1)

> (rotate-left (rotate-left '(1 2 3 4)))
(3 4 1 2)
(This is horribly inefficient, so this is just for fun, not production code.) Now what happens when we combine rotate-left with reverse?
> (reverse (rotate-left (reverse '(1 2 3 4))))
(4 1 2 3)

(define rotate-right (compose reverse rotate-left reverse))

> (rotate-right '(1 2 3 4))
(4 1 2 3)
rotate-left becomes rotate-right when used “under” reverse. Of course rotate-left doesn't commute with reverse: (reverse (reverse (rotate-left '(1 2 3 4)))) the reverses cancel each other and we're left with a rotate-left.

We can define “deep” versions of reverse, rotate-left, and rotate-right:
(define (deeply f)
  (lambda (l)
    (if (list? l)
        (f (map (deeply f) l))
        l)))

> ((deeply reverse) '((1 2 3) 4 5 (6 7 (8 9 10))))
(((10 9 8) 7 6) 5 4 (3 2 1))

> ((deeply rotate-left) '((1 2 3) 4 5 (6 7 (8 9 10))))
(4 5 (7 (9 10 8) 6) (2 3 1))

> ((deeply rotate-right) '((1 2 3) 4 5 (6 7 (8 9 10))))
(((10 8 9) 6 7) (3 1 2) 4 5)
Naturally, a (deeply rotate-left) will undo a (deeply rotate-right). You might suspect that the composition of (deeply reverse), (deeply rotate-left), and (deeply reverse) is equivalent to (deeply rotate-right), and you'd be right (I suspected as much, too, but it didn't seem so obvious, so I checked).

Notice that the deeper list structure has 3 elements each, but the topmost list structure has 4 elements, so 3 deep rotations is equivalent to one shallow rotation in the opposite direction, or (compose rotate-left (deeply rotate-left) (deeply rotate-left) (deeply rotate-left)) is an identity. In fact, the shallow rotate-left commutes freely with (compose (deeply-rotate left) (deeply rotate-left) (deeply rotate-left))
;; These are all equivalent identities
(compose rotate-left (deeply rotate-left) (deeply rotate-left) (deeply rotate-left))
(compose (deeply rotate-left) rotate-left (deeply rotate-left) (deeply rotate-left))
(compose (deeply rotate-left) (deeply rotate-left) rotate-left (deeply rotate-left))
(compose (deeply rotate-left) (deeply rotate-left) (deeply rotate-left) rotate-left)

Arbitrary application of these operators will scramble your list structure much like arbitrary rotations will scramble a Rubik's cube. The analogy is more than skin deep: group theory can be used to describe and analyze the combinatorics of both. Group theory tells us that operations of the form F-1GF are likely to be interesting. And indeed:
> ((compose rotate-right reverse rotate-left) '((1 2 3) 4 5 (6 7 (8 9 10))))
(4 (1 2 3) (6 7 (8 9 10)) 5)

(define involute (compose rotate-right reverse rotate-left))
swaps the outside elements with the inside ones. And if we compose a rotate-left with this, we find we've reversed only the last three elements in the list ((1 2 3) (6 7 (8 9 10)) 5 4).

Just using these operators, there seems to be no way to get to get to '((1 2 3) 5 4 (6 7 (8 9 10))) (at least I couldn't find one), so I defined another operator:
(define (call-on-tail f)
  (lambda (x)
    (cons (car x) (f (cdr x)))))
which leaves the head element alone while applying the transformation to the rest.
> ((compose rotate-left reverse (call-on-tail rotate-right) involute)
   '((1 2 3) 4 5 (6 7 (8 9 10))))
((1 2 3) 5 4 (6 7 (8 9 10)))

These functions can move elements up and down the list structure:
(define (leftmost c)
  (if (pair? c)
      (leftmost (car c))
      c))

(define (replace-leftmost c new-value)
  (if (pair? c)
      (cons (replace-leftmost (car c) new-value) (cdr c))
      new-value))

(define (rightmost c)
  (if (pair? c)
      (if (null? (cdr c))
          (rightmost (car c))
          (rightmost (cdr c)))
      c))

(define (replace-rightmost c new-value)
  (if (pair? c)
      (if (null? (cdr c))
          (cons (replace-rightmost (car c) new-value) (cdr c))
          (cons (car c) (replace-rightmost (cdr c) new-value)))
      new-value))

(define (swap-ends l)
  (replace-leftmost (replace-rightmost l (leftmost l)) (rightmost l)))

> (swap-ends '((1 2 3) 4 5 (6 7 (8 9 10))))
((10 2 3) 4 5 (6 7 (8 9 1)))

> ((compose involute swap-ends involute) '((1 2 3) 4 5 (6 7 (8 9 10))))
((1 2 3) 5 4 (6 7 (8 9 10)))

> ((deeply swap-ends) '((1 2 3) 4 5 (6 7 (8 9 10))))
((6 2 1) 4 5 (8 7 (10 9 3)))

> ((compose (deeply swap-ends)
            (deeply swap-ends)
            (deeply swap-ends)
            (deeply swap-ends)
            (deeply swap-ends)) '((1 2 3) 4 5 (6 7 (8 9 10))))
((1 2 3) 4 5 (6 7 (8 9 10)))

There's no real application for all this, except maybe to come up with some puzzles. It's just fun to noodle around with list transformations to see what you can come up with, and to practice your list manipulation skills. You really need a language with a REPL to play around like this. A parsimonious syntax like Lisp helps, too. It would have been a bit more difficult to fool around if I had to put the appropriate commas, curly braces, brackets, and semicolons in just right.

None of these operations work on circular lists, but you can imagine that rotations and reversals could work on fully circular lists, but I'm not sure how they'd make sense on lists with circular tails. It would be challenging to make them work, though. They also don't work on “dotted” lists — they throw an error when they run into the dotted item at the end of the list. But it is fairly easy to imagine how they might be made to work on a dotted list. It would be much less of a challenge to implement.

by Joe Marshall (noreply@blogger.com) at January 12, 2020 03:12 PM

Gendl / SBCL / Ubuntu / WSL / Windows and an experiment in live blogging

I'm helping David Cooper by trying to run a demo of his Gendl software. He's preparing a new release and I get to be the guinea pig. For fun, I'm live blogging as we go along just as an experiment.

I'm running SBCL 1.4.5debian under Ubuntu 18.04.3 under Windows Subsystem for Linux (WSL 1) under Windows 10 Home edition. I have to say that Ubuntu on WSL is remarkably stable and useful. It seems to act just like the Ubuntu I'm used to using and runs ELF executables without modification. The GNU tool chain just seems to work and I used apt-get install to fetch and install SBCL, which hasn't complained either. I've git cloned David's release and I'm now awaiting further instruction.

While waiting, I've installed slime and am running SBCL under Emacs 25.2.2. Quicklisp is used to install and start Gendl. This starts up a web server that provides the UI in another thread.

So far, this entire melange of software is working quite smoothly despite the oddball combination of the parts. Lisp has habit of tickling memory protection bugs and threading bugs. Unix isn't supposed to get along with Windows. Windows isn't known to get along with Unix very well, either, unless you isolate them from each other through a virtual machine. But WSL is doing its job acting as a go-between. I don't know the details of how it works, but I do know that you need to have some virtualization enabled, but it isn't a full virtual machine (Windows 10 Home edition doesn't support Hyper-V). In WSL, the Linux side can see the Windows disk as a mount point, but it doesn't seem that the Windows side can see the Linux disk. WSL gives you a bash shell in a window capable of running Emacs, or you can run an XWindows server like Xming under Windows if you want the full X experience. Performance seems reasonably snappy.



Well live blogging didn't work. It just felt rude to be typing and not paying full attention while someone was demonstrating some software that he had obviously worked very hard on. So I'll give a re-cap of what I understood from the demo. I'm no expert on CAD systems and most likely misunderstood important points, so take this as a novice's view. I asked David to help me correct this write-up. Not surprisingly, there's one important point I misunderstood, so I'll put in David's explanation.

Gendl is inspired by ICAD. Through use of CLOS and some macros, Gendl provides a DSL that allows you to design an object through a tree-like decomposition of its component pieces. Each component has properties, and these can be inherited by subcomponents (so, for example, when you paint a component a certain color, the color propagates down the component tree to the child components and they get painted, too).

(Here I messed up majorly.) In David's words
The technical term for the properties is “messages.” In many ways this is a message-passing object system, inspired by Flavors, which was inspired by SmallTalk (The ICAD System was built with Flavors, not CLOS).

Note there are two, orthogonal, concepts of "inheriting" going on. Normal class mixins provide one type of inheritance -- an object definition will inherit all the messages of its mixins. We usually call this "inheritance."

The passing of values from parent instance to child instance and other descendant instances in the instance tree is called “passing down” rather than inheritance, to avoid confusion with that other, different, inheritance. Messages can be passed down implicitly (by making them trickle-down-slots), or passed down explicitly by including them as keyword/value pairs in the object specification.

Another way to think of this is that the class (or “type”) of a given instance can inherit from multiple mixins, can contain multiple child object instances, but can have at most one Parent object in the instance tree (and it can have zero Parent objects if it's the root)

Also:

The mixin inheritance relationship is an “is a” relationship (essentially the same as normal CLOS inheritance).

The parent-child relationship is a “has a” relationship, and comes along with automatic demand-driven (i.e. lazy) dependency tracking.

The base component contains a co-ordinate system, so all components have a co-ordinate system relative to the root object. One primary module or application of Gendl is the web-based UI Geysr, that allows you to navigate the component tree, inspect components, change their properties, and visualize how they fit together. This module also provides an “exploded” view of the object if desired, where each subcomponent is rendered displaced a small distance from it's actual location. It can also render an exploded view of the assembly processs for the object.

The DSL provides a way of specifying a components location via constraint-like functional messages between components. This means that components can get resized when necessary, angles recomputed when necessary, and even the number of subassemblies recomputed dynamically when the object changes. In the “bench” example David showed me, the number of slats in the bench seat was recomputed dynamically from the seat width, the slat width, and constraints on the slat spacing. The user simply specified the perimeter of the seating area and Gendl figured out how many 2x4's you'd need to cover it. (Maybe I'm easily impressed, but I thought this was pretty neat.)

Another part of Gendl is the Process Planning module which computes a Manufacturing Bill of Processes and Raw Materials. This is much more interesting. Again using the DSL provided by CLOS, you define rules on how to build components from their constituent parts. Then Gendl, starting at the root node, infers a construction plan by recursive descent through the components and combining the rules. When it is done, you have the instructions for building the entire assembly. For the bench example, it started with purchase of sufficient 2x4s for the frame, seat, and back, then cutting the 2x4s to length, some at an angle for the reclining back, then fastening the cut pieces together, finally painting the result. The leaf nodes in the process planning tree represent the Bill of Materials for the object under construction. I thought this was pretty cool, too.

While I know nothing about CAD, I do know a little about computer languages, so I'll opine a little on the DSL. The basic operation in Gendl is define-object which is a macro that extends defclass. Objects have slots (fields) that hold values. One feature of Gendl is the ability to define “trickle down” slots whose values are available in the environment of subcomponents. Thus when you attach a new seat slat to the seat of the bench, it can use the trickle down value to match its paint color with that of the rest of the seat. This use of “environmental” properties isn't unique to Gendl, but it is worth note. David says, “Trickle-down slots are essentially a short-hand way of passing messages from a parent down to its descendants.

The recursive nature of Lisp matches well with the recursive decomposition of objects. The DSL makes it obvious to even the casual observer how the objects are composed. The value of the slots in the object can be defined as any Lisp expression, but drawing from the constraint-like language subset makes it obvious how the pieces relate to each other. You don't specify that the arm rests are 60 inches from the center (although you could), you specify that they butt up against the backrest. Given that constraint, you can move one subassembly and Gendl will automatically recompute the positions of the adjoining assemblies. While this is powerful, I suspect you need a fair amount of experience in setting up the constraints to do it so that it is useful.

Something I completely missed because it is natural to me and other Lisp hackers, but not to CAD designers, is the power of having an Emacs/Slime REPL in parallel with the web-based interface. The web-based interface gives you a natural point-and-click, drag-and-drop, UI with the instant visual feedback while at the same time you have the powerful DSL available in a REPL for experimenting directly with the object tree in a natural text-based way. (Sometimes people forget that text itself is highly abstract visual medium with thousands of years of development behind it.) In David's demo, he would move between the REPL and the UI extremely frequently, keeping both windows open at the same time, sometimes manipulating the object graphically, other times textually. In retrospect, this looks like a huge advantage, but at the time it seemed like an obvious way to use the system. I expect other jaded command-line hackers would have the same experience.

The rules for constructing components are written as CLOS methods that are specialized to the component they are constructing. It is rather obvious what rule applies for a given component. In addition it is obvious what to specialize on for constructing a component. The rule files are straightforward and reasonably terse.

Given the shortness of the demo (there was a lot to grasp), and my own inexperience, I don't think I could write an object description from scratch, but given an existing description I feel confident I could add a small subcomponent. I can see how this product would be useful in design and manufacturing and I think the learning curve doesn't look too steep if one is willing to start small before working up to something hugely complex.

Thanks to David for giving me the demo and for helping correct my blog post. I take credit for all the errors and misconceptions.

by Joe Marshall (noreply@blogger.com) at January 12, 2020 03:32 AM

January 10, 2020

Joe Marshall

Micro-services

All the cool kids are doing micro-services these days. They are either writing their new apps as a micro-service architecture, or are in the process of changing their legacy apps to micro-service architectures. The fact that everyone seems to be doing it surely means there are no drawbacks of any note, or that the advantages vastly outweigh them.

I see these advantages to a micro-service architecture:
  • Architectural impedance match — each architectural feature maps to a few obvious micro-services.  For example, if you are adding a feature that simply surfaces a data field to a user interface, you probably need to modify little more than the component that generates the UI.  If you are adding a richer service, you'll need to add a persistence component and a business logic component in addition to the UI component. But the reason there is such a good match between the architecture and the micro-services is simply because you draw circles around each architectural component in your design and declare it to be a micro-service. Basically, if it is worth drawing as a separate component, you consider it worth implementing it as a micro-service.

    This impedance match works well with Agile/SCRUM planning. At the beginning of each sprint, it is fairly easy to assign resources to handle the changes necessary to the existing components or to write new components. Each service is usually tiny enough that it doesn't exceed the capacity of a single team. If you can assign more than one team to a component, then the component is too large and needs to be broken into more micro-services. For a more complex feature, it is possible to plan to bring the micro-service online over a period of more than one sprint without adding too much to the risk of the component.
  • Deployment impedance match — each architectural feature maps to only a handful of obvious micro-services, each one usually implemented by a stand-alone "plug-in" program.  The micro-services are usually run in a separate container where they can crash and be restarted with little ill effect on the program at large - perhaps some functionality is temporarily lost while the micro-service is restarted, or maybe a hot backup is ready to take over.  The containers are typically something like "docker" and are monitored with "kubernetes" to ensure they are up and running.  Structuring the program as a set of micro-services facilitates this kind of of monitoring and restarting. This works well when you can distribute your micro-services over a fleet of virtual and physical machines.
  • Possible bug reduction — From experience, it seems that two 500 line programs have fewer bugs than a single 1000 line program. In theory, this would scale and a program made of dozens of hundred-line subprograms would have far fewer bugs than if that same program were written as a single subprogram with several thousands of lines of code. In practice, however, I think that it isn't so simple.
    Two bug-free programs might have emergent buggy behavior when they try to co-ordinate their behavior. Emergent bugs are very hard to find and fix.
  • Robustness — it is hard to completely knock down a micro-service architecture. No matter how many subprocesses you manage to disable, at least a few will still remain running. If we are satisfied with occasional missing pieces of our program, this is robustness of a sort.
  • Dependency injection — Back in the 80's, we'd limit the spread of complexity by structuring larger programs to only know about those smaller subprograms that they depended directly upon. This was eventually given a name and declared “a good thing”. You get dependency injection almost for free in a micro-services architecture if not least because the services won't even know about each other unless you tell them. You are virtually forced to follow a reasonable design principle because you have to do something to get the services to know about each other, and this minimal effort also happens to minimize unwanted information sharing.

It is a time-honored tradition in computer programming to take any idea that offers some advantage, give it a name, elevate it to a “principle” and use it to the exclusion of any prior good idea. “Micro-service architecture” is the latest fad, and it has so rapidly replaced “object-oriented programming” that you can often find both in the same job description. This is almost a contradiction because a micro-service architecture often cuts straight through the objects manipulated by the program. What is conceptually a single object might be represented in several completely different ways in various micro-services in the system. It might be reconstituted from a relational mapping by way of an ORM, serialized into JSON format for RPCs, and returned as part of a nested GraphQL query before having its guts evacuated and spread out through the Document Object Model for display. The identity of the object through the system is completely lost.

Despite the plethora of wonderful ideas that come from a micro-service architecture, there seem to be some downsides.
  • Ubiquitous process barriers — A process barrier is thrown up between every functional component of a program. This helps limit the effect of a crash, but greatly hinders cooperation and synchronization. Error handling and try/finally clauses don't easily work across process boundaries, and what before might be handled by a critical section of code now might require a distributed locking protocol.
  • Ubiquitous RPCs — all communication between the micro-services have to take place between the tin-cans and strings we call remote procedure calls. RPCs, besides being slow, introduce failure modes that normal procedure calls don't have. And if the programmer doesn't think hard about these new failure modes, the RPC will be far less robust than its direct counterpart.
  • Anti-object oriented — all objects in the problem domain of a micro-service architecture must be simple enough to marshal into JSON objects so they can be passed as arguments and returned as values. As noted in a previous essay, the more complex the object, the harderit is to marshal across a procedure boundary, and if fields are hard to marshal, methods are virtually impossible. There is a strong incentive to keep "objects" in JSON format as they are passed through the system and only parse the JSON strings as they are needed by the lowest layer. This is the antithesis of object-oriented design which is to try to make objects as similar to their real-world counterparts as possible and encapsulate behavior (methods) along with the data.
  • Non robustness — while it is hard to knock over an entire micro-architecture application, it can be all too easy to knock over parts of it. A program that is 90% functional 100% of the time is also 10% broken 100% of the time. If that 10% is an often used part of the program, it can appear very non robust. So what if the header and footer of each page (each its own micro-service, naturally) is up nearly 100% of the time when the body of each page is missing half its data. A colleague of mine noted that a certain micro-architecture he was working on seemed very fragile for something that is supposed to be so robust.
  • Verbosity — the frameworks used to support a micro-service architecture makes plain old Java look downright terse. Objects and methods have dozens of annotations describing how to marshal the objects and hook up the RPCs. Some frameworks make use of the redundancy of Java to aid in automatically constructing a micro-architecture system, but these often rely on reflection and automated code generation that introduces its own set of problems. And Java's habit of putting every class in a separate file means you often have to follow a tortuous trail through the code to find the path from the request coming in to the method that actually handles the logic that implements the request.

Despite these drawbacks, I expect to see a lot more micro-architectures in the future because they match up so well with “factory style” software production. It really makes it easier to manage teams of developers. I expect enthusiasm to be curbed a bit when it is noticed that rewriting a program as a set of micro-services doesn't really result in a dramatic improvement in the reliability of the program or a dramatic reduction in the complexity of a large body of code.

by Joe Marshall (noreply@blogger.com) at January 10, 2020 02:35 PM

GNU Guix

Meet Guix at FOSDEM

As usual, GNU Guix will be present at FOSDEM on February 1st and 2nd. This year, we’re happy to say that there will be quite a few talks about Guix and related projects!

The Minimalistic, Experimental and Emerging Languages devroom will also feature talks about about Racket, Lua, Crystal, Nim, and Pharo that you should not miss under any circumstances!

Guix Days logo.

For the third time, we are also organizing the Guix Days as a FOSDEM fringe event, a two-day Guix workshop where contributors and enthusiasts will meet. The workshop takes place on Thursday Jan. 30st and Friday Jan. 31st at the Institute of Cultural Affairs (ICAB) in Brussels.

Again this year there will be few talks; instead, the event will consist primarily of “unconference-style” sessions focused on specific hot topics about Guix, the Shepherd, continuous integration, and related tools and workflows.

Attendance to the workshop is free and open to everyone, though you are invited to register (there are only a few seats left!). Check out the workshop’s wiki page for registration and practical info. Hope to see you in Brussels!

About GNU Guix

GNU Guix is a transactional package manager and an advanced distribution of the GNU system that respects user freedom. Guix can be used on top of any system running the kernel Linux, or it can be used as a standalone operating system distribution for i686, x86_64, ARMv7, and AArch64 machines.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. When used as a standalone GNU/Linux distribution, Guix offers a declarative, stateless approach to operating system configuration management. Guix is highly customizable and hackable through Guile programming interfaces and extensions to the Scheme language.

by Manolis Ragkousis at January 10, 2020 02:30 PM

Programming Praxis

Consecutive Array Search

Given an array of distinct integers and a target integer, determine if two adjacent integers in the array sum to the target. Solve the problem twice, once assuming the array is unsorted and once assuming the array is sorted.

Your task is to solve the two array search problems described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

by programmingpraxis at January 10, 2020 10:00 AM