Data Science with Nix: Parameter Sweeps

Parameter sweeping is a technique often utilized in scientific computing and HPC settings. In the mainstream software industry the concept is called a build matrix.

The idea is that you have a task you want to perform with varying input parameters. If the task takes multiple parameters, and you’d like to try it out with multiple values for each parameter, it is easy to end up with a combinatorial explosion.

This blog post gives a practical demonstration showing how Nix is a perfect companion for managing parameter sweeps and build matrices, and how nixbuild.net can be used to supercharge your workflow.

My hope is that this text can interest readers that don’t know anything about Nix as well as experienced Nix users.

Use Cases

In scientific computing, it is common to run simulations of physical processes. The list of things simulated is endless: weather forecasting, molecular dynamics, celestial movements, FEM analysis, particle physics etc. A simulation is usually implemented directly as a computer program or as a description for a higher level simulation framework. A simulation generally has a set of input parameters that can be defined. These parameters can describe initial states, environmental aspects or tweak the behavior of the simulation algorithm itself. Scientists are interested in comparing simulation results for a range of different parameter values, and the process of doing so is referred to as a parameter sweep.

Parameter sweeping is often built into simulation frameworks. For simulations implemented directly as specialized programs, scientists will simply run the program over and over again with different parameters, collecting and comparing the results. When supercomputers are used for running the simulations, the job scheduler usually has some support for launching multiple simulation instances with varying parameters.

In the software industry, the term build matrix is used to mean basically the same thing as parameter sweeping. Regularly, build matrices are used to build different variants of the same deliverable. In the simplest case, a programmer builds and packages a program for set of different targets (Windows, MacOS, Android etc). But more complex build matrices with (much) higher dimensionality is of course also used.

Benchmarking is another area where build matrices are utilized, and combinatorial explosions are common. The demo below will show how a compression benchmark can be implemented with Nix and nixbuild.net.

Embarassing Parallelism

Parameter sweeping can be classified as an embarrassingly parallel problem. As long as we have enough CPUs, all simulations or builds can be executed in parallel, since there is (usually) no dependencies between them. This is a perfect workload for nixbuild.net, which is built to be very scalable.

At the same time, it is also easy to get into troubles managing all possible combinations of parameter values. Adding new parameters or parameter values can increase the number of runs exponentially, and the work of managing the runs and their results becomes overwhelming. The next section will show how Nix can help out with this.

How Nix Helps

One of the aspects of Nix that I find most empowering is that it helps you with the boring and stress-inducing task of managing files. Let me see if I can explain.

Assume you have a program called simulation that takes a parameter file as its only argument. The parameter file contains simulation parameters in a simple INI-format like this:

param1=3
param2=92
param3=0

The program outputs a simulation result on its standard output, in CSV format.

Now we want to run the simulation for some different combinations of input parameters. We could manually author the needed parameter files, or we can write a simple shell script for it:

round=1

for param1 in 3 4 5; do
  for param2 in 10 33 50 92; do
    for param3 in 0 1; do
      echo "param1=$param1" >> "round$round.ini"
      echo "param2=$param2" >> "round$round.ini"
      echo "param3=$param3" >> "round$round.ini"
      round=$((round+1))
    done
  done
done

We now got 24 different parameter files with all possible combinations of the values we are interested in:

$ ls -v
round1.ini  round6.ini   round11.ini  round16.ini  round21.ini
round2.ini  round7.ini   round12.ini  round17.ini  round22.ini
round3.ini  round8.ini   round13.ini  round18.ini  round23.ini
round4.ini  round9.ini   round14.ini  round19.ini  round24.ini
round5.ini  round10.ini  round15.ini  round20.ini

$ cat round20.ini
param1=5
param2=33
param3=1

To run the simulations, we simply loop through all parameter files:

for round in round*.ini; do
  simulation $round >> results.csv
done

So far, so good. At this point, we might want to tweak our parameter generation a bit. Maybe there are more parameter values we want to explore, different sweeps to do. So we change our parameter generator script and re-run a few times. We then realise we want to make changes to our simulation program itself. So we do that, and recompile it. Now we want to re-run all the different parameter sweeps we’ve done. Luckily, we saved all different versions of our parameter generation script, so we can just run the updated simulator with the previous simulator.

At the end of our productive simulation session we have the following mess (with all intermediate parameter files removed):

gen-params.sh         results-1.csv   results-12.csv  simulation-3
gen-params-1.sh       results-2.csv   results-13.csv  simulation-O2-1
gen-params-2.sh       results-3.csv   results-14.csv  simulation-O2-2
gen-params-3.sh       results-4.csv   results-15.csv  simulation-O3
gen-params-3v2.sh     results-5.csv   results-16.csv  simulation-debug
gen-params-4.sh       results-6.csv   results-17.csv  simulation-debug2
gen-params-test.sh    results-7.csv   simulation      simulation-wrong
gen-params-test-2.sh  results-8.csv   simulation-1
results.csv           results-10.csv  simulation-2

Admittedly, this is how things usually end up for me when I’m doing any kind of “exploratory” work. I’m sure you all are much more organized. To my rescue comes Nix. It allows me to stop caring entirely about generated files, and only care about how stuff is generated. Additionally, it gives me tools to abstract, parameterize and reuse generators.

Let’s make an attempt at recreating our workflow above with Nix:

{ pkgs ? import <nixpkgs> {} }:

let

  inherit (pkgs) lib callPackage runCommand writeText;

  # Compiles the 'simulation' program, and allows us to provide
  # build-time arguments
  simulation = callPackage ./simulation.nix;

  # Executes the given simulation program with the given parameters
  runSimulation = buildArgs: parameters:
    runCommand "result.csv" {
      buildInputs = [ (simulation buildArgs) ];
      parametersFile = writeText "params.ini" (
        lib.generators.toKeyValue {} parameters
      );
    } ''
      simulation $parametersFile > $out
    '';

  # Merges multiple CSV files into a single one
  mergeResults = results: runCommand "results.csv" {
    inherit results;
  } ''
    cat $results > $out
  '';

in {

  sim_O3_std_sweep = mergeResults (
    lib.forEach (lib.cartesianProductOfSets {
      param1 = [3 4 5];
      param2 = [10 33 50 92];
      param3 = [0 1];
    }) (
      runSimulation {
        optimizationLevel = 3;
      }
    )
  );

  sim_O2_small_sweep = mergeResults (
    lib.forEach (lib.cartesianProductOfSets {
      param1 = [1 3];
      param2 = [20 60 92];
      param3 = [0 1];
    }) (
      runSimulation {
        optimizationLevel = 2;
      }
    )
  );

}

The key function above is perhaps cartesianProductOfSets, from the library functions in nixpkgs. It will create all possible combinations of input parameters, if we list the possible value for each parameter. Our build function is then mapped over all these combinations using forEach.

We can build one of our parameter sweeps like this:

nix-build -A sim_O3_std_sweep

When all 24 simulation runs are done, Nix will create a single result symlink in the current directory, pointing to a results.csv file containing all simulation results. We can add new sweeps to our Nix file and re-run the build at any time. We never have to care about any generated files, since everything needed to re-generate results exists in the Nix file. The Nix file itself can be version-controlled like any source file.

In addition to the demonstrated ability to parameterize builds, Nix provides us with two more things, for free.

No unnecessary re-builds

In the example above, the sim_O3_std_sweep and sim_O2_small_sweep builds have some overlapping parameter sets. If you build both, Nix will only run the overlapping simulations once, and use the same result.csv files to create the two different results.csv files. This happens without any extra effort from the user. The same is true if you make changes that only affect part of your build. Nix also has support for external caches which makes it easy to share and reuse build results between computers (or you can simply use nixbuild.net to get build sharing without any extra configuration).

Automatic parallelization

When Nix evaluates an expression, it constructs a build graph that tracks all dependencies between builds. In the example above, the results.csv file depends on the list of result.csv files, which in turn depend on specific builds of simulation. All of these dependencies are implicit; you don’t have to do anything other than simply refer to the things you need to perform your build.

The build graph allows Nix to execute the actual builds with maximum parallelization. However, running as many builds as possible in parallel is often not optimal, if your compute resources (usually: your local computer) are limited. Nix has a simplistic build scheduler, which is just a user-configurable limit of the maximum number of concurrent builds. This works in many cases, but quickly gets non-optimal when you have lots of builds that could run in parallel, or when the builds themselves have varying compute requirements.

This is where nixbuild.net can step in. It is able to run an “infinite” number of concurrent Nix builds for you, while keeping all builds perfectly isolated from each other (security- and resource-wise). It also selects compute resources intelligently for each individual build.

From the perspective of scientific computing, you can say that Nix provides a generic framework for parallel workloads, and nixbuild.net acts somewhat as a supercomputer, minus the effort of writing submission scripts and explicitly managing compute results.

Demo: Compression Benchmark

I’m now going to show you an example that is similar to the example in the previous section, but instead of an imaginary simulation we will run an actual benchmark this time. The benchmark will compare the compression ratio of a number of different lossless compression implementations. Since this article is about parameter sweeping, we will vary the following parameters during the benchmark:

Compression implementation: brotli, bzip2, gzip, lz4, xz and zstd.
Two different versions of each compression implementation. We’ll use the versions packaged in nixpkgs 16.03 and 20.09, respectively.
Compression level: 1-9.
Corpus type: text, binaries and jpeg files.
Corpus size: small, medium and large.

We’ll try out the Cartesian product of the above parameters, resulting in 972 different builds. There is no particular thought behind the parameter selection, they are just picked to demonstrate the abilities of Nix and nixbuild.net. If you were to design a proper benchmark you’d likely come up with different parameters, but the concept would be the same.

Here is the complete Nix expression implementing the benchmark outlined above. The expression is parameterized over package sets from different releases of nixpkgs. There are different ways of actually importing those package sets, but that is out of the scope of this example.

{ pkgs, pkgs_2009, pkgs_1603 }:

let

  inherit (pkgs)
    stdenv fetchurl lib writers runCommand unzip gnutar
    referencesByPopularity uclibc hello zig;

  compressionCommand = pkgs: program: level: {

    brotli = writers.writeBash "brotli-compress" ''
      if [ -x ${pkgs.brotli}/bin/brotli ]; then
        ${pkgs.brotli}/bin/brotli --stdout -${toString level}
      else
        ${pkgs.brotli}/bin/bro --quality ${toString level}
      fi
    '';

    bzip2 = "${pkgs.bzip2}/bin/bzip2 --stdout -${toString level}";

    gzip = "${pkgs.gzip}/bin/gzip --stdout -${toString level}";

    lz4 = "${pkgs.lz4}/bin/lz4 --stdout -${toString level}";

    xz = "${pkgs.xz}/bin/xz --stdout -${toString level}";

    zstd = "${pkgs.zstd}/bin/zstd --stdout -${toString level}";

  }.${program};

  corpus = rec {
    txt.small = calgary-text.small;
    txt.medium = calgary-text;
    txt.large = runCommand "enwik8" {
      buildInputs = [ unzip ];
      src = fetchurl {
        url = "http://mattmahoney.net/dc/enwik8.zip";
        sha256 = "1g1l4n9x8crxghapq956j7i4z89qkycm5ml0hcld3ghfk3cr8yal";
      };
    } ''
      unzip "$src"
      mv enwik8 "$out"
    '';

    pkg.small = closure-tar "uclibc-closure.tar" uclibc;
    pkg.medium = closure-tar "hello-closure.tar" hello;
    pkg.large = closure-tar "zig-closure.tar" zig;

    jpg.small = fetchurl {
      url = "https://people.sc.fsu.edu/~jburkardt/data/jpg/charlie.jpg";
      sha256 = "0cmd8wwm0vaqxsbvb3lxk2f7w2lliz8p361s6pg4nw0vzya6lzrg";
    };
    jpg.medium = fetchurl {
      url = "https://cdn.hasselblad.com/samples/x1d-II-50c/x1d-II-sample-02.jpg";
      sha256 = "15pz84f5d34jmp0ljz61wx3inx8442sgf9n8adbgb8m4v88vifk2";
    };
    jpg.large = fetchurl {
      url = "https://cdn.hasselblad.com/samples/Cam_1_Borna_AOS-H5.jpg";
      sha256 = "0rdcxlxcxanlgfnlxs9ffd3s36a05g8g3ca9khkfsgbyd5spk343";
    };

    calgary-text = stdenv.mkDerivation {
      name = "calgary-corpus-text";
      src = fetchurl {
        url = "http://corpus.canterbury.ac.nz/resources/calgary.tar.gz";
        sha256 = "1dwk417ql549l0sa4jzqab67ffmyli4nmgaq7i9ywp4wq6yyw2g1";
      };
      sourceRoot = ".";
      outputs = [ "out" "small" ];
      installPhase = ''
        cat bib book2 news paper* prog* > "$out"
        cat paper1 > "$small"
      '';
    };

    closure-tar = name: pkg: runCommand name {
      buildInputs = [ gnutar ];
      closure = referencesByPopularity pkg;
    } ''
      tar -c --files-from="$closure" > "$out"
    '';
  };

  benchmark = { release, program, level, corpusType, corpusSize }:
    runCommand (lib.concatStringsSep "-" [
      "zbench" program "l${toString level}" corpusType corpusSize release.rel
    ]) rec {
      corpusFile = corpus.${corpusType}.${corpusSize};
      run = compressionCommand release.pkgs program level;
      version = lib.getVersion release.pkgs.${program};
      tags = lib.concatStringsSep "," [
        program version (toString level) corpusType corpusSize
      ];
    } ''
      orig_size="$(stat -c %s "$corpusFile")"
      result_size="$($run < "$corpusFile" | wc -c)"
      percent="$((100*result_size / orig_size))"
      echo >"$out" "$tags,$orig_size,$result_size,$percent"
    '';

in runCommand "compression-benchmarks" {
  results = map benchmark (lib.cartesianProductOfSets {
    program = [
      "brotli"
      "bzip2"
      "gzip"
      "lz4"
      "xz"
      "zstd"
    ];
    release = [
      { pkgs = pkgs_1603; rel = "1603"; }
      { pkgs = pkgs_2009; rel = "2009"; }
    ];
    level = lib.range 1 9;
    corpusType = [ "txt" "pkg" "jpg" ];
    corpusSize = [ "small" "medium" "large" ];
  });
} ''
  echo program,version,level,corpus,class,orig_size,result_size,ratio > $out
  cat $results >> $out
''

Above, compressionCommand defines the command used for each compression program to compress stdin to stdout with a given level.

The corpus attribute set defines txt, pkg and jpg datasets. For text and jpeg we simply fetch suitable sets, and for the binary (pkg) sets we use Nix itself to create a tar file out of the transistive closure of some different packages. The corpus sizes varies between around 50 kB and 300 MB.

benchmark runs a single compression command for one combination of input parameters.

Finally, we again use the cartesianProductOfSets function to create builds of all possible combinations of parameters, and then simply concatenate all individual results into a big CSV file.

Building the complete benchmark takes about 25 minutes on my somewhat old 8-core workstation, with Nix configured to run at most 8 builds concurrently. If I use nixbuild.net instead, time is cut down to 10 minutes due to the parallelization gains possible when running 972 independent Nix builds.

In the end we get a CSV-file with values for each parameter combination. The first ten lines of the file looks like this:

program,version,level,corpus,class,orig_size,result_size,ratio
brotli,0.3.0,1,txt,small,53161,19634,36
brotli,1.0.9,1,txt,small,53161,21162,39
bzip2,1.0.6,1,txt,small,53161,16558,31
bzip2,1.0.6.0.1,1,txt,small,53161,16558,31
gzip,1.6,1,txt,small,53161,21605,40
gzip,1.10,1,txt,small,53161,21605,40
lz4,131,1,txt,small,53161,27936,52
lz4,1.9.2,1,txt,small,53161,28952,54
xz,5.2.2,1,txt,small,53161,18416,34

To quickly get some sort of visualization of the benchmark data, I dumped the CSV contents into rawgraphs.io and produced the following graph:

From this visualization, we can draw a few conclusions:

There’s little point in compressing (already compressed) JPG data.
xz is the clear winner when it comes to producing small archives of binary data.
bzip2 produces almost the same compression ratio for all level settings. It even looks like level 9 can produce slightly worse compression than level 8.
lz4 makes a very big jump in compression ratio between level 2 and 3.

To further refine our workflow, we could also produce data visualizations directly in our Nix expression, by creating builds that would feed the CSV data into some visualization software.

Remember, this blog post is not about benchmarking compression, but about how you can use Nix and nixbuild.net for such workflows. Hopefully you’ve gained some insights into how Nix can be used in scientific computing and data science workflows. Let’s wrap up with a summary of why I find Nix useful in these situations:

The Nix programming language and standard library provide tools for managing combinatorial problems, and allows us to quickly come up with high level abstractions giving us sensible knobs to turn when exploring parameter sweeps and build matrices.
We don’t have to think about parallelization, Nix takes care of it for us.
Nix makes it very easy to build specific variants of packages. This is helpful if you want make comparisons between different software versions or patches. nixpkgs is a huge repository of pre-packaged software available to anyone.
nixbuild.net gives you extreme scalability with no adaptation or configuration needed. In the example above we saw build times cut to less than half by sending our Nix builds to nixbuild.net.
Reproducibility and build reuse is first-rate in Nix.

Thank you for reading this rather lengthy blog post! If have any comments or questions about the content or about nixbuild.net in general, don’t hesitate to contact me.