Data Science with Nix: Parameter Sweeps
Parameter sweeping is a technique often utilized in scientific computing and HPC settings. In the mainstream software industry the concept is called a build matrix.
The idea is that you have a task you want to perform with varying input parameters. If the task takes multiple parameters, and you’d like to try it out with multiple values for each parameter, it is easy to end up with a combinatorial explosion.
This blog post gives a practical demonstration showing how Nix is a perfect companion for managing parameter sweeps and build matrices, and how nixbuild.net can be used to supercharge your workflow.
My hope is that this text can interest readers that don’t know anything about Nix as well as experienced Nix users.
Use Cases
In scientific computing, it is common to run simulations of physical processes. The list of things simulated is endless: weather forecasting, molecular dynamics, celestial movements, FEM analysis, particle physics etc. A simulation is usually implemented directly as a computer program or as a description for a higher level simulation framework. A simulation generally has a set of input parameters that can be defined. These parameters can describe initial states, environmental aspects or tweak the behavior of the simulation algorithm itself. Scientists are interested in comparing simulation results for a range of different parameter values, and the process of doing so is referred to as a parameter sweep.
Parameter sweeping is often built into simulation frameworks. For simulations implemented directly as specialized programs, scientists will simply run the program over and over again with different parameters, collecting and comparing the results. When supercomputers are used for running the simulations, the job scheduler usually has some support for launching multiple simulation instances with varying parameters.
In the software industry, the term build matrix is used to mean basically the same thing as parameter sweeping. Regularly, build matrices are used to build different variants of the same deliverable. In the simplest case, a programmer builds and packages a program for set of different targets (Windows, MacOS, Android etc). But more complex build matrices with (much) higher dimensionality is of course also used.
Benchmarking is another area where build matrices are utilized, and combinatorial explosions are common. The demo below will show how a compression benchmark can be implemented with Nix and nixbuild.net.
Embarassing Parallelism
Parameter sweeping can be classified as an embarrassingly parallel problem. As long as we have enough CPUs, all simulations or builds can be executed in parallel, since there is (usually) no dependencies between them. This is a perfect workload for nixbuild.net, which is built to be very scalable.
At the same time, it is also easy to get into troubles managing all possible combinations of parameter values. Adding new parameters or parameter values can increase the number of runs exponentially, and the work of managing the runs and their results becomes overwhelming. The next section will show how Nix can help out with this.
How Nix Helps
One of the aspects of Nix that I find most empowering is that it helps you with the boring and stress-inducing task of managing files. Let me see if I can explain.
Assume you have a program called simulation
that takes a parameter file as its
only argument. The parameter file contains simulation parameters in a simple
INI-format like this:
param1=3
param2=92
param3=0
The program outputs a simulation result on its standard output, in CSV format.
Now we want to run the simulation for some different combinations of input parameters. We could manually author the needed parameter files, or we can write a simple shell script for it:
round=1
for param1 in 3 4 5; do
for param2 in 10 33 50 92; do
for param3 in 0 1; do
echo "param1=$param1" >> "round$round.ini"
echo "param2=$param2" >> "round$round.ini"
echo "param3=$param3" >> "round$round.ini"
round=$((round+1))
done
done
done
We now got 24 different parameter files with all possible combinations of the values we are interested in:
$ ls -v
round1.ini round6.ini round11.ini round16.ini round21.ini
round2.ini round7.ini round12.ini round17.ini round22.ini
round3.ini round8.ini round13.ini round18.ini round23.ini
round4.ini round9.ini round14.ini round19.ini round24.ini
round5.ini round10.ini round15.ini round20.ini
$ cat round20.ini
param1=5
param2=33
param3=1
To run the simulations, we simply loop through all parameter files:
for round in round*.ini; do
simulation $round >> results.csv
done
So far, so good. At this point, we might want to tweak our parameter generation a bit. Maybe there are more parameter values we want to explore, different sweeps to do. So we change our parameter generator script and re-run a few times. We then realise we want to make changes to our simulation program itself. So we do that, and recompile it. Now we want to re-run all the different parameter sweeps we’ve done. Luckily, we saved all different versions of our parameter generation script, so we can just run the updated simulator with the previous simulator.
At the end of our productive simulation session we have the following mess (with all intermediate parameter files removed):
gen-params.sh results-1.csv results-12.csv simulation-3
gen-params-1.sh results-2.csv results-13.csv simulation-O2-1
gen-params-2.sh results-3.csv results-14.csv simulation-O2-2
gen-params-3.sh results-4.csv results-15.csv simulation-O3
gen-params-3v2.sh results-5.csv results-16.csv simulation-debug
gen-params-4.sh results-6.csv results-17.csv simulation-debug2
gen-params-test.sh results-7.csv simulation simulation-wrong
gen-params-test-2.sh results-8.csv simulation-1
results.csv results-10.csv simulation-2
Admittedly, this is how things usually end up for me when I’m doing any kind of “exploratory” work. I’m sure you all are much more organized. To my rescue comes Nix. It allows me to stop caring entirely about generated files, and only care about how stuff is generated. Additionally, it gives me tools to abstract, parameterize and reuse generators.
Let’s make an attempt at recreating our workflow above with Nix:
{ pkgs ? import <nixpkgs> {} }:
let
inherit (pkgs) lib callPackage runCommand writeText;
# Compiles the 'simulation' program, and allows us to provide
# build-time arguments
simulation = callPackage ./simulation.nix;
# Executes the given simulation program with the given parameters
runSimulation = buildArgs: parameters:
runCommand "result.csv" {
buildInputs = [ (simulation buildArgs) ];
parametersFile = writeText "params.ini" (
lib.generators.toKeyValue {} parameters
);
} ''
simulation $parametersFile > $out
'';
# Merges multiple CSV files into a single one
mergeResults = results: runCommand "results.csv" {
inherit results;
} ''
cat $results > $out
'';
in {
sim_O3_std_sweep = mergeResults (
lib.forEach (lib.cartesianProductOfSets {
param1 = [3 4 5];
param2 = [10 33 50 92];
param3 = [0 1];
}) (
runSimulation {
optimizationLevel = 3;
}
)
);
sim_O2_small_sweep = mergeResults (
lib.forEach (lib.cartesianProductOfSets {
param1 = [1 3];
param2 = [20 60 92];
param3 = [0 1];
}) (
runSimulation {
optimizationLevel = 2;
}
)
);
}
The key function above is perhaps cartesianProductOfSets
, from the library
functions in nixpkgs. It will create all possible combinations of input
parameters, if we list the possible value for each parameter. Our build function
is then mapped over all these combinations using forEach
.
We can build one of our parameter sweeps like this:
nix-build -A sim_O3_std_sweep
When all 24 simulation runs are done, Nix will create a single result
symlink
in the current directory, pointing to a results.csv
file containing all
simulation results. We can add new sweeps to our Nix file and re-run the build
at any time. We never have to care about any generated files, since everything
needed to re-generate results exists in the Nix file. The Nix file itself can
be version-controlled like any source file.
In addition to the demonstrated ability to parameterize builds, Nix provides us with two more things, for free.
No unnecessary re-builds
In the example above, the sim_O3_std_sweep
and sim_O2_small_sweep
builds
have some overlapping parameter sets. If you build both, Nix will only
run the overlapping simulations once, and use the same result.csv
files to
create the two different results.csv
files. This happens without any extra
effort from the user. The same is true if you make changes that only affect
part of your build. Nix also has support for external caches which makes it
easy to share and reuse build results between computers (or you can simply use
nixbuild.net to get build sharing without any extra configuration).
Automatic parallelization
When Nix evaluates an expression, it constructs a build graph that tracks all
dependencies between builds. In the example above, the results.csv
file
depends on the list of result.csv
files, which in turn depend on specific
builds of simulation
. All of these dependencies are implicit; you don’t
have to do anything other than simply refer to the things you need to perform
your build.
The build graph allows Nix to execute the actual builds with maximum parallelization. However, running as many builds as possible in parallel is often not optimal, if your compute resources (usually: your local computer) are limited. Nix has a simplistic build scheduler, which is just a user-configurable limit of the maximum number of concurrent builds. This works in many cases, but quickly gets non-optimal when you have lots of builds that could run in parallel, or when the builds themselves have varying compute requirements.
This is where nixbuild.net can step in. It is able to run an “infinite” number of concurrent Nix builds for you, while keeping all builds perfectly isolated from each other (security- and resource-wise). It also selects compute resources intelligently for each individual build.
From the perspective of scientific computing, you can say that Nix provides a generic framework for parallel workloads, and nixbuild.net acts somewhat as a supercomputer, minus the effort of writing submission scripts and explicitly managing compute results.
Demo: Compression Benchmark
I’m now going to show you an example that is similar to the example in the previous section, but instead of an imaginary simulation we will run an actual benchmark this time. The benchmark will compare the compression ratio of a number of different lossless compression implementations. Since this article is about parameter sweeping, we will vary the following parameters during the benchmark:
Compression implementation:
brotli
,bzip2
,gzip
,lz4
,xz
andzstd
.Two different versions of each compression implementation. We’ll use the versions packaged in nixpkgs 16.03 and 20.09, respectively.
Compression level: 1-9.
Corpus type: text, binaries and jpeg files.
Corpus size: small, medium and large.
We’ll try out the Cartesian product of the above parameters, resulting in 972 different builds. There is no particular thought behind the parameter selection, they are just picked to demonstrate the abilities of Nix and nixbuild.net. If you were to design a proper benchmark you’d likely come up with different parameters, but the concept would be the same.
Here is the complete Nix expression implementing the benchmark outlined above. The expression is parameterized over package sets from different releases of nixpkgs. There are different ways of actually importing those package sets, but that is out of the scope of this example.
{ pkgs, pkgs_2009, pkgs_1603 }:
let
inherit (pkgs)
stdenv fetchurl lib writers runCommand unzip gnutar
referencesByPopularity uclibc hello zig;
compressionCommand = pkgs: program: level: {
brotli = writers.writeBash "brotli-compress" ''
if [ -x ${pkgs.brotli}/bin/brotli ]; then
${pkgs.brotli}/bin/brotli --stdout -${toString level}
else
${pkgs.brotli}/bin/bro --quality ${toString level}
fi
'';
bzip2 = "${pkgs.bzip2}/bin/bzip2 --stdout -${toString level}";
gzip = "${pkgs.gzip}/bin/gzip --stdout -${toString level}";
lz4 = "${pkgs.lz4}/bin/lz4 --stdout -${toString level}";
xz = "${pkgs.xz}/bin/xz --stdout -${toString level}";
zstd = "${pkgs.zstd}/bin/zstd --stdout -${toString level}";
}.${program};
corpus = rec {
txt.small = calgary-text.small;
txt.medium = calgary-text;
txt.large = runCommand "enwik8" {
buildInputs = [ unzip ];
src = fetchurl {
url = "http://mattmahoney.net/dc/enwik8.zip";
sha256 = "1g1l4n9x8crxghapq956j7i4z89qkycm5ml0hcld3ghfk3cr8yal";
};
} ''
unzip "$src"
mv enwik8 "$out"
'';
pkg.small = closure-tar "uclibc-closure.tar" uclibc;
pkg.medium = closure-tar "hello-closure.tar" hello;
pkg.large = closure-tar "zig-closure.tar" zig;
jpg.small = fetchurl {
url = "https://people.sc.fsu.edu/~jburkardt/data/jpg/charlie.jpg";
sha256 = "0cmd8wwm0vaqxsbvb3lxk2f7w2lliz8p361s6pg4nw0vzya6lzrg";
};
jpg.medium = fetchurl {
url = "https://cdn.hasselblad.com/samples/x1d-II-50c/x1d-II-sample-02.jpg";
sha256 = "15pz84f5d34jmp0ljz61wx3inx8442sgf9n8adbgb8m4v88vifk2";
};
jpg.large = fetchurl {
url = "https://cdn.hasselblad.com/samples/Cam_1_Borna_AOS-H5.jpg";
sha256 = "0rdcxlxcxanlgfnlxs9ffd3s36a05g8g3ca9khkfsgbyd5spk343";
};
calgary-text = stdenv.mkDerivation {
name = "calgary-corpus-text";
src = fetchurl {
url = "http://corpus.canterbury.ac.nz/resources/calgary.tar.gz";
sha256 = "1dwk417ql549l0sa4jzqab67ffmyli4nmgaq7i9ywp4wq6yyw2g1";
};
sourceRoot = ".";
outputs = [ "out" "small" ];
installPhase = ''
cat bib book2 news paper* prog* > "$out"
cat paper1 > "$small"
'';
};
closure-tar = name: pkg: runCommand name {
buildInputs = [ gnutar ];
closure = referencesByPopularity pkg;
} ''
tar -c --files-from="$closure" > "$out"
'';
};
benchmark = { release, program, level, corpusType, corpusSize }:
runCommand (lib.concatStringsSep "-" [
"zbench" program "l${toString level}" corpusType corpusSize release.rel
]) rec {
corpusFile = corpus.${corpusType}.${corpusSize};
run = compressionCommand release.pkgs program level;
version = lib.getVersion release.pkgs.${program};
tags = lib.concatStringsSep "," [
program version (toString level) corpusType corpusSize
];
} ''
orig_size="$(stat -c %s "$corpusFile")"
result_size="$($run < "$corpusFile" | wc -c)"
percent="$((100*result_size / orig_size))"
echo >"$out" "$tags,$orig_size,$result_size,$percent"
'';
in runCommand "compression-benchmarks" {
results = map benchmark (lib.cartesianProductOfSets {
program = [
"brotli"
"bzip2"
"gzip"
"lz4"
"xz"
"zstd"
];
release = [
{ pkgs = pkgs_1603; rel = "1603"; }
{ pkgs = pkgs_2009; rel = "2009"; }
];
level = lib.range 1 9;
corpusType = [ "txt" "pkg" "jpg" ];
corpusSize = [ "small" "medium" "large" ];
});
} ''
echo program,version,level,corpus,class,orig_size,result_size,ratio > $out
cat $results >> $out
''
Above, compressionCommand
defines the command used for each compression
program to compress stdin
to stdout
with a given level.
The corpus
attribute set defines txt
, pkg
and jpg
datasets. For text
and jpeg we simply fetch suitable sets, and for the binary (pkg
) sets we use
Nix itself to create a tar file out of the transistive closure of some
different packages. The corpus sizes varies between around 50 kB and 300 MB.
benchmark
runs a single compression command for one combination of input
parameters.
Finally, we again use the cartesianProductOfSets
function to create builds of
all possible combinations of parameters, and then simply concatenate all
individual results into a big CSV file.
Building the complete benchmark takes about 25 minutes on my somewhat old 8-core workstation, with Nix configured to run at most 8 builds concurrently. If I use nixbuild.net instead, time is cut down to 10 minutes due to the parallelization gains possible when running 972 independent Nix builds.
In the end we get a CSV-file with values for each parameter combination. The first ten lines of the file looks like this:
program,version,level,corpus,class,orig_size,result_size,ratio
brotli,0.3.0,1,txt,small,53161,19634,36
brotli,1.0.9,1,txt,small,53161,21162,39
bzip2,1.0.6,1,txt,small,53161,16558,31
bzip2,1.0.6.0.1,1,txt,small,53161,16558,31
gzip,1.6,1,txt,small,53161,21605,40
gzip,1.10,1,txt,small,53161,21605,40
lz4,131,1,txt,small,53161,27936,52
lz4,1.9.2,1,txt,small,53161,28952,54
xz,5.2.2,1,txt,small,53161,18416,34
To quickly get some sort of visualization of the benchmark data, I dumped the CSV contents into rawgraphs.io and produced the following graph:
From this visualization, we can draw a few conclusions:
There’s little point in compressing (already compressed) JPG data.
xz
is the clear winner when it comes to producing small archives of binary data.bzip2
produces almost the same compression ratio for all level settings. It even looks like level 9 can produce slightly worse compression than level 8.lz4
makes a very big jump in compression ratio between level 2 and 3.
To further refine our workflow, we could also produce data visualizations directly in our Nix expression, by creating builds that would feed the CSV data into some visualization software.
Remember, this blog post is not about benchmarking compression, but about how you can use Nix and nixbuild.net for such workflows. Hopefully you’ve gained some insights into how Nix can be used in scientific computing and data science workflows. Let’s wrap up with a summary of why I find Nix useful in these situations:
The Nix programming language and standard library provide tools for managing combinatorial problems, and allows us to quickly come up with high level abstractions giving us sensible knobs to turn when exploring parameter sweeps and build matrices.
We don’t have to think about parallelization, Nix takes care of it for us.
Nix makes it very easy to build specific variants of packages. This is helpful if you want make comparisons between different software versions or patches. nixpkgs is a huge repository of pre-packaged software available to anyone.
nixbuild.net gives you extreme scalability with no adaptation or configuration needed. In the example above we saw build times cut to less than half by sending our Nix builds to nixbuild.net.
Reproducibility and build reuse is first-rate in Nix.
Thank you for reading this rather lengthy blog post! If have any comments or questions about the content or about nixbuild.net in general, don’t hesitate to contact me.