Thursday, June 26, 2014

Thoughts on developing Docker Containerized Applications

I created a simple set of Docker containers to demonstrate how a scientific workflow could be containerized. The actual Docker container specifications are also open sourced on github

Here are a couple of design decisions that came up when I was developing the containers.

Good Usage statements or help outputs are critical

Containerized applications, like dmlond/bwa_aligner or dmlond/bwa_reference, can be run almost as if they are normal programs on the host machine. The containerized application will have very little interaction with other applications on the host machine. This has some very positive connotations, e.g yum updates to the host will not impact your process (so long as the container does not yum update as well). But, there are some negative connotations. If your process depends on directories that need to be mounted at run time, with -v or --volumes-from, the process usage statement, or commandline help, should reflect this. If you simply wrap a script designed to run on the host system without modifying it to notify the user about the new level of indirection created by the need to mount these directories, they will undoubtedly get confused, and complain when it doesnt run as expected.

ENTRYPOINT locks the container

When you specify an ENTRYPOINT in your container, this locks the container to running a specific command. A user cannot run this container with a different binary, such as:


docker run -i -t container /bin/bash docker run -v /path/to/local:/archive container cp /path/to/container/* /archive etc.

This is useful for containers which must act like a single application. It will be a surprise if you expect to be able to do this with your newly built container.

CMD is useful in many unexpected ways

One of the containers that I created, dmlond/bwa_plasmodium_data, is solely designed to be run as a data container with volumes to be used in the bwa_aligner (or other) container at run time. Another container, dmlond/bwa_samtools_base, is both an intermediate container (e.g. other containers extend it to share its common software install base), and a utility container that can be run by itself to apply its installed software to files in the data or reference volumes. When I initially created these containers, I left then without a CMD. It was not long before I realized different ways that a CMD could be useful.

  1. The dmlond/bwa_plasmodium_data container downloaded and made use of datasets from publicly funded research. One should make every effort to attribute the actual researchers for their data when you use it. The CMD for the container gave me a nice way to do this. I added a txt file to the container with information about the data and links to URLs describing the research from which it was generated. The CMD cats this file when the container is run unless another command is specified by the user. This actually creates a nice win-win for the person running this container to replicate the workflow. Before I added the CMD, when I wanted to run the container to expose its /data volume, I would have to specify a command to run:

    $ docker run --name bwa_plasmodium_data dmlond/bwa_plasmodium_data /bin/true

    Now, I just have to run the container with a --name.

    $ docker run --name bwa_plasmodium_data dmlond/bwa_plasmodium_data

    This data comes from the Genome Epidemiology Network http://www.malariagen.net/data It was pulled from the Short Read Archive Consult the following for more information: http://sra.dnanexus.com/runs/ERR022523 http://sra.dnanexus.com/runs/ERR022523/studies The ftp URLs to the actual files added to the data container are: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR022/ERR022523/ERR022523_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR022/ERR022523/ERR022523_2.fastq.gz

    Its default run state is to print out its attribution information, making it a prominent part of the workflow interaction. (If I wanted to lock this down even further, I could make it an ENTRYPOINT, but for now it is useful to keep it as a CMD).

  2. The dmlond/bwa_samtools_base container installed samtools and bwa using the centos 6.5 RHEL yum repository. Unless it gets built again from source at a time when RHEL has updated these, it will always contain the same exact version of these tools. It is good for users to be able to know what versions the container provides. I accomplish this by adding a script which simply runs bwa and samtools without arguments in successing, and identifying that script as the CMD. This allows users to overwrite the CMD with, say /bin/bash to use it in its utility mode, while also making it possible for users to run it without any command to find out what version of bwa and samtools is installed.

Overall, my interaction with Docker has be extremely positive. I can see many ways that Docker containerized applications could be utilized in research pipelines.