TL;DR: I couldn’t find a tool to reverse Docker images, so I built one. The name is Whaler; you can find it hosted here https://github.com/P3GLEG/Whaler
I was recently on an engagement where I discovered a private Docker registry and was curious about what I could do with it. I have seen a lot more Kubernetes(K8s) and docker deployments from startups and wanted to figure out how I could pull secrets, source code and other miscellaneous information out of nothing more than an Image. It isn’t particularly hard to do on your own; you can run docker history –no-trunc “your image” and you’re done. The problem you encounter is you will see output like this.
You couldn’t see what the files were, only this random hash which I assumed was a sha hash of the file, or filename. It turned out I was completely wrong in this assumption, and it led me down a rabbit hole.
The hash represents the file name, appended with the path and is different depending on the instruction executed, if it is a directory or if it is multiple files being added at one time. The current code for the hash can be found in github.com/docker/docker/builder/dockerfile/internals.go:295. I needed to find an easier way than running the image, brute forcing each file with the hash and somehow accounting for COPY on multiple files.
I found a couple of blog posts that talked about how docker stores layers. Layers are the building blocks of an image. A layer is a stored .tar file for the instruction that was executed. Before Docker 1.10, every command would create a new layer. After 1.10, the community added a feature called content addressable layers. Which can be read about at https://github.com/moby/moby/wiki/Engine-v1.10.0-content-addressability-migration. This made it so only instructions that modified the filesystem would add a layer. Things like ARG, EXPOSE, ENV, CMD, ENTRYPOINT, MAINTAINER, USER, LABEL, VOLUME would never create a layer. RUN is a special case as it can sometimes edit the filesystem and sometimes it doesn’t. I found that some developers would run an echo which doesn’t create a layer after 1.10, but a mkdir/rm does. Accounting for every case where this happened becomes a headache.
When you run docker save IMAGE, you are exporting your docker image with all the layers that build up that image. Within the base of it, there are two files; one is the current COW(Copy on Write) layer, a sha256 hash named JSON file like ba9ea7b8ba264e0af3b6a33404881e1260e35d3b713dd6d5f783bd41a9b42145.json shown below, and a manifest.json. This was the magic I needed.
Within the manifest.json are the layers in sorted order! The sha256 hash named file contains all the commands executed to generate the image and if they are considered an EMPTY_LAYER. EMPTY_LAYER means precisely that, it’s a command that didn’t create a layer and can be ignored. I used this nugget of information to iterate through the sorted layers, and the history to reverse which commands created what layer and within each layer contained the exact listing of files added by that instruction.
After several coding sessions and banging my head against my desk, I built Whaler. It saves the docker image and goes through each tar stored at each layer to list the files created. I added some quick regex checks too, nothing too serious. If an image is large, please be patient as it has to save the image as a tar file then search through it. I hope it helps you out in your journeys.
Here are some examples