The Build Methodology Decision

There are many different approaches to creating an application build. Obviously I go for full automation of the end to end process (and even then there still multiple approaches to be thought through), but the sad reality most people live in is that builds are anything but automated, or if they are, it is only certain steps and not end-to-end.

Hopefully you are planning either how to get out of any manual steps you have already, or are in the enviable position of a green-field project and starting from scratch. Not getting bitten by a massive people cost long-term is always a good thing in my opinion and more often than not, 'manual' activities within DevOps always involves a higher people cost.

This brings me to the general philosophy I have when it comes to DevOps and servers:

if you are logging in to a production server manually to make changes, then you are doing it wrong

Of course I also have a general philosophy that revolves around having automated scaling, redundancy and great monitoring and alerting to ensure that all reasonable non-engineering aspects are covered. Fundamental issues with application code is never something that DevOps can be expected to cater for. If you can mitigate getting a hotfix deployed by automating the deploy itself, then the expectations that hot-fixes can only be applied manually goes right out the window. You will know you have it 'right' when the question isn't 'how many hours of downtime do you need' and instead it is 'can we have the zero downtime rolling deploy take less than the existing 9 minutes to complete?'.

If we eliminate manual interactions, the decision on how to accomplish an automated application build is limited down to only a few choices, and follow some pretty basic requirements:

No human can be involved in an application deploy outside of source control (GitHub in my case)
No production environment build can occur without first being tested in a different (and identical) environment
Configuration differences between non-production and production must be catered for
At a minimum, security patches must be applied
Application dependency versions must be kept up to date on a regular schedule
Alerts based on monitoring of errors must be reacted to

All of this works hand-in-hand, in that if any one requirement is missed, the people required for a production environment can move from 'is maintained by few and is stable' to 'is maintained by many and is unstable' very quickly.

So what are the choices? The most common methodologies I've encountered over the years are:

NoImage

Every deploy creates a new OS from scratch and adds the application afterwards

BaseImage

A common 'base image' for OS
Specific application installs based on server function applied afterwards

GoldImage

A full OS + application install
Kept up to date separately for all aspects of that OS + application combination

The NoImage approach can provide some advantages when something breaks during an upgrade of an OS component in that it is sometimes easier to find what went wrong during the OS build when that script breaks (in your test environment of course!). Years ago when the OS (and especially the network) was more often an issue than not, this approach made perfect sense. In the last few years I've found that the modern OS is more often than not very stable when using well-supported hardware, so for me this has become less of an issue.

The BaseImage approach can help on standardising what components are rolled out at an OS level, including any common application dependancies that are required. This is particularly good at saving time on updates when you have many different applications using the same OS, but not so much when you have fewer applications, or even applications that require different OS types. If you are trending towards a 1:1 application to OS ratio, then this may not be the best approach.

The GoldImage is perhaps the fastest for deploys, but also can create the largest investment in time keeping all the various images current with updates.

Due to having 5+ applications that all work on the same OS, I personally have gone with the BaseImage approach, as the time saved in a GoldImage deploy is counted in seconds per image (for example, Java SE/JRE 1.7 vs 1.8 being the only non-consistent dependancy I have to deal with, and the install script takes only 14 seconds on average to install Java). Your milage can vary of course!

Remember that we need to keep in mind when choosing our deploy methodology, that there are differences between non-production and production environments. The easiest example is 'which database instance am I using' - to ensure you have one of the most fundamental security concepts in place, you can never have your non-production environments connect to your production systems. Having an operational server that connects to both is a different topic/post for another day, but nevertheless, keeping non-production and production separate is critical. Therefore encoding the database connection information within a gold image is not a good idea.

There is an inherent danger in having a separate gold image for non-production to your production image - if you aren't using the same fundamental image in both, you are setting yourself up for a long-term headache in my experience. As such, the last few minor configuration pieces need to be applied as part of the deploy, which again removes some of the minimisation that a gold image might otherwise provide over a base image methodology.

Given I have personally gone with the BaseImage approach, the following posts will of course be using that methodology. Hopefully the above gives you some inkling into the thinking and experience behind the decision I have made - there is no 'right' way, just a few 'best for my company' and a large number of alternatives that aren't quite as good.

Next up, SSH Keys. Because without a way of authenticating to our initial base image server, we wouldn't get very far at all!