Amazon Linux and Upstart/Init

Have you ever added a sleep or pause into a script to resolve a timing issue? I have, and I have to say I feel kinda dirty every time I do.

One of the more entertaining foibles of using linux within a cloud service, specifically Amazon Linux within AWS in this instance, is not easily identifying what has changed from one of the more mainstream forks available. This means quite often that older scripts I've used in the past no longer work. Resolvable correctly with a bit of effort, or you can do what I found most people who came across an issue with how quickly Node can start ended up using.

Node starts very quickly. This is good! Or in the case of hands-free automated enterprise scaling, it is sometimes not so good. In a nutshell Node is starting so quickly during the boot cycle, that it starts before the network interfaces are fully enabled.

Roll on to using Upstart to encapsulate handling a Node application, and you can easily get into a sticky situation where Node has started and bound to lo or the local loopback interface and will happily ignore eth0 or similar when they start up later in the boot cycle. This means nothing can connect to a running Node application over the network, which is usually the point of running Node.

The most common resolution I've seen to date? It's to use a pause of ~30 seconds and hope the network card is already running. When you are running at scale, 'hope' is not a good thing at all.

So what does a 'dirty' init script look like you ask? Well, something like this:

description "An Awesome App node.js server"  
author      "doatt"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn

script  
    echo $$ > /usr/anAwesomeApp/anAwesomeApp.pid
    exec /usr/bin/node /usr/anAwesomeApp/anAwesomeApp.js >> /usr/anAwesomeApp/anAwesomeApp.log 2>&1
end script

pre-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Starting" >> /usr/anAwesomeApp/anAwesomeApp.log
    pause 30
end script

post-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Started" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

pre-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopping" >> /usr/anAwesomeApp/anAwesomeApp.log
    rm /usr/anAwesomeApp/anAwesomeApp.pid
end script

post-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopped" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

Yes, console logging to a file for the win :)

There are two main issues with the above. The biggest is the pause 30 line. That one is the 'dirty' part of the whole piece. The second is the raw approach to starting based on a runlevel of 2, 3, 4 or 5. I've seen this approach hundreds of times in examples all around the internet. I was even beginning to think that there was no solution in sight outside of using a pause command.

While I've spent a number of hours on figuring this out (and isn't hindsight awesome!), I ran through all sorts of attempts to eliminate the 30 second pause by modifying the start on runlevel [2345] component.

Lets focus on the start/stop commands, of which the stop command can be listed as:

stop on runlevel [016]

Personally I prefer using ! to imply not.

For anybody who doesn't already know, a runlevel of 2 is GUI mode, and this usually means you aren't running a server in a way I (personally) would consider fit for the enterprise. Amazon Linux does not boot in mode 2 by default for good reason - you really shouldn't have a GUI when you have a fully automated infrastructure. Again (personally) I avoid applications that require a GUI to be running on a hands-free server. So to make sure I don't allow my Node apps to run when a GUI is involved, there is a fairly simple change:

start on runlevel [345]  
stop on runlevel [!345]

This of course, can have the stop also written as:

stop on runlevel [0126]

Yeah I know, I'm mean. :D

According to the official Upstart documentation (both generic and Ubuntu specific), there are a bunch of awesome start on options that can be used outside of, or in addition to runlevel.

The best sounding ones are local-filesystems and the even better sounding net-device-up IFACE!=lo. These are awesome on some other variants of Linux, and I've enjoyed having them available in the past. But do they work with Amazon Linux?

No.

Many electrons were annihilated during reboots to figure that no combination of local-filesystems or net-device-up and the like worked as documented.

The (actual) good news is, there is the ability to detect core services starting. After much trial and error, and again lots of surfing the intrawebs, I found that none of the standard names for networking services are in use within Amazon Linux. In hindsight the simplicity of the name was obvious - you can believe I was kicking myself when I figured it out...

Amazon Linux uses a networking service called.... wait for it..... network. :sigh:

So, an awesome script I happen to have simplified so its easy enough to follow, does the following:

ensures the server is running in multi-user mode without an OS GUI interface
ensures the server has its network interfaces all running (eth0 and lo for example)
limits the amount of log spam that might occur if something does go wrong (all servers sit behind an Elastic Load Balancer right?!)
prints helpful information to the log file created that includes a standard date/time format

All good things. Well, maybe not the console log to file approach, but I like having both the console log created as well as any application direct logging that may be occurring, just in case...

So, without further rambling on my part, here is a fully functional Amazon Linux based Upstart/Init script that handles Node starting before the network card itself completes initialisation.

description "An Awesome App node.js server"  
author      "doatt"

start on (runlevel [345] and started network)  
stop on (runlevel [!345] or stopping network)

respawn limit 20 5

script  
    echo $$ > /usr/anAwesomeApp/anAwesomeApp.pid
    exec /usr/bin/node /usr/anAwesomeApp/anAwesomeApp.js >> /usr/anAwesomeApp/anAwesomeApp.log 2>&1
end script

pre-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Starting" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

post-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Started" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

pre-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopping" >> /usr/anAwesomeApp/anAwesomeApp.log
    rm /usr/anAwesomeApp/anAwesomeApp.pid
end script

post-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopped" >> /usr/anAwesomeApp/anAwesomeApp.log
end script