DevOps All The Things

Your very own VPC (warning, boring security conversation as well)

Phillip Holmes — Sat, 30 May 2015 19:20:51 GMT

One of the building blocks of running an AWS EC2 instance is called the VPC or Virtual Private Cloud. The VPC system allows you to create shared or segmented networking blocks for the various EC2 instances that you will run.

The really boring Security conversation...

Fundamentally the level of security you can provide between instances and anything else - other instances or the Internet in general - is dictated by how you construct your VPC topology. In general terms, one large shared VPC with no restrictions is bad. The other extreme of a VPC per instance is, while technically possible, also overly complicated and doesn't actually provide more security and instead more opportunities to configure things badly.

I did warn in the title that a boring security conversation was going to happen right? ;)

A nice middle-ground between over-simplification and over-complication is where you want to be. If you haven't been architecting networks and implementing servers in them for many years, then I'm going to recommend you put all birds of a feather in common cages so that (in this analogy at least) you can try to achieve some colour coordination. From a DevOps perspective, this means we are going to lump all our development instances together in one VPC, and keep that separate to our operations instances, and again separate to our production instances. As such, the simplest model I would ever recommend is going with a minimum of three VPCs:

Development
Operations
Production

In this model, the Development and Production VPCs should (read 'must') never initiate connections to each other. For specific services such as a log aggregator or a yum server holding rpm images, the Operations VPC can have connections inbound. The Operations VPC should (again, read 'must') be the only area that can initiate connections directly to the instances within the Development or Production VPCs.

You will of course have public facing connections incoming to Development or Production, but unless there is an unsurmountable technical reason to not use a load balancer (AWS call them an ELB or Elastic Load Balancer), then you should never have direct connections to the instances from the Internet either. Load balancers provide more than just High Availability (HA), but also allow for some direct control and also protocol translation between the Internet and the internal instances. This also helps us comply with the fairly strict requirement of only allowing Operations to initiate internal connections to Development or Production.

In case you have already made the logical leap to 'what about SSH for debugging from the office?' as a question, there are two key ways to get connected without necessarily allowing the entire world to have a go at hacking your servers.

VPN

Setting up a VPN server in your Operations VPC is easily seen as the most secure way of achieving access to any server from anywhere that has the right authentication. Therein lies probably the biggest issue with the security of this approach though - anybody that has the right authentication can also mean anybody that compromises that authentication, or even grabs the relevant authentication details from your workstation.

If you do go this route, ensure your VPN requires Multi-Factor-Authentication (MFA) using TOTP (Time based One Time Password) in addition to a public/private key pair. Never store the 'secret' part of the TOTP system on the same device as where the private key is also held and you will make compromising the VPN authentication nigh impossible without the hacker and yourself being in the same physical location. It's actually simpler to implement (PAM modules on Linux are great for this) than it sounds, but does add additional overhead of having a VPN server (or two!) to care about. Also, if not implemented correctly it provides only the pretence of good security.

If implementing a VPN, I always back this up with a process for implementing the alternative model in a hurry by manipulating Security Groups within the AWS console.

Security Groups

In the end, Security Groups are part of your controls into your servers. They work as an additional layer to the ACL (Access Control List) component of the VPC system, and are used to allow access. You can only tell a Security Group to deny traffic by not specifying the port/IP combination. The ACL system on the other hand is able to be used for both enabling and disallowing connections. EC2 instances only know about Security Groups so I've always found it best to only use the ACL system for disallowing specific IP addresses absolutely when needed, and then using Security Groups for detailing what can connect in general. My general ACL rules as a result look similar to the below (sorry whoever is currently using 1.2.3.4/32!). IP addresses in ACLs and Security Groups are defined using Classless Inter-Domain Routing (CIDR) notation, click the link if you want to know more. Rule 1 doesn't always exist (or there is more than one when there are multiple denies).

Now that we have digressed somewhat, lets move back to talking about Security Groups. Because a Security Group is designed to allow access, if you don't specifically allow a port and IP combination in a Security Group, then that connection will not be able to be made. If you set up a block behind that Security Group within the ACL then even if the Security Group is enabled then the connection will still not be able to be created.

Putting a deny in an ACL is far easier than breaking up an existing Security Group in order to avoid allowing access within a larger range. For example, just using Security Groups in my ACL example above would mean blocking 1.2.3.4 would look like:

0.0.0.0/8  
1.0.0.0/15  
1.2.0.0/23  
1.2.2.0/24  
1.2.3.0/30

1.2.3.5/32  
1.2.3.6/31  
1.2.3.8/29  
1.2.3.16/28  
1.2.3.32/27  
1.2.3.64/26  
1.2.3.128/25  
1.2.4.0/22  
1.2.8.0/21  
1.2.16.0/20  
1.2.32.0/19  
1.2.64.0/18  
1.2.128.0/17  
1.3.0.0/16  
1.4.0.0/14  
1.8.0.0/13  
1.16.0.0/12  
1.32.0.0/11  
1.64.0.0/10  
1.128.0.0/9  
2.0.0.0/7  
4.0.0.0/6  
8.0.0.0/5  
16.0.0.0/4  
32.0.0.0/3  
64.0.0.0/2  
128.0.0.0/1

That blank line? well thats 1.2.3.4/32. This is overkill for when you just want to block the one IP. My actual security group just uses 0.0.0.0/0 to denote 'the entire Internet' and the ACL blocks the single IP without the need for many Security Group rules.

Alternatively to just allow 2.3.4.5/32 to connect over HTTPS, you keep the ACL rule as is (1.2.3.4/32 still can't connect, sorry! Neither can the rest of the Internet, so don't feel too bad), just specify the following in your Security Group:

All this was to say 'Security Groups can be used to specify only your Office IP address can connect directly to instances within your VPCs'. This means you can easily put IP restrictions on what can connect directly to the systems running within your VPCs.

It is possible to implement each allow rule in both a Security Group and ACL. This is 'best practice' in that you would have to make the same mistake twice to allow access where you didn't intend access to be granted, but in reality all does is create situations where what looks like it should work (Security Group configured correctly) gives a false negative issue on a server (ACL not configured correctly) where the actual problem lies with the ACL and not the instance. Sometimes people forget they are using both and that can take a while to diagnose as well.

I mentioned in the VPN answer that you can use Security Groups as a backup for if you lose all connectivity to the VPN server(s). Having a Security Group already implemented with no rules (i.e. nobody can connect) is a good idea in that situation, as you can then easily modify the security group to give direct access from your current IP address temporarily. And therein lies the crux of the security issue - its easy to modify the controls to allow access if you have the authentication details to the system where the controls are configured. Hence, always use a MFA on every account with this sort of access.

A final note on potential security issues - accounts with AWS Access Keys with * access to EC2 can also modify Security groups, so be careful where you allow that level of access via AWS Access Keys.

All these words and no actual actions! Put it down to liking everybody to be on the same page before configuring a security system, as not having an inkling on why you do something can lead to disaster down the road. The best analogy here is teaching somebody how accelerate a car without telling them there are brakes or how to use them.

Before we continue, you will probably already noticed that a VPC already existed (172.31.0.0/16) in your account. This is deemed the 'default' VPC and while it is possible to delete it if you aren't using it, I like leaving it around as you can't recreate the default VPC without direct AWS staff help. Lets just put it down to 'being ignored' from this point on, although you may find you add instances to it by accident in the future as the default VPC is used as just that, a default. A default you should always override as you want to only ever put production servers in a production VPC, and defaults are bad from that perspective. Don't be that DevOps person with something in the default VPC ;)

Creating a VPC

There are two ways to create a VPC. Within the Dashboard of the VPC console is a 'Start VPC Wizard' button. It gives you 4 options, but unless you are wanting to spend money on hardware or running a Network Address Translation (NAT) instance, you will end up choosing option 1. The other way is to create each component individually, but where is the fun in that? :)

The other 3 options are all very valid, its just that they cost real money to implement and are also more advanced than you will probably need outside of a large Enterprise. Feel free to experiment (or implement!) them as wanted of course. Just be aware there will be a cost that you will be paying in terms of dollars and/or equipment.

Step 2 dictates the starting point of the structure you will be using, and I only recommend modifying a couple of the pieces at the start. The first is which overall IP range you will use. I like 10.0.0.0/16 for my development ranges, so will stick with that in the first instance. Give your VPC a name (development) and while its unneeded, I like subnets in groups of ~4000 IP addresses as I never plan to have more than 16 subnets within a single VPC. I usually don't go above the number of availability zones for that matter. This is why I specify my first subnet as 10.0.0.0/20. I also choose this to be put within Availability Zone A (I use the US West 2 or Oregon region by default, hence us-west-2a for Availability Zone A in the US West 2 region (or location if you will)). Next up I give my subnet a name, which is usually specific to its purpose and always contains (in my case at least!) the VPC name and the availability zone. For this example we are just going with development-a although that could easily be development-awesomeapplication-a if I was implementing something specific to Awesome Application.

Feel free to ignore the endpoints for S3 - you can always add this later if you need it - its also a new capability in the last few months from AWS which is pretty awesome from a security perspective on what you can now do with S3 to EC2 connectivity.

I also choose to allow DNS hostnames for those applications and OS components that prefer them. Be aware that changing your Hardware tenancy from Default to Dedicated will start costing money. Unless your company requires it, it isn't needed for most systems and applications.

All done? Why Yes! Yes you are!

So what joys did you miss out on by using the wizard and not setting up each component individually? To be honest, no real joy was missed in using the wizard, but you did get some components already configured for you:

an initial Subnet
two Route Tables
an Internet Gateway
an ACL
a Security Group

Is that all you need in the VPC? Well, if you aren't planning on using multiple Availability Zones, then sure. But of course you are always going to plan on using more than one Availability Zones as without doing so you run the risk of that dreaded concept of 'unplanned maintenance'. AKA 'outage'. AWS provide a guarantee that unless there is a massive hurricane or earthquake, a 'region' such as US West 2 or Oregon will be always available. The fine print of course specifies that a single Availability Zone may not be up 100%, but that within that region at least one of the Availability Zones will be up and running. This means that unless you use multiple Availability Zones you could find yourself in an outage scenario. I always recommend having a minimum of two instances (spread out across two zones) as a result.

Basic High Availability architecture dictates that you must be able to have one segment of your systems removed at any time with the other remaining segments able to withstand the additional throughput caused by that segment loss. If you have two instances and can't withstand the loss of one, then bring up a third instance. And so on... This can also be augmented by running in multiple regions, but thats a whole different story.

What does that mean for us? Well, given we want more than one Availability Zone, we will have to add some more subnets as a subnet is tied directly to the networking equipment held within a specific Availability Zone. A further digression - while the AWS names are Region and Availability Zone - I believe we would call them massive datacenters that are geographically separated to ensure availability of the region overall. When you see 'Availability Zone A' think 'one large building with many floors full of servers, networking and HVAC equipment'. When you see 'Region' think 'multiple buildings in different parts of the city'. AWS run at a different scale to most companies.

So next up is setting up the remaining subnets.

Adding more Subnets

This is fairly easy to accomplish - switch to the Subnets view in the AWS VPC console:

Click the Create Subnet button to start, and update it accordingly. For my second development subnet in zone b, I call it development-b. I choose the relevant development VPC 10.0.0.0/16 | development and specify the us-west-2b subnet. Choosing the CIDR block or IP address range is the hardest part as it usually involves math (or memory if you work with networking enough). The next range after 10.0.0.0/20 is 10.0.16.0/20 which covers the 10.0.16.0 to 10.0.31.255 IP range.

Now that we have created our second subnet for zone b, go ahead and do the same, but for zone c instead. The CIDR range for zone c is 10.0.32.0/20.

This should leave you with 3 subnets with descriptions to match:

Route Tables

When using the VPC Wizard to create a new VPC, AWS created two Route Tables for you. One is configured as the 'main' route with no subnets associated with it, with the other being more specific and having that initial subnet (development-a) associated with it. What this means is that our two new subnets (development-b and development-c) are not currently associated with a specific route, and therefore will use the 'main' route. Be aware that the 'main' route created via the wizard does not have an Internet Gateway associated with it, so servers in the 'main' route will not be able to talk to anything other than the instances within that same VPC.

Usually I change this so that the three subnets share the same routing mechanism as the initial subnet and therefore have an Internet Gateway. When I'm setting up servers that should not be able to talk to the Internet I put them in subnets associated with the 'main' route of course. In the examples I'm creating however I will want access to the Internet, so to change the association, select the Route Table (from the Route Table menu option on the left) and click the Edit button:

Tick the boxes next to the subnets not currently associated and click Save.

You can keep each subnet in its own route, but that adds configuration effort later as usually you will want instances in the same category to be able to easily talk to each other, hence using the same route is important.

Other VPC Components

There are a few other VPC components that can be configured, but they are usually for more advanced scenarios and I'll only cover them as I get to specific cases that use them. This includes:

DHCP Option Sets
Endpoints (S3)
Peering Connections
Customer Gateways
Virtual Private Gateways
VPN Connections

The Internet Gateway would need further discussion if we hadn't used the VPC Wizard. The Internet Gateway component allows each EC2 Instance to be able to talk to the Internet and also allow the Internet to talk to it (if the appropriate Security Groups and ACLs are configured). Adding one to a Routing Table is pretty easy (create the Internet Gateway, attach it to the relevant VPC, then in the Routing Table config, attach the Routing Table to the Internet Gateway). A VPC can only have one Internet Gateway associated with it, so you only have to do this for an overall VPC once. Routing Tables as previously discussed can be configured to see that Internet Gateway or not as the purpose for the Routing Table dictates.

ACLs also deserve an updated mention. ACLs are stateless, which means if you are using them in a 'deny all, allow only what is needed' mode (another security 'best practice' concept), then you need to specify both incoming and outgoing ports and IP combinations. This can get tricky, especially if the application and protocol being used doesn't give good documentation on what ports are used. The Security Groups are far easier to work with of course.

Which does bring us on to Security Groups. Rather than cover them further here, Security Groups are more important from an EC2 point of view, and I'll cover them as part of the various EC2 posts.

Last, but not least, are Elastic IPs. These are used to keep the same IP address in use while being able to swap out which specific instance is running in the background. Unless you are using an alternative HA system such as Round-Robin DNS, this implies a SPF (Single Point of Failure) implementation and again unless there is a good technical reason to do so, should be avoided, especially in production and public facing systems.

SSL Certificates for Platform (or Web Servers) in my book at least should reside on an ELB, so I've never considered them a good reason for having multiple IP addresses and network interfaces in use on a single Instance.

If you can deal with a short-lived DNS A record (most systems can!) then the bulk of the other reasons behind needing an Elastic IP can be avoided completely. Implementing custom DNS names as part of creating an EC2 instance (this is very easy when also using Route53) is the way to go in order to avoid a more complicated setup involving removing an Elastic IP from one server you are about to decommission and attaching it to a new one without causing an outage or requiring a planned maintenance. If you put the effort (and architecture) in place you too can achieve a zero-downtime system. Elastic IP addresses usually involve some sort of downtime, even if it is limited to only seconds.

Creating the other VPCs

Well, now that you have what is needed for the first VPC (10.0.0.0/16), you should move on to setting up your Operations and Production VPCs.

There is no hard and fast rule for what should be used as the overall IP ranges, but as we have lots to work with, I usually recommend separating out the other VPCs just in case you do want to have multiple development or production environments.

I'd recommend 10.50.0.0/16 and 10.200.0.0/16 or similar for Production and Operations respectively. What you use isn't that important as you will usually reference the internal IP addresses by the VPC name (or subnet name) and hardly ever as the actual IP addresses themselves.

In the examples going forwards I will presume you have gone with 10.50.0.0/16 and 10.200.0.0/16 of course. After setting them up, you should have a subnet list that looks similar to the below:

Don't forget to make sure the subnets are attached to the right route and you will be good to go!

Setting up an EC2 SSH User

Phillip Holmes — Sun, 10 May 2015 23:56:01 GMT

There are many approaches to connecting to servers long-term. For Linux, the default is to use SSH without a password. All the examples will be based on Linux servers, so SSH is what we will use.

While access to a server with only one form of authentication (SSH in this case) can be a security nightmare, security is never about just one type of control, but rather a series of controls that (ideally) overlap.

In these examples, we will be relying on SSH without using a password, and will couple this with the use of AWS Security Groups to ensure that access to any instance over SSH is restricted to only specific IP addresses. That's not all of course (any machine I allow to connect is a potential attack vector including my workstation), and IP restrictions are one of the best secondary protections available. Appropriately when creating a new instance I ensure it has a restrictive Security Group enabled. This of course is a post for the future, so lets get back to SSH keys.

SSH keys are based on a public and private key pair (for every private key that you need to protect there is a public key you can give away safely), which is quite different to the traditional username/password system you may be used to. Having a public/private key pair is considered asymmetric-key encryption in that one side of the key can be stored in many locations and shared (hence 'public'), with the private side of the key pair protected. A password as such is the basic example of symmetric-key encryption where the password (or key) is the same on both sides of the encryption and decryption.

Knowledge of a password allows authentication no matter where it is used - for SSH, only knowledge of the private key is considered important as while anybody can have access to the public key, you can't use that public key to authenticate against another system that only has the public key. Thus you can put the public key anywhere without creating a massive security vulnerability.

Hopefully the above gives you the background to make the statement 'keep the private key secure' more than just a phrase and something more understandable.

Before I forget to actually say it, 'keep the private key secure' :)

Within the AWS console, navigate to the EC2 panel and the SSH 'Key Pairs' section - the key is either created or imported into the EC2 panel, as this is how AWS enable a key for use against a standard AMI (Amazon Machine Image) when an EC2 instance is created.

If you don't already have a SSH key, one can be created within the AWS console. To create a key using the AWS console, use the 'Create Key' button:

Give your key a good name that will make sense long-term - I'd steer away from using a nonsensical name as knowing what a key is used for can be very important.

As part of creating the key, you will have the private key component saved either to your default save location or will be prompted to save the file. By default whatever you call the key in the above dialog will create a file with that name and the extension .pem. For example if you call a key ec2access then you will receive a key called ec2access.pem.

Alternatively, if you already have a SSH key you would like to use, then you can import the public side of the key pair you have already. Click the 'Import Key Pair' button, and use the 'choose file' button in that dialog to start the import process - just make sure you choose your public key:

Now that you have a key listed, you can create a new EC2 instance, and by using that key, are able to access it.

Next up, we will create the needed VPC and subnets required. More on that next post!

The Build Methodology Decision

Phillip Holmes — Sun, 26 Apr 2015 20:44:00 GMT

There are many different approaches to creating an application build. Obviously I go for full automation of the end to end process (and even then there still multiple approaches to be thought through), but the sad reality most people live in is that builds are anything but automated, or if they are, it is only certain steps and not end-to-end.

Hopefully you are planning either how to get out of any manual steps you have already, or are in the enviable position of a green-field project and starting from scratch. Not getting bitten by a massive people cost long-term is always a good thing in my opinion and more often than not, 'manual' activities within DevOps always involves a higher people cost.

This brings me to the general philosophy I have when it comes to DevOps and servers:

if you are logging in to a production server manually to make changes, then you are doing it wrong

Of course I also have a general philosophy that revolves around having automated scaling, redundancy and great monitoring and alerting to ensure that all reasonable non-engineering aspects are covered. Fundamental issues with application code is never something that DevOps can be expected to cater for. If you can mitigate getting a hotfix deployed by automating the deploy itself, then the expectations that hot-fixes can only be applied manually goes right out the window. You will know you have it 'right' when the question isn't 'how many hours of downtime do you need' and instead it is 'can we have the zero downtime rolling deploy take less than the existing 9 minutes to complete?'.

If we eliminate manual interactions, the decision on how to accomplish an automated application build is limited down to only a few choices, and follow some pretty basic requirements:

No human can be involved in an application deploy outside of source control (GitHub in my case)
No production environment build can occur without first being tested in a different (and identical) environment
Configuration differences between non-production and production must be catered for
At a minimum, security patches must be applied
Application dependency versions must be kept up to date on a regular schedule
Alerts based on monitoring of errors must be reacted to

All of this works hand-in-hand, in that if any one requirement is missed, the people required for a production environment can move from 'is maintained by few and is stable' to 'is maintained by many and is unstable' very quickly.

So what are the choices? The most common methodologies I've encountered over the years are:

NoImage

Every deploy creates a new OS from scratch and adds the application afterwards

BaseImage

A common 'base image' for OS
Specific application installs based on server function applied afterwards

GoldImage

A full OS + application install
Kept up to date separately for all aspects of that OS + application combination

The NoImage approach can provide some advantages when something breaks during an upgrade of an OS component in that it is sometimes easier to find what went wrong during the OS build when that script breaks (in your test environment of course!). Years ago when the OS (and especially the network) was more often an issue than not, this approach made perfect sense. In the last few years I've found that the modern OS is more often than not very stable when using well-supported hardware, so for me this has become less of an issue.

The BaseImage approach can help on standardising what components are rolled out at an OS level, including any common application dependancies that are required. This is particularly good at saving time on updates when you have many different applications using the same OS, but not so much when you have fewer applications, or even applications that require different OS types. If you are trending towards a 1:1 application to OS ratio, then this may not be the best approach.

The GoldImage is perhaps the fastest for deploys, but also can create the largest investment in time keeping all the various images current with updates.

Due to having 5+ applications that all work on the same OS, I personally have gone with the BaseImage approach, as the time saved in a GoldImage deploy is counted in seconds per image (for example, Java SE/JRE 1.7 vs 1.8 being the only non-consistent dependancy I have to deal with, and the install script takes only 14 seconds on average to install Java). Your milage can vary of course!

Remember that we need to keep in mind when choosing our deploy methodology, that there are differences between non-production and production environments. The easiest example is 'which database instance am I using' - to ensure you have one of the most fundamental security concepts in place, you can never have your non-production environments connect to your production systems. Having an operational server that connects to both is a different topic/post for another day, but nevertheless, keeping non-production and production separate is critical. Therefore encoding the database connection information within a gold image is not a good idea.

There is an inherent danger in having a separate gold image for non-production to your production image - if you aren't using the same fundamental image in both, you are setting yourself up for a long-term headache in my experience. As such, the last few minor configuration pieces need to be applied as part of the deploy, which again removes some of the minimisation that a gold image might otherwise provide over a base image methodology.

Given I have personally gone with the BaseImage approach, the following posts will of course be using that methodology. Hopefully the above gives you some inkling into the thinking and experience behind the decision I have made - there is no 'right' way, just a few 'best for my company' and a large number of alternatives that aren't quite as good.

Next up, SSH Keys. Because without a way of authenticating to our initial base image server, we wouldn't get very far at all!

Ghost 6.0 Upgrade

Phillip Holmes — Sun, 19 Apr 2015 18:23:26 GMT

When I first started writing my posts, I was using Ghost 5.5, and missed the ability to spell check intensely. Ghost 5.5 was using a weird and wonderful browser editor that meant the inbuilt spelling checkers within current browsers was unable to actually function.

I've run through the posts many times since I started writing and corrected spelling mistakes where I could find them, but I'm really bad at proof-reading my own work without some virtual help. Hopefully that wasn't too obvious or annoying!

Eventually I moved to using Atom as my editor for all new posts, because at least it had a spell check feature available. It had its own annoyances as I like writing in markdown (my Atom and the Never Ending Italics on _TARGET post is a good example), and was giving us better quality both from the writing and reading perspectives.

Well, Ghost 6.0 was finally released a little while ago, and I've just got around to installing it and seeing what happens. The good news is, with only a few minor modifications to my website deployment script all appears to be up and running in the new version without any major issues.

The best news is the new editor within Ghost does allow the browser to identify (and therefore easily fix) spelling mistakes using the in-built spell checker. One more pass through all my original posts and all but one needed minor modifications (yeah, I'm really bad at spelling) - it turns out the spelling checker in Atom wasn't as good as I thought it was either!

Now that I have moved back to typing directly in the Ghost interface I can say I'm much happier - copy/pasting between Atom and Ghost was never ideal even if it did solve a fundamental flaw that Ghost had implemented originally and taken well over a year to fix.

Of course publishing this post will be the ultimate 'tell' that all is really working right? :)

Now, off to run the publish script and see what happens.........

Optional Step - Add A Custom Domain ($ Cost Involved)

Phillip Holmes — Sun, 05 Apr 2015 18:27:20 GMT

This is an optional step which will cost you real money if you move forward with it. While the free tier is manageable and can truly be 'free', adding a domain in of itself costs money, but it is also the only way to take full advantage of also using Route53. Sadly, Route53 also costs money to use above and beyond the free tier in that each domain costs USD$0.50 per month at a minimum. DNS queries over time can cost a few cents (or if you have a popular domain, dollars, or if you have a really popular domain, many dollars!).

All that said, I'm going to presume that you will have a domain, and that you will use Route53 when I move through using the service. There is a very simple reason for this - AWS services integrate with each other very well, and Route53 is one of those central pieces that really do connect all the dots.

One point to note - it is easy enough to transfer an existing domain from a different register to AWS - very similar in fact to the steps below. You don't actually need to do this in order to take advantage of the Route53 DNS service, so you can instead totally ignore the below and not actually transfer the domain. Instead you could just transfer over your DNS service to AWS and still take advantage of the integrations.

A key item to know when running a domain migration is that AWS use GHANDI as their register of choice, so when you migrate one of the more complicated domains such as .co.uk, you need to know that information in advance, otherwise you may make the same mistake I did when transferring my first complicated domain and chose one of the Amazon related names in the international list of registers. Amazon have one, but use it for internal domains and not for the public facing Route53 service. Best to know before migrating right? :)

So with no further preamble, lets get on with registering a domain!

After switching to the Route53 dashboard in the AWS web console, you will be greeted by a 'get started now' dialog (or if you already have a domain, an actual dashboard). The vanilla greeting dialog (with a focus on registering a domain) looks like so:

You can either click the 'Get Started Now' button, or click on the 'Registered Domains' menu item on the left. If you do you can then start registering a domain by clicking the 'Register Domain' button at the top. This second method is how you will register your 2nd domain and beyond:

Now for the really hard part. Finding a domain that you like, that is also available. For this example, I'm registering a domain called devopsallthethings.com which ultimately I'll redirect to my main doatt.com domain. I like less typing and was lucky enough to get a nice short domain name.

Amazon are quite clear about what the pre-tax cost will be of a domain, so choose wisely both for enabing others to be able to type in the domain name easily, as well as being able to remember what the domain was actually called. You know your market better than I, so good luck with the domain name search!

Clicking the 'Check' button will help you find a domain that is available:

Once you have found a domain that you like, click the 'Add to cart' button:

Next up, you need to enter contact information. Some (but not all!) domains come with privacy protection for domains registered as a person rather than company. If that is you, use the 'Person' option when prompted. If you are part of a larger organisation, then you probably have somebody else registering the domain for you, or know who to ask for what details to put in for the conatct details. Obviously I'm not going to make it easy to find my actual contact details, so I'm skipping showing the full form with all the information listed.

Let me just stress here, that correct information is paramount. Putting in fictitious addresses is a bad idea, and putting in an incorrect email address is worse. The email address will be used in the future for domain related information, so is very important.

Once you have entered in your contact details Amazon are going to get you to confirm that you are willing to part with actual money. Domain registrations happen outside of the normal monthly billing cycle with AWS, so you will get an immediate charge, and a yearly one afterwards for a similar amount. They also outline the expected fees you should be aware of - again, Route53 is not free (although USD$0.50 for a month of DNS is pretty good value given the infrastructure AWS use), so be aware you are setting yourself up for some costs here. You as always are in control :)

Now that you have been asked a few times to confirm all these steps, you have to (finally!) complete your purchase:

This brings up a nice 'what happens next' screen. For most standard domains you should expect to have the domain up and running in minutes - Amazon Web Services like to cover their bases and say allow up to 3 days just in case of course. Some of the more obscure top level domain (TLD) extensions do actually take a few days, so your milage will vary depending on which TLD you chose.

Once GHANDI (the AWS register working in the background) has moved the domain registration to the next stage, you will receive an email to the primary account contact that you listed earlier. This is a good check to make sure you are receiving the emails related to the domain.

Nothing to do here other than wait a bit for the domain to finalize on the AWS side of things in their console.

Once the domain is finalised, and if you used AWS to register the domain, you will see an entire similar to the below:

Note that AWS automatically set up Route53 for you - this was detailed earlier on in the process as part of the costs associated with buying the domain, so should not be a surprise. If you are feeling adventurous, you can now click on the Route53 entry and see the automatically created NS (nameserver) and SOA (Start of Authority) records that were created for you.

We will be back in here many times as we use other AWS services such as S3, CloudFront and even EC2. But that is information for another day...

AWS Shortcuts: Sometimes Overlooked But Very Useful

Phillip Holmes — Sun, 05 Apr 2015 03:50:00 GMT

So while there are 38 (and counting!) distinct AWS services, I generally find that I only use around a third of them.

That said, with 10 or more frequently used services, navigating between them can be intimidating. AWS do give a way of adding shortcuts to the top menu, however by default they also enable the name to show next to the icon. If you are only using two or three services this isn't a big deal, but after 5 or 6 you start to run out of space.

The default view looks like so:

When dragging the icons up to the bar, its easy to miss the settings menu which sits on the right-hand side:

Switching this to Icons only then minimises the width each option takes up - which shortens the toolbar to look like this:

You don't have to worry about the icons being confusing. At first they are, but after using the various services consistently for a while you will not only get used to which color means which service, but also which piece of the puzzle so to speak each icon represents. I've found most complement each other in some shape or pattern - almost feels deliberate really!

Automation Server - First Steps

Phillip Holmes — Sun, 22 Mar 2015 18:45:41 GMT

Before you can create a server, you need to have an environment to set that server up in.

Given we are using Amazon Web Services, you will need to have a key ingredient - an email account. You thought I was going to say and AWS account right? You do, but its not the key ingredient!

I recommend having an email account that has 'plus' addressing. This is where emails that go to user@domain can also be sent as user+text@domain as well. The alternative of course is to have aliases or even multiple email accounts such as text@domain. Plus addressing is a personal preference.

The reason for wanting multiple email accounts is easy to describe:

You will have a primary AWS account that you really shouldn't be using once the overall system is set up - this needs a unique (and locked away) email address
You will want to have AWS accounts per person working within AWS
You will probably want (is that 'may' want?) to use rules within your email client to filter alerts and notifications - unique email addresses is great for this so plus addressing means less actual email accounts in use

I love plus addressing for meaning less email accounts in use (some email providers charge per account), as well as less configuration in a mail client needed to get the relevant emails. In larger organisations cost isn't such a big deal - I'm an extremest when it comes to trying to save money though!

For my examples I'm using a gmail account called devopsallthethings@gmail.com. This gives me plus addressing of course. Please note this is an account I have set up for demo purposes, so don't bother emailing it - all emails are ignored. No, my birthday is not the epoch either (and it was changed in the form afterwards!) ;)

I recommend using the Mobile Phone option as well as using the 'Prove you're not a robot' option and enabling multi-factor (2-step as Google call it) authentication for security reasons. For this demonstration the core AWS account is linked fully to this email address, so you should do everything you can to secure the email account.

An example from early 2015 of the gmail account creation screen:

As mentioned, you should then go and secure your account - you do this by clicking on your email address in the upper-right of the browser and choosing the 'Account' link like so:

Then, enable 2-step verification - scroll down the list to the 'Signing in' section and click the 2-Step Verification row:

Follow the prompts - its a great system and very easy to use.

Right, now that you have an email account (or already had one you were planning to use and the above was rather boring!) then you are ready to sign up for an AWS account - again if you haven't got one already ;)

First, go to the Amazon Web Services (AWS) home page and click the 'Create an AWS Account' button which is usually in the top right corner of the page.

You should end up seeing something like the following:

Note that I use +awsmaster as my plus addressing name to denote this is the master account for AWS. I only really want to log on with this account once, and then lock it away and never use it again.

You then need to put in your contact information. No screenshot for obvious reasons :)

Next, you need to put in a credit card. Don't worry, as long as you stop the servers, and only use T2.Micro's you shouldn't hit any of the limits and never get charged. Of course, if you do get charged, it will be for reasons that are fully under your control, so just remember you are in control of your own expenses!

This is also why I keep pointing out you need to secure your account - using a weak password and no multi-factor security on both your AWS and email account can easily lead to very large bills as there are people out there who would love to use a 'free' (free to them) account for nefarious purposes...

You will then be prompted to confirm your telephone number. This is easiest when spoken in my experience, as while you would expect the touch numbers to be clearer I've always had problem and had to speak.

Once you are past giving tons of personal and financial information over, you are then asked which support plan you would like to use.

Free right? Thats my choice for the demo!

Well, now that is complete, you can log in to your newly minted AWS account. Or log in to the AWS account you already had. Either way works.

The AWS console is pretty scary to the uninitiated - after using it for a while you will have some shortcuts to the services you are actually using, but for now there is only one option you need to click. This is the Identity & Access Management (IAM) option which is under the 'Administration & Security' category:

So now we are in the IAM dashboard, I find that after creating an easy to remember console url, following most of the steps AWS outline in the Security Status screen is the best bet:

Before we do that though, lets customise the IAM url we will use (a lot!) in the future. Click the 'Customise' option and choose something relevant to you and/or your organisation:

This gives us a slightly more memorable link than the one AWS gives by default (the account number basically):

So, on to the first security status step - Activate MFA on your root account. I personally use Authy. Again this is tied to a secured email account as it is again a great system to compromise if you are a hacker - and by 'compromise' I basically mean 'gain access to your email account'... Activating MFA can be undertaken in so many different ways I'm not going to walk through it - the steps are nice and simple and outlined clearly as you go. Plus, having screenshots of a valid QR code would just be silly.

Once you have secured your master or root account, the next step is to create a user account for yourself.

Creating a new user is very straight forward - with the main caveat that you should choose a forward thinking name (do you need to prefix by department or even purpose?). For this example I chose a user called adminallthethings. Note that generating an access key is important if you are going to use the API or console tools, but not important for your 'web interface' user account. As a result I recommend unchecking the 'Generate an access key for each user' option.

Now there are three more steps you have to apply to this new user account before you can use it for real. The first is attaching the 'admin' policy by clicking the 'Attach Policy' button inside the Permissions. Using the in-built Managed Policies is a good idea for this step.

This gives the account full access to everything. Which means you need to secure this new account just as well as the master or root account. The only difference that I'm aware of (and I could be wrong!) is that only the master account can delete the entire account. Any account with write privileges to systems such as EC2 can create havoc so having administrator (or not having administrator) access isn't always such a big deal. Treat every account as if it can destroy everything you are hosting and you should be as good as you can get!

A quick view of what this means to AWS, as its basically just a JSON blob of text that says 'give this account access to all the things':

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

We will cover how policies are written later as we use them to manage what servers (or accounts) can do a lot within the automation scripts.

The penultimate step on the account is to add a website password by using the 'Manage Password' button inside the Security Credentials section at the bottom of the user account. Since I use MFA on all my website accounts the complexity of the password isn't as big a deal as it could be. I still recommend using a password manager and using an initially randomised (and long) password that you never have to try and remember.

Finally for this new account, associate a MFA using the 'Manage MFA Device' option. This is the same process as what you just undertook in your master account, just with a second MFA entry.

There is another step in the IAM Security Status table - Use groups to assign permissions - and we will cover this in a future post. I use it a lot in how things are configured, so we will come back to it more than once in the future.

The final Security Status table item is 'Apply an IAM password policy'. If you mandate that all AWS website access is undertaken using accounts that have a MFA, this is not that important imo. I would still set it to eliminate some of the more moronic passwords such as Password1 that are used so frequently in production systems - just make the minimum 12 or more characters and turn all the options on if you are feeling mean :) Do require a MFA though - passwords are too often written down electronically.

Now don't log out too quickly - the last item of business you may want to add to the above is enabling billing access to your equivalent adminallthethings account.

Adding this will mean that your administrator account can manipulate billing information - keep that in mind when thinking about adding this option, and this will apply to all administrator level accounts that you may create in the future so is a fundamental step.

To add the ability to see (and manipulate) the billing information from your adminallthethings account, you first choose the 'Account Settings' option from the menu on the left:

Then, scroll down to the

It's as simple as clicking edit, ticking the box that activates IAM access and clicking Update.

There are other options in there that you can also look at configuring (Security Challenge Questions for accounts that have real value is important for example). This is a demo though, so I'm not going to go over that in detail today. You may want to unsubscribe from all the marketing emails though ;) I have them enabled on other accounts, so if this is your first account, make sure you get them as most are quite interesting if you are using AWS long-term.

So we are pretty much finished with creating the account. Before we stop for this post though, log out of your master account and log in as your new user. Use the memorable link (its visible within the IAM Dashboard view if you can't find it immediately) and log on with that new account (and don't forget to use your MFA!):

Right, done. Ready for the next step. Once I write it up and post it ;)

Postgres 9.3 min/max for the cidr data type

Phillip Holmes — Sun, 15 Mar 2015 22:52:30 GMT

I've become a big fan of PostgreSQL (aka PostgresDB or just Postgres depending on where you look). Over the years they have added some of the best database technology to the point where if I do need a relational database, I will use PostgreSQL without a second thought.

I also believe that you should use a relational database for data that has relationships - you know, when you might want to compare two pieces of data against each other in some way or other.

I've found that all data is relational... ;)

I digress. This post is about the CIDR data type that PostgreSQL makes available, and how (in 9.3 and below) there is no min or max function inbuilt into the data type. That may be coming with 9.5 I think, but its not what we are using today.

CIDR or Classless Inter-Domain Routing if you want to be all formal about it, is a common way of detailing out IP subnet ranges. Sadly PostgreSQL in version 9.3 and below does not have a default method for determining the minimum IP address and maximum IP address within a CIDR range. Using a 'trick' its actually quite easy to determine this for yourself, but it's a) non-obvious, and b) not documented in a straight-forward way that I've seen. Hence this post...

So on to my use-case. MaxMind helpfully provide a set of GeoIP databases related to what countries and/or cities any specific IP address is associated with. GeoIP identification is a common thing to use in an application and specifically the free country geoIP db from MaxMind is pretty accurate (higher than 90% I've seen said on the Internets, so it must be true!).

Their geoIP listing is pretty good - for example if you take the first few rows of the 'blocks' data, you get the following:

network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider  
1.0.0.0/24,2077456,2077456,,0,0  
1.0.1.0/24,1814991,1814991,,0,0  
1.0.2.0/23,1814991,1814991,,0,0  
1.0.4.0/22,2077456,2077456,,0,0  
1.0.8.0/21,1814991,1814991,,0,0

geoname_id and the like relate to the locations data (country, region, that sort of thing). The key focus here in the blocks table is the network column.

1.0.0.0/24 of course actually means the 255 IP addresses between 1.0.0.0 and 1.0.0.255. In the networking world the 1.0.0.255 address would actually be used for broadcast purposes, but its still an IP of note - it's just unlikely to ever be seen by a service you are running and wanting to know the geoIP location for.

If you can use PostgreSQL natively, this entire article is pretty irrelevant, as you can use some of the more interesting functions to figure out if an IP address of 1.0.0.57 is within a CIDR range. For example:

SELECT geoname_id FROM geoip_blocks WHERE 1.0.0.57 <<= network;

This will return the geoname_id of 2077456.

Not an issue right? Except if you are using the Java Persistence API (JPA) and therefore using the Java Persistence Query Language (JPQL) you will quickly find that <<= (which means 'is contained within or equals') and the like are not supported. PostgreSQL is just too advanced...

To get around the lack of advanced query operators, you have to use the older BETWEEN query functionality. Except that by default you can't. Due to no MIN or MAX, its really hard to use a between on a single object that is actually a range of IP addresses.

Ideally you would want to use the MIN or MAX functions to be able to return the minimum or maximum IP address. In PostgreSQL 9.3 and below they do not do this. So you are left with wanting to figure out the MIN or MAX for yourself based on the network CIDR data.

Roll in a little known (or at least not very well documented) trick, and you can do this very easily. Knowing the trick is the majority of the battle. It turns out that if you subtract 0.0.0.0 from a CIDR range, it will return the number of IP addresses possible up to that point of the range. In this case:

SELECT CIDR '1.0.0.0/24' - INET '0.0.0.0';  
returns:  
16777216

Well, that was the easy part right? Getting the minimum is really straight forward. So is the maximum when you use the same technique, but the thinking process you need to go through isn't quite as obvious. If you have worked in the networking field for a while, you will know that the /24 of a CIDR range is also the same as the subnet mask 255.255.255.0. You have probably seen this in the ipconfig of your own network device. What the subnet mask really means is that the first three octets (the 255.255.255. part) are specific down to a single IP address. The last octet is 0 which actually means any number can go there - so 255 total possibilities.

This almost helps us right? You could create a CIDR lookup table and do some math alongside the min function to figure all this out. Thankfully PostgreSQL also provides a function called hostmask, which returns the inverse of the subnet mask - 0.0.0.255. This shows us which octets have what number of possible values.

This is helpful because using the trick above, you can do the following to get the maximum number of IP addresses within a single CIDR range:

SELECT (CIDR '1.0.0.0/24' - INET '0.0.0.0') + (hostmask(CIDR '1.0.0.0/24') - INET '0.0.0.0');  
returns:  
16777471

So now we have a way of getting the minimum or maximum of a specific range, and by translating a single IP address into a number, we can use that information and the BETWEEN clause in a SELECT statement to figure out which network CIDR range an IP address is located within. As an example, our test value of 1.0.0.57 just happens to be the 16,777,273th IP address in the IPv4 world.

SELECT INET '1.0.0.57' - INET '0.0.0.0';  
returns:  
16777273

16777273 of course is in-between 16777216 and 1677471. This is good enough for JPQL as long as the minimum and maximum data is in a table already. Given we have to create and then update the geoIP data tables from time to time (monthly for the free tables, more often for the paid-for subscription from MaxMind), then we can easily add a couple of extra columns and update them with the minimum and maximum values.

It's all well and good using commands that don't really use data in tables - so an example of running an update on an entire table to populate the MIN/MAX fields would actually look like this:

UPDATE geoip.geoip_blocks_temp SET min_ip = network - inet '0.0.0.0', max_ip = (network - inet '0.0.0.0') + (hostmask(network) - inet '0.0.0.0');  
returns:  
UPDATE 1776115

Yes, there are 177,615 different network blocks as I write this in the MaxMind Country geoIP database.

So, how best to do this from start to finish? Why with Ansible of course! Following is the general way I write an Ansible script, and thankfully MaxMind provide a MD5 checksum of their data files so I use that to see if I need to download data or not.

First in my process, I start with what I need to accomplish in pseudo code:

keep copy last md5 checksum for comparison  
get latest md5 checksum  
compare the md5 files  
if different delete old data files  
if different download new data files  
if different extract new data files  
if different import into database

This really isn't something I feel a role is good for (its an infrequent set of commands that we do not need to run very often - maybe once a day at most), so this becomes its very own playbook. I'm using Ansible 1.7.x in this example as there are still a few backwards-incompatibility bugs in 1.8.x as I write this, so just be aware this probably does not work in 1.8.x or higher (specifically the inclusion of an include using a variable for that directory). I also don't use some of the core modules (wget directly is probably a sin to use in the eyes of the creator of Ansible ;) ) for good reasons - I don't always need to be idempotent, and some of the core modules do bizarre things in my OS of choice.

Then I go through and expand each line (either multiple commands or just one depending on how good my pseudo code skills are on the day). As an example, what originally was

get latest md5 checksum

as a specific item became:

- name: get latest md5 checksum
  raw: "wget -O {{ geoipfiles }}GeoLite2-Country-CSV.zip.md5 http://geolite.maxmind.com/download/geoip/database/GeoLite2-Country-CSV.zip.md5"
  register: resultingdata
- debug: var=resultingdata

Yes, I used raw - it gives me a consistent result every time I run it across Ansible version updates (something I've had issues with more than once), and in this case I know I have to run the command - it really can't be idempotent at this point of the script.

Don't worry if the below is confusing - I will eventually explain it all in future posts...

What I ended up with of course was something slightly longer-winded (some lines sanitised of course!):

---
- hosts: localhost
  connection: local
  gather_facts: False
  vars:
    myplaybookname: apb-api-epicdata-prod-geoip-refresh
  roles:
    - globalvars
  tasks:
    - include: "{{ ansiblescripts }}inc-slack.yaml"
      vars:
        slackmessage: "Playbook {{ myplaybookname }} Start"
        slackcolor: "warning"
        aignore_errors_default: no

    - name: ensure oldest md5 checksum is deleted
      file:
        path: "{{ geoipfiles }}GeoLite2-Country-CSV.zip.md5.old"
        state: absent
      register: resultingdata
    - debug: var=resultingdata

    - name: keep copy last md5 checksum for comparison
      copy:
        src: "{{ geoipfiles }}GeoLite2-Country-CSV.zip.md5"
        dest: "{{ geoipfiles }}GeoLite2-Country-CSV.zip.md5.old"
      ignore_errors: true
      register: resultingdata
    - debug: var=resultingdata

    - name: ensure old md5 checksum is deleted
      file:
        path: "{{ geoipfiles }}GeoLite2-Country-CSV.zip.md5"
        state: absent
      register: resultingdata
    - debug: var=resultingdata

    - name: get latest md5 checksum
      raw: "wget -O {{ geoipfiles }}GeoLite2-Country-CSV.zip.md5 http://geolite.maxmind.com/download/geoip/database/GeoLite2-Country-CSV.zip.md5"
      register: resultingdata
    - debug: var=resultingdata

    - name: get old md5 checksum
      raw: "cat {{ geoipfiles }}GeoLite2-Country-CSV.zip.md5.old"
      ignore_errors: true
      register: md5old
    - debug: var=md5old

    - name: get new md5 checksum
      raw: "cat {{ geoipfiles }}GeoLite2-Country-CSV.zip.md5"
      register: md5new
    - debug: var=md5new

    - name: if different delete old data files
      file:
        path: "{{ geoipfiles }}IPByCountry"
        state: absent
      when: md5new.stdout != md5old.stdout
      register: resultingdata
    - debug: var=resultingdata

    - name: if different download new data files
      raw: "wget -O {{ geoipfiles }}GeoLite2-Country-CSV.zip http://geolite.maxmind.com/download/geoip/database/GeoLite2-Country-CSV.zip"
      when: md5new.stdout != md5old.stdout
      register: resultingdata
    - debug: var=resultingdata

    - name: if different extract new data files
      raw: "unzip -j {{ geoipfiles }}GeoLite2-Country-CSV.zip -d {{ geoipfiles }}IPByCountry"
      when: md5new.stdout != md5old.stdout
      register: resultingdata
    - debug: var=resultingdata

    - name: if different import into database
      raw: "time psql -d dbname -h dbhostname.rds.amazonaws.com -p 5432 -U dbusername -t -f {{ ansiblefiles }}geoip-tables.psql"
      when: md5new.stdout != md5old.stdout
      register: psqloutput
    - debug: var=psqloutput

    - include: "{{ ansiblescripts }}inc-slack.yaml"
      vars:
        slackmessage: "Playbook {{ myplaybookname }} SQL stderr\n{{ psqloutput.stderr }}\nSQL stdout\n{{ psqloutput.stdout }}"
      when: md5new.stdout != md5old.stdout

    - include: "{{ ansiblescripts }}inc-slack.yaml"
      vars:
        slackmessage: "Playbook {{ myplaybookname }} End with changed data result {{ md5new.stdout != md5old.stdout }}"
        slackcolor: "good"
        aignore_errors_default: no

If you are used to reading Ansible scripts, you will have already seen that my conditional on if I should download the bigger data zip is based on a when clause:

when: md5new.stdout != md5old.stdout

It would be great to do that in a simple 'set variable so that I can compare the md5 files', but thats not the direction Ansible went, so I improvised and use CAT to accomplish the same concept.

CAT as a command is pretty useful for doing that comparison - there are many other ways of course I'm sure! set_fact is not one of those ways, or at least isn't in all the attempts I've made to use that particular command ( /rant!).

You will also have noticed that there is a SQL file referenced. And for the very astute, no db password :) I use another trick for storing db passwords (a .pgpass file in the home directory of the user running the sql commands) as I hate them being listed in Ansible directly for what are hopefully obvious reasons.

The SQL file is a bit more involved, as we have to do a few things (back to pseudo code!):

make sure schema exists  
drop old temporary tables if they still exist from an interrupted run  
create new tables  
go overboard on indexes  
import data  
alter blocks table for min/max values  
update blocks table with min/max values  
delete old data tables  
rename temp tables  
rename temp index and constraint names

To make sure everything works as intended, and to minimise production interruptions, all this is run in such a way that if it fails, it fails before the table switching occurs. The SQL could also be easily broken up into two consecutive commands, for creating the temporary data and then only running the table switch and renames at the end if all is good with the create. This would be sensible for data sources that you are importing where you can't trust that the data will be in the same format (or consistent in quality!) each time. MaxMind in my experience give consistent high quality data so I didn't feel it was necessary in this example to go to that extreme (my testing did, but thats because I didn't trust my own SQL writing skills!).

The entire script takes ~5 seconds to run, but the really important piece is when the tables are switched, and by changing the method to drop/rename rathe than drop/create, our total outage time is less than 0.1 seconds. If desired this can also all be encapsulated within a non-locking commit, but if your application engineers are good, they will be able to handle a 0.1 second outage within their code. I would never ask them to handle a 5 second outage though - thats just asking for trouble :P

So on to the SQL code:

CREATE SCHEMA IF NOT EXISTS geoip;  
DROP TABLE IF EXISTS geoip.geoip_blocks_temp;  
DROP TABLE IF EXISTS geoip.geoip_locations_temp;

-- GeoLite2-Country-Locations-en.csv
-- geoname_id,locale_code,continent_code,continent_name,country_iso_code,country_name
-- bigint     varchar(2)  varchar(2)     varchar(50)    varchar(2)       varchar(50)

CREATE TABLE geoip.geoip_locations_temp  
(
  geoname_id bigint NOT NULL,
  locale_code                          character varying(2) NOT NULL,
  continent_code                       character varying(2),
  continent_name                       text,
  country_iso_code                     character varying(2),
  country_name                         text,
  CONSTRAINT geoip_locations_pkey_temp PRIMARY KEY (geoname_id)
);

-- GeoLite2-Country-Blocks-IPv4.csv
-- network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider
-- CIDR    bigint     bigint                        bigin                          int                int

CREATE TABLE geoip.geoip_blocks_temp  
(
  network                           cidr NOT NULL,
  geoname_id                        bigint,
  registered_country_geoname_id     bigint,
  represented_country_geoname_id    bigint,
  is_anonymous_proxy                int NOT NULL,
  is_satellite_provider             int NOT NULL,
  CONSTRAINT geoip_blocks_pkey_temp PRIMARY KEY (network),
  FOREIGN KEY (geoname_id)          REFERENCES geoip.geoip_locations_temp (geoname_id)
);

CREATE UNIQUE INDEX index_geoip_locations_geoname_id_temp       ON geoip.geoip_locations_temp (geoname_id);  
CREATE        INDEX index_geoip_locations_locale_code_temp      ON geoip.geoip_locations_temp (locale_code);  
CREATE        INDEX index_geoip_locations_country_iso_code_temp ON geoip.geoip_locations_temp (country_iso_code);  
CREATE UNIQUE INDEX index_geoip_blocks_network_temp             ON geoip.geoip_blocks_temp (network);  
CREATE        INDEX index_geoip_blocks_geoname_id_temp          ON geoip.geoip_blocks_temp (geoname_id);  
CREATE        INDEX index_geoip_blocks_is_anonymous_proxy_temp  ON geoip.geoip_blocks_temp (is_anonymous_proxy);

-- First the locations table due to the foreign key constraint in blocks
\COPY geoip.geoip_locations_temp FROM '/usrdeploy/IPByCountry/GeoLite2-Country-Locations-en.csv' WITH CSV HEADER;

-- Then the blocks table
\COPY geoip.geoip_blocks_temp FROM '/usrdeploy/IPByCountry/GeoLite2-Country-Blocks-IPv4.csv' WITH CSV HEADER;

-- add min_ip and max_ip columns to blocks
ALTER TABLE geoip.geoip_blocks_temp ADD COLUMN min_ip bigint;  
ALTER TABLE geoip.geoip_blocks_temp ADD COLUMN max_ip bigint;

-- populate min/max IPs
UPDATE geoip.geoip_blocks_temp SET min_ip = network - inet '0.0.0.0', max_ip = (network - inet '0.0.0.0') + (hostmask(network) - inet '0.0.0.0');

-- create indexes for min/max IPs
CREATE UNIQUE INDEX index_geoip_blocks_min_ip_temp ON geoip.geoip_blocks_temp (min_ip);  
CREATE UNIQUE INDEX index_geoip_blocks_max_ip_temp ON geoip.geoip_blocks_temp (max_ip);

-- drop old tables
DROP TABLE IF EXISTS geoip.geoip_blocks;  
DROP TABLE IF EXISTS geoip.geoip_locations;  
-- rename temp tables
ALTER TABLE geoip.geoip_locations_temp RENAME TO geoip_locations;  
ALTER TABLE geoip.geoip_blocks_temp RENAME TO geoip_blocks;

ALTER INDEX geoip.index_geoip_locations_geoname_id_temp       RENAME TO index_geoip_locations_geoname_id;  
ALTER INDEX geoip.index_geoip_locations_locale_code_temp      RENAME TO index_geoip_locations_locale_code;  
ALTER INDEX geoip.index_geoip_locations_country_iso_code_temp RENAME TO index_geoip_locations_country_iso_code;  
ALTER INDEX geoip.index_geoip_blocks_network_temp             RENAME TO index_geoip_blocks_network;  
ALTER INDEX geoip.index_geoip_blocks_min_ip_temp              RENAME TO index_geoip_blocks_min_ip;  
ALTER INDEX geoip.index_geoip_blocks_max_ip_temp              RENAME TO index_geoip_blocks_max_ip;  
ALTER INDEX geoip.index_geoip_blocks_geoname_id_temp          RENAME TO index_geoip_blocks_geoname_id;  
ALTER INDEX geoip.index_geoip_blocks_is_anonymous_proxy_temp  RENAME TO index_geoip_blocks_is_anonymous_proxy;  
ALTER TABLE geoip.geoip_blocks                                RENAME CONSTRAINT geoip_blocks_pkey_temp    TO geoip_blocks_pkey;  
ALTER TABLE geoip.geoip_locations                             RENAME CONSTRAINT geoip_locations_pkey_temp TO geoip_locations_pkey;

It's also possible to run the above as a full schema swap (i.e. tables/indexes/constraints keep their names, all you do is drop old schema and rename temp schema). From a timing perspective this takes around about the same time, and in some of my testing was actually taking 10 to 20 milliseconds longer. Of course some of the time it was slightly faster, so I'll write that off as networking variances within AWS.

Well there you have it - a real world example of creating MIN/MAX for JPQL usage, while keeping all the awesome functionality of the CIDR type within PostgreSQL for when it can actually be used natively. One day we'll get MIN/MAX as functions within PostgreSQL. One day soon I think, but we needed geoIP queries via JPQL and given the above, we already have that.

Amazon Linux and Upstart/Init

Phillip Holmes — Wed, 04 Mar 2015 03:15:43 GMT

Have you ever added a sleep or pause into a script to resolve a timing issue? I have, and I have to say I feel kinda dirty every time I do.

One of the more entertaining foibles of using linux within a cloud service, specifically Amazon Linux within AWS in this instance, is not easily identifying what has changed from one of the more mainstream forks available. This means quite often that older scripts I've used in the past no longer work. Resolvable correctly with a bit of effort, or you can do what I found most people who came across an issue with how quickly Node can start ended up using.

Node starts very quickly. This is good! Or in the case of hands-free automated enterprise scaling, it is sometimes not so good. In a nutshell Node is starting so quickly during the boot cycle, that it starts before the network interfaces are fully enabled.

Roll on to using Upstart to encapsulate handling a Node application, and you can easily get into a sticky situation where Node has started and bound to lo or the local loopback interface and will happily ignore eth0 or similar when they start up later in the boot cycle. This means nothing can connect to a running Node application over the network, which is usually the point of running Node.

The most common resolution I've seen to date? It's to use a pause of ~30 seconds and hope the network card is already running. When you are running at scale, 'hope' is not a good thing at all.

So what does a 'dirty' init script look like you ask? Well, something like this:

description "An Awesome App node.js server"  
author      "doatt"

start on runlevel [2345]  
stop on runlevel [!2345]

respawn

script  
    echo $$ > /usr/anAwesomeApp/anAwesomeApp.pid
    exec /usr/bin/node /usr/anAwesomeApp/anAwesomeApp.js >> /usr/anAwesomeApp/anAwesomeApp.log 2>&1
end script

pre-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Starting" >> /usr/anAwesomeApp/anAwesomeApp.log
    pause 30
end script

post-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Started" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

pre-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopping" >> /usr/anAwesomeApp/anAwesomeApp.log
    rm /usr/anAwesomeApp/anAwesomeApp.pid
end script

post-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopped" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

Yes, console logging to a file for the win :)

There are two main issues with the above. The biggest is the pause 30 line. That one is the 'dirty' part of the whole piece. The second is the raw approach to starting based on a runlevel of 2, 3, 4 or 5. I've seen this approach hundreds of times in examples all around the internet. I was even beginning to think that there was no solution in sight outside of using a pause command.

While I've spent a number of hours on figuring this out (and isn't hindsight awesome!), I ran through all sorts of attempts to eliminate the 30 second pause by modifying the start on runlevel [2345] component.

Lets focus on the start/stop commands, of which the stop command can be listed as:

stop on runlevel [016]

Personally I prefer using ! to imply not.

For anybody who doesn't already know, a runlevel of 2 is GUI mode, and this usually means you aren't running a server in a way I (personally) would consider fit for the enterprise. Amazon Linux does not boot in mode 2 by default for good reason - you really shouldn't have a GUI when you have a fully automated infrastructure. Again (personally) I avoid applications that require a GUI to be running on a hands-free server. So to make sure I don't allow my Node apps to run when a GUI is involved, there is a fairly simple change:

start on runlevel [345]  
stop on runlevel [!345]

This of course, can have the stop also written as:

stop on runlevel [0126]

Yeah I know, I'm mean. :D

According to the official Upstart documentation (both generic and Ubuntu specific), there are a bunch of awesome start on options that can be used outside of, or in addition to runlevel.

The best sounding ones are local-filesystems and the even better sounding net-device-up IFACE!=lo. These are awesome on some other variants of Linux, and I've enjoyed having them available in the past. But do they work with Amazon Linux?

No.

Many electrons were annihilated during reboots to figure that no combination of local-filesystems or net-device-up and the like worked as documented.

The (actual) good news is, there is the ability to detect core services starting. After much trial and error, and again lots of surfing the intrawebs, I found that none of the standard names for networking services are in use within Amazon Linux. In hindsight the simplicity of the name was obvious - you can believe I was kicking myself when I figured it out...

Amazon Linux uses a networking service called.... wait for it..... network. :sigh:

So, an awesome script I happen to have simplified so its easy enough to follow, does the following:

ensures the server is running in multi-user mode without an OS GUI interface
ensures the server has its network interfaces all running (eth0 and lo for example)
limits the amount of log spam that might occur if something does go wrong (all servers sit behind an Elastic Load Balancer right?!)
prints helpful information to the log file created that includes a standard date/time format

All good things. Well, maybe not the console log to file approach, but I like having both the console log created as well as any application direct logging that may be occurring, just in case...

So, without further rambling on my part, here is a fully functional Amazon Linux based Upstart/Init script that handles Node starting before the network card itself completes initialisation.

description "An Awesome App node.js server"  
author      "doatt"

start on (runlevel [345] and started network)  
stop on (runlevel [!345] or stopping network)

respawn limit 20 5

script  
    echo $$ > /usr/anAwesomeApp/anAwesomeApp.pid
    exec /usr/bin/node /usr/anAwesomeApp/anAwesomeApp.js >> /usr/anAwesomeApp/anAwesomeApp.log 2>&1
end script

pre-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Starting" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

post-start script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Started" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

pre-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopping" >> /usr/anAwesomeApp/anAwesomeApp.log
    rm /usr/anAwesomeApp/anAwesomeApp.pid
end script

post-stop script  
    echo "- - - [`date -u +%Y-%m-%dT%T.%3NZ`] (sys) Stopped" >> /usr/anAwesomeApp/anAwesomeApp.log
end script

The Hubot msg Object

Phillip Holmes — Thu, 19 Feb 2015 02:22:17 GMT

I've been playing with Hubot lately as part of allowing certain commands to be available to the engineers for deploying environments within AWS. One of the puzzles I have had is the lack of a good reference as to what all the various pieces within the msg object are within the Hubot system. I'm sure my google-fu is just really bad, but for the life of me I couldn't find a good and clear listing.

In the end (as I'm still learning coffeescript) I made a really basic Hubot module and used it to write out debug information to the console. This meant that I could then view the logs created (Hubot is a node application, so I wrap it up in an Init script so that it runs on a server 24x7) and by viewing those logs, determine what the objects within msg are.

My main objective is to ensure that commands called out by specific individuals, and occasionally only if they talk in specific rooms within Slack are catered for. This will be implemented using either hard-coded switch statements or if I really get a round tuit, stored within a database. Hubot is pretty awesome, so I can easily populate Hubot's 'brain' (or just directly check authentication) using a REST API endpoint. The world is my oyster or somesuch!

The script command is a lot easier than I thought it would be, so of course I tried writing a recursive object logging function to start with (that worked, but not 100% and was actually very messy to read). In the end, my script is really basic. You of course get to miss out on the lengthy Google searching and general coding iteration I went through to get (what at times was a good 20 lines of code) down to this one statement:

robot.respond /debug/, (msg) ->  
  console.log Object(msg)

And the output? Well, that is pretty awesome for what I needed (the below is mostly sanitised of course):

{ robot:
   { name: 'hubot',
     events: { domain: null, _events: [Object], _maxListeners: 10 },
     brain:
      { data: [Object],
        autoSave: true,
        saveInterval: [Object],
        _events: [Object] },
     alias: false,
     adapter:
      { message: [Function],
        close: [Function],
        open: [Function],
        userChange: [Function],
        brainLoaded: [Function],
        loggedIn: [Function],
        error: [Function],
        robot: [Circular],
        _events: [Object],
        options: [Object],
        client: [Object],
        self: [Object] },
     Response: [Function: Response],
     commands:
      [ 'hubot help - Displays all of the help commands that Hubot knows about.',
        'hubot help  - Displays all help commands that match .' ],
     listeners: [ [Object], [Object], [Object], [Object], [Object], [Object] ],
     logger: { level: 7, stream: [Object] },
     pingIntervalId: null,
     version: '2.11.1',
     server:
      { domain: null,
        _events: [Object],
        _maxListeners: 10,
        _connections: 0,
        connections: [Getter/Setter],
        _handle: [Object],
        _usingSlaves: false,
        _slaves: [],
        allowHalfOpen: true,
        httpAllowHalfOpen: false,
        timeout: 120000,
        _connectionKey: '4:0.0.0.0:8080' },
     router:
      { [Function: app]
        use: [Function],
        handle: [Function],
        listen: [Function],
        setMaxListeners: [Function: setMaxListeners],
        emit: [Function: emit],
        addListener: [Function: addListener],
        on: [Function: addListener],
        once: [Function: once],
        removeListener: [Function: removeListener],
        removeAllListeners: [Function: removeAllListeners],
        listeners: [Function: listeners],
        route: '/',
        stack: [Object],
        init: [Function],
        defaultConfiguration: [Function],
        engine: [Function],
        param: [Function],
        set: [Function],
        path: [Function],
        enabled: [Function],
        disabled: [Function],
        enable: [Function],
        disable: [Function],
        configure: [Function],
        get: [Function],
        post: [Function],
        put: [Function],
        head: [Function],
        delete: [Function],
        options: [Function],
        trace: [Function],
        copy: [Function],
        lock: [Function],
        mkcol: [Function],
        move: [Function],
        purge: [Function],
        propfind: [Function],
        proppatch: [Function],
        unlock: [Function],
        report: [Function],
        mkactivity: [Function],
        checkout: [Function],
        merge: [Function],
        'm-search': [Function],
        notify: [Function],
        subscribe: [Function],
        unsubscribe: [Function],
        patch: [Function],
        search: [Function],
        connect: [Function],
        all: [Function],
        del: [Function],
        render: [Function],
        request: [Object],
        response: [Object],
        cache: {},
        settings: [Object],
        engines: {},
        _events: [Object],
        _router: [Object],
        routes: [Object],
        router: [Getter],
        locals: [Object],
        _usedRouter: true },
     adapterName: 'slack',
     errorHandlers: [],
     onUncaughtException: [Function] },
  message:
   { user:
      { id: 'ABCDEFGHI',
        name: 'phillip',
        email_address: 'phillip@domainname',
        room: 'hubottest' },
     text: 'hubot debug',
     id: '1234567890.123456',
     done: false,
     room: 'hubottest' },
  match:
   [ 'hubot debug',
     index: 0,
     input: 'hubot debug' ],
  envelope:
   { room: 'hubottest',
     user:
      { id: 'ABCDEFGHI',
        name: 'phillip',
        email_address: 'phillip@domainname',
        room: 'hubottest' },
     message:
      { user: [Object],
        text: 'hubot debug',
        id: '1234567890.123456',
        done: false,
        room: 'hubottest' } } }

Now that I can see the bulk of the objects, I can easily reference the channel (called room and referenced as msg.message.user.room, msg.message.room, msg.envelope.room, msg.envelope.user.room or msg.envelope.message.room - so many choices!). I can reference the user.name and if I want to add some obfuscated security, I could reference the user.id. All these objects are returned in some shape or other to a Slack client, so using user.id isn't actually adding any real security...

All that said, off I go to create my commands that will call my Rundeck API and run the relevant environment jobs without even needing to log in to Rundeck itself.

Atom and the never ending italics on _target

Phillip Holmes — Sun, 15 Feb 2015 18:33:58 GMT

So something that has been annoying me within markdown is the default for a html reference ( ) is to use the same window/tab.



My preference for giving out links to other sites is to have them open in a new window/tab, so as a result I use html references directly within markdown more often than not. It's a minor annoyance as its not actually that hard to type out the syntax. After all, I've done that for many years now in other editors!

One thing that has annoyed me to the point where I have 'fixed' it is that Atom has a bug where if something starts with an underscore, it must therefore be the start of a long sentence that should be displayed in italics. This is a bug which all newly opened tickets get re-pointed at issue #44 of the language-gfm module and closed if they are opened. Thankfully I've learned to check closed issues previously, so did find that out before opening a new ticket myself...

After spending a bit of time trying to remember how regex worked, I managed to figure out a fix. I modified line 40 of my local instance of the gfm.json (located at /Applications/Atom.app/Contents/Resources/app/node_modules/language-gfm/grammars/gfm.json on my system) file within the language-gfm module from:

"begin": "(?<=^|[^\\w\\d_\\{\\}])_(?!$|_|\\s)",


to:

"begin": "(?<=^|[^\\w\\d_\\{\\}])_(?!$|_|\\s|blank['\"])",


In more words, this means that if an underscore ( _ ) is found, and it isn't immediately followed by blank' or blank", then it will continue as the start of italics in markdown. This means that target='_blank' and all the words following it within the Atom editor will stop showing in a different colour and as italics while words are being electronically inked to the virtual paper. _blank_ and _blank blanky blank_ will still work of course. This is an edge case, and my implementation is a nasty hack. A hack that works, so I'm happy.

I then thought I'd fork the language-cfm module and submit the change for review as I'm sure I'm not alone in being annoyed by this bug - only to find out that this particular Atom module is written in coffeescript and not javascript as my local version is when I open the local source for that module.

I created the fork and updated it, but haven't submitted a pull request - coffee is written slightly differently to javascript (a ' is used instead of a " and escaping both highlights a syntax error in the editor). As I'm still trying to learn coffeescript (I'm writing a specific Rundeck module for our Hubot implementation within Slack) I'm not confident the regex will work after being compiled, so I am hoping it gets a more up-to-speed person to make a better (non-hack!) fix and submit it...



bash: pipe error: Too many open files in system
Phillip Holmes — Fri, 13 Feb 2015 01:23:00 GMT
So I came across the following error the other day while trying to run the find command on my Mac:  

bash: pipe error: Too many open files in system  


I also then went into one of those 'must close all the things' scenarios where all the applications started screwing up and I in the end resorted to rebooting - something I have hardly ever had to do since starting to use the Mac.

Long story short, it ended up being a bad update to BetterTouchTool (BTT) which I use to enable keyboard commands to move windows to using the left or right half of a screen, between monitors and the like. The author of BTT had put out a bad update, and as I had auto-updates turned on I got some bad code which didn't crash the application, but after further diagnosis was opening handles and leaving them open whilst opening new handles and so forth - after ~7000 handles in use by the one application the system ran out (total limit I think is 10,000) and crashes commenced with catastrophic consequences for all running applications.

Finding out which app was causing the issue turns out not to be as obvious as you might think. Activity Monitor, while an excellent system application does not actually show this info in an easy to digest (or compare) method - you can double-click an application in one of its views and see which files are open for that application, but you cannot easily show which is the application using the most file handles.

A quick search on Google pointed me at either lsof (a command-line tool accessed via bash) and a pretty neat looking application called Sloth. Sloth is also a fine application - but it also does not give a nice clear listing of just the number of handles per application.

The solution in my case was to use lsof, combine it with awk, uniq, sort and head and get a nice clear listing of the top applications hogging all the handles.

The command is nice and simple right? :)

lsof | awk '{print $1}' | uniq -c | sort -rn | head  


This generates the following output:

user@desktop ~/AllTheFiles  
$lsof | awk '{print $1}' | uniq -c | sort -rn | head
3271 com.apple  
2978 Google  
 914 Atom\x20H
 505 Skype
 476 Mail
 375 Atom\x20H
 304 Finder
 292 Dock
 277 Atom\x20H
 270 Atom\x20H


com.apple in the above example is actually Safari - Google is the reference to Chrome.

It would be entirely possible to automate polling for an application going above 6000 handles and terminating the process (hidden behind the output is the pid for each process) - I didn't go that far myself, and thankfully the BTT author updated the app within a couple of days so I'm back to normal.


The tools for the technology
Phillip Holmes — Sun, 08 Feb 2015 18:29:36 GMT
So a long time ago (even as far back as last year!) I used Microsoft Windows on my workstations exclusively. I still do have a windows machine, but mainly just for games - I have however moved over to Mac OSx for my work computer. As a result the tools I use have changed from Microsoft centric to more generic/cross-platform.

The main workstation I currently use - and will do for many years to come - is a MacBook Pro (15" with Retina - the screen is awesome!). I'll no doubt update the underlying hardware regardless of the expense as I've found it more than worth the cost. As a result I have moved away from my old favourite editor Notepad++ as it is Windows specific regardless that I have Parallels installed. Don't get me wrong - parallels is a great system, but it creates a lot of heat due to CPU usage, and I don't like hearing my laptop fans going  ;)

I have three main software tools I use for automation work:


the console - iTerm 2
the editor - Atom
the browser - Google Chrome


Sounds almost too simple right? The end result of course is a ton of applications are accessed from within each.

iTerm 2

iTerm 2 is probably the best terminal application out there. The ability to select multiline text (and not as a block - as a consecutive set of characters that can be across multiple lines that can also be different lengths per line) as well as easily start up new sub-sectioned consoles (or 'panes' as they call them) within the same 'tab' makes this one of the most powerful consoles out there. Add in the easy to use toggle for broadcasting commands into multiple sections within a tab and you have the ideal DevOps terminal.

For example, if I would like to tail a log happening across multiple machines, all I have to do is start up a set of console panes in the same tab and SSH to the machines in question. Then I can enable broadcast (command-shift-T in my configuration) and type the single command and have all consoles start outputting the log data:

tail -f /usr/someApp/someLog.log  


Obviously a simple example, but very easy to use and very powerful if you need to do something more powerful than just tail...

iTerm 2 also has the capability to automatically log the console locally. Ever needed to remember what command you used to do something a month or so ago and can't for the life of you find it again in a Google search? grep your output directory or even troll the logs manually if you wish - all good things (and bad!) are kept around for as long as you want them. This is especially useful if you need the output from a server that has since been decommissioned where that output would never make it to your logging solution otherwise.

Yes, there are many other features (command suggestions, paste history, etc), so if you haven't tried out iTerm2 already, I suggest you go do so now.

iTerm 2 of course gives me access to bash. Bash (on the Mac) gives me access to ssh. Which in turn gives me access to bash (on Linux), echo, tail, cat and vi, all of which I use extensively when I'm forced into looking at remote machines. I don't allow GUI's on the Linux servers as the overhead is totally unnecessary. Occasionally I use openssl as well on my local machine, which then usually ends up with echo and/or vi as copy/paste from iTerm is just awesome and pretty much eliminates the need for a GUI text editor for simple text manipulation.

Atom

Atom is one of the best 'free' editors out there that works across platforms. It's one of the best text editors out there regardless of platform  :)

I regularly open up a directory in an Atom window (the concept of a project space) and when combined with that directory being a local (or synced) git repo, can see what I've changed easily due to the core 'git-diff' module Atom comes with. I work almost exclusively with git repo based text files of course, so like other git-synced editors benefit a lot from knowing what is going on within a file.

Being a highly customisable editor, you can theme the look and feel of the editor in many, many ways. I'm partial to light text on a dark background (iTerm is green on black of course!), and working the way you want to work is important in my opinion.

Atom is also a really good markdown editor once you add in a couple of markdown specific plugins. Combine that with a spell-check facility and it won't be surprising to know that this is how I write these blog posts. On a side note, Ghost (my current blog platform) doesn't have a spell-check.

Google Chrome

Google Chrome is possibly the most work-centric browser out there. In my opinion it has the best set of tools for debugging websites, and combined with the ability to have multiple 'people' windows means I can switch between sets of accounts (home and work for example) and both easily see which main account I'm working as, but also keep a good level of separation between sets of accounts.

If you haven't come across the 'People' concept, open up Chrome settings and scroll down while in the 'settings' view until you find it - you may need to enable it to get it to show, but once you do, you will have a person name showing at the top-right corner of every Chrome browser window.

If like me you have multiple Amazon accounts being accessed constantly during the day (everybody has a home, dev and production account right?) then being able to tell easily which account you are in while you are working is paramount.

Google Chrome is where I get to use the AWS console (outside of iTerm 2 and the AAWS CLI of course!), GitHub (outside of git commands or the GitHub GUI client), Google Apps for mail and calendar, Rundeck, Pivotal for project management, and possibly most importantly, Google Search. I'm also quite partial to sending people lmgtfy.com/ links as well of course  :)

Well, enough text about tools, even if in the end there are only three actual pieces of software in that list. Next I'll be talking about the process for creating the first automation server, which no doubt will lead us directly into the AWS console...


A note on Security of Accounts for Tools
Phillip Holmes — Wed, 04 Feb 2015 02:03:44 GMT
Before I start on some examples, a quick note on security on accounts and a reason why I've chosen some of the tools I use:

Multi-Factor Authentication (MFA) is a big deal. Use it.

All accounts that I use that can have a MFA, have one. This is something I have enabled on all systems that can have it. This includes my email, AWS, and GitHub (and therefore CircleCI).

The really important tools (GitHub and AWS) are partially chosen because they support MFA, and you can be sure I use it to protect those accounts!

Personally for MFA controlled accounts, I don't really care too much about what password I use - sure I still use a password manager that dictates a random set of characters that I'll never be able to remember, but I also don't care about password systems as a subset of a MFA controlled account in general as passwords are too easy to brute-force in most cases.

OK, back off I roam into the ether once more...


Fourth - The Build System
Phillip Holmes — Sun, 01 Feb 2015 18:16:15 GMT
So the engineers have written a bunch of code. Now what?

We have to build it so it can be deployed!

This brings me on to a very important sub-topic - namely 'do you need humans to test most code?'.

The best software engineers I have worked with always put tests for the pieces of code they are writing. Of course they also have a peer review system going where another good engineer will ensure they have actually written the appropriate tests before the code is submitted for a build/deploy. By doing this they enable two things:


no need for Quality Assurance (QA) staff for mindless testing of positive/negative/false positive testing of systems such as REST APIs
testing can be automated within the build system for positive/negative/false-positive testing...


Personally I'm a great fan of only putting actual people in positions where the human brain is still capable of being better than a 'dumb' computer. Artificial Intelligence (AI) can sound like a big deal, but in reality we aren't anywhere near being able to replace a human being able to look at a website and see formatting issues. So yes, I still believe in QA staff, just not in areas where we can automate their jobs  ;)

OK, enough of a sidetrack. Given the engineers have written some awesome code that doesn't have any bugs, and have implemented their testing, we now have to get that code ready for the deploy system - truly the bridge between engineering and DevOps.

There are quite a few build engines out there - some are old, and some are new. Most of the old ones use an older way of thinking and therefore don't always mesh well with newer ways of working, or for that matter, deploying. That can be said about some of the newer ones as well of course!

Functionality of course is king, so having a list of requirements your build system must give you for how you want to run your deploys is very important - my list looks (at a high level) similar to this:


must be able to take code from github using a commit trigger
must be able to automatically configure a build environment based on packaging instructions within the code (i.e. package.json for a npm install)
must be able to install any needed tools within the build environment
must be able to notify both a chat system (I use Slack for example) as well as the deploy system (a custom node app in my case)
must have an interface for abstracting any private information needed for triggering the deploy (API key or similar) so it is not located in the code
must be easily configurable
must have testing components available, and must fail to continue if a test fails during the build (important for deploys!)
must have the ability to add in custom actions (bash for example) as part of the build config
must have a debug interface (i.e. can SSH to the build server part way through a build if needed for debugging)
must have a good GUI for interacting with jobs and the ability to quickly and easily cancel a running job in an emergency
must be cheap or free - after all, no humans are involved so the cost savings are huge and must be passed on as exactly that - cost savings
must be able to run a build without a deploy being triggered based on who just committed code or where the code is being committed to (a feature branch for example)
optional: ideally no management of the build servers needed (just say no to the legacy 'run this server 24x7 just in case you need to run a build' approach)


Sounds like a horrible list of requirements right? I know of many systems that do some, or even most of those, but so far have only come across one that fits every requirement I have wanted, and then some.

There is one major name that I've found most companies use by default: Jenkins. If they aren't using Jenkins, I've found most of the remaining companies end up using Bamboo just because they are already tied into the Atlasssian suite of tools, or Travis which isn't free (hence the plans link) but is hosted and is therefore the main competitor for what I use personally.

I've also seen a ton of custom written build systems (bunches of perl scripts are not uncommon) - these always end up with one scenario though - the guy that wrote them is invaluable for the consistent building of the applications involved, and replacing that person is always more painful than its worth.

My ambition in this area? Try to use systems that any good DevOps person can pick up in a matter of minutes (ok, maybe hours).

Jenkins

So it turns out Jenkins is Free. But only if you ignore two important pieces - the systems you need in order to have Jenkins running, and the people you require in order to keep it running (any '24x7' server requires DevOps people - that's just life!). If you use Jenkins, you will have an additional overhead beyond what is required for just maintaining your deploy system.

Other than that, it's a very good system, albeit one that due to its more complex nature (1000+ configuration module available and counting!), also requires more care and feeding that in my opinion could be better spent on other activities regardless of having to also ensure the servers are kept up to date etc.

Bamboo

Bamboo is really great if you are already heavily into Atlasssian products. This does however mean that the company doesn't make it easy to work with other systems. Why would they do that you ask? Well it's obvious - they want everybody to use their entire suite, and from a business perspective I totally agree with that as a concept. I just don't use them in reality as the 'suite' approach is not always (and hardly ever is) the right one.

Travis

Travis is a great tool, albeit quite expensive if you aren't involved in open source software. I think it's great that they give back to the open source community, but in the end I'm not always involved in open source, so I would have to pay to use the service, and I don't believe the expense outweighs the benefits when compared to my product of choice.

So what do I use (and therefore recommend)? CircleCI.

CircleCI

First and foremost, CircleCI meets every one of my requirements. Including price. This is no small accomplishment as I'm very demanding when it comes to my services, and I really don't like paying passed-on costs such as for people where people are not necessary.

CircleCI also bring another hidden requirement that we all have, but usually forget about:


must have good customer support if the system is unreliable/unstable, alternatively optional good customer support if the system is reliable/unstable


Let's just say CircleCI have a very reliable/stable system, and somehow also manage to have great customer support. Nay, lets instead say they have great customer service. They really do.

Can you tell I'm a fan?  :)

In the end, we have chosen to use GitHub for our source control, and while there are many other code control systems out there, and there are many other build systems out there that support GitHub, CircleCI hits the mark on all the important stuff that we require, and then some.

For the overall develop/build/deploy flow, whenever any branch of a repo in GitHub has a commit, CircleCI will run a build. With our configuration, if that build is into a specific branch (development or production), then a deploy will also happen.

Regardless of if a deploy is going to happen, CircleCI provides a great interface and alerting system for if (and why!) a build fails, and for the engineers they get to see that their code (and tests!) are working as planned. Failed builds never result in a deploy, and therein lies the beauty of a good code, build and then deploy system.

On the pricing side, originally CircleCI charged a token amount (~$20/month) for the capability to run a private system build. Last year they removed that charge for the first instance and only charge if you would like parallelism or two simultaneous builds (or more if you want to buy more). I find I'm more than happy to buy multiple containers as more consistent builds are required, although this is tempered with how long a build takes and how often those commits happen. On most of our projects is actually quite hard for most of the engineers to write enough code to consistently keep a single build node running  given how fast the build itself takes.

Deploys are a different issue of course. Given the nature of High Availability (HA) in the 'cloud', rolling deploys (update 1 of x sets of servers, then the 2nd set, then the 3rd set and so on) are mandatory, and while a build may take 3 minutes, a deploy can easily take 15 - so running multiple deploys at once for most cloud based HA scenarios is a big no.

Well, after four main articles, we are getting out of theory, and I'll start to get into practical examples. Now that we have a way for engineers to submit code (GitHub), a build system (CircleCI) that triggers deploys (Rundeck which uses Ansible), we have the overall flow covered and hopefully the actual examples will make more sense.