We’ve used Puppet (a system configuration management software) to automate our cluster of physical servers for a couple of years. But it wasn’t until about a month ago that we managed to get our cloud servers (hosted on Amazon EC2) under the control of puppet. This blog post is to explain how (and why) we did it!
First some context: Puppet is systems configuration management software developed by Puppet Labs (Formally Reductive Labs). Puppet provides a framework to enforce the desired state of a system in configuration files that can be seen as a live picture of the “recipes” for configuring and running that system. It provides a simplified declarative language (or DSL – domain-specific language) for common requirements, as well as hooks to execute arbitrary scripts or programs and evaluate return codes.
Puppet is awesome because it turns your dispersed, heterogeneous system configuration into a library of auditable source files that you can manage via your favorite source control — with the implicit knowledge that they are ‘live’ because Puppet is making it so! So if you want to see how and why the configuration of your web servers has changed in the last six months, you can just look in source control and see the change history of the puppet file! No more guessing what changed, and no more troubleshooting obscure differences between servers that are supposed to be identical: Puppet enforces the configuration you specify, so a manual change to a server will be auto-magically reverted and file-bucketed if it conflicts with the specification file that your team wrote.
Puppet in the cloud, however, introduces some complications. The trouble is that Puppet was developed before cloud computing was common. Its configuration depends highly on accurate and precise enumeration of the environment’s nodes. In more static environments, a human will edit these files as you (occasionally) bootstrap new physical servers or retire old ones. But in leveraging cloud servers, it’s a mistake to assume human intervention! We create and destroy new cloud machine instances on a frequent and automated basis; we spawn and tear down new machines automatically in response to load, as well as tearing down and replacing all servers every 24 hours for overall system hygiene.
So we needed to write some code that would make Puppet work in a more dynamic environment, where machines appear and disappear frequently and somewhat unpredictably. Specifically, we needed to:
1) Get a PuppetMaster server running on EC2 (easy). This machine runs the cluster, and is not created or deleted frequently. It is the sun around which all the other servers orbit.
2) Add an entry in the /etc/hosts file of newly created servers, telling them where our Puppet server is (easy, although if our Puppet server ever changed locations, this script would have to change).
3) Generate a new entry for the new node in the appropriate .pp file on the PuppetMaster (tricky).
So what’s the complication to adding an entry for each new server on the PuppetMaster’s listing of nodes? Well, it has to do with the nitty-gritty of how DNS works on EC2. EC2 will reassign previously-used hostnames to new live instances automatically and silently, which can cause all kinds of unanticipated errors if PuppetMaster’s nodes don’t reflect the new reality! In addition, EC2’s API for calling a new instance’s hostname can be unreliable (we simply retry the API call until we retrieve the value successfuly … eventually it works).
Below is a code snippet that runs when a new server is created. It runs from our physical servers when we create a new instance, fires up the new machine, and then makes the appropriate entries in the new server and in the .pp file listing the nodes.
#!/bin/bash #Place request for spot instance ec2-request-spot-instances $AMI -n $NUM_INSTANCES -p $PRICE -t $TYPE -k $KEY_PAIR --group $SECURITY_GROUP > /tmp/ec2_spot_instance_request SIR_REQUEST=`cat /tmp/ec2_spot_instance_request | cut -f 2` rm -f /tmp/ec2_spot_instance_request #Capture status of request. Initially request has STATUS=open and we need it to be active in order to continue STATUS=`ec2-describe-spot-instance-requests | grep $SIR_REQUEST | cut -f 6` #We won't want to wait till infinity for instance to spawn up. Our threshold is 5 minutes. COUNT=1 #This variable checks if instance is spot or regular IS_SPOT=1 #Wait for spot instance request to succeed. REQUIRED_STATUS="active" while [ $STATUS != $REQUIRED_STATUS ] do sleep 60 STATUS=`ec2-describe-spot-instance-requests | grep $SIR_REQUEST | cut -f 6` if [ $COUNT -gt 5 ] then IS_SPOT=0 break fi COUNT=`expr $COUNT + 1` done if [ $IS_SPOT -eq 1 ] then INSTANCE_ID=`ec2-describe-spot-instance-requests | grep $SIR_REQUEST | cut -f 12` else #Kill spot instance request we made earlier ec2-cancel-spot-instance-requests $SIR_REQUEST #Spawn up a regular instance ec2-run-instances $AMI -n $NUM_INSTANCES -t $TYPE -k $KEY_PAIR --group $SECURITY_GROUP > /tmp/ec2_instance_request INSTANCE_ID=`cat /tmp/ec2_instance_request | tail -1 | cut -f2` STATUS=`cat /tmp/ec2_instance_request | tail -1 | cut -f6` rm -f /tmp/ec2_instance_request REQUIRED_STATUS="running" while [ $STATUS != $REQUIRED_STATUS ] do sleep 60 STATUS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f6` done fi sleep 120 #Instance is now active. Capture data associated with instance like instance-id, external and internal dns. INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4` INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' ` #We need to do the record keeping echo "`date`: $INSTANCE_ID: $INSTANCE_EXTERNAL_DNS: $INSTANCE_INTERNAL_HOSTNAME: FOOBAR" >> $DB_FILE #If we are not able to get internal hostname within next 1 minutes for some reason then quit while [ -z $INSTANCE_INTERNAL_HOSTNAME ] do COUNT=`expr $COUNT + 1 ` sleep 10 INSTANCE_INTERNAL_HOSTNAME=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 5 | cut -f 1 -d'.' ` INSTANCE_EXTERNAL_DNS=`ec2-describe-instances $INSTANCE_ID | tail -1 | cut -f 4` if [ $COUNT -ge 6 ] then exit 1 fi done #Configure puppetmaster to associate relevant class with node $SCP root@$PUPPET_MASTER:/etc/puppet/manifests/nodes.pp /tmp/nodes.pp grep $INSTANCE_INTERNAL_HOSTNAME /tmp/nodes.pp if [ $? -eq 0 ] then sed "/$INSTANCE_INTERNAL_HOSTNAME/d" /tmp/nodes.pp > /tmp/newnodes.pp mv /tmp/newnodes.pp /tmp/nodes.pp fi echo "node $INSTANCE_INTERNAL_HOSTNAME { include $1 }" >> /tmp/nodes.pp $SCP /tmp/nodes.pp root@$PUPPET_MASTER:/tmp $SSH root@$PUPPET_MASTER "mv /tmp/nodes.pp /etc/puppet/manifests/nodes.pp" rm -f /tmp/nodes.pp $SSH root@$INSTANCE_EXTERNAL_DNS "cat /tmp/puppet_host >> /etc/hosts" ... #Install puppet on newly built ec2 host $SSH root@$INSTANCE_EXTERNAL_DNS "apt-get -y install puppet"
As you can see, the combination of cloud computing and system automation is extremely powerful. Systems automation encapsulates your ‘secret sauce,’ and the cloud lets you scale it automatically, seamlessly to the nth degree!
We’re still in the early days of cloud computing, so you still need to roll your own solutions and figure out the “gotchas” yourself. For us that’s half the fun! I hope these scripts make it easier for other operations folk to control their cloud infrastructure, using Puppet. If you like thinking about this kind of stuff, maybe you should consider working at SlideShare!
{ 3 comments }
Good script, that one! However, I’m not very comfortable with modifying the /etc/hosts file. Won’t modifying the puppet.conf file to point to the correct puppetmaster server achieve the same thing?
Also, it’s not clear in the script how the file /tmp/puppet_host got into the instance. I don’t see any previous to that file in the script.
Oops – “any previous to that” should be “any previous reference to that”
This is god speaking “Mayank is a rockstar”
Comments on this entry are closed.