Launch an AWS EC2 AMI instance with an EBS disk larger than the default

When you launch an EBS instance on EC2, you get a root disk size that is the same as what the machine image was created with.  For example, the Ubuntu AMIs by default are 8GB.  Typically, the best approach is to leave the root volume this size and add additional volumes provisioned with the size needed for particular data (mysql data files or whatever).

However, sometimes, you just need a bigger root disk than 8GB.  It IS possible to launch such an instance.  Use this command:

# Launch an instance with a specific root disk size instead of the default
 
UBUNTU_EBS_IMAGE=ami-d0f89fb9
# New volume size in GB
VOLUME_SIZE=30
 
SNAPSHOT_ID=`ec2-describe-images $UBUNTU_EBS_IMAGE | egrep "^BLOCKDEVICEMAPPING" | cut -f4`
 
ec2-run-instances $UBUNTU_EBS_IMAGE \ # Don't forget all your other normal parameters
  --block-device-mapping "/dev/sda1=$SNAPSHOT_ID:$VOLUME_SIZE:true:standard"
 
# Ubuntu images on boot will auto-resize their root disk to match the available EBS size

Reading between the lines of the AWS outage – Why you should be worried Part 2

TL;DR: The real problem is there isn’t enough capacity.  The us-east1 region is too big and during an outage there simply aren’t enough resources to allow users to recover their sites.

In part 1 I discussed how the bug-rate on AWS doesn’t seem to be getting better.  A few bugs aren’t necessarily a big deal.  For one, they’re expected given the trailblazing that AWS is doing and the incredibly hard & complex problems they’re solving.  This is a recipe for bad things to happen sometimes.  To account for this the true selling point in AWS has always been “if something goes wrong, just get more instances from somewhere else and keep on running”.

In the early days (pre-EBS), good AWS architecture dictated that you had to be prepared for any instance to disappear at any time for any reason.  Had important data on it?  Better have four instances with copies of that data in different AZs & regions along with offsite backups, because your data could disappear at any time.  Done properly, you simply started a new instance to replace it, copied in your data, and went about your merry way.

What has unfortunately happened is that nearly all customers are centralized onto us-east-1.  This has many consequences to the architecture model described above.

Traffic Load

A very common thread in all of the us-east-1 outages over the last two years is that any time there is trouble, the API & management console becomes overloaded.  Every user will be trying to move and/or restore their services.  All at once.  And the API/console has shown to be extremely dependent on east-1.  PagerDuty went so far as to move to another region to de-correlate east1 failures from their own failures.

Competition for Resources

Once again, by virtue of us-east-1 being the largest region, whenever there is an outage every customer will start trying to provision new capacity in other AZs.  But there is seldom enough capacity.  Inevitably in each outage there is an entry in the status updates that says “We’re adding more disks to expand EBS capacity”, or “We’re bring more systems online to make more instances available”, and so forth.  You can’t really blame Amazon for this one: They can’t keep the prices they have and always be running below 50% capacity.  But when lots of instances fail, or lots of disks fill up, or lots of IP addresses get allocated, there just aren’t enough left.

This is a painful side effect of forcing everyone to be centralized into the us-east1 region.  us-west has us-west1 & us-west2 because the datacenters are too far apart to maintain a low-latency connection to put them into the same regional designation.  us-east has a dozen or more datacenters, and thanks to them being so close, Amazon has been able to call them all ‘us-east’ instead of ‘us-east1′ and ‘us-east2′.

But what happens when a bug affects multiple AZs in a region?  Suddenly, having all the AZs in a single region becomes a liability.  Too many people are affected at once and they have nowhere to go.  And all those organizations that have architected under the assumption that they can “just launch more instances somewhere else” are left with few options.

P.S. I know things are sounding a little negative, but stay tuned.  My goal here is first to identify what are the truly dangerous issues facing AWS, and then to describe the best ways to deal with them as well as why I still think AWS is the absolute best cloud provider available.

Reading between the lines of the AWS outage – Why you should be worried Part 1

TL;DR: There are many very real problem with AWS that are being ignored.  #1 is that they appear to have no plan for dealing with their ongoing software bugs.

There has been a surprising amount of talk about the most recent AWS outage (Oct 22, 2012). In truth, I was busy coding and didn’t even hear about the outage until it was almost over (my company runs partially on AWS, but in another region). From what I read on the Amazon status site, the scope sounded pretty limited; I didn’t see it as a major event. But the talk since then says otherwise.

Amazon has now released their complete post-mortem and in reading I was struck by several hidden truths that I think many people will miss.  I was an early closed beta tester of AWS (when you basically needed a personal invite to get in) and have done over 30 top-to-bottom builds of complete production app stacks while I was with RoundHouse.  So I hope to provide some additional insight into what makes the last few AWS outages especially interesting.

What you should really worry about Part 1 - The bugs aren’t getting better

Widespread issues occurred because of (by my reading) 6 different software bugs.  That’s a lot.  This fact can be spun as both a positive and a negative.   Here’s what we would assume are the positives:

  • The bugs can (and will) be fixed.  Once they are, they won’t happen again.
  • By virtue of AWS’s advancing age, they have “shaken out” more bugs than any of their lesser competitors.  Similar bugs surely lie dormant in every other cloud provider’s code as well.  No provider is immune and every single one will experience downtime because of it.  In this regard AWS is ahead of the game.

But this line of thinking fails to address the underlying negatives:

  • Any codebase is always growing & evolving and that means new bugs.  The alarming part is that each outage seems to have some underlying bugs discovered.  The incidence of Amazon letting new bugs through seems to be disconcertingly high.  It does no good to fix the old ones if other areas of the service have secretly added twice as many new bugs.  If things were really being “fixed” we would expect new bugs to show up less often, not more often.  After a certain point, we have to start assuming that the demonstrated error rate will continue.
  • When bugs like this happen, they often can’t be seen coming.  The impact and scope can’t be anticipated, but it is typically very large.

So far, we have not yet heard of any sort of comprehensive plan intended to address this issue.  Ironically, AWS is consistently dropping the ball on their ‘root-cause’ analysis but failing the “Five Whys” test.  They’re stopping at about the 2nd or 3rd why without addressing the true root cause.

In the case of AWS, anecdotal evidence suggests they have not yet succeeded in building a development protocol that is accountable to the bugs it produces. They’re letting too many through.  Yes, these bugs are very difficult to detect and test for, but they need to have a higher standard or outages like this will continue to occur.

Fixing Passenger error: PassengerLoggingAgent doesn’t exist

While doing a new install of Passenger & nginx, I ran into some strange errors:


2012/09/25 20:09:54 [alert] 2593#0: Unable to start the Phusion Passenger watchdog because it encountered the following error during startup: Unable to start the Phusion Passenger logging agent because its executable (/opt/ruby-enterprise-1.8.7-2011.12/lib/ruby/gems/1.8/gems/passenger-3.0.12/agents/PassengerLoggingAgent) doesn't exist. This probably means that your Phusion Passenger installation is broken or incomplete. Please reinstall Phusion Passenger (-1: Unknown error)

Our environment is using Ubuntu 12.04 LTS & Chef with the nginx::source & nginx::passenger_module recipes from the opscode cookbook. It turns out there were two root causes here that needed to be resolved:

  1. Even though the config explicitly stated to use version 3.0.12, the 3.0.17 passenger gem was also getting installed.  Some things were going to one place, some to another.
    • SOLUTION: I figured it’d just be easier to stick with the latest release.  So I changed the setting to just use 3.0.17 and uninstalled the old version
  2. The PassengerLoggingAgent was failing to be installed (but was failing silently.

SOLUTION: It turned out that we were missing some libraries.  Building the passenger package manually showed the details:

root@w2s-web01:/opt/ruby-enterprise-1.8.7-2011.12/lib/ruby/gems/1.8/gems/passenger-3.0.17# rake nginx RELEASE=yes
g++ ext/common/LoggingAgent/Main.cpp -o agents/PassengerLoggingAgent -Iext -Iext/common -Iext/libev -D_REENTRANT -I/usr/local/include -DHASH_NAMESPACE="__gnu_cxx" -DHASH_NAMESPACE="__gnu_cxx" -DHASH_FUN_H="<hash_fun.h>" -DHAS_ALLOCA_H -DHAS_SFENCE -DHAS_LFENCE -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wpointer-arith -Wwrite-strings -Wno-long-long -Wno-missing-field-initializers -g -DPASSENGER_DEBUG -DBOOST_DISABLE_ASSERTS ext/common/libpassenger_common.a ext/common/libboost_oxt.a ext/libev/.libs/libev.a -lz -lpthread -rdynamic
In file included from ext/common/LoggingAgent/LoggingServer.h:46:0,
from ext/common/LoggingAgent/Main.cpp:43:
ext/common/LoggingAgent/RemoteSender.h:31:23: fatal error: curl/curl.h: No such file or directory
compilation terminated.
rake aborted!
Command failed with status (1): [g++ ext/common/LoggingAgent/Main.cpp -o ag...]

So the libcurl development headers were the true issue.

apt-get install libcurl4-openssl-dev

was the solution.  Or in our case it was to add:

package 'libcurl4-openssl-dev'

to the nginx recipe in chef.   I hope this helps someone else out there!

AWS VPC Error: Client.InvalidParameterCombination

When trying to execute an ec2-run-instances command for a VPC, you must specify both which subnet & which security group you want it to belong to:

ec2-run-instances ami-abc123 \
 --group sg-abc123 \
 --subnet subnet-abc123 \
 --private-ip-address 10.0.1.10 \
 .... your other params

However, doing so generates this error:

Client.InvalidParameterCombination: Network interfaces and an instance-level security groups may not be specified on the same request

I even found one lowly report of someone else with this issue: https://forums.aws.amazon.com/message.jspa?messageID=368030

Luckily, my company has premium AWS support and a quick 10 minute chat got the answer I needed.  You must use the --network-attachment param, which takes the place of --group, --private-ip-address, and --subnet

The resulting command looks like this:

ec2-run-instances ami-abc123 \
  --network-attachment :0:subnet-abc123::10.0.1.10:sg-abc123::
  .... your other params

Good luck, I hope this helps!

A new mentorship program

I recently returned from RailsConf 2012 in Austin.  I had a terrific time,  learned a ton, and most importantly: I was inspired.  I was inspired to return to making more contributions to Open Source and to help introduce others (kids and adults) to the computer world.

RailsConf had three double-sided whiteboards that (as far as I could tell) with nothing but “We’re hiring” ads on them.  I also stopped by a job fair where there were more companies hiring than applicants.  And finally, when all the attendees were asked if they worked for a company that was hiring Rails developers, at least two-thirds raised their hands!

So my first idea to get wings is a new mentorship program I’m starting in my town.  This is specifically a locally-oriented program.  I’m currently searching for 1-3 high school or college students who are interested in learning how to be the kind of well-rounded & highly-knowledgeable developers that these businesses are looking for.

Over about the next four months, we’ll cover the types of topics not typically covered at university, such as: git, TDD, setting up your own servers, and a complete understanding of the Rails stack.  This isn’t an internship (they’re doing their own work, not work for some other company), and it’s not a class (I’m not getting paid).  It’s just a chance to get some young new developers off on the right foot.  There will be a ton of material to cover, and there definitely will not be a speed limit!  Along the way, I hope to have each student come up with their own web app idea and fully implement it, start to finish.  In the end, I hope they’ll wind up with something even more impressive than a portfolio to be able to show potential employers.

I’m pretty excited and hope to find some equally excited students.  I’ll be periodically offering updates on how the program is going.  I also hope to open source my lesson plan as well for use by others, so check back for more!

New Ruby Gem: AWS-SES

I’ve release a new gem: aws-ses.  This gem provides a ruby wrapper to accessing Amazon SES, allowing for you to easily send e-mail, verify e-mail addresses, and monitor your SES activity.

It’s still very new (give me a break, SES was announced yesterday!!), but I plan to get the functionality down solid and continue to support it.  Of course feedback, help, and patches would be greatly appreciated!

https://github.com/drewblas/aws-ses