After covering AWS's IAM, EC2, EBS, S3 and Blue/Green Deployments we now turn our attention to AWS's Route 53 service. Since it's a DNS service - it plays one of the most critical roles, if not the most critical - connecting us all together on the Internet. Take out DNS and the Internet will grind to a halt within mere hours if not immediately.
But Route 53 does way more that providing a DNS service for registration of your A
, NS
or MX
records. This service goes to great lengths to provide many tools necessary for keeping your application highly available and resilient to any infrastructure or application failures. How so?
We suggest you start by watching the AWS Re:Invent 2014 session "Amazon Route 53 Deep Dive: Delivering Resiliency, Minimizing Latency" by Lee-Ming Zen to learn the best ways to utilize all the power of Route 53.
What do we learn from this session? What special niceties does Route 53 bring to the table? We'll discuss below.
It is not static!
We may think of traditional DNS systems as a yellow pages book full of IP addresses and name servers. And like a phone book it's kind of static. DNS records are heavily cached on all levels (your browser, your OS, your proxy, your ISP and your DNS server) to optimize performance. To adapt to possible DNS changes, each record has a TTL (time-to-live) field so it's refreshed every once in a while. But within the boundaries of TTL, the DNS picture of the world is pretty much static (hence all the caches).
As it turns out this is not the case with Route 53. While Route 53 provides all the usual means of defining Resource Record sets (RRsets) of all major types (A
, AAAA
, CNAME
, MX
, NS
, PTR
, SOA
, SPF
, SRV
, TXT
) and proper TTLs, it also provides much more. If you want it to be just a yellow pages book of your resources - it'll happily do so. But here's the catch - when a traditional DNS system is queried it returns the answer that it knows, without thinking too hard. Route 53, however, may provide a reply that reminds me that old saying .. "It depends". Where is my Web application? It depends. What endpoints should I use? It depends. Well, depends on what? Glad you asked! Most likely it depends on an application's health status, your network latency, geographical location, or all of them together.
Health Checks and DNS Failover
Health checks and DNS failover are the major tools in Route 53's arsenal making your application highly available and resilient to failures. If your applicaiton is deployed in multiple availability zones and multiple AWS regions with Route 53 health checks attached to every endpoint, it is capable of sending back a list of healthy endpoints only, allowing your clients to switch without even noticing there has been a problem. This automatic recovery scenario can be used in active-active or active-passive setups, depending on whether or not your additional endpoints are always hit by live traffic or only after all primary endpoints have failed. Therefore, Route 53 health checks and failover properties are essential tools in guaranteeing your service uptime, compared to the traditional monitor-alert-restart way of addressing failures. Nothing beats it when one can sleep through the failure and the system just heals itself automagically.
Route 53 health checks are not triggered by DNS queries; they are run periodically by AWS and results are published to all DNS servers. This way name servers can be aware of an unhealthy endpoint and route differently within, say, 30 seconds of a problem (after 3 failed tests in a row) and new DNS results will be known to clients a minute later (assuming your TTL is 60 seconds), bringing complete recovery time to about a minute and a half in total.
However, DNS caching can still be a problem here (see our previous post where "long tail" problem is covered) if TTL is not respected by all layers between your client and Route 53. You could then apply a "cache busting" technique: send a request to a unique domain ("http://<unique-id>.<your-domain>"
) and define a wildcard Resource Record "*.<your-domain>"
to match it.
Conditional Routing
Basing routing decisions on health check status is just one example of Route 53 conditional routing. Other possible routing policies are WRR (weighted round robin), LBR (latency-based routing) and Geolocation routing. When specified, Route 53 evaluates a relative resource's weight, client's network latency to the resource, or the client's geographical location when deciding which resource to send back in a DNS response. As I mentioned before, Route 53's manner of answering "It depends" when queried for a service or domain's IP address originates from it trying to provide the best, the fastest and the healthiest results possible, rather than dumping all known IP addresses, whatever their health status happens to be.
All routing policies mentioned above (WRR, LBR, Geolocation) can be associated with health checks, so resource health status is considered before it even becomes a candidate in a conditional decision tree.
Aliases and Root (Naked) Domains
One special feature Route 53 provides that is specific to AWS is aliasing. While traditional DNS records store an IP address or a hostname in their RDATA
field, aliased RRsets point to other AWS resources, such as an S3 bucket, an ELB instance, a CloudFront distribution, or another RRset in the same hosted zone. This can be used for user-friendly AWS URLs, hosting a static website in an S3 bucket or root domain hosting.
Root domain (aka naked domain) hosting allows ALIAS
ing your root domain.com
to the "domain.com"
S3 bucket redirecting all requests to www.domain.com
, which in turn is ALIAS
ed to an AWS resource or CNAME
d to external resource, such as Heroku app.
There are several recommendations or requirements regarding root domain hosting:
- Do not
CNAME
your root domain. It's just not allowed. Route 53 makes it impossible to do so, butALIAS
can be used instead. - Do not use
A
records forwww.domain.com
as it means hardcoded IPs (something we all know isn't fun). Instead, Route 53ALIAS
or traditionalCNAME
record allowswww.domain.com
to always point to the right resource, wherever your site is hosted, even when the physical server has changed its IP address. - Always redirect your naked
domain.com
towww.domain.com
. However, this may not be necessary if you run your entire site on AWS, because using Route 53's ability to translate an alais to anA
record on the fly avoids most of the issues of a naked domain, one of the big advantages of using Route 53 with AWS as your frontend.
Building blocks of a resilient applicaiton with Route 53
Now, let's build an app that is highly available and resilient to failures. What building blocks is it composed of?
- In every AWS region an ELB (Elastic Load Balancer) is set up with cross-zone load balancing and connection draining enabled - this distributes the load evenly across all instances in all availability zones and ensures requests "in flight" are fully served before an EC2 instance is disconnected from an ELB for any reason.
- Each ELB delegates requests to EC2 instances running in multiple AZs (Availability Zones) and grouped in an Auto Scaling Group - this protects the app from AZ outages, ensures a minimal amount of instances is always running and responds to changes in load, properly scaling each groups' EC2 instances out and in.
- Each ELB has health checks defined - this ensures it only delegates requests to healthy instances.
- Each ELB also has a Route 53 health check associated with it - this ensures ELBs running out of healthy EC2 instances are not routed to.
- The application's
prod.domain.com
Route 53 records areALIAS
ed to ELBs with Latency routing policy and associated with ELB health checks - this ensures requests are routed to a healthy ELB providing a minimal latency to a client. - The application's
fail.domain.com
Route 53 record isALIAS
ed to an AWS CloudFront, Amazon's content delivery service, distribution of an S3 bucket hosting a static "fail whale" version of an application. - Application's
www.domain.com
Route 53 record isALIAS
ed toprod.domain.com
(as primary target) andfail.domain.com
(as secondary target) with Failover routing policy - this ensureswww.domain.com
routes to production ELBs if at least one of them is healthy or the "fail whale" if all of them appear to be unhealthy. - Application's
domain.com
Route 53 record is redirected towww.domain.com
using an S3 bucket of the same name. - (Optionally) Application's content (both static and dynamic) is served using CloudFront - this ensures the content is delivered to clients from AWS edge locations spread all over the the world with minimal latency. Serving dynamic content from CDN, cached for short periods of time like several seconds, takes the load off the application and further improves its latency and responsiveness.
- (Optionally) The application is deployed in multiple AWS regions - this protects the app from an AWS region outage.
Failure cases and how they are handled
Using the building blocks above, we are now equiped to handle various failure cases.
- AWS region outage - handled by Route 53. Requests are routed to other regions right after the ELB in a failed region fails its Route 53 health checks.
- AWS ELB outage - handled by Route 53. Requests are routed away from the unhealthy ELBs.
- AWS AZ outage - handled by ELB. Requests are delegated to instances in the remaining AZs right after the instances in a disconnected AZ fail their ELB health checks.
- EC2 instance failure - handled by Auto Scaling Groups. A new EC2 instance is started immediately after an EC2 instance terminates. It is reconnected to an ELB after it passes ELB health checks.
- Load spike - handled by Auto Scaling Group. Additional EC2 instances are started to handle an increased load. Remember to call your AWS contact point about pre-warming ELBs if a spike is expected.
- Global clients connecting from various parts of the world - handled by Route 53 and CloudFront. Requests are routed to an ELB with minimal latency to a client. Static and cached dynamic content is served from the closest AWS edge location.
- Total zombie apocalypse with all application instances failing their health checks (or the slightly more likely case of pushing bad applicaiton code globally) - handled by Route 53. Requests are routed to a static "fail whale" version of an application served from AWS edge locations and hosted on S3.
In the diagram below, we can see many of our building blocks and failure scenarios at play.
- A client asks Route 53 for the IP address of our website.
- Route 53 knows that while the client is closest to us-east, all the load balancers are down. It sends back the IP addresses of the ELBs in us-west.
- The client then randomly selects one of the ELB nodes, in this case the node in Zone B.
- The ELB knows that the instances in it's own zone are down, but because it has cross zone requests turned on, it will send the request to an instance in Zone A, which then sends back the response.
As you can see, Route 53 together with other AWS services will go to a great length to keep your application available for your clients and accessible globally with minimal latency possible.
Additional resources
Here are some additional Route 53 resources for you to play with or dig deeper:
- Route 53 category of AWS Architecture Blog where the team is sharing some great details about how Route 53 service operates in a highly available, scalable and distributed manner. Our favorite posts on deployment strategies and Shuffle Sharding are definitely must-read (ever wondered how and why Route 53 assigns 4 name servers with strange-looking names to each hosted zone?)
- Ansible Route 53 module adding or deleting Route 53 entries
- Chef Root53 cookbook updating Route 53 records
- AWS CLI - route53 service support
- ruby_route_53 - Route 53 Ruby gem and a CLI tool
- python-route53 - Python Route 53 module
- cli53 - command line script to administer Route 53
- route53d - a DNS frontend to Route 53 allowing you to use standard DNS tools to make changes to Route 53 zones