Twitter Tales

Glossary

Twitter had a few internal services whose names are familiar to anyone who’s spent time in the insides of the bird website. They also used a bunch of open source tools in creative ways. Here are some of the Twitter-specific things we’ll be talking about:

Audubon
Software which maintained the machine database: associations from the host’s name, to various facts and grouping relationships.
atla
Atlanta A, the production environment in the Atlanta data center facility.
configbus
A service that distributed configuration-oriented git repositories internally.
Loony
The command-line interface to Audubon.
Servitor
An orchestration framework to manage retries and failures in Audubon and Wilson.
smf1
SMF 1, the production environment in the Sacramento data center facility.
TCC
Twitter Command Center, the Slack channel for coordinating the response to incidents.
Wilson
The installation stack for the bare-metal servers.

Shell prompts

When we got there, Twitter’s shell prompts were customized from the defaults that you’d normally find on a Linux distribution, adding the machine’s Audubon role to the usual username and hostname that one finds in a shell prompt. This is incredibly helpful during troubleshooting, as it keeps relevant information close together, for ease of copy/pasting.

lbird@sfo2-aaa-02-sr1 [hwlab.shared] /etc $

Additionally, the role section was color-coded depending on which Puppet branch the machine was on. But you can’t copy/paste color, and we got tired of having to ask “okay and what branch is your machine on,” or otherwise go look it up, and so we dug into the Puppet code and changed it.

Production branches remained the same, color-coded red, with square brackets:

lbird@sfo2-aaa-02-sr1 [hwlab.shared] /etc $

The official canary branch, where changes had to sit before going to production, color-coded yellow, got parentheses:

lbird@sfo2-aab-02-sr1 (hwlab.shared) /etc $

Anything else, color-coded green, got angle brackets:

lbird@sfo2-aac-02-sr1 <hwlab.shared> /etc $

Depending on how your brain is shaped, you might be able to infer this, but, we chose the symbols specifically to represent how the prompt should feel. Square brackets represent a solid environment, with walls at right angles and everything in its place. The parentheses are meant to call to mind a softness, because of the quicker rate of change of code, then the angle brackets are pointy, and we used that very word in the eng@ email about the change.

This is a very subtle accessibility change that we are still incredibly proud of.

The iptables Story

With apologies to the character of Roy Batty from Blade Runner:

we’ve seen things you people wouldn’t believe — shell scripts piping openssl smime through an HTTP tunnel. push-mode kerberos propagation across 7 zones worth of replicas. iptables rules that, over 24 hours, dropped 96 gigabytes of DNS traffic. all these moments will be lost, like tweets in prod.

So, one day, while we’re primary on call, our secondary messages us: “Hey, Alex, I’m getting a weird message on the name servers in atla, mind taking a look?”.

This sentence is classic Alex-bait. We log in, notice that the name servers are taking rather a long time to respond, and also the entire atla environment seems to be completely nonfunctional.

Once we manage to get a look at the log files, our jaw drops: we have never seen BIND, the industry standard name server, say that it cannot write to a client IP, because it would block. When the error is one that we haven’t seen before, that’s when you know it’s gonna be good.

By this point, we’ve joined the TCC call, and they’ve failed over as much as they can out of atla. We note, after a lot of waiting on the machines to respond, that the problem seems to be mainly focused on the first and second name servers in atla. The other replicas in the fleet also aren’t responding correctly, but ns1 and ns2 are practically melting down — massive amounts of network traffic, high load average.

Also, none of the machines in atla can resolve their own hostnames. A bit of work in tcpdump and Wireshark, and we noticed that instead of serving their own name from their own local cache, as we would expect, the machines were instead trying to do an actual DNS query for, say, atla-aaa-02-sr1.prod.twttr.net, and then atla-aaa-02-sr1.prod.twttr.net.prod.twttr.net, atla-aaa-02-sr1.prod.twttr.net.atla.twttr.net, atla-aaa-02-sr1.prod.twttr.net.twitter.biz — this is the search domain feature of the DNS client, in action. It’s meant to let you type a short name, atla-aaa-02-sr1, and have .prod.twttr.net, .atla.twttr.net, and .twitter.biz automatically searched. But if atla-aaa-02-sr1.prod.twttr.net is looking up its own name, and cannot find it, then this triggers search-domain resolution. These particular names will never exist, and the name servers were spending all their time responding to each request with four “nonexistent” responses, effectively performing an amplified denial-of-service attack on the entire fleet.

The light-bulb above our head switches on, and here’s where we open two browser tabs: one to RFC 1035, the DNS specification, and another to Stack Overflow, where we grab an example command which uses iptables to match strings in network traffic, paste it three times into a scratch buffer, and make them look like this, except with all of the search domains:

iptables -A INPUT -p udp -m udp --dport 53 -m u32 --u32 "28 & 0xF8 = 0" -m string --algo bm --from 40 --hex-string '|04|prod|05|twttr|03|net|04|prod|05|twttr|03|net' -j DROP

Not being one to simply copy/paste from Stack Overflow, we cross-check ourselves against the RFC and the packet in Wireshark, also, making sure that the 28 & 0xF8 = 0 match catches DNS questions only, and that the string match should start at 40 bytes into the packet.

We pipe up on the incident call: “I think this will let us get a handle on things, should we run this on the name servers?” and we paste those commands.

The call drops silent, for a good solid minute.

Then, the Incident Manager On Call, the top of the food chain here, breaks the silence: “Uh, Alex, nobody here has seen those options before. So, uh, proceed, but be ready to roll back?”

Cool.

We take a deep breath, paste the commands into our terminals, which are logged in as root, and watch the load average on ns1 and ns2 start to drop.

This gives us some breathing room, and we confirm that generate_audubon_zones, the process that generates DNS records from the machine database, was at fault. It got an empty list from the Audubon server, but the safety checks built into it only applied if the response from Audubon produced a nonzero number of records. When atla Audubon hiccuped, and returned 0 results, (due to some transient error that we never could reproduce, of course) generate_audubon_zones happily wrote out a skeletal DNS zone file, containing no entries.

Additionally, Puppet had hard-coded ns1 and ns2 into /etc/resolv.conf. This hadn’t proved to be an issue before, because each machine normally got its DNS through a local instance of BIND configured with a list of all of the DNS servers in that machine’s environment. But another failsafe was designed to remove the local BIND instance if it was not functioning. So when the local resolver was determined to be bad because it couldn’t resolve the machine’s hostname, that failsafe ended up pointing the entire atla fleet at those two poor DNS servers.

The incident-ending fix for this ended up requiring us to manually run generate_audubon_zones, ferrying those zones by hand to both of the affected nameservers, and also the configbus repository, at which point the system began to recover — other machines could look up their own names again, and then configbus could distribute the correct DNS records to the rest of the nameservers, making them healthy again. That’s where our immediate involvement in the incident ended, but the fallout for all of the other services took weeks to fully work out.

The Puppet code was fixed to evenly distribute nameserver statements in /etc/resolv.conf, same as it did in the BIND config and by the end of the next day, those iptables rules had dropped 96 gigabytes of traffic in total. Also, generate_audubon_zones got an update to handle the case of “Audubon returned an empty result set.”