Twitter had a few internal services whose names are familiar to anyone who’s spent time in the insides of the bird website. They also used a bunch of open source tools in creative ways. Here are some of the Twitter-specific things we’ll be talking about:
- Software which maintained the machine database: associations from the host’s name, to various facts and grouping relationships.
A, the production environment in the Atlanta data center facility.
- A service that distributed configuration-oriented
- The command-line interface to Audubon.
- An orchestration framework to manage retries and failures in Audubon and Wilson.
1, the production environment in the Sacramento data center facility.
- Twitter Command Center, the Slack channel for coordinating the response to incidents.
- The installation stack for the bare-metal servers.
When we got there, Twitter’s shell prompts were customized from the defaults that you’d normally find on a Linux distribution, adding the machine’s Audubon role to the usual username and hostname that one finds in a shell prompt. This is incredibly helpful during troubleshooting, as it keeps relevant information close together, for ease of copy/pasting.
lbird@sfo2-aaa-02-sr1 [hwlab.shared] /etc $
Additionally, the role section was color-coded depending on which Puppet branch the machine was on. But you can’t copy/paste color, and we got tired of having to ask “okay and what branch is your machine on,” or otherwise go look it up, and so we dug into the Puppet code and changed it.
Production branches remained the same, color-coded red, with square brackets:
lbird@sfo2-aaa-02-sr1 [hwlab.shared] /etc $
The official canary branch, where changes had to sit before going to production, color-coded yellow, got parentheses:
lbird@sfo2-aab-02-sr1 (hwlab.shared) /etc $
Anything else, color-coded green, got angle brackets:
lbird@sfo2-aac-02-sr1 <hwlab.shared> /etc $
Depending on how your brain is shaped,
you might be able to infer this,
we chose the symbols
specifically to represent
how the prompt should feel.
represent a solid environment,
with walls at right angles
and everything in its place.
The parentheses are meant
to call to mind a softness,
the quicker rate of change of code,
then the angle brackets are pointy,
and we used that very word
about the change.
This is a very subtle accessibility change that we are still incredibly proud of.
With apologies to the character of Roy Batty from Blade Runner:
we’ve seen things you people wouldn’t believe — shell scripts piping
openssl smimethrough an HTTP tunnel. push-mode kerberos propagation across 7 zones worth of replicas.
iptablesrules that, over 24 hours, dropped 96 gigabytes of DNS traffic. all these moments will be lost, like tweets in prod.
So, one day,
while we’re primary on call,
our secondary messages us:
“Hey, Alex, I’m getting a weird message on the name servers in
atla, mind taking a look?”.
This sentence is
We log in,
the name servers
are taking rather a long time
and also the entire
seems to be completely nonfunctional.
Once we manage to get a look at the log files,
our jaw drops:
we have never seen BIND,
the industry standard name server,
say that it cannot
write to a client IP,
When the error is one that we haven’t seen before,
that’s when you know it’s gonna be good.
By this point,
we’ve joined the TCC call,
and they’ve failed over
as much as they can out of
after a lot of waiting
on the machines to respond,
that the problem seems to be
mainly focused on the
first and second
The other replicas
in the fleet
also aren’t responding correctly,
practically melting down —
massive amounts of network traffic,
high load average.
none of the machines
can resolve their own hostnames.
A bit of work
tcpdump and Wireshark,
and we noticed that
instead of serving
their own name
from their own local cache,
as we would expect,
trying to do
an actual DNS query for,
this is the search domain feature
of the DNS client,
to let you type
a short name,
atla-aaa-02-sr1.prod.twttr.net is looking up its own name,
and cannot find it,
then this triggers search-domain resolution.
These particular names will never exist,
and the name servers
were spending all their time
responding to each request
with four “nonexistent” responses,
an amplified denial-of-service attack
on the entire fleet.
The light-bulb above our head switches on,
and here’s where we open two browser tabs:
one to RFC 1035,
the DNS specification,
and another to Stack Overflow,
where we grab an example command which uses
to match strings in network traffic,
paste it three times into a scratch buffer,
and make them look like this,
except with all of the search domains:
iptables -A INPUT -p udp -m udp --dport 53 -m u32 --u32 "28 & 0xF8 = 0" -m string --algo bm --from 40 --hex-string '|04|prod|05|twttr|03|net|04|prod|05|twttr|03|net' -j DROP
Not being one to simply copy/paste from Stack Overflow,
we cross-check ourselves
against the RFC and the packet in Wireshark,
making sure that the
28 & 0xF8 = 0 match
catches DNS questions only,
the string match
at 40 bytes into the packet.
We pipe up on the incident call: “I think this will let us get a handle on things, should we run this on the name servers?” and we paste those commands.
The call drops silent, for a good solid minute.
Then, the Incident Manager On Call, the top of the food chain here, breaks the silence: “Uh, Alex, nobody here has seen those options before. So, uh, proceed, but be ready to roll back?”
We take a deep breath,
paste the commands into
which are logged in as
and watch the load average
start to drop.
This gives us some breathing room,
and we confirm that
the process that generates DNS records
from the machine database,
was at fault.
It got an empty list from the Audubon server,
but the safety checks built into it
if the response
produced a nonzero number of records.
atla Audubon hiccuped,
and returned 0 results,
(due to some transient error that we never could reproduce, of course)
happily wrote out a skeletal DNS zone file,
containing no entries.
Puppet had hard-coded
This hadn’t proved to be an issue before,
because each machine
normally got its DNS
through a local instance of BIND
configured with a list
of all of
the DNS servers
in that machine’s environment.
But another failsafe
the local BIND instance
if it was not functioning.
So when the local resolver
was determined to be bad
because it couldn’t resolve
the machine’s hostname,
that failsafe ended up pointing
at those two poor DNS servers.
The incident-ending fix for this ended up
requiring us to
ferrying those zones by hand to both of
the affected nameservers,
and also the
at which point the system began to recover —
other machines could look up their own names again,
the correct DNS records
to the rest of the nameservers,
making them healthy again.
That’s where our
in the incident ended,
but the fallout
for all of the other services
took weeks to fully work out.
The Puppet code
to evenly distribute
same as it did in the BIND config
and by the end of the next day,
96 gigabytes of traffic
generate_audubon_zones got an update
the case of
“Audubon returned an empty result set.”