Twitter Tales
Glossary
Twitter had a few internal services whose names are familiar to anyone who’s spent time in the insides of the bird website. They also used a bunch of open source tools in creative ways. Here are some of the Twitter-specific things we’ll be talking about:
- Audubon
- Software which maintained the machine database: associations from the host’s name, to various facts and grouping relationships.
-
atla
- Atlanta
A
, the production environment in the Atlanta data center facility. -
configbus
- A service that distributed configuration-oriented
git
repositories internally. - Loony
- The command-line interface to Audubon.
- Servitor
- An orchestration framework to manage retries and failures in Audubon and Wilson.
-
smf1
- SMF
1
, the production environment in the Sacramento data center facility. - TCC
- Twitter Command Center, the Slack channel for coordinating the response to incidents.
- Wilson
- The installation stack for the bare-metal servers.
Shell prompts
When we got there, Twitter’s shell prompts were customized from the defaults that you’d normally find on a Linux distribution, adding the machine’s Audubon role to the usual username and hostname that one finds in a shell prompt. This is incredibly helpful during troubleshooting, as it keeps relevant information close together, for ease of copy/pasting.
lbird@sfo2-aaa-02-sr1 [hwlab.shared] /etc $
Additionally, the role section was color-coded depending on which Puppet branch the machine was on. But you can’t copy/paste color, and we got tired of having to ask “okay and what branch is your machine on,” or otherwise go look it up, and so we dug into the Puppet code and changed it.
Production branches remained the same, color-coded red, with square brackets:
lbird@sfo2-aaa-02-sr1 [hwlab.shared] /etc $
The official canary branch, where changes had to sit before going to production, color-coded yellow, got parentheses:
lbird@sfo2-aab-02-sr1 (hwlab.shared) /etc $
Anything else, color-coded green, got angle brackets:
lbird@sfo2-aac-02-sr1 <hwlab.shared> /etc $
Depending on how your brain is shaped,
you might be able to infer this,
but,
we chose the symbols
specifically to represent
how the prompt should feel.
Square brackets
represent a solid environment,
with walls at right angles
and everything in its place.
The parentheses are meant
to call to mind a softness,
because of
the quicker rate of change of code,
then the angle brackets are pointy,
and we used that very word
in the eng@
email
about the change.
This is a very subtle accessibility change that we are still incredibly proud of.
The iptables
Story
With apologies to the character of Roy Batty from Blade Runner:
we’ve seen things you people wouldn’t believe — shell scripts piping
openssl smime
through an HTTP tunnel. push-mode kerberos propagation across 7 zones worth of replicas.iptables
rules that, over 24 hours, dropped 96 gigabytes of DNS traffic. all these moments will be lost, like tweets in prod.
So, one day,
while we’re primary on call,
our secondary messages us:
“Hey, Alex, I’m getting a weird message on the name servers in atla
, mind taking a look?”.
This sentence is
classic Alex-bait.
We log in,
notice that
the name servers
are taking rather a long time
to respond,
and also the entire atla
environment
seems to be completely nonfunctional.
Once we manage to get a look at the log files,
our jaw drops:
we have never seen BIND,
the industry standard name server,
say that it cannot write
to a client IP,
because it would block
.
When the error is one that we haven’t seen before,
that’s when you know it’s gonna be good.
By this point,
we’ve joined the TCC call,
and they’ve failed over
as much as they can out of atla
.
We note,
after a lot of waiting
on the machines to respond,
that the problem seems to be
mainly focused on the
first and second
name servers
in atla
.
The other replicas
in the fleet
also aren’t responding correctly,
but ns1
and ns2
are
practically melting down —
massive amounts of network traffic,
high load average.
Also,
none of the machines
in atla
can resolve their own hostnames.
A bit of work
in tcpdump
and Wireshark,
and we noticed that
instead of serving
their own name
from their own local cache,
as we would expect,
the machines
were instead
trying to do
an actual DNS query for,
say,
atla-aaa-02-sr1.prod.twttr.net
,
and then
atla-aaa-02-sr1.prod.twttr.net.prod.twttr.net
,
atla-aaa-02-sr1.prod.twttr.net.atla.twttr.net
,
atla-aaa-02-sr1.prod.twttr.net.twitter.biz
—
this is the search domain feature
of the DNS client,
in action.
It’s meant
to let you type
a short name,
atla-aaa-02-sr1
,
and have .prod.twttr.net
,
.atla.twttr.net
,
and .twitter.biz
automatically searched.
But if atla-aaa-02-sr1.prod.twttr.net
is looking up its own name,
and cannot find it,
then this triggers search-domain resolution.
These particular names will never exist,
and the name servers
were spending all their time
responding to each request
with four “nonexistent” responses,
effectively performing
an amplified denial-of-service attack
on the entire fleet.
The light-bulb above our head switches on,
and here’s where we open two browser tabs:
one to RFC 1035,
the DNS specification,
and another to Stack Overflow,
where we grab an example command which uses iptables
to match strings in network traffic,
paste it three times into a scratch buffer,
and make them look like this,
except with all of the search domains:
iptables -A INPUT -p udp -m udp --dport 53 -m u32 --u32 "28 & 0xF8 = 0" -m string --algo bm --from 40 --hex-string '|04|prod|05|twttr|03|net|04|prod|05|twttr|03|net' -j DROP
Not being one to simply copy/paste from Stack Overflow,
we cross-check ourselves
against the RFC and the packet in Wireshark,
also,
making sure that the 28 & 0xF8 = 0
match
catches DNS questions only,
and that
the string match
should start
at 40 bytes into the packet.
We pipe up on the incident call: “I think this will let us get a handle on things, should we run this on the name servers?” and we paste those commands.
The call drops silent, for a good solid minute.
Then, the Incident Manager On Call, the top of the food chain here, breaks the silence: “Uh, Alex, nobody here has seen those options before. So, uh, proceed, but be ready to roll back?”
Cool.
We take a deep breath,
paste the commands into
our terminals,
which are logged in as root
,
and watch the load average
on ns1
and ns2
start to drop.
This gives us some breathing room,
and we confirm that
generate_audubon_zones
,
the process that generates DNS records
from the machine database,
was at fault.
It got an empty list from the Audubon server,
but the safety checks built into it
only applied
if the response
from Audubon
produced a nonzero number of records.
When atla
Audubon hiccuped,
and returned 0 results,
(due to some transient error that we never could reproduce, of course)
generate_audubon_zones
happily wrote out a skeletal DNS zone file,
containing no entries.
Additionally,
Puppet had hard-coded ns1
and ns2
into /etc/resolv.conf
.
This hadn’t proved to be an issue before,
because each machine
normally got its DNS
through a local instance of BIND
configured with a list
of all of
the DNS servers
in that machine’s environment.
But another failsafe
was designed
to remove
the local BIND instance
if it was not functioning.
So when the local resolver
was determined to be bad
because it couldn’t resolve
the machine’s hostname,
that failsafe ended up pointing
the entire atla
fleet
at those two poor DNS servers.
The incident-ending fix for this ended up
requiring us to
manually run generate_audubon_zones
,
ferrying those zones by hand to both of
the affected nameservers,
and also the configbus
repository,
at which point the system began to recover —
other machines could look up their own names again,
and then configbus
could distribute
the correct DNS records
to the rest of the nameservers,
making them healthy again.
That’s where our
immediate involvement
in the incident ended,
but the fallout
for all of the other services
took weeks to fully work out.
The Puppet code
was fixed
to evenly distribute
nameserver
statements
in /etc/resolv.conf
,
same as it did in the BIND config
and by the end of the next day,
those iptables
rules
had dropped
96 gigabytes of traffic
in total.
Also,
generate_audubon_zones
got an update
to handle
the case of
“Audubon returned an empty result set.”