Anycast TCP

  1. What is Anycast Routing?
  2. Requirements for Anycast TCP
  3. Breaking down the problem
  4. Anycast TCP architecture
  5. Optimizations
  6. Issues and Criticisms
  7. Implementation notes

What is Anycast Routing?

There are four general kinds of routing: unicast, broadcast, multicast and anycast.

Unicast: I send a packet to an IP address and it routes to that one unique computer configured with that IP address.

Broadcast: I send a packet to the broadcast IP address. All computers within the broadcast domain receive my packet.

Multicast: Multiple computers within some domain subscribe to a multicast channel described by a multicast IP address. I send a packet to the multicast IP address. All of the subscribers receive the packet.

Anycast: Multiple computers around the world are configured with the same IP address. I send a packet to that IP address. The closest of the configured computers in the network (and only that one computer) receives my packet.

Requirements for Anycast TCP

Experiments with Anycast DNS show that in 99%+ cases Anycast devolves to unicast. That is, the packets from the client computer will always go to the same anycast node. A successful Anycast TCP solution must be as efficient or nearly as efficient as unicast in this normal case. In particular, a solution which relies on advising every anycast node about the connection state of every TCP connection on any of the nodes is unsustainable.

There are two corner cases for anycast TCP: Split Path Routing and Network Topology Change.

Split Path Routing: the client computer is equidistant to two or more of the anycast nodes and network load balancing causes packets transmitted to the anycast address to alternate between those nodes.

Network Topology Change: the client computer starts communicating with one of the anycast nodes. A network link goes up or down somewhere and a different anycast node becomes the closest one, causing the client's packets to shift to the other node.

A successful Anycast TCP solution may be less efficient in the corner cases but must still function correctly.

Breaking Down the Problem

The problem decomposes in to four components:

  1. Handle packets which have reached the correct Anycast node without increasing the TCP protocol overhead.
  2. Efficiently (without additional state) detect packets which have reached the wrong Anyast node so that they may be rerouted.
  3. Efficiently (as statelessly as possible) determine which Anycast node can handle the packet.
  4. Send the packet to the correct Anycast node.

Handling correctly routed packets is easy: just dump them in to the ordinary TCP stack. They'll match a TCP Control Block and be handled normally.

Detecting packets which have reached the wrong Anycast node turns out to be easy as well. Any packet which fails to match a local TCP Control Block would normally cause the OS to generate a TCP RST packet back to the sender. In the Anycast case, this means that a different node owns the connection. So, don't send a RST, send the original packet to the correct Anycast node instead.

Sending packets to the correct Anycast node is also easy. You just need a VPN between the anycast nodes. A stateless multipoint GRE tunnel works particularly well for this.

This leaves the challenge of figuring out which Anycast node handles the packet for which the local node has no TCP Control Block. In the rest of this paper I'll discuss several semi-stateless mechanisms which allow the local Anycast node to send the packet to the correct Anycast node without needing to know the TCP state of the entire Anycast cluster.

The primary approach offered encodes information about the Anycast node handling each TCP connection in to the TCP sequence number. It's stateless except for connections which transmit large quantities (many 10s of megabytes) of data and requires state sharing only on those specific connections which exceed the data limit. If combined with modest application-layer restrictions (such as sending an HTTP redirect to a unicast-hosted address for files that are gigabytes long) the technique can be made fully stateless.

Anycast TCP Architecture

An Anycast TCP system can be achieved without any modifications to the TCP protocol. It does require the computer's network stack to treat some TCP packets differently than it ordinarily would. This design proposal will present a split user-space/kernel architecture in which hooks are placed in the kernel TCP stack which inform a user-space program. The user-space program resolves the corner cases.

First hook: TCP Reset

A computer ordinarily sends a TCP RST packet either when a connection has been attempted (via TCP SYN) to a TCP port on which no program is listening, or when a TCP ack, FIN or payload packet (not SYN) arrives which does not correspond to any connection in the local computer's TCP state table (no local tcp control block).

For the second case only (not SYN), the hook will prevent a TCP RST from being sent. Instead, the received packet is forwarded to a user-space program. The local node does not own the associated connection, so presumably a different anycast node does. The user space program determines which of the other nodes owns the connection and forwards the packet to that node via a tunneling protocol such as GRE.

More on how the program makes that decision after the hooks.

Second hook: TCP Sequence Number Start Range

A computer accepting a TCP connection ordinarily randomizes its starting sequence number across the entire 32 bit space. Nodes in an anycast cluster will not do so.

Each node in the anycast cluster is assigned a non-overlapping "home range" of sequence numbers. When the user-space program receives an unknown packet from the TCP stack, it looks up which node's home range the sequence number (technically the ACK number in this case) falls in to. It then tunnels the packet to the corresponding node.

TCP sequence numbers increment with the number of bytes sent. We want the connections' sequence numbers to stay in the home range for as long as possible. So, we hook in to the TCP stack and set a start range of sequence numbers which starts at the beginning of the home range and is a fraction of the size of the home range. The computer randomizes the starting sequence number within the start range instead of across the entire 32 bits. This typically allows an anycast node to transmit hundreds of megabytes on a connection before the connection's sequence number leaves the home range.

Third hook: TCP Sequence Number High Water Mark

We can't force the TCP sequence number to wrap around and stay in a node's home range. It will wrap to zero at maxint, like always. So, before a connection climbs out of the home range, the other anycast nodes must be notified of the specific TCP connection ID so that any packets they receive can be rerouted to the node which owns the connection. The TCP connection ID is the combination of the source and destination IP addresses and the source and destination TCP ports. This ID is unique for every TCP connection.

Set a high water mark near but not at or beyond the end of the anycast node's home range. Hook the TCP stack so that when transmission of a TCP packet brings the sequence number across the high water mark, the user-space program is notified. The user-space program then notifies the other anycast nodes that this node owns the TCP connection with the given connection ID.

In addition to notifying the user-space program, set a "high-water" flag in the TCP control block indicating that a notification was sent. The TCP connection starts with this flag clear. Once set it will remain set until the connection closes no matter how many times the sequence number rolls around through the home range.

Fourth hook: Connection close

If enough megabytes were sent over a connection to hit the high water mark then state information was transferred to the other anycast nodes. When that same connection closes (normally or abnormally) the other anycast nodes must delete the state information too.

Hook the TCP stack so that when connections with the high-water flag are fully closed, the user-space program is notified. The user-space program then informs the other anycast nodes that the given TCP connection ID is no longer active and should be removed from their databases.

User-space program: receive packet which would have caused a RST

Look up the connection ID in the database of long-lived connections handled by other nodes. If found, send the packet across a configured tunnel (e.g. GRE) to a unicast address of the anycast node which owns it. The other node will receive it on its tunnel interface and process it through its normal TCP stack without further intervention by the user-space program.

If no connection ID is found, examine the sequence number (ACK number) and do a static lookup against local configuration to determine which node owns that home range. If the home range belongs to the local node, emit a TCP RST packet via an IP socket.

If the home range belongs to a remote node, send the packet across a configured tunnel (e.g. GRE) to the anycast node which owns it. The other node will receive it on its tunnel interface and process it through its normal TCP stack without further intervention by the user-space program.

That's the entire rerouting logic. Everything else deals with transfer of state knowledge in the unusual cases.

User-space program: high water notification

Tell each of the other anycast nodes that this node owns the given TCP connection ID. This only happens after a particular TCP connection has sent enough megabytes of data to approach leaving the local home range of sequence numbers. There are multiple implementations. Here's one:

User-space program: connection close

Tell each of the other anycast nodes to delete the given connection ID. This only happens for connections which previously generated a high-water-mark notification. Adding on to the implementation discussed there:

User-space program: resync

When an anycast node first boots, it needs to receive the current set of cross-cluster connection IDs from the other nodes. Likewise, if it has been out of contact with another node for too long, it will have to refresh its knowledge instead of just receiving updates. Continuing with the implementation discussed in high water notification:

Optimizations

Mutlipath Flag

Add a multipath flag to the transmission control block for each TCP connection. On initiation, the multipath flag is clear. The multipath flag is set if any packet for the connection is received on the tunnel from another node. Once set it remains set for the duration of the connection.

When the connection reaches the sequence number high water mark, only notify the user-space program if the multipath flag is set.

This improves efficiency of the normal case up to the same level as unicast. No state information must be transmitted to the cluster except for connections which both exhibit split-path behavior and transmit enough data to leave the sequence number home range. Connections which do exhibit split-path behavior continue to work 100% correctly. However with this optimization, topology change events now have a small chance of killing the connection so are no longer 100% covered.

Broadcast Flood A

Instead of transmitting connection ID state information between anycast nodes, identify packets in the local sequence number home range which are not known to the local TCP state table. Broadcast these packets out via the tunnels to all of the other anycast nodes. If one of the nodes holds the connection, it processes the packet. The other nodes drop it: if the packet is received from another anycast node instead of directly from the client and it is not known in the local TCP state table then the packet is discarded.

TCP RST messages are suppressed.

This eliminates state management between the servers at the price of additional data transfer. Because packets in this state (not known to the local server but in its home range) are expected to be rare, the additional traffic may be an acceptable trade off for eliminating state management.

This also eliminates the ability to send authoritative resets for stale TCP connections. Clients with stale connections will continue sending retries (which continue being flooded) until they time out. RST packets for connection attempts to non-listening ports work as expected and are not impaired.

This requires more complex logic in the TCP RST hook where the user-space program must consider whether a packet arrived on a tunnel interface from another anycast node and if it did, discard it instead of rebroadcasting it.

Broadcast Flood B

Same as A but the node which owns the connection responds to the broadcast by telling the sending node that it owns the connection. This allows the node first receiving packets from the client to cache information about which node owns it and avoid broadcasting the packet when then owning node is already known.

This trades some discardable state information to reduce the impact of the packet broadcasts.

This would require complicated packet transfer logic at the recipient. An anycast node would have to examine packets received from another node, know which node it came from, know whether the transmission was broadcast or unicast and evaluate the packet against the local TCP state table before releasing it to the normal TCP stack.

Broadcast Flood C

Same as Flood A or B but do not manipulate the TCP sequence number. Any connection not locally owned must be broadcast to the other anycast nodes. This has all the problems of the other two flood approaches and can be expected to result in many multiples of the amount of broadcast traffic. However, many multiples may still be a tiny amount of the total TCP traffic handled by the anycast system and avoiding sequence number manipulation restores full entropy in the sequence number space for all security solutions which rely on it.

Flood C should be the easiest of all the Anycast TCP approachs to implement: it requires only the TCP Reset hook in the TCP stack and when expanding on Flood A uses trivial logic in the user-space program: if not received from a tunnel interface with the other nodes, broadcast to all other nodes. Stop.

DNS

TCP connections associated with the DNS are known to be very small, typically under 1mb. Because they are small, they always hit the best case scenario: no state transfer needed with the constrained sequence number model. By dispensing with state transfer based on the foreknowledge that no connections can continue long enough to need it, efficiency is raised to identity with unicast for the normal case.

Large Transfers

When a data transfer is larger than can fit in the home sequence number range, state information about the node which owns the connection has to be shared with the other anycast nodes so that packets can be redirected to the correct node. With protocols that support redirects (like HTTP) this can be mitigated or eliminated. When a data transfer is predicted to be large (such as a video file) use the initial anycast connection to determine which site is closest to the client and then send a redirect instructing the client to connect to a unicast address at the same site instead of continuing the transfer via Anycast TCP.

The added startup duration for this would be undesirable for small transfers but may in some circumstances be fine for large ones.

Because the initial anycast request comes directly from the client, this may do a better job selecting the best site for the subsequent unicast request than a DNS-based technique. DNS requests come not from the client but from the DNS server the client is using. That DNS server can be and often is on an entirely different network than the client.

New TCP Option

Development of a new TCP option facilitating Anycast TCP could eventually eliminate the need to encode the anycast node's identity in the sequence number. Supporting clients would send option XX,2 in the SYN packet where XX is a to-be-assigned TCP option number. If the server knows the communication involves an Anycasted IP address, it would respond in the SYN/ACK with, for example, XX,3,N where N is the identity of the anycast server handling the connection. All subsequent packets from the client to the anycast server would include option XX,3,N. Whichever server receives the packet, it can be rerouted statelessly to server N.

Particularly large anycast clusters could expand the option length from 3 to 4 and provide a 16-bit anycast node identifier.

Communications using the enhanced TCP stack would not need the anycast nodes to share any state with each other since each packet contains the identity of the connection's home anycast node. Only legacy TCP  implementations would need to rely on the anycast server encoding the anycast node identity in the the sequence number.

Enhanced TCP Options

Anycast capability could be enhanced by more complex TCP options as well. For example, an anycast node might use a TCP option to inform the client of its unicast IP address.

Issues and Criticisms

Why do anything in user-space?

Cluster maintenance is a potentially complex cross-server coordination activity. User-space can be easily changed and customized to address unexpected problems or meet unusual needs. Let the kernel do what the kernel is good at and let user-space do the rest.

Implications for SYN flood protection

SYN flood is a denial of service attack on a server where the attacker overwhelms the server's buffers for TCP connection state by sending large numbers of synthesized TCP SYN packets.

Syncookies are the primary defense to a SYN flood. Instead of initiating TCP connection state upon receipt of the TCP SYN packet, they encode a rough current time in 5 bits of the TCP sequence number and a cryptographic hash of the connection ID in 24 bits of the TCP sequence number that is returned in the SYN-ACK packet. The TCP connection state is only initiated when the server receives the client's first ACK packet containing a reasonable time stamp and a correct cryptographic hash.

Because Anycast TCP limits the starting sequence number, it limits the effectiveness of syncookies as well. Even if mitigated by stealing two bits from the timestamp and using a larger Anycast TCP start range, the bits available to the cryptographic hash will be reduced. This will be exacerbated if there are a large number of servers in the Anycast TCP cluster, requiring smaller home and starting ranges for each node.

Thus Anycast TCP reduces the defensive capacity syncookies provides in the face of SYN flood attacks.

Split Brain Implications

Any geographically diverse computer cluster faces a condition known as "Split Brain." In Split Brain, multiple cluster nodes are running, each is actively serving clients, but due to a network problem between them, members of the cluster are unable to talk to each other.

Nodes currently encountering Split Brain conditions are unable to inform some or all of the other nodes about TCP connection IDs and cannot move the client's packets between them even if the correct node is known. Should a client suffer from split-path routing at the same time as the nodes with which he communicates suffer Split Brain, the client's communication will fail: the node which does not own the connection won't be able to reroute packets to the node which does resulting in a packet loss rate too high for the TCP connection to survive.

What about HTTP/2 and QUIC?

Anycast TCP should be fully compatible with HTTP/2 over TCP.

QUIC is a UDP-based implementation of HTTP/2. To use both Anycast TCP and QUIC on the same nodes and same IP address, QUIC would need to be compatible with anycast routing as well.

Making QUIC compatible with anycast is beyond the scope of this document, however since QUIC is still an emerging protocol it might be smart to bake it in directly. One potential approach would be for the initial QUIC packets returned from the server to include a unicast address or addresses to which the client should direct further communication associated with that QUIC connection. By shifting the client to the node's unicast address, the server need not deal with any of the rerouting and state management complexities described for Anycast TCP.

DOS attacks on the Broadcast Flood variants

The Broadcast Flood versions of the protocol will amplify forged TCP ack traffic by re-sending such packets to all servers in the anycast cluster This multiplies a denial of service attack by the number of servers in the cluster.

DOS attacks on the sequence number variants

A predictable initial sequence number would allow an attacker to forge open TCP connections with spoofed source addresses using small traffic flows, resulting in consumption of server resources and eventually a denial of service. For such an attack to be effective with small traffic flows, the attacker would need to successfully guess the initial sequence number nearly 100% of the time. Although Anycast TCP reduces the entropy available in the initial sequence number, it does not reduce it to a point where guessing for denial of service purposes becomes practical.

Information leak due to the sequence number

Anycast TCP encodes the identity of the anycast node in each TCP connection's server-side initial sequence number. As a result, external entities may make reasonable inferences about which anycast node is handling a given connection by examining the sequence number.

Because an attacker can guess the anycast node based on the sequence number, he can force the nearest anycast node to forward crafted packets to any other anycast node in the cluster. Because this offers no amplification of the attacker's packets and can forward them only to anycast nodes in the cluster, vulnerability is not significantly increased compared to ordinary DOS and DDOS attacks.

No other known attacks are enhanced or enabled by the availability of this information.

Security breach via forged TCP connections

If an attacker can guess the server's initial sequence number, he can forge TCP connection using a spoofed IP address by blindly sending packets with sequence numbers the server will accept in to the flow. This can lead to Morris 1985 style attacks against the server.

Because Anycast TCP greatly reduces the randomness involved in the initial sequence number selection, it increases vulnerability to such attacks. Implementations should mitigate this risk by assuring at least 16 bits of entropy (and more whenever practical) remain in the initial sequence number selection.

As usual, Morris style attacks are defeated by never relying on the source IP address alone for authentication or access control purposes.

A discussion of initial sequence number selection and attacks based on predictable initial sequence numbers can be found in RFC 6528.

Firewalls and Anycast TCP

An Anycast TCP node will not function correctly if placed behind a stateful firewall. The reason is obvious: an Anycast TCP node intentionally receives packets for which it does not have connection state and forwards them to the anycast node which does have connection state. If the firewall blocks those packets for which it has no connection state, the packets will be lost and Anycast TCP will malfunction.

Client side firewalls (including stateful firewalls) should not impact Anycast TCP servers.

Load Balancers and Anycast TCP

Load balancing traffic between multiple servers at an anycast site requires additional attention. Anycast packets will find the closest site on the network to the client but anycasting provides no inherent way to distribute that traffic between servers once the site is reached. Classic load balancers are stateful in nature so placing anycast nodes behind them runs afoul of the stateful firewall problem described above.

Option 1: Standard Protocol Proxy

A standard TCP or HTTP reverse proxy load balancer will work as normal with Anycast TCP. Only the load balancer participates in the Anycast cluster. All servers behind the load balancer are simple unicast. As a consequence, each Anycast site has only one active anycast node with only one home sequence number range.

Standard proxies are effective at offloading encryption and spreading load between complex dynamic content back ends. They do not scale satisfactorily for static content back ends.

Option 2: Algorithmic or Lookup Table Routing

With a unicast system, Linux Virtual Server keeps TCP state information about the connections flowing through it and assigns each to one of the many backend servers. These backend servers all have the same content and all share the same load balanced IP address. This scales very well since the load balancer need only consider packets from the client to the server (return packets bypass the load balancer) and need only evaluate a few bytes to determine where to send the packet next without unraveling it into a larger connection or handling a higher level protocol.

This will not provide a satisfactory result with Anycast TCP. As discussed in Firewalls, you can't have a device that respects TCP state sitting in front of the Anycast TCP nodes. It won't work.

Next, there's a practical limit to how many servers can participate in an Anycast TCP cluster while leaving enough bits in each one's home sequence number range to transmit a useful amount of data before spreading state information to the other nodes. With 100 sites and 100 servers at each site, the sequence number space is reduced to allowing transfers of only a couple hundred kilobytes. Moreover, each time a connection climbed out of its home space it would have to share the connection state information with 10,000 other nodes. That doesn't scale!

We need Anycast TCP itself to scale before we worry about the load balancer. And Anycast TCP itself doesn't load-balance once at the site. So, first assign a home sequence number range to each Anycast site and configure all of the servers at that site to use the same range. 100 sites in the previous example fits in 7 bits leaving 25 bits (32 megabytes) for the home range (transmission limit) before state has to be spread.

Next up, when a transmission exceeds the home range, we have to be smarter about the data spread. Instead of sending the information to every anycast node, send it to a designated Coordinator node at each site. The coordinator keeps track of only the local anycast nodes and the remote coordinator nodes. It does not keep track of all anycast nodes in the system. In addition, when a connection isn't known to one of the local anycast nodes, instead of figuring out who it belongs to themselves, they forward it to the coordinator for further disposition. The individual anycast TCP nodes don't keep track of what any others are doing at all.

Finally, we want the load balancer sitting in front of all these anycast nodes to make a stateless per-packet decision about which local back-end anycast TCP server handles the connection. Either the connection should be handled by that server or it should be handled by a server at a remote site. We want to avoid the situation where the load balancer forwards a packet to a local server, the server forwards it to the coordinator and the coordinator sends it to a different local server or worse, can't figure out which local server to send it to.

For this, use other stable characteristics of the packet such as: client IP address or client TCP port number. Use a hash or a lookup table to assign particular bit patterns to particular backend nodes.

Implementation notes

TCP Reset Hook

Create a /dev/tcprst or comparable communication mechanism such that:

For both IPv4 and IPv6,
If a user-space program has /dev/tcprst open for reading AND
a received TCP packet would cause an RST to be generated by the kernel (i.e. no matching tcp control block exists or is created) AND
ACK flag = set
THEN
DO NOT originate an RST packet
Send the received TCP packet and information to the user-space socket open on /dev/tcprst.

The information sent to user-space on /dev/tcprst should include:
The packet including the link layer if the receiving interface has one
Offsets to the starts of layer 3 (ip header) and layer 4 (tcp header)
The interface on which the packet was received

A separate /dev/tcprst4 and /dev/tcprst6 for IPv4 and IPv6 respectively is acceptable but not required.

Create a compile-time flag to include or exclude the code in the stack.