Summary of Routing Architectures discussed in the RRG
THIRD DRAFT

This document attempts to look at abstract architectural issues associated with the BGP routing table growth problem while avoiding engineering details as requested by the Routing Research Group chair.

 

Item 1: Routing table size problem, root cause:

The root cause of the routing table growth problem is that we're using the same element (IP address) to express three distinct concepts:

  1. the globally unique identity (GUID) needed by the originating client system to find the destination server,
  2. a component of the session identity (SID) used by the layer 4 and higher protocols to keep track of communications with other hosts, and
  3. the node's present location or locations (LOC) within the network topography.

Changing a host's GUID is often hard. Client systems which provide no services to the network can change because nothing needs to find them but changing a server's identity requires changes in many other hosts' references which are unwieldy or impractical to make.

Changing a SID terminates the communications session. This makes address changes unsuitable for use in multihoming, even where the multihoming is trivial, such as a home user with two DSLs.

As a result, the GUID strongly resists all change while the SID resists change that must complete shorter than a typical communication session. Because the LOC is bound to the GUID and SID, the LOC strongly resists change even when you desire to move the affected host to a new location in the network.

Also, hosts find it very difficult to present two LOCs from separate parts of the topography graph. In addition to the problems above, packets from such hosts would need to be "policy routed" by the source address until they reach a transit-free network level in order to preserve security-required source filtering, a function not inherent to modern dynamic routing protocols. As a result, multihomed systems generally connect distant portions of the network topography instead of presenting a small number of aggretatable stubs.

Thus we end up with a great deal of duplication where more than one LOC holds the same spot in the topography (disaggregation and fragmentation) and trivial multihoming forces us into a general graph topography with very limited hierarchical aggregation of the LOCations that routers route by.

 

Item 2: Solution strategies

Strategy A.

Do local routing by the address (which functions as GUID, SID component and local LOC) but have each packet flow through an encoder which attaches a second Remote LOC (RLOC) before the packet enters the internetwork core. Do not route by the address in the core; route by RLOC. Limit RLOC routing in the core so that only service providers (ISPs) with significant interconnection have their own RLOCs. Fewer than 10,000 such "core ISPs" exist today and the number is growing much more slowly than the routing table overall. Once the packet reaches the network identified by the RLOC, local routing by address takes over for final delivery. Distribute RLOCs through the core via a typical distance-vector or link-state routing protocol.

Variants include:

A1a. Each core ISP has one RLOC. The RLOC's existence and reachability is flood-propagated to the rest of the core.

A1b. Each core ISP has a small number of RLOCs for traffic engineering (TE). The RLOCs' existence and reachability is flood-propagated to the rest of the core.

A1c. Each core ISP has an aggregated set of RLOCs which it may hierarchically assign to customers downstream and/or disaggregate for TE. The aggregated RLOC's existence and reachability is flood-propagated to the rest of the core.

Methods for mapping the address to one or more RLOCs include:

A2a. GUIDs statically mapped to each RLOC are periodically pushed towards a central or distributed registry. The full list is periodically downloaded to the encoders which add RLOCs to the packets.

A2b. GUIDs dynamically mapped to each RLOC are pushed towards a central or distributed registry as they change. The registry  pushes all incremental changes in near-real time to all encoders which add RLOCs to the packets.

A2c. GUIDs dynamically mapped to each RLOC are pushed towards a central or distributed registry as they change. Encoders request and briefly cache individual mappings from the registry as needed.

Failure handling approaches include:

Link failures in the Internet core cause the RLOCs to be rerouted with no change to the address-RLOC map.

A3a. RLOC encoders detect when particular RLOCs are no longer reachable at all and fall back on secondary RLOCs for a particular address. Encoders rely on active failure messages from some system in the RLOC-specified network to indicate that a host is no longer available via that RLOC, causing them to fall back on secondary RLOCs for that host.

A3b. Link failures which prevent parts of the RLOC's network from reaching a destination host or set of hosts it serves cause a dynamic change to the address-RLOC map, depreferencing the affected RLOC.

Compatibility approaches include:

A4a. New IP protocol. Not compatible with IPv4 and IPv6.

A4b. Modified IP protocol. Not compatible with deployed IPv4 and IPv6.

A4c. Standard IPv4 and IPv6 packets are encapsulated in a tunnel packet while they transit the Internet core. Path-MTU issues are addressed by setting an Internet-wide maximum packet size enforced by the encoders and assuring that all core links support that size.

A4d. Standard IPv4 and IPv6 packets are encapsulated in a tunnel packet while they transit the Internet core. Path-MTU issues are addressed by returning packets which breach the MTU while in the core back to the encoder who must act as a proxy by returning a sensible packet-too-big message to the originating host.

A4e. The IPv6 address space is partitioned into end-user address space and Internet core address space. The address to RLOC map is symmetric. Part of the IPv6 end-user address is swapped for the RLOC when the packet enters the Internet core and then restored when it leaves the Internet core. IPv4 is abandoned.

Major criticisms:

There don't appear to be any genuinely clean ways of implementing strategy A. Handling path-MTU is a problem since the packets in the core are different than the origin host would recognize. Extra bandwidth is consumed by the ITR figuring out whether the ETR is still available and functioning. Border filtering of source addresses becomes problematic. It's a mess.

Deployment may require heavy weight "for the public good" relays in the non-upgraded part of the Internet to facilitate migration.

 

Strategy B.

Assign heirarchically aggregatable LOCs to every host. Assign multiple LOCs to each host such that in the network topography hosts appear as stubs in multiple locations instead of forming distant connections in the graph. Assign one aggregated set of LOCs to each core ISP where a core ISP is one which has at least half a dozen major transit or peering links. Flood-propagate the aggregated LOC's existence and reachability to the rest of the core.

Having reduced the network topology to something relatively close to a hierarchy, perform plain old hierarchical aggregation on the LOCs. Add and remove LOCs to each host dynamically during operation as needed to reflect changes in the nearby network hierarchies.

Attach source and destination LOCs when the packet leaves the host. Route by first source then destination LOC: move up the source network hierarchy until you can move laterally toward the destination LOC in a permissioned manner.

LOCs dynamically mapped to each host are pushed towards a distributed registry as they change. Hosts request and temporarily cache individual mappings from the registry as needed.

During a link failure, the host abandons the impacted LOC since it is no longer reachable that way. If the link failure is distant enough that the host still has a viable path within the same LOC hierarchy, the host may acquire a new LOC representative of its new path through the nearby network.

Variants include:

B1a. Each host has a single numeric GUID to which the LOCs are attached. This GUID is used by the layer-4 and higher protocols to compose the SID..

B1b. Each service provided by a host has a globally unique, hierarchic character-string GUID to which the LOCs are attached. Clients initiating communication with that service negotiate a numeric SID which is unique only within the scope of that service.  

Major criticisms:

1. Probably not compatible with UDP or TCP though could be compatible with IPv6's layer 3 and the replacement layer-4 protocols should be coaxable to run on top of IPv4's layer 3 in the not-yet-upgraded part of the network.

2. How do firewalls work if the LOCs are constantly in flux?

 

Strategy C.

Suppress distant routes by aggregating them into sets expected to be available in a given direction. Because LOC reachability info is not flooded, the routing tables each router must deal with are relatively small.

Variants include:

C1. Geographic aggregation. All nodes within some geographic boundary are assigned the same LOC. Routers move packets to any adjacent router deemed to be "closer" to the LOC in question.

Major criticisms:

No one has been able to construct a protocol under strategy C without introducing constraints that are fundamentally incompatible with the Internet's economic model. For example, Geoag has been shown to have uncorrectable theft-of-service anomalies in networks as small as 8 autonomous systems and two geographic areas.

 

Strategy D.

Use plain old BGP for the RIB. Algorithmically compress the FIB in each router.

Variants include:

D1a. Aggregate any adjacent routes that have the same next hop.

D1b. Insert a /0 route into the FIB which goes to the most popular next hop for all the routes in the RIB. Step to the /1 level. For each /1, if most of the routes in the RIB within that /1 go to a different next hop than the longest route above (the /0 route), add that /1 route to the FIB. Step to the /2 level. Repeat until all routes in the RIB go to the correct next hop in the FIB. Unrouted space is treated as "don't care": it will route wherever the algorithm happens to drop it and will rely on the TTL to take packets off the network.

Major criticisms:

1. The RIB can perhaps (maybe!) grow to up to an order of magnitude larger than the FIB before it hits the wall too. One order of magnitude doesn't gain us multihoming for SOHOs.

2. FIBs towards the edge should aggregate well with this strategy but there's no evidence to support a conclusion that they'd aggregate well deep in the core.

 

Strategy E.

Make no routing architecture changes. Instead, create a billing system through which the folks running core routers are paid by the folks announcing each prefix to carry those prefixes. Let economics suppress growth to a survivable level.

Variants include:

E1a. Everybody pays the RIRs. the RIRs pay the router operators.

E1b. Private negotiation between parties.

E1c. Assisted private negotiation where router ops can offer standardized contracts to carry prefixes and prefix announcers can accept groups of identical contracts via an automated third-party payment system moving funds between the two easily.

Major criticisms:

1. Smells like pie in the sky. If it could be done without creating massive boondoggle, why hasn't it been done already?

2. This means giving up on a solution that genuinely enables users and accepting one that merely keeps the Internet viable.

 

Strategy F.

Do nothing.

Major criticisms:

It costs "everybody else" a grand total of at least $6000 per year for each prefix you announce. [Herrin BGP Cost] When we give away that $6000 of consumables for free, of course we're gonig to have a "tragedy of the commons" problem.

 

Addendum:

Item 1. Additional causative  factors worth mentioning:

Because even a distant route may present a better priority in one coreward direction than another, it's not possible to suppress distant routes. As a result, every router in the core must carry all routes.