This document attempts to look at abstract architectural issues associated with the BGP routing table growth problem while avoiding engineering details as requested by the Routing Research Group chair.
The root cause of the routing table growth problem is that we're using the same element (IP address) to express three distinct concepts:
Changing a host's GUID is often hard. Client systems which provide no services to the network can change because nothing needs to find them but changing a server's identity requires changes in many other hosts' references which are unwieldy or impractical to make.
Changing a SID terminates the communications session. This makes address changes unsuitable for use in multihoming, even where the multihoming is trivial, such as a home user with two DSLs.
As a result, the GUID strongly resists all change while the SID resists change that must complete shorter than a typical communication session. Because the LOC is bound to the GUID and SID, the LOC strongly resists change even when you desire to move the affected host to a new location in the network.
Also, hosts find it very difficult to present two LOCs from separate parts of the topography graph. In addition to the problems above, packets from such hosts would need to be "policy routed" by the source address until they reach a transit-free network level in order to preserve security-required source filtering, a function not inherent to modern dynamic routing protocols. As a result, multihomed systems generally connect distant portions of the network topography instead of presenting a small number of aggretatable stubs.
Thus we end up with a great deal of duplication where more than one LOC holds the same spot in the topography (disaggregation and fragmentation) and trivial multihoming forces us into a general graph topography with very limited hierarchical aggregation of the LOCations that routers route by.
Do local routing by the address (which functions as GUID, SID component and local LOC) but have each packet flow through an encoder which attaches a second Remote LOC (RLOC) before the packet enters the internetwork core. Do not route by the address in the core; route by RLOC. Limit RLOC routing in the core so that only service providers (ISPs) with significant interconnection have their own RLOCs. Fewer than 10,000 such "core ISPs" exist today and the number is growing much more slowly than the routing table overall. Once the packet reaches the network identified by the RLOC, local routing by address takes over for final delivery. Distribute RLOCs through the core via a typical distance-vector or link-state routing protocol.
A1a. Each core ISP has one RLOC. The RLOC's existence and reachability is flood-propagated to the rest of the core.
A1b. Each core ISP has a small number of RLOCs for traffic engineering (TE). The RLOCs' existence and reachability is flood-propagated to the rest of the core.
A1c. Each core ISP has an aggregated set of RLOCs which it may hierarchically assign to customers downstream and/or disaggregate for TE. The aggregated RLOC's existence and reachability is flood-propagated to the rest of the core.
Methods for mapping the address to one or more RLOCs include:
A2a. GUIDs statically mapped to each RLOC are periodically pushed towards a central or distributed registry. The full list is periodically downloaded to the encoders which add RLOCs to the packets.
A2b. GUIDs dynamically mapped to each RLOC are pushed towards a central or distributed registry as they change. The registry pushes all incremental changes in near-real time to all encoders which add RLOCs to the packets.
A2c. GUIDs dynamically mapped to each RLOC are pushed towards a central or distributed registry as they change. Encoders request and briefly cache individual mappings from the registry as needed.
Failure handling approaches include:
Link failures in the Internet core cause the RLOCs to be rerouted with no change to the address-RLOC map.
A3a. RLOC encoders detect when particular RLOCs are no longer reachable at all and fall back on secondary RLOCs for a particular address. Encoders rely on active failure messages from some system in the RLOC-specified network to indicate that a host is no longer available via that RLOC, causing them to fall back on secondary RLOCs for that host.
A3b. Link failures which prevent parts of the RLOC's network from reaching a destination host or set of hosts it serves cause an external analysis element to make a dynamic change to the address-RLOC map, depreferencing or removing the affected RLOC. The external analysis element may be under the control of the end-user destination network, the RLOC network or a third party under contract to one of them.
Compatibility approaches include:
A4a. New IP protocol. Not compatible with IPv4 and IPv6.
A4b. Modified IP protocol. Not compatible with deployed IPv4 and IPv6.
A4c. Standard IPv4 and IPv6 packets are encapsulated in a tunnel packet while they transit the Internet core. Path-MTU issues are addressed by setting an Internet-wide maximum packet size enforced by the encoders and assuring that all core links support that size.
A4d. Standard IPv4 and IPv6 packets are encapsulated in a tunnel packet while they transit the Internet core. Path-MTU issues are addressed by returning packets which breach the MTU while in the core back to the encoder who must act as a proxy by returning a sensible packet-too-big message to the originating host.
A4e. The IPv6 address space is partitioned into end-user address space and Internet core address space. The address to RLOC map is symmetric. Part of the IPv6 end-user address is swapped for the RLOC when the packet enters the Internet core and then restored when it leaves the Internet core. Use a different A4 variant for IPv4.
A4f. The IPv6 flow label or some other component(s) of the IPv6 header are used to contain the RLOC. The flow label is set before the packet enters the core. Non-local packets are routed based on the flow label. Use a different A4 variant for IPv4.
A4g. Steal bits from other functions in the IPv4 header (e.g. checksum) to make space for an RLOC. Discard those components and set the RLOC when the packet enters the core. Restore the original bits when the packet leaves the core. Use a different A4 variant for IPv6.
Core routing methods:
A5a. Distribute RLOCs through the Internet core via BGP.
A5b. Distribute RLOCs through the Internet core via a new distance-vector protocol.
A5c. Distribute RLOCs through the Internet core via a link-state protocol.
There don't appear to be any genuinely clean ways of implementing strategy A. Handling path-MTU is a usually problem since the packets in the core are different than the origin host would recognize. Extra bandwidth is consumed by the ITR figuring out whether the ETR is still available and functioning. Border filtering of source addresses becomes problematic. It's a mess.
Deployment may require heavy weight "for the public good" relays in the non-upgraded part of the Internet to facilitate migration.
Assign hierarchically aggregatable LOCs to every host. Assign multiple LOCs to each host such that in the network topography hosts appear as stubs in multiple locations instead of forming distant connections in the graph. Assign one aggregated set of LOCs to each core ISP where a core ISP is one which has at least half a dozen major transit or peering links. Flood-propagate the aggregated LOC's existence and reachability to the rest of the core.
Having reduced the network topology to something relatively close to a hierarchy, perform plain old hierarchical aggregation on the LOCs. Add and remove LOCs to each host dynamically during operation as needed to reflect changes in the nearby network hierarchies.
Attach source and destination LOCs when the packet leaves the host. Route by first source then destination LOC: move up the source network hierarchy until you can move laterally toward the destination LOC in a permissioned manner.
GUID to LOC maps are pushed from the host towards a distributed registry as they change. Hosts request and temporarily cache individual mappings from the registry as needed.
LOC variants include:
B1a. A hierarchically aggregated numeric LOC is dynamically assigned to each host from each upstream path.. Each router receives a supernet from upstream and assigns a subnet downstream. Link state changes in the coreward path are satisfied by renumbering instead of rerouting: the host abandons the LOC hierarchically associated with the old path. If a new path is available, the host acquires a LOC hierarchically associated with the new path.
B1b. A LOC is an administratively-assigned loose source route instead of a single address. The first address in the loose source route is a universally-known waypoint router. The last address is the final destination. Link state changes in the coreward path are satisfied by reroutng in the appropriate routing domain when possible. If rerouting is in the affected domain is not possible, the host abandons the impacted LOC.
B1c. Semi-hierarchical numeric LOCs are administratively assigned. Local reconnection during link state changes is accomplished with rerouting instead of renumbering.
GUID variants include:
B2a. Each host has a single numeric GUID to which the LOCs are attached. This GUID is used by the layer-4/5 and higher protocols to compose the SID.
B2b. Each service provided by a host has a globally unique, hierarchic character-string GUID to which the LOCs are attached. Clients initiating communication with that service negotiate a numeric SID which is unique only within the scope of that service.
1. Probably not compatible with UDP or TCP though B1a/c could be compatible with IPv6's layer 3. The replacement layer-4/5 protocols should also be coaxable to run on top of IPv4's layer 3 in the not-yet-upgraded part of the network.
2. How do firewalls work if the LOCs are constantly in flux in B1a?
3. How is theft of service avoided in B1b?
Suppress distant routes by aggregating them into sets expected to be available in a given direction. Because LOC reachability info is not flooded, the routing tables each router must deal with are relatively small.
C1. Geographic aggregation. All nodes within some geographic boundary are assigned the same LOC. Routers move packets to any adjacent router deemed to be "closer" to the LOC in question.
No one has been able to construct a protocol under strategy C without introducing constraints that are fundamentally incompatible with the Internet's economic model. For example, Geoag has been shown to have uncorrectable theft-of-service anomalies in networks as small as 8 autonomous systems and two geographic areas.
Use plain old BGP for the RIB. Algorithmically compress the FIB in each router.
D1a. Aggregate any adjacent routes that have the same next hop.
D1b. Insert a /0 route into the FIB which goes to the most popular next hop for all the routes in the RIB. Step to the /1 level. For each /1, if most of the routes in the RIB within that /1 go to a different next hop than the longest route above (the /0 route), add that /1 route to the FIB. Step to the /2 level. Repeat until all routes in the RIB go to the correct next hop in the FIB. Unrouted space is treated as "don't care": it will route wherever the algorithm happens to drop it and will rely on the TTL to take packets off the network.
1. The RIB can perhaps (maybe!) grow to up to an order of magnitude larger than the FIB before it hits the wall too. One order of magnitude doesn't gain us multihoming for SOHOs.
2. FIBs towards the edge should aggregate well with this strategy but there's no evidence to support a conclusion that they'd aggregate well deep in the core.
Make no routing architecture changes. Instead, create a billing system through which the folks running core routers are paid by the folks announcing each prefix to carry those prefixes. Let economics suppress growth to a survivable level.
E1a. Everybody pays the RIRs. the RIRs pay the router operators.
E1b. Private negotiation between parties.
E1c. Assisted private negotiation where router ops can offer standardized contracts to carry prefixes and prefix announcers can accept groups of identical contracts via an automated third-party payment system moving funds between the two easily.
1. Smells like pie in the sky. If it could be done without creating massive boondoggle, why hasn't it been done already?
2. This means giving up on a solution that genuinely enables users and accepting one that merely keeps the Internet viable.
Do nothing. (RFC 1887 § 4.4.1)
It costs "everybody else" a grand total of at least $6000 per year for each prefix you announce. [Herrin BGP Cost] When we give away that $6000 of consumables for free, of course we're going to have a "tragedy of the commons" problem.
Make everybody attach to only one ISP using IPv6 and the ISP's single set of PA addresses. (Actual result of RFC 1887 § 4.4.3)
Tried it. Wasn't accepted by the operations community because the IPv6 architecture makes renumbering every bit as hard as in IPv4 and the multihoming described in RFC 1887 § 4.4.3 does not appear to actually work.
Item 1. Additional causative factors worth mentioning:
Because even a distant route may present a better priority in one coreward direction than another, it's not possible to suppress distant routes. As a result, every router in the core must carry all routes.