next_inactive up previous


Department of Electrical Engineering and Computer Science Doctor of Philosophy in Computer Science and Engineering August 2002 August 30, 2002

M. Frans KaashoekProfessor of Computer Science and Engineering

Arthur C. SmithChairman, Department Committee on Graduate Students

Network Layer Support for Overlay Networks

John Jannotti


\begin{abstractpage}
<tex2html_file> ...

Acknowledgments

Many people have contributed substantially to this thesis -- technically, emotionally, or both.

I would like to thank my thesis committee, Hari Balakrishnan, Ion Stoica, and my advisor, Frans Kaashoek. Their interest and insight has proven invaluable. I would especially like to thank Frans for his encouragement throughout my years at MIT. His faith in me may be misplaced, but it has encouraged me nonetheless.

The PDOS group have shaped my views considerably, and I am richer for it. I would like to thank Robert Morris, Dawson Engler, Eddie Kohler, and David Mazières in particular. I've learned more from them than I can ever repay.

I would like to thank Cisco Systems. As a company, Cisco has been extremely understanding of my commitment to my work at MIT while I have also worked for there. As individuals, they have been be friends, and technically helpful. I would especially like to thank Jim O'Toole and Suchitra Raman for discussions related to this thesis.

I would like to thank Ann Hintzman. Although she served as proofreader for this thesis, to describe her contribution in such limited terms would be a profound understatment. She has supported me emotionally more than I have deserved, and I remain eternally grateful.

I would like to thank my family. They have loved me, supported me, and encouraged me. I hope I have made them proud.

Finally, I would like to thank all of the people that have distracted me, making this thesis take longer to write, but making my life brighter for their presence.


Contents


List of Figures


1 Introduction

Distributed applications would be well served by richer semantics than the network layer supplied by the Internet. Today's distributed applications have only one primitive from which they may build services: Internet Protocol (IP) Unicast, the best-effort, single source and destination delivery of datagrams. Unicast delivery allows a single network node to ask for a single packet of data to routed through the Internet to a single destination node. Additional semantics have been built on top of this single primitive. For example, Transmission Control Protocol (TCP) provides reliability and flow control.

Many applications, however, would benefit from additional services that are more difficult to build on top of IP Unicast delivery. Mission critical applications would like control over the way their packets are routed -- perhaps trading off resource usage for reliability by using multi-path routing [35]. Teleconferencing applications [20], chat rooms, and Internet broadcasting systems would benefit from efficient group communication. A stock ticker application might like to perform latency measurements over many paths to find a low latency path undetected by normal IP routing. A content distribution network would like to distribute and store data throughout the network [21,16].

One approach to addressing these needs is to build new network services into routers across the Internet. Generally this approach has two drawbacks. First, it may be inappropriate to add the necessary functions to routers that should remain fast and simple to ensure their continued availability as an important shared resource. Second, adding an important network service to routers is likely to support only those applications envisioned during the design of the service. For example, IP Multicast provides a single service model that is inappropriate for a number of multicasting applications. Efforts to revamp IP Multicast for reliability or for secure admission control require yet more modifications to routers.

1 Overlay networks

Overlay networks completely sidestep these two drawbacks. Overlay networks avoid the issue of ``dumb'' IP routers by performing packet routing and duplication in edge nodes. Many examples are described in Section 2.2. In these systems, cooperating servers throughout the Internet act as routers in an overlay network.

Figure 1-1 demonstrates an overlay network. Just as a physical network has a topology consisting of the nodes of the network and the links between them, an overlay network has a virtual topology, which exists by the agreement of the overlay nodes. Packets are transmitted only along the virtual links between the overlay nodes using the underlying unicast mechanism provided by IP.

Figure 1-1: An overlay network. Rectangular end-hosts and dashed links form an overlay network over the physical network of round routers and solid links.
\begin{figure}\centering\epsfig{file=overlay.eps, width=\graphwd}\end{figure}

In contrast to the Internet, in which routers are a shared resource that cannot be specialized for a particular purpose, the members of an overlay network may provide specialized services specific to the application at hand. An overlay-based multicast system can duplicate packets in the servers, a content distribution network can cache gigabytes of data, RON provides resilient routing by constant performance measurements among participating nodes.

We will use three metrics to evaluate how efficiently an overlay network is operating. Stress indicates the number of times that a semantically identical packet traverses a given link. In IP Multicast, stress never exceeds one. On the other hand, overlay networks could not hope to achieve such efficiencies because packets being forwarded by an edge node will traverse (at least) the node's local link twice. Stretch indicates the ratio of latency in an overlay network compared to some baseline, generally IP unicast or multicast. Misfires measures the number of times that duplicate packets are mistakenly sent to the same destination. These metrics are used to evaluate overlay performance with the primitives proposed in this thesis in Chapter 7.

2 Problem

The overlay network approach faces two important challenges. First, overlay networks operate at a disadvantage to router-based systems because of the physical location of their computational elements. The components of an overlay network are servers located at the edge of the network. This drawback is both a performance problem, packets going in and out of external servers increases stress and stretch; as well as a functional problem, overlay nodes are not in a position to observe network traffic that is not explicitly directed to them. For example, IP Multicast's mechanism for joining groups relies on the ability of routers to observe passing messages.

Second, it can be difficult to build virtual topologies that resemble the topology of the underlying network. It is beneficial for the virtual links of an overlay network to connect nodes that are well-connected in the underlying network. Choosing well-connected virtual links is akin to supplying a physical network with a higher bandwidth link-layer. It is also common to prefer virtual links that share as few underlying links as possible with other virtual links. This property leads to independent failures, and less duplicate traffic on underlying links. Unfortunately, it is very difficult to determine these characteristics today. Overlays have fallen back on wasteful and error-prone techniques such as continual bandwidth probes to learn about the underlying network.

3 Design goals

A single set of simple extensions to IP that support overlay networks would, at once, address the needs of a large variety of applications that would otherwise each require separate network support. Just as the ability to support multiple processes through virtual memory and supervisor mode is considered a first priority in processor design, the ability to support multiple virtual networks as overlays should be a design goal in wide-area network design.

The goal of this thesis is to propose simple extensions to the network layer that would allow the construction of simple, efficient overlay networks. The extensions should be beneficial to the large variety of research projects and service providers that are investigating systems based on overlay networks.

A number of design criteria can be drawn from the diverse needs of existing overlay systems. First, and most directly, overlay networks would benefit from pushing work toward the core of the network. End System Multicast exhibits this need most acutely because its participants are expected to be desktop machines -- no nodes are expected to be located at strategic network points. Yet every overlay system would benefit if routing and duplication could be directed by the end hosts, but performed by the appropriate routers, thereby allowing their packets to follow a more optimal path through the underlying network.

Second, all of these systems have difficulties constructing overlay topologies. RMX uses hand configuration. End System Multicast lacks scalability due, in part, to its topology generation algorithm. X-Bone and Yoid need help pruning an initially quadratic number of possible links down to a manageable size. Overcast consumes bandwidth to conduct constant network measurements.

A third, more subtle conclusion, is that the overlay nodes must retain complete control of forwarding when necessary. While it is important to provide a mechanism that allows overlay networks to obtain efficient forwarding, that mechanism must be fine-grained enough to allow the overlay to handle a particular link in a specialized way when necessary. For example, systems such as RMX would like to take advantage of automatic forwarding when possible, but when a link is congested they must retain the ability to perform their own forwarding so that they may transcode to a thinner format. Similarly, Overcast must be able to interpose its nodes in the forwarding mesh so that they may cache the forwarded data and replay transmission for new nodes.

In addition to these criteria, there are lessons to be drawn from the challenges that have faced other proposals for router modification, particularly IP Multicast. It is critical that new functions are incrementally deployable -- there must be benefits to deployment even when only a small portion of routers have been modified to support the new functionality. In addition, the benefits from deployment should be concentrated around the portion of the network in which deployment occurred. These benefits offer an economic incentive to early adopters.

In order to provide for incremental deployment, this thesis proposes mechanisms that are for optimization only. Overlay networks must be prepared to operate as if the primitives do not exist. When the primitives are available, the network will provide explicit signaling to the application, allowing it to avoid work that has been performed in the network.

Finally, it is important to keep proposed enhancements small so that future modifications to their behavior are unlikely to be required, and generally useful so that greater utility might someday be obtained by using then in novel combinations.

4 Contributions

The contributions of this dissertation are:

5 Outline

The remainder of this thesis is divided into three parts. The first part consists of Chapters 34, and 5. These chapters describe the Packet Reflection and Path Painting primitives, with Chapter 5 serving to provide detailed implementation information. The second part is Chapter 6, which presents realistic applications of those primitives in various overlay networks. Finally, Chapter 7 examines the performance of the primitives in many scenarios, including limited deployment and asymmetric networks, evaluating the ability of the primitives to reduce stress and stretch while creating few misfires and consuming limited resources on the participating routers.


2 Related Work

Related work can broadly be divided into three areas. First, IP Multicast and systems built on IP Multicast. IP Multicast provides a group communication primitive for IP, and a number of systems have attempted to build additional semantics on top of IP Multicast. The efforts have seen limited success, in part because of IP Multicast's limited deployment, and in part because IP Multicast provides a difficult base upon which to build -- IP Multicast is a monolithic primitive combining all needed mechanisms to provide a single high-level abstraction.

A second set of related projects are overlay networks. These systems seek to avoid the deployment issues of IP Multicast by providing network abstractions purely in end hosts. This thesis complements these systems by providing protocols to allow these systems to approach the efficiency of router based network services.

The most closely related work is an approach to network-layer extension that is quite similar to the approach advocated in this thesis. In Section 2.3 we examine a active networks based system that builds multicast from unicast forwarding and ephemeral state. That approach yields a primitive that resembles reflection, as well as a few other primitives that may also be of use in general overlay networks.

1 Multicast specific

Multicast, with various semantics, is a fertile area of research in the networking community. Numerous proposals for enhancing the network layer to support multicast in one form or another have been made over the years. This section examines the most important of these proposals. In each case, the single greatest difference between the goal of these ideas and the purpose of this thesis is breadth. The primitives presented here are intended to allow overlay networks to accomplish much more than a single form of multicast communication.

1 IP Multicast

IP Multicast [12] (IPM) is a low-level network primitive that provides efficient communication between multiple nodes. The basic unit for communication is that of the group, which corresponds to an IP address chosen from a particular reserved range of all IP addresses. In IP Multicast, nodes register their interest in a particular group by sending graft messages toward a designated rendezvous point. As graft messages propagate to the rendezvous, the protocol builds a spanning tree which includes all interested nodes and the routers that connect those nodes. Graft messages may be suppressed when they reach the existing tree, they need not reach the rendezvous. When a node sends a packet to an IP Multicast group, routers forward it to all interfaces that correspond to edges in the spanning tree.

IP Multicast, taken as a whole, suffers from a number of drawbacks that have hampered its widespread acceptance. First, it requires router support. The spanning tree for a group cannot cross regions of the network that are not running IPM enabled routers. This hurdle creates little incentive to be ``the first on the block'' to roll out IPM support because IPM's utility is greatly diminished without widespread adoption.

Second, IP Multicast's admittance control and, by extension, its security, are weak. Any Internet node (assuming deployment of IPM routers) can send or receive packets for any IP Multicast group. Proponents argue for end-to-end encryption and authentication, but such a solution would do little to avoid denial of service attacks that rely only on the volume of data, not its authenticity. Even if unauthenticated packets were disposed of at end-hosts, IP Multicast's willingness to send copies of ``junk'' packets to all members of a group presents a ripe target for Internet trouble makers. In fact, recent experience with the Ramen Worm [41] shows that IP Multicast is so susceptible to bandwidth wasting attacks that it can be taken advantage of by mistake. The Raman Worm probed IP addresses at random, some of which were IP Multicast addresses. The Worm was first detected by the enormous growth in Multicast traffic it accidentally consumed as it spread.

Third, and most subtly, IP Multicast is an awkward primitive on which to build other network services. For example, attempts have been made for years to describe a simple, scalable reliable multicast protocol on top of IP Multicast [15,24,25,26,29], yet there is no consensus that any single protocol is appropriate for deployment in the world's routers. One problem plaguing such attempts is that there is no single agreed upon semantics for reliable multicast. One application might wish to use flow control to slow the data stream down to the bandwidth of the slowest link. Another might want to stream data as quickly as possible to nodes that have fast connections, while trickling in to those that do not. Yet another might wish to relax ordering constraints in favor of receiving most of the data in near real-time and the rest later. Attempting to support all of these semantics in router based software calls for a consensus that each is important enough to standardize upon, and risks destabilizing core routing functionality as increasingly complex software is added. Instead, more complex semantics should be provided by end-hosts, just as TCP provides flow-control and reliability.

2 SSM and Express

Source-specific Multicast [18] (SSM) and Express [19] avoid some of the problems of standard IP Multicast by offering a simplified, single-source service model. In both systems, end-hosts subscribe to a group that is, in part, named by its source. This approach greatly simplifies the join protocol because it implicitly names a rendezvous point for join messages -- the source.

Besides this simplification, these single-source approaches have additional benefits compared to standard IP Multicast. First, the namespace of possible group names is vastly expanded and managed more easily. In standard IP Multicast, 28 bits of address space have been set aside to be used as Multicast group addresses. Management of that space, however, is a difficult problem. In single-source approaches, the IP address of the source is a portion of group identifier, which allows each Internet host to manage its own namespace.

In many applications, the limited power of a single-source approach is also a benefit. Applications that are fundamentally one-to-many benefit from the simpler admission control and security of a hard-wired single source. Only the source of the group may send packets to the group, so the flooding attacks of standard IP Multicast are impossible. Additionally, with only one source, simple cryptographic schemes for admission control are possible. The sender encrypts and signs all packets. Only those nodes possessing the group key may decrypt and authenticate the packets.

Finally, as we will see in the next section, a single-source models lends itself more easily to reliability extensions.

3 Repliers, PGM, and BCFS

Papadopoulos' Replier scheme [31], PraGmatic Multicast [14], and the BreadCrumb Forwarding Service [47] all offer support for reliable multicast applications. As each system slightly extends or modifies the previous, they work in very similar ways.

All three schemes support reliability by allowing ``downstream'' receivers to contact ``upstream'' receivers for lost packets. It remains the responsibility of the application residing at the receivers to buffer and retransmit the lost packets. At each router, a particular interface is specially designated to receive request packets when they are sent ``upstream''. This interface is the replier link. In a router that is not acting as a branch point, the replier link is the upstream link. In a router that is a branch point, one of the downstream interfaces is chosen. When an end-host sends a request packet upstream, it will be forwarded along these replier links (a packet coming from a replier link is sent upstream). As a result, all upstream packets will flow through a hierarchy of repliers.

In this way, multiple acknowledgments, sent upstream, may be coalesced when they arrive at a designated replier. That replier may respond to the entire set of receivers using a directed multicast down the distribution tree.

These schemes have the usual drawbacks of network-layer multicast schemes in comparison to the primitives presented in this thesis -- they are not incrementally deployable nor may they be used outside of a multicast context.

4 REUNITE

REUNITE [38] is a multicast protocol that multicasts using recursive unicast distribution trees. In other words, as in an overlay network, packets are transmitted from point to point using traditional IP unicast. Unlike overlay networks, REUNITE uses point to point unicast transmissions mainly between routers, only involving end-hosts at the edges of the tree.

REUNITE accomplishes a number of things that the primitives proposed in this thesis also hope to achieve. It is a fairly simple protocol, it is incrementally deployable, and state requirements can be manages explicitly by overloaded routers. REUNITE, however, is aimed strictly at supporting multicast. Path Painting and Packet Reflection are intended to support a much broader range of applications.

5 Heterogeneous multicast

Many researchers have attempted to extend basic IP Multicast to support heterogeneous receivers in a multicast group. In a large IP Multicast group there is likely to be a wide distribution of available bandwidth between any two members of the group. This diversity presents an unenviable tradeoff for senders. On the one hand, a source might decrease its transmission bandwidth to the level of the most poorly connected receiver. This choice would unnecessarily decrease the quality of transmission for nearby receivers, but would result in less packet loss and congestion in the rest of the network. Alternatively, senders might ignore the problem, and transmit at full speed. Nearby receivers would receive high-quality feeds, but the volume of traffic would cause congestion at slow links. Furthermore, though poorly connected receivers might receive the same number of bytes in either scenario, the reconstituted signal is likely to be of higher quality in the first because the source carefully selected the best information to send to bandwidth-starved hosts rather than allowing the network to drop packets at random.

Receiver-driven Layered Multicast (RLM) [27] presents one solution to the problem and Thin Streams [46] refines that idea. In RLM, senders encode transmissions in a number of layers. The lowest layer is a low-fidelity encoding of the transmission. Higher level layers contain extra data which allows a more faithful reproduction of the stream at the receiver. Sends transmit each layer in a separate IP Multicast group, allowing receivers to independently subscribe to each layer.

RLM receivers determine the appropriate number of layers to subscribe to by conducting join experiments. In a join experiment a receiver subscribes to the next higher layer and determines whether its subscription creates congestion by monitoring loss rate over a short period called the decision time. If it does, it unsubscribes. Failed join experiments contribute to network congestion though, so RLM scales back the frequency of an individual node's experiments with the size of the multicast group. To allow nodes to learn the appropriate level of subscription rapidly in spite of less frequent experiments, nodes learn by observing the network during other nodes' experiments. If the network becomes congested during another nodes experiment on say, level three, of a presentation, then the observing node can act as though it conducted a failed level three experiment itself.

Thin Streams offers a number of improvements over RLM while retaining the same basic idea of allowing receivers to determine the appropriate level of traffic for a given presentation by subscribing to the correct number of streams in a presentation. First, presentations of Thin Streams consist of many more layers than those advocated by RLM. In an effort to prevent packet loss during join experiments, Thin Streams pares down both the bandwidth of individual layers and the length of the decision time. This design allows the network to buffer an entire join experiment if it causes congestion. Join experiments can now be run without causing packet loss in unrelated traffic.

Obviously, packet loss can no longer be used as metric to measure the success of join experiments. Instead, Thin Streams uses a technique developed in TCP Vegas [8]. A node compares the expected bandwidth of a new stream to the actual bandwidth being observed. If the new layer is arriving slowly it reflects growing buffering in the network. The join experiment can be judged a failure before packet loss occurs.

RLM and Thin Streams have a few significant drawbacks. First, they implicitly limit the multi-source IP Multicast model to a single source. They would not allow two groups of researchers, at MIT and Berkeley, to teleconference in a way that allows MIT researchers to see each other in high-fidelity and Berkeley researchers to see each other in high-fidelity. Such scenarios would require a set of IP Multicast Groups for each source, which would greatly compound the remaining problems.

Second, they require encoder research to develop codecs that decompose well into a layered architecture. Third, they use extra IP Multicast groups. Using extra groups contributes to an existing problem area for IP Multicast, the large size of the IP Multicast routing tables that must be present in every router. Large tables require large amount of expensive fast memory and slow all lookups when using a hash-based lookup scheme. In addition, IP Multicast's small address space is strained even more by using additional groups for each presentation.

Video Gateways [1] are designed to solve the same problems as layered multicast, while avoiding the problems of the layered approaches. What makes Video Gateways even more interesting though, is that they address an additional problem. On the Internet today, and for the foreseeable future, IP Multicast cannot be assumed to reach all hosts. Some networks have deployed IP Multicast, some have not. This situation has existed for some time and it is difficult to predict when it might change.

Video Gateways are application-level proxies that connect two separate IP Multicast groups using unicast. Each pool of nodes contains a Video Gateway that subscribes to the local IP Multicast group. The Video Gateways then communicate using IP unicast to bridge the content of the two groups. Because such bridging is under application control it may perform additional functions if desired. For example, a Video Gateway connecting a group using Motion-JPEG with a group using H.261 by transforming data between the two formats has been demonstrated. The Video Gateways perform the necessary transcoding in each direction so that all members may source and sink data with all other nodes in both groups.

Video Gateways address all of the problems of layered IP Multicast, and even allow some new capabilities that layered systems do not even attempt to provide (such as the ability to transcode data formats and bridge unicast networks). They have a downside, however, that is likely to prevent widespread use of Video Gateways for bridging large groups with many segregated IP Multicast sessions. Video Gateways are statically configured entities. Each pair of gateways connects exactly two IP Multicast groups, in exactly the way that a human operator has configured the two gateways. This manual setup becomes untenable as the desire to connect more groups, on an ad-hoc basis, is considered; the human effort involved in configuring tens of gateways in order to video conference would be considerable. Nonetheless, the fundamental architecture is appealing. A network of forwarding nodes, similar to Video Gateways, that organize themselves to bridge between segregated multicast groups would be a logical extension.


2 Overlay networks

A number of research groups and service providers are investigating services that could be described as a network of Video Gateways. Often referred to as ``Overlay networks'', these services consists of many application-level nodes using unicast to move data to multiple recipients.

Some overlays are extremely generic, such as RON [3], the Dynabone [40], and Yoid [16]. These systems exists solely to provide an overlay network with ``better'' properties than the underlying network, such as lower latency or greater resistance to DOS attacks. In the following sections, more specialized overlays are examined.

1 Application-level multicast

Application-level multicast (ALM) systems include RMX [10], End System Multicast [20], and Overcast [21]. All share the goal of providing the benefits of IP Multicast without requiring direct router support or the presence of a physical broadcast medium.

RMX focuses on real-time reliable multicast. As such, its focus is on reconciling the heterogeneous capabilities and network connections of various clients with the need for reliability. Therefore RMX focuses on semantic rather than data reliability. For instance, RMX can change high resolution images into progressive JPEGs before transmittance to underprovisioned clients. In that sense, RMX is the closest relative to Video Gateways. Transcoding is an excellent example of function that is best performed by application-level servers, providing direct evidence that an architecture that involves external servers may be preferable to enhancing the network layer.

End System Multicast provides small-scale multicast groups for teleconferencing applications using only the group members to duplicate packets. End System Multicast's chief drawback is its limitation to small scale groups. The primitives proposed here could provide a significant benefit to ESM because reflection can duplicate packets in the core of the network, rather than relying on the receivers alone.

Overcast is an ALM system designed to support high-bandwidth, single-source applications. Overcast's contribution is two-fold. First, it demonstrates the benefits of an ALM system by providing features that IP Multicast cannot support, such as on-demand access to content. Second, simulated evaluations of Overcast's tree building technique provide evidence that reasonable trees can be built under application-level control. Inefficiencies remain, however, because packet duplication cannot occur in routers and online measurements must be used extensively and continuously in order build and maintain the distribution tree.

2 Peer-to-peer

Many Peer-to-peer (P2P) systems can also be considered overlay networks. In these systems, large numbers of end-hosts cooperate toward some end. For many such systems, such as Gnutella [17], and Freenet [11], that goal is the widespread availability of content. For others, such as CAN [32], Chord [37], Pastry [33], and Tapestry [22], the goal is a more generic lookup service. Still others, such as Mix Nets [9], attempt to provide an overlay network in which cryptographic techniques provide anonymity for participants (a secondary goal in many of the previously mentioned systems as well).

Painting and Reflection are complementary to these systems, though some P2P systems will be unable to derive benefits from both primitives. For example, in systems such as Chord, important properties of the system are based on the random path a lookup request takes before reaching its destination. Further, once the lookup is completed, no more packets will be expected to take that path. In such cases, Packet Reflection cannot be used to optimize the lookup.

On the other hand, Gnutella might take advantage of Packet Reflection when sending responses to requests. These responses may be quite large, and they are pipelined through the set of nodes through which the request came. The nodes along that path could use Packet Reflection to speed the delivery of the responses.


3 Active networks

In active networks [39,45,34,44], applications can download new protocols and code into routers, allowing for rapid innovation of network services. This thesis avoids many of the hard problems of active networks by focusing on specific functionality; it does not need to address the problems created by dynamic downloading of code, sharing resources among multiple competing applications, or standardizing a programming platform. Despite this major philosophical difference, some researchers have used an active networks to describe a set of primitives [43] with similar power to the primitives proposed here.

Therefore, a detailed comparison to this approach is appropriate. The Kentucky approach presents a primitive dup that closely resembles packet reflection, and a number of primitives (in an active networks framework) that accomplish goals similar to packet reflection.

A strength of their approach is that it handles asymmetric routes better than painting and reflection. However, this appears to come at the cost of some scalability - part of their join mechanism involves an echo packet reaching a central rendezvous point before being returned to the sender. Very large groups could overwhelm the rendezvous point's network.

A strength of the primitives presented here is that they have been designed to operate correctly in the face of route changes in the underlying network. dup is insufficient to handle these cases. For example, if a route change causes the router maintaining node A's dup request to stop receiving the group's packets (because they are now taking a different path), A will lose the packets. With reflection, duplications are explicitly confirmed to a node's parent, using success tags. If the duplication point is bypassed, the parent knows it. It can then perform the duplication on its own and make a new reflection request.

A final difference is largely philosophical. Dup and friends are presented as a few of the infinity of possible primitives that could be constructed with an active networks approach. While this approach brings the promise of arbitrary programmability, it carries the baggage of complexity and difficult to address security problems. The primitives presented in this thesis are simple, have enough power, and the security implications can be well understood.


3 Packet Reflection

In an overlay network each node carries out explicit unicast communication with its neighbors in the topology. When one overlay node forwards packets between two other nodes, that packet is transmitted on the same link multiple times as it reaches the overlay router and is re-emitted toward the final destination. The links near the overlay router will have a stress of two, and the stretch of the packet will exceed one as time is wasted while the packet approaches and then leaves the overlay router.

When multicasting on an overlay network, the stress problem is exacerbated. The forwarding node duplicates packets, forcing semantically equivalent packets to be transmitted on the same link, in the same direction, multiple times. In such cases, some links will have a stress equal to the number of packet duplications plus one (one packet arrives, each duplicate leaves). For example, Figure 3-1 shows a simple application-level multicasting tree in one link, $R_4 E_1$, has a stress of three another link, $R_3 R_4$, has a stress of two.

Stretch is also a problem in Figure 3-1. $E_2$ receives packets only after they have traversed eight links, rather than the four of a direct unicast. $E_3$ must wait for six traversals instead of four. Assuming unit latencies, they would experience a stretch of 2.0 and 1.5 respectively.

In the remainder of this chapter, we will see how packet reflection can reduce stress and stretch in these situations. Most examples will describe a multicasting overlay system both because multicast is a common application of overlay networks and because unicasting overlay networks can be seen as a degenerate case of a multicasting system.

Figure 3-1: An application-level multicast distribution tree. Packets are sent from the source $S_1$ to end host $E_1$ through routers $R_1$, $R_3$, and $R_4$. $E_1$ sends the packets on to $E_2$ and $E_3$.
\begin{figure}\centering\epsfig{file=forwarding.eps, width=\figwd}\end{figure}

1 Reducing stress with reflection

IP routers perform a simple operation on most packets: when a packet arrives, lookup the destination IP address in a routing table, and use the resulting entry to choose an interface on which to emit the packet. An overlay node, acting as an overlay router, performs a similar operation. Overlay routers determine the overlay address and forward the packet, using IP unicast, to the next overlay node. The next node is, again, determined by consulting a routing table. In order to perform multicasts, these routing tables may contain multiple next hops for a single overlay destination address.

The stylized nature of this operation suggests that it could be succinctly described so that another node could perform the operation instead. An end host may ask ``the network'' to perform packet reflection on its behalf. In Figure 3-2 end host $E_1$ directs a reflection request toward $S_1$, which takes it to router $R_4$. This optimization alleviates the excess stress on link $R_4 E_1$. In addition to performing requested reflections, routers continue to forward packets using their normal forwarding rules. Thus, $E_1$ will continue to receive all packets addressed to it.

Figure 3-2: End host $E_1$ avoids overloading link $R_4 E_1$ by sending $reflect(S_1\rightarrow E_1, 1, \{(E_1\rightarrow E_2,0),(E_1\rightarrow E_3\},0))$ to $R_4$. $R_4$ will now duplicate packets for $E_1$ from $S_1$, sending copies to $E_2$ and $E_3$. In both duplicates the source will be $E_1$.
\begin{figure}\centering\epsfig{file=split-request.eps, width=\figwd}\end{figure}

The format of a reflection request is denoted $reflect(S\rightarrow D, T,
\{(S_i\rightarrow D_i, t_i\})$. Such a request will be addressed to $S$, and rely on routing symmetry to direct it to routers that can fulfill the request. This notation should be read as, ``When a reflectable (IP Protocol = REFLECT) packet arrives matching the inbound flow identifier $S\rightarrow D$, duplicate it once for each outbound flow identifier $S_i\rightarrow D_i$. Rewrite the source and destination in each duplicate and emit each, tagged with the associated $t_i$. Emit the original packet tagged with $T$.'' Tags are used to ensure that nodes know when their reflection requests have been honored and are described in detail in Section 3.3. The operation of a router receiving a reflection request and handling packets that match the request are formalized in Rules 1 and 2.

Rule 1   Upon agreeing to a reflection request, the router shall install a reflection table entry as per the request. An entry is keyed by source and destination addresses and ports. A table entry contains a success tag, and a number of copy entries. Each copy entry contains new source and destination addresses and ports, and a tag for each copy. (In future rules, ``address'' will be shorthand for ``address and port''.)

Rule 2   When forwarding an IP packet, if a reflectable packet matches a reflection table key, make one copy for each copy entry in the table entry. Each copy receives new source and destination addresses and a tag. The original packet is tagged with the success tag of the reflection table entry. All packets, including the original, are forwarded by normal unicast rules.

Figure 3-3: Router $R_4$ avoids overloading link $R_3 R_4$ by sending $reflect(S_1\rightarrow E_1, {\bf 2}, \{(E_1\rightarrow E_2,0))$ to $R_3$. Note that the tag has been incremented, and one copy has been eliminated from the original request from $E_1$.
\begin{figure}\centering\epsfig{file=save-return.eps, width=\figwd}\end{figure}

As seen so far, packet reflection allows end hosts to avoid wasted packets on the link between themselves and the nearest router to them. Although this optimization is useful, greater utility is achieved when routers themselves make reflection requests. A router that has been asked to reflect a packet out the same interface on which it is received may pass on a similar reflection request. In Figure 3-3, router $R_4$ takes advantage of packet reflection by propagating part of its responsibility to reflect packets. By pushing a request similar to $E_1$'s original request on to $R_3$, $R_4$ avoids work and (more importantly) alleviates the stress on link $R_3 R_4$.

Rule 3   Upon receiving a reflection request, a router shall mark each copy entry NORMAL. NORMAL entries shall be treated as in Rule 2. The router may make a new reflection request that asks for some copies to made on its behalf. Requested copies are marked DEMANDED. The success tag associated with the new request is recorded in the reflection table entry for the request. The tag is the entry's expected tag, which is distinct from the success tag to be written into the original packet before forwarding. When a packet arrives with the expected tag, DEMANDED copies are not made.

Of course, if $R_3$ performs the reflection to $E_2$, $R_4$ should not. Tags allow $R_4$ to know whether a given packet has already been reflected by $R_3$. This mechanism is formalized in Rule 3 and described in more detail in Section 3.3.

2 Anatomy of a reflection request

Outside of this section, this thesis refers to reflection ``requests''. A request, however, actually consists of a three packet handshake. An ASK packet initiates the request. It contains a list of copies that the requester would like made on its behalf. Each copy represents a packet that should be generated in response to the observation of a copy meeting the reflection's match criteria. When the ASK reaches a router that supports reflection, it responds with an OFFER. An OFFER contains a subset of the copies requested in the ASK, and a cryptographically generated nonce. The subset of copies are those copies that the router is actually willing to make on the requester's behalf. Finally, the original requester finishes the request with a DEMAND packet which contains some subset of the copies from the OFFER, and echoes the nonce back to the router that made the OFFER.

All reflection packets have the structure depicted in Figure 3-4. Many fields have the same meaning in all packets. For example, the Source Address/Port and Destination Address/Port always refer to IP address and ports of the packets that will match the reflection request. The Success Tag is always the value to write into packets that match the reflection request after fulfilling the request. Tags are described in detail in Section 3.3. The Copy Count is the number of five-tuples to follow. Those tuples describe the copies that should be created in response to observing a packet that matches the Source and Destination fields. The tuples contain new source and destinations for the copies as well as a tag to write into those copies.

Figure 3-4: Detailed contents of reflect packets.
\begin{figure}\centering\begin{tabular}{\vert l\vert}
\hline
Reflect Request\\
...
...py Destination Port 1\\
Copy Tag 1 \\
... \\
\hline
\end{tabular}\end{figure}

1 Ask

A reflection request begins with an ASK packet. The IP Destination of an ASK will always match the Source Address, because ASK packets are sent toward the source of packets they are intended to match. The ASK is generally acted upon by a node other than the node named by the IP Destination. Instead, the first reflection-capable router along the way will intercept it.

The Nonce is unused in an ASK. As seen in the following sections, the Nonce is a challenge that is created by the router forming an OFFER.

In an ASK the copy information is speculative. It contains all copies that the asker would like handled for it. The offerer will decide which of those requests it will handle.

2 Offer

When the ASK arrives at a router capable of reflection, an OFFER packet is calculated and returned. The OFFER packet differs from the ASK in three places: IP Source and Destination, Nonce, and Copy Information.

The IP Source and Destination in an OFFER indicate the IP address of the router making the offer, and the IP address from the Destination Address field. An OFFER packet is not necessarily addressed (at the IP level) to the asker. Instead, the asker must intercept the OFFER on its return path. This complication is introduced in order to increase the security value of the Nonce.

The Nonce field contains a cryptographically generated integer. The nonce will be echoed back in the DEMAND packet, confirming that the demander controls the IP address in question. The nonce should be the result of a one-way hash function run on the match criteria of the request and a router secret. This technique, based on the idea behind SYN cookies [7], will allow the router to confirm a nonce without storing it, preventing a denial of service attack. The nonce prevents the use of reflection to intercept communications that could not be intercepted by other means. That is, an accurate nonce in a forged DEMAND implies that the attacker can already intercept packets addressed to the victim.

The copy information in an OFFER lists the subset of copies from the ASK that the router is willing to service. The router might determine this subset in any way it chooses, though Rule 4 is a guideline. Generally, a router should be willing to make a copy if, when consulting its own IP routing table, it determines that the copy would not be emitted on the same interface as a packet that meets the matching criteria of the ASK. Rule 4 means that a router would make a copy if doing so would decrease stress on one of its own links.

Rule 4   A router shall offer to perform all copies in a reflection request which will require that the copy and the original packet be emitted on the same interface.

Alternatively, Rule 5 is a recursive approach to determining what copies a router should offer. In this formulation, all ASK packets would recursively propagate to the source, then a series of OFFER packets would propagate back to original asker. Finally, DEMAND packets would proceed toward the sender. The recursive approach will push requests further into the network at the cost of more network traffic during setup.

Rule 5   After receiving an ASK, a router shall begin to pass along the reflection request (as per Rule 3) before offering a response. The router shall then offer to perform all copies implied by Rule 4 or offered by the next router.


3 Demand

A DEMAND is the final phase of a reflection request, and is made by the same node that sent the ASK. A DEMAND is sent in response to an OFFER, and contains the nonce of the OFFER. It will also usually contain the same copy information as the OFFER. However, this property is not a hard and fast rule. The demander may choose to eliminate some copy requests and, in some cases, must do so in order to maintain the correctness of success tags. A rule for constructing DEMAND packets is deferred until tags are described.


3 Tags confirm reflection

Tags are a network feedback mechanism that allows end-hosts to determine when their reflection requests have been honored. They are crucial for correctness, as they allow end-hosts to perform necessary duplications when the network does not.

The propagation of route reflection requests toward the source of the match criteria assumes symmetry in IP routing. Under this assumption, the request, wherever it ends up, will lie on the path from the source to the destination of the inbound flow identifier. This assumption may not hold, however, in some situations. In those cases, asymmetric routing paths exist between two hosts. This asymmetry could allow a packet to arrive at an end-host without passing through the router that would be expected to perform reflection on it.

A similar concern is that routes in the underlying network change as a result of broken links or configuration changes. A reflection request may have propagated to a router that, after a route change, no longer sees the packets that are to be reflected. Figure 3-5 demonstrates this problem.

For correctness in these scenarios, packet reflectors must signal the original destination node when a packet has been successfully reflected. In the absence of such confirmation, the requesting node would perform the reflection on its own, as if packet reflection had not been requested. In addition, the requester might try to reestablish the reflection request. If the problem was caused by a new route, a router on the new path might accept the request.

Figure 3-5: A new physical route is brought online between $R_1$ and $R_4$, bypassing the reflection request in $R_3$. $R_4$ notices that it receives untagged packets and performs both copies on its own. $E_1$ is unaffected.
\begin{figure}\centering\epsfig{file=route-change.eps, width=\figwd}\end{figure}

To implement this signaling mechanism, packet reflection requests contain a tag, as do all packets forwarded by the reflection mechanism. When a router performs a reflection, it writes the value of the tag for that reflection request to the original packet, which continues on its way to the original destination. If a packet is received without the appropriate tag, it is clear that duplication did not occur, so the receiver performs the duplication as if it had never made the reflection request.

There are two important phases in the use of success tags in reflection requests. In the first phase, a requester must choose appropriate success tags for reflection requests. We will see that as a request propagates from router to router the success tag must be changed to avoid ambiguity. The second phase begins once reflection requests have been established. When a router receives a packet for which it has agreed to perform reflection, it must either perform the entire reflection itself or determine that some part of it has already been done, and perform the rest.

1 Establishing tags

The most important goal to keep in mind with respect to the choice of success tags is that the meaning of particular tag must be unambiguous. A secondary goal is that the success tags in successive requests remain unchanged. As we will soon see, certain situations allow the same success tag to be used in successive requests without creating ambiguity. The advantage of doing so is that the effects of route asymmetry and route changes are mitigated.

We begin by assuming that edge hosts use a success tag of one when initiating a request, codified in Rule 6. For example, in Figure 3-2, the first request, $reflect(S_1\rightarrow E_1, {\bf 1}, \{(E_1\rightarrow
E_2,0),(E_1\rightarrow E_3\},0))$, requests that success be indicated by writing a 1 to the original $S_1\rightarrow E_1$ packet. Zero is reserved to indicate that no reflection has occurred yet. Reflectable packets begin with their success tag set to zero. The important question is how routers should choose success tags when propagating their reflecting responsibility.

Rule 6   An end-host shall use a success tag of $one$ in its reflection requests.

Changing success tags is sometimes a necessity. In Figure 3-3, $E_1$ has requested that two copies be made whenever it receives a packet from $S_1$. $R_4$ agreed to perform those copies, but went on to request that $R_3$, in fact, should make one of the copies. The two requests must be:

\begin{displaymath}reflect(S_1\rightarrow E_1, 1, \{(E_1\rightarrow E_2,0), (E_1\rightarrow E_3,0)\})\end{displaymath}


\begin{displaymath}reflect(S_1\rightarrow E_1, {\bf 2}, \{(E_1\rightarrow E_2,0)\})\end{displaymath}

Following this request, the $E_1\rightarrow E_2$ copy entry in $R_4$ will be marked DEMANDED by Rule 3.

The second request must change the success tag in case the $S_1\rightarrow E_1$ packet emitted by $R_3$ ever makes it to $E_1$ without passing through $R_4$ (due to a route change or a new asymmetric path). $R_4$ is ensuring that $R_3$ will not confuse $E_1$ with a claim that is not true. A success tag of 1 would indicate that both copies have been sent, so $R_3$ must use a different success tag after making only one copy.

Figure 3-6: A slightly different underlying topology allows $R_4$ to propagate its entire responsibility to $R_3$. In this situation, $R_3$ can make make its request using the same success tag that it has been asked to use.
\begin{figure}\centering\epsfig{file=pass-through.eps, width=\figwd}\end{figure}

Changing success tags, however, is not always a requirement. Compare Figure 3-6 to Figure 3-3. In Figure 3-6 $R_4$ has agreed to the same request from $E_1$, but was able to go on to request that $R_3$ perform both copies on its behalf. In this case, there is no chance of confusion. If the $S_1\rightarrow E_1$ packet emitted by $R_3$ ever makes it to $E_1$ without passing through $R_4$ with a success tag of 1, $E_1$ will not be confused. The claim is accurate: both copies requested by $E_1$ have been made.

The difference between these two cases is that in Figure 3-6, $R_3$ has agreed to perform the exact same reflection request that $R_4$ had previously been assigned. Under these circumstances, it is reasonable for $R_4$ to go a step further and ask that $R_3$ perform $R_4$'s tagging operation as well. If $R_4$ had used a new tag, then when such a packet arrived, its only required operations would have been to rewrite the success tag to its own value. By reusing its own value in the new request, $R_4$ can simply forward the packet. $R_4$ has arranged matters so that its operation under Rule 3 has led to a degenerate case: the incoming packet's tag is the same as the expected tag, and no copy entries are NORMAL, so IP forwarding of the original packet is all that remains to be done.

Rule 7   When propagating the request associated with a reflection table entry, a router shall use a new tag unless all copy entries of the reflection table entry are offered by the next router. The new tag shall be chosen by incrementing the success tag of the reflection table entry. In such a case, the expected tag will be one greater than the success tag for that entry.

There is a significant advantage to avoiding unnecessary tag changes, beyond the ease with which $R_4$ may now forward packets. If a new link were brought up connecting $R_3$ directly to $E_1$, reflection would proceed without any difficulty. The packet would be tagged at $R_3$ in exactly the way that $E_1$ expects, so $E_1$ would correctly detect that its request had been fulfilled. The fact that $R_3$, rather than $R_4$, performed the duplications is irrelevant. If, instead, routers always changed success tags when propagating requests, then $R_4$'s presence would be required between $R_3$ and $E_1$ so that the tag could be transformed.

By the same reasoning that allows packets to skip $R_4$, we can also conclude that $R_4$ may safely throw out the reflection table entry associated with the request. This optional space optimization is codified in Rule 8. Routers should not eliminate this state without cause, however. If, for example, a new route should be added to the network that skips $R_3$ but not $R_4$, it would be beneficial for $R_4$ to still have enough information to perform the necessary duplications. If the state has been eliminated, the packet will not be duplicated until it reaches the application in $E_1$.

Rule 8   A router that has successfully passed on an entire reflection request, thereby avoiding the creation of a new success tag by Rule 7, may throw away the state associated with the request without decreasing the efficiency of the overlay under normal circumstances.

Route asymmetry, however, presents the same difficulty in a more persistent form. For example, if the previously described new link were a one-way link toward $E_1$, then $E_1$'s next request would still go through $R_4$. Only by keeping the success tag unchanged can the network avoid an extra reflection. Of course, sometimes route asymmetry will occur around a router that was forced to change the success tag in its request because it was unable to forward the entire reflection. In these cases, missing the router in question causes a misfire. Although the same duplication may already have been performed, when the packet arrives at the next router the tag will not be correct so the entire reflection must be performed. This problem is described in more depth in Section 3.5.

2 Using tags during reflection

Once a router has agreed to service a reflection request, it is expected to make the appropriate copies or ensure that another router has done so, and then write its success tag to the original packet. In the simplest case the router simply performs the reflection and writes the success tag. The router will always perform the reflection when it has not made a reflection request to any other router. The packet will arrive with no success tag (the field will be set to zero), the router will make the copies, and then forward the original packet to the next hop with the success tag filled as previously requested.

If a router has made a reflection request, then it might face a second case. A packet may arrive with its success tag correctly set to the value requested by the router in its reflection request. In this case, the local router knows that its reflection request has been honored, and certain copies (those labeled DEMANDED by Rule 3) have already been made by another router. In some cases, this knowledge will allow the local router to avoid making any copies. No copies will be required when the router was able to make a reflection request to offload its entire responsibility. Other times, the next router will have offered to perform only a portion of the local router's work. In those cases, the local router will still need to make some number of copies (those marked NORMAL). Once the router makes any remaining copies, it tags and forwards the original packet.

A final case is possible. A non-zero success tag appears in a packet, but the success tag does not correspond to the value of the router's previous reflection request. For example, the router has requested that a reflection be performed and that 11 be written to the forwarded packet, but the packet arrives with a 13. As we will see later, this mismatch can happen when further routers have made requests to yet more routers, but some router that is expected to perform a reflection has been skipped by the path of the original packet. The value 13 indicates that some reflection request was performed on the packet, but the local router cannot know what has occurred, so the entire reflection must be performed, followed by writing the success tag. The local router always writes the same success tag because it is confirming that its own reflection request has been completed.

In summary, Rule 9 refines Rule 2 to account for tags.

Rule 9   To reflect a packet that does not have a success tag corresponding to the success of its reflection table entry, a router shall make all copies, then tag and forward the original. To reflect a packet that does have a matching success tag, make all copies that are marked NORMAL, then tag and forward the original.


3 Tags in reflection copies

In addition to the tag associated with the request as a whole, reflection requests also contain a tag associated with each outbound flow identifier. This feature is necessitated by the interaction of multiple reflection requests. Suppose that a router has accepted a request: $reflect(A\rightarrow B, 1, \{(B\rightarrow C, 0)\})$. Now, when a router observes a packet going from $A\rightarrow B$, it sends a copy from $B\rightarrow C$ and 1 will be written to the $A\rightarrow B$ packet. Now suppose that the router receives another request: $reflect(B\rightarrow
C, 5, (\{C\rightarrow D,0)\})$. The router now determines that when it receives a packet from $A\rightarrow B$, it must send two packets in addition to the original: $B\rightarrow C$ and $C\rightarrow D$. The router tags the $B\rightarrow C$ packet with a 5 so that it is clear to the originator of the second request that its request was honored.

Finally, suppose that the router emits $B\rightarrow C$ packets on the same interface that it receives $A\rightarrow B$ packets. In that case, it makes a request to its upstream router using the tag associated with the $B\rightarrow C$ flow identifier: $reflect(A\rightarrow B, 2,
\{(B\rightarrow C, {\bf 5})\})$. By doing so, it insures that it is upholding its contract to write a 5 into $B\rightarrow C$ packets. The need to associate success tags with copy entries is codified in Rule 10.

Rule 10   If a router contains a reflection table entry, $R_1$, in which a copy entry, $C_1$, matches another reflection table entry $R_2$, the copies associated with $R_2$ should be added to the copy entries of $R_1$. The success tag of $R_2$ should be written to $C_1$. The newly added copy entries of $R_1$ should be marked NORMAL, although a subsequent reflection request could change them to DEMANDED.

4 Tags complicate DEMAND

Rule 10 is the first rule that places non-zero tags in the copy entries of a reflection table entry. Those non-zero tags complicate matters because they must be used only when the copies they signify are also made. A node must be careful not to make a reflection request that moves the copy entry (and non-zero success tag) to a new router unless it also moves the copies associated with that tag. In order to maintain the correctness of tags, a router may be forced to drop some copies during the demand phase of a request if dependent copies were dropped during the offer phase.

Suppose that a router, R, has agreed to two reflection requests:

\begin{displaymath}reflect(A\rightarrow B, 1, \{(B\rightarrow C,0)\})\end{displaymath}


\begin{displaymath}reflect(B\rightarrow C, 1, \{(C\rightarrow D,0), (C\rightarrow E,0)\})\end{displaymath}

From these two requests, R uses Rule 10 to determine that it must make three copies when it sees an $A\rightarrow B$ packet. First, it must send a $B\rightarrow C$ copy. Next, it must send $C\rightarrow D$ and $C\rightarrow E$ copies in response to the $B\rightarrow C$ packet it just generated. Furthermore, the $B\rightarrow C$ packet must be tagged with a 1 to confirm that the two extra copies were sent (as per the second reflect request).

Now R decides to save itself work by propagating a reflection request toward B. It intends to request:


\begin{displaymath}reflect(A\rightarrow B, 2, \{(B\rightarrow C,1),(C\rightarrow D,0),(C\rightarrow E,0)\})\end{displaymath}

R builds an ASK for this request. If the OFFER comes back indicating that the offerer is unwilling to make the $C\rightarrow D$ copy, R cannot go ahead with the obvious DEMAND that includes the other two copies. If $(B\rightarrow C,1)$ is demanded, those packets will go out with a success tag equal to 1. That tag would erroneously indicate that $(C\rightarrow D,0)$ was generated.

R has two choices to assure correctness. It may demand only the $B\rightarrow C$ copy, but ask that it be sent out tagged with a 0, not 1. In this case R should remove the $C\rightarrow E$ copy from its request. If the $C\rightarrow E$ is not removed, it will only cause a misfire when a later router creates the same copy because no success tag has indicated that it has already been done. Alternatively, R may simply drop the request for $B\rightarrow C$ completely. In this case, that would mean dropping the request completely, but a larger request could still have valid copies to be made. The key requirement is to drop the copy whose tag would erroneously indicate that additional copies were also made. This requirement is summarized as Rule 11.

Rule 11   A router shall only demand a tagged copy if it also demands the copies that are implied by that tag.

4 Multiple reflection requests

It is relatively easy to understand the effects of a single reflection handshake in isolation. The router that services the handshake will begin duplicating some subset of the copies that the end-host originally asked for, and tagging the original packet when it does so. Complications arise as we begin to consider the effects of multiple reflection requests.


1 Chain of requests

As each router passes a reflection request toward the source using Rule 3, a smaller and smaller subset of the copies will be serviced by the accepting routers. As per Rule 4, the subset chosen will relate to branch points in the underlying topology. A router will refuse to perform copies that require emitting the copy on the same interface as the original, so as the request moves toward the source it will contain fewer and fewer copies.

Under these circumstance, one might expect a chain of reflection requests that cause copies to be made as close as possible to the source. This expectation is not always correct, however, due to ``local maxima''. A copy may stop at router because the next router in the chain refuses to perform the copy, even though yet another router, closer to the source, would be willing to make the copy.

Imagine that a router, $R$, has been asked to reflect a packet to end-host, $E$. The router has three interfaces, and $E$ lies on a separate interface from both the source and destination of the original packet. $R$ will attempt to pass the reflection request further up toward the source, but the next router may consider $R$ to be the next hop for $E$. In such a case, the next router would refuse the request, and $R$ would be left performing the copy. This situation is a reasonable because stress is reduced, but it is certainly possible that there is a ``higher'' branchpoint, closer to the source that would be preferable in some way, because, for example, it decreases stretch. In that case, the recursive Rule 5 may be preferable to Rule 4.

2 Multiple, separate requests

The interactions of multiple, separate reflection requests can be subtle. If the requests contain common endpoints, they may interact under Rule 10 when their request chains meet at a single router. For example, consider these requests:


\begin{displaymath}reflect(A\rightarrow B, 1, \{(B\rightarrow C,0)\})\end{displaymath}


\begin{displaymath}reflect(B\rightarrow C, 1, \{(C\rightarrow D,0)\})\end{displaymath}

When these requests meet at a single router, the router should set up its reflection tables to act as though it received the following two requests:


\begin{displaymath}reflect(A\rightarrow B, 1, \{(B\rightarrow C,1)\}, \{(C\rightarrow D,0)\})\end{displaymath}


\begin{displaymath}reflect(B\rightarrow C, 1, \{(C\rightarrow D,0)\})\end{displaymath}

The important rule to keep in mind is that reflection is always an optimization of an existing overlay relay -- two unicast transmissions. The interactions of separate requests will never change the character of those transmissions. This property simplifies predictions about the final result of multiple reflection requests. For example, if there are no cycles in the overlay network, no cycles can be induced by using reflection.

Consider that an acyclic overlay network can be seen as a tree, rooted at the origin, $O$, of a packet. In each reflection request made by a node, $X$, in the tree, the match criteria will name a source node that is closer to the origin than the destination of the copies. For example, consider:


\begin{displaymath}reflect(S\rightarrow X,\{(X\rightarrow D_1)\},\{(X\rightarrow D_2),\dots\})\end{displaymath}

Each destination ($D_1$, $D_2$, ...), is guaranteed to be further from $O$ in the overlay topology than $S$ or $X$. Every time that a copy is made, its new destination is further from the origin in the overlay topology. The tree must have finite depth, so the number of copies is bounded. No cycles are possible.

On the other hand, malicious overlays can also be ``optimized''. For example, three nodes might make the following requests:


\begin{displaymath}reflect(A\rightarrow B, 1, \{(B\rightarrow C,0)\})\end{displaymath}


\begin{displaymath}reflect(B\rightarrow C, 1, \{(C\rightarrow A,0)\})\end{displaymath}


\begin{displaymath}reflect(C\rightarrow A, 1, \{(A\rightarrow B,0)\})\end{displaymath}

These requests would create a cycle among reflecting routers, just as there is a cycle among the three nodes in the overlay. The duplicate suppression techniques described in Section 3.6 would be expected to mitigate the effects of cycles, however.


5 Misfires

Misfires are caused when the ``success tag'' system is unable to convince a node that a packet has already been duplicated. In such cases, the node must perform the duplication again, on its own. One common reason for such a failure is network asymmetry, as seen in Figure 3-7. Others causes include route changes and multipath routing.

Figure 3-7: Network asymmetry causes a misfire. A one-way link and a new router have been added to the previous topology. $S_1\rightarrow E_1$ packets now skip $R_4$, arriving at $R_5$ tagged only by $R_3$. $R_5$ recognizes that its request was not properly fulfilled and makes the copies on its own. $E_2$ will receive packets created at $R_3$ and $R_5$.
\begin{figure}\centering\epsfig{file=asymmetry.eps, width=\figwd}\end{figure}

In the case of network asymmetry changing success tags in the core of the network, the tag mechanism cannot prevent misfires, but it can address the missed reflection request inside the network. Stress and stretch will still be reduced at the point the asymmetry ``comes back together''. For example, in Figure 3-7 $R_3 R_6$ is a one-way link toward $R_6$. The routes between $S_1$ and $E_1$ are now asymmetric. Although the reflect request in $R_4$ will never be used in this scenario, $R_6$ will detect that it should continue to perform both copies on $E_1$'s behalf because packets will arrive without the success tag that $R_6$ asked $R_4$ to use. Once $R_6$ performs the reflection, it will write its success tag as usual. From the standpoint of $E_1$, it does not matter who performed the reflection. $R_6$'s success tag assures $E_1$ that no more work is required.


6 Preventing repeated reflection

Due to misfires, routers and edge nodes must expect that, in some situations, they will receive duplicate packets. These nodes might be expected to reflect all matching packets that they receive, but it is pointless to reflect the same packet multiple times. A node receiving a packet that it has already reflected should not perform reflection for the packet again.

Two possible designs would allow duplicate detection. In the first, packets contain a sequence number. When a packet is reflected, its sequence number is associated with its reflection table entry. Further packets with equal of lower sequence numbers are duplicates. As usual in designs with sequence numbers of finite size, ``greater than'' is defined to be circular comparison. This technique faces difficulty in the face of network reorderings, however. If a higher numbered packet arrives before lower numbered packet, the lower numbered packet would not be reflected. Although this is not a problem for correctness because the packet would not receive its success tag and the end-host could perform the reflection, it would result in a loss of efficiency.

For the sake of foiling denial of service attacks, a different approach is possible. Instead of using sequence numbers, each packet could be tagged with a large random number. A router would associate a small cache of recently seen packets with each entry in its reflection table. Again, duplicates would not be reflected. This scheme also faces problems in the face of reorderings, if the number of packets reordered exceeds the size of the cache. Again, this is not a correctness issue, but for a different reason. In this scheme, when there is a cache miss on a duplicate packet, it will be reflected a second time. This addition duplication is unnecessary, but not a problem for correctness.

The random number scheme has an additional benefit compared to the sequence number scheme. It permits a strengthening of design choice that states that duplicates should not be reflected. Instead, packets that have been detected as duplicates may be dropped. The random number scheme produces false negatives, which would allow safe dropping. The sequence number scheme permits occasional false positives which should not be dropped.

7 Soft state

Routers should maintain reflection table entries for a finite period of time. It is the responsibility of end hosts to repeat reflection requests on a periodic basis in order to maintain the state in each router. When routers receive a refreshing request, they should repeat their own attempt to pass on the request by Rule 3. This thesis does not explore appropriate timeout intervals, though it is expected that timeouts on the scale of minutes would be appropriate.

8 Security

The security implications of reflection seem, at first, difficult to predict. One wonders if malicious users might use reflection to snoop packets from afar. However, a Rule 12 allows routers to reject reflection requests that might be intended to ``reflect'' sensitive data to prying eyes.

Rule 12   Routers should accept a reflection request on a given interface only if control of that interface would have been sufficient to implement the behavior requested by the reflection request. For example, if packets addressed to $A$ would be emitted through a router's first interface, then requests to reflect packets intended for $A$ should only be accepted on the first interface. Further, the source to be written in the copies for such reflections must be $A$.

In addition, the Nonce field ensures that a request is being made by a node that, at the least, already has access to packets destined for the destination. A malicious node can reflect $A$'s packets only if it can intercept $A$'s packets. In such a case, the malicious node already has access to $A$'s data, so reflection would merely optimize the theft.

9 Deployment

Packet reflection is suited to incremental deployment because there is an immediate gain wherever it is deployed. Even if only a single router implements the primitive, application-level multicast nodes attached to that router can immediately take advantage of packet reflection and save bandwidth on their LAN as well as trimming latency to their neighbors. For example, in the previous example of Figure