Network Working GroupA. Falk
Expires: April 17, 2005D. Katabi
 October 17, 2004

Specification for the Explicit Control Protocol (XCP)


Status of this Memo

This document is an Internet-Draft and is subject to all provisions of section 3 of RFC 3667. By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she become aware will be disclosed, in accordance with RFC 3668.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at

The list of Internet-Draft Shadow Directories can be accessed at

This Internet-Draft will expire on April 17, 2005.

Copyright Notice

Copyright (C) The Internet Society (2004).


This document contains an initial specification for the Explicit Control Protocol (XCP), an experimental congestion control protocol. XCP is designed to deliver the highest possible end-to-end throughput over a broad range of network infrastructure, including links with very large bandwidth-delay products, which are not well served by the current control algorithms. XCP is potentially applicable to any transport protocol, although initial testing has applied it to TCP in particular. XCP routers are required to perform a small calculation on congestion state carried in each data packet. XCP routers also periodically recalculate the local parameters required to provide fairness. On the other hand, there is no per-flow congestion state in XCP routers.

Table of Contents

1.  Introduction
    1.1  XCP Protocol Overview
2.  The Congestion Header
    2.1  Header placement
    2.2  Congestion Header Formats
    2.3  IPsec issues
    2.4  NAT, middlebox issues
    2.5  MPLS/Tunneling Issues
3.  XCP Functions
    3.1  End-System Functions
        3.1.1  Sending Packets
        3.1.2  Processing Feedback at the Receiver
        3.1.3  Processing Feedback at the Sender
    3.2  Router functions
        3.2.1  Calculations Upon Packet Arrival
        3.2.2  Calculations Upon Control Interval Timeout
        3.2.3  Calculations Upon Packet Departure
        3.2.4  The Control Interval
        3.2.5  Obtaining the Persistent Queue
4.  Unresolved Issues
    4.1  XCP With Non-XCP Routers
    4.2  XCP as a TCP PEP
    4.3  Sharing resources between XCP and TCP
    4.4  A Generalized Router Model
    4.5  Host back-to-back operation
    4.6  Alternate Responses to Packet Loss
    4.7  Alternate XCP Data Representations
5.  Security Considerations
6.  IANA Considerations
7.  Acknowledgements
8.  References
§  Authors' Addresses
§  Intellectual Property and Copyright Statements


1. Introduction

The Van Jacobson congestion control algorithms [Jacobson88]Jacobson, V., Congestion Avoidance and Control, August 1988.[RFC2581]Allman, M., Paxson, V. and W. Stevens, TCP Congestion Control, April 1999. are used by the Internet transport protocols TCP [RFC0793]Postel, J., Transmission Control Protocol, September 1981. and SCTP [RFC2960]Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson, Stream Control Transmission Protocol, October 2000.. The Jacobson algorithms are fundamental to stable and efficient Internet operation, and they have been highly successful over many orders of magnitude of Internet bandwidth and delay.

However, the Jacobson congestion control algorithms have begun to reach their limits. Gigabit-per-second file transfers, lossy wireless links, and high latency connections are all driving current TCP congestion control outside of its natural operating regime. The resulting performance problems are of great concern for important applications of the network.

The original Jacobson algorithm was a purely end-to-end solution, requiring no congestion-related state in routers. More recent modifications have backed off from this purity. Active queue management (AQM) in routers (e.g., RED) [RFC2309]Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J. and L. Zhang, Recommendations on Queue Management and Congestion Avoidance in the Internet, April 1998. improves performance by keeping queues small, while Explicit Congestion Notification (ECN) [RFC3168]Ramakrishnan, K., Floyd, S. and D. Black, The Addition of Explicit Congestion Notification (ECN) to IP, September 2001. passes one bit of congestion information back to senders. These measures do improve performance, but there is a limit to how much can be accomplished without more information from routers. The requirement of extreme scalability together with robustness has been a difficult hurdle to passing more information.

This document concerns the Explicit Control Protocol (XCP) developed by Dina Katabi of MIT [KHR02]Katabi, D., Handley, M. and C. Rohr, Internet Congestion Control for Future High Bandwidth-Delay Product Environments, August 2002.. XCP represents a significant advance in Internet congestion control: it extracts congestion information directly from routers, without any per-flow state. XCP should be able to deliver the highest possible application performance over a broad range of network infrastructure, including extremely high speed and very high delay links that are not well served by the current control algorithms. XCP achieves fairness, maximum link utilization, and efficient use of bandwidth. XCP is novel in separating the efficiency and fairness policies of congestion control, enabling routers to put available capacity to work quickly while conservatively managing the allocation of capacity to flows. XCP is potentially applicable to any transport protocol, although initial testing has applied it to TCP in particular.

XCP's scalability is built upon the principle of carrying per-flow congestion state in packets. XCP packets carry a congestion header through which the sender requests a desired throughput. Routers make a fair per-flow bandwidth allocation without maintaining any per-flow state. This enables the sender to learn the bottleneck router's allocation to a particular flow in a single round trip.

The gains of XCP come with some pain. XCP is more difficult to deploy than other proposed Internet congestion control improvements, since it requires changes in the routers as well as in end systems. It will be necessary to develop and test XCP with real user traffic and in real environments, to gain experience with real router and host implementations and to collect data on performance. Providing specifications is an important step towards enabling experimentation which, in turn, will lead to deployment. XCP deployment issues will be addressed in more detail in a subsequent version of this document.

This document contains an initial specification of the protocol and algorithms used by XCP, as an experimental protocol. The XCP algorithms defined here are based upon Katabi's SIGCOMM paper [KHR02]Katabi, D., Handley, M. and C. Rohr, Internet Congestion Control for Future High Bandwidth-Delay Product Environments, August 2002., her MIT thesis [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003., and her ns simulation. However, this document includes algorithmic modifications and clarifications that have arisen from early experience with implementing and testing XCP. This document is intended to provide a baseline for further engineering and testing of XCP.

This document is organized as follows. The remainder of Section 1Introduction provides an overview of the XCP protocol, Section 2The Congestion Header discusses the format of the congestion header, Section 3XCP Functions describes the functions occurring in the end-systems and routers, and Section 4Unresolved Issues lists some unresolved issues.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]Bradner, S., Key words for use in RFCs to Indicate Requirement Levels, March 1997..

1.1 XCP Protocol Overview

The participants in the XCP protocol include sender hosts, receiver hosts, and intermediate nodes in which queuing occurs along the path from the sender to the receiver. The intermediate nodes are generally routers, but link-layer switches may also contain packet queues.

XCP supplies feedback from the network to the sender on the maximum rate (throughput) for injecting data into the network. XCP feedback is acquired through the use of a congestion header on each packet that is sent. Routers along the path may update the congestion header as it moves from the sender to the receiver. The receiver copies the network feedback into outbound packets of the same flow. An end-system may function as both a sender and a receiver in the case of a bidirectional flow.

The figure below illustrates four entities participating in XCP. The sender initializes the congestion header, two routers along the way may update it, and the receiver copies the feedback from the network into a returning packet in the same flow.

+----------+        +--------+     +--------+        +----------+
|          |------->| Router |---->| Router |------->|          |
|  Sender  |        +--------+     +--------+        | Receiver |
|          |<----------------------------------------|          |
+----------+                                         +----------+

The congestion header contains four pieces of data:

An XCP-capable router calculates a fair capacity re-allocation for each packet. A flow only recieves this re-allocation from a particular router if that router is the bottleneck for that flow. For XCP, a bottlneck router is defined to be a router that has insufficient capacity to accept a flow's current or desired throughput.

Based on current conditions, an XCP-capable router generates positive or negative feedback that is allocated to each packet and compared to the packet's Delta_Throughput field. Delta_Throughput is reduced if the current value exceeds this calculated feedback allocation. Each XCP-capable router along the path from sender to receiver performs this processing. A packet reaching the receiver therefore contains the minimal feedback allocation from the network, i.e., the capacity reallocation from the bottleneck router.

The receiver copies this value into the Reverse_Feedback field of a returning packet in the same flow (e.g., an ACK or DATA-ACK for TCP) and, in one round-trip, the sender learns the flow's per-packet throughput allocation.

The sender uses the reverse feedback information to adjust its allowed sending rate. For the transport protocol TCP [RFC0793]Postel, J., Transmission Control Protocol, September 1981., for example, this may be accomplished by adjusting the congestion window, or cwnd, that limits the amount of unacknowledged data in the network. (Cwnd is defined for Van Jacobson congestion control in [RFC2581]Allman, M., Paxson, V. and W. Stevens, TCP Congestion Control, April 1999..)

Additionally, it is possible to use XCP's explicit notification of the bottleneck capacity allocation for other types of applications. For example, XCP may be implemented to support multimedia streams over DCCP [I-D.ietf-dccp-spec]Kohler, E., Datagram Congestion Control Protocol (DCCP), October 2003. or other transport protocols.

An XCP-capable router maintains two control algorithms on each output port: a congestion controller and a fairness controller. The congestion controller is responsible for making maximal use of the outbound link while at the same time draining any standing queues. The fairness controller is responsible for fairly allocating bandwidth to flows sharing the link. These two algorithms are executed only periodically, at an interval known as the "control interval". The algorithms defined below set this interval to the RTT averaged across all flows. Further work on choosing an appropriate value for the control interval may be required.

Each port-specific instance of XCP is independent of every other, and references to an "XCP router" should be considered an instance of XCP running on a particular output port.

Actually, it is an oversimplification to say that congestion in routers only appears at output ports. Routers are complex devices which may experience resource contention in many forms and locations. Correctly expressing congestion which doesn't occur at the router output port is a topic for further study. Even so, it is important to correctly identify where the queue will build up in a router. The XCP algorithm will drain a standing queue however it is necessary to measure that queue in order for correct operation. For more discussion of this issue see Section Section 4.4A Generalized Router Model.

More context, analysis, and background can be found in [KHR02]Katabi, D., Handley, M. and C. Rohr, Internet Congestion Control for Future High Bandwidth-Delay Product Environments, August 2002. and [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003..


2. The Congestion Header

The congestion control data required for XCP are placed in a new header which is called the Congestion Header.

2.1 Header placement

The Congestion Header is located between the IP and transport headers. This is consistent with the fact that XCP is neither hop-by-hop communication -- as in IP -- nor end-to-end communication -- as in TCP or UDP -- but is rather end-system-to-network communication. It should allow a router to "easily" locate the congestion header on a packet with no IP options.

Other choices were considered for header location. For example, making the Congestion Header a TCP option was suggested. This made sense as the congestion information is related to the transport protocol. However, it requires that routers be aware of the header format for every new transport protocol that might ever use XCP. This seemed like an unreasonable burden to place on the routers and would impede deployment of new transport protocols and/or XCP.

It has also been suggested that the Congestion Header be an IPv4-style option. While this proposal is transport protocol independent, it would generally force XCP packets to take the slow path on non-XCP routers along a path. This could severely impact performance.

The XCP protocol uses protocol number [TBD], assigned by IANA. IP packets containing XCP headers will use this protocol number in the IP header's Protocol field [RFC0791]Postel, J., Internet Protocol, September 1981. to indicate to routers and end-systems that an XCP congestion header follows the IP header.

2.2 Congestion Header Formats

This section defines the XCP Congestion Header formats. This holds for IPv4; the corresponding header for IPv6 is a subject for further study.

XCP-capable IPv4 packets carry the following Congestion Header:

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 |Version|Format |   Protocol    |    Length     |    unused     |
 |                              RTT                              |
 |                           Throughput                          |
 |                        Delta_Throughput                       |
 |                        Reverse_Feedback                       |


Version: 4 bits
This field indicates the version of XCP that is in use. The version of XCP described in this document corresponds to a value of 0x01. Future values will be assigned by IANA ( See note in the IANA Considerations section.
Format: 4 bits
This field contains a code to indicate the congestion header format. The current format codes are defined below.
Format Code
Standard Format 0x1
Minimal Format 0x2
 Table 1 

Standard Format
The standard format includes the Throughput, Delta_Throughput, and RTT fields shown above. This format is used by XCP in data packets flowing from sender to receiver.
Minimal Format
The Throughput, Delta_Throughput, and RTT fields are unused and SHOULD be set to zero. A router should not perform any processing on a minimal format header. This format is intended for use in empty ACK packets, to return congestion information from receiver to sender.

Other formats may be defined in the future, to define different representation formats for the Throughput, Delta_Throughput, and/or RTT fields, for example. The corresponding format values format values will be assigned by IANA. See IANA Considerations section below.

Protocol: 8 bits
This field indicates the next-level protocol used in the data portion of the packet. The values for various protocols are specified by IANA.
Length: 8 bits
This field indicates the length of the congestion header, measured in bytes. Length in this version of XCP will always be 40 bytes, or 0x28.
unused: 8 bits
This field is unused and MUST be set to zero in this version of XCP.
RTT: 32bits
This field indicates the round-trip time measured by the sender, in milliseconds. This field is an unsigned integer.
The minimum value expressible in this field is 1 ms (values should be rounded up). A value of zero in the RTT field indicates that the sender does not yet know the round-trip time. The maximum value expressible in this field is 4.3E9ms or approximately 49 days.
Throughput: 32 bits
This field indicates the current throughput for the flow, as measured by the sender, in bytes per millisecond.
Throughput values should be rounded up. The maximum value expressible in this field is 4.3E9 bytes/ms or 32 Tbps, in steps of 1 byte/ms or 8000 bytes/sec.
Delta_Throughput: 32 bits
This field indicates the desired or allocated change in throughput. It is set by the sender to indicate the amount by which the sender would like to adjust its throughput, and it may be subsequently reduced by routers along the path (See Section 3.2Router functions). It is measured in bytes per second and is a signed, 2's complement value.
The minimum throughput change expressible in this field is -17 Gbps. The maximum value expressible in this field is 17 Gbps, in steps of 8 bits per second.
Reverse_Feedback: 32bits
This field indicates the value of Delta_Throughput received by the data receiver. The receiver copies the field Delta_Throughput into the Reverse_Feedback field of the next outgoing packet in the same flow. See Section 3.1.2Processing Feedback at the Receiver.

2.3 IPsec issues

IPsec [RFC2401]Kent, S. and R. Atkinson, Security Architecture for the Internet Protocol, November 1998. must be slightly modified to accommodate use of XCP. The specifications for the IP Authenticated Header (AH) [RFC2402]Kent, S. and R. Atkinson, IP Authentication Header, November 1998. and IP Encapsulating Security Payload (ESP) [RFC2406]Kent, S. and R. Atkinson, IP Encapsulating Security Payload (ESP), November 1998. state that the IPsec headers immediately follow the IP header. This would be a problem for XCP in that a) it would make the XCP headers harder to find by the routers, b) ESP encryption would make it impossible for routers along the path to read and write congestion header information and c) AH authentication would fail if any router along the path had modified a congestion header. Therefore, the XCP congestion header should immediately follow the IP header and precede any AH.

2.4 NAT, middlebox issues

Middleboxes that attempt to perform actions invisibly on flows must preserve the congestion header. Middleboxes that terminate the TCP connection should terminate the XCP connection. Middleboxes that insert queues into the forwarding path should participate in XCP.

2.5 MPLS/Tunneling Issues

When a flow enters an IP tunnel [RFC2003]Perkins, C., IP Encapsulation within IP, October 1996., IPsec ESP tunnel [RFC2406]Kent, S. and R. Atkinson, IP Encapsulating Security Payload (ESP), November 1998., or MPLS [RFC3031]Rosen, E., Viswanathan, A. and R. Callon, Multiprotocol Label Switching Architecture, January 2001., network ingress point, the congestion header should be replicated on the "front" of the outer IP header. For example, when a packet enters an IP tunnel, the following transformation should occur:

             [IP2]  \_ outer header
      __-->  [XCP]  /
[IP1]/       [IP1]  \
[XCP] --->   [XCP]   |_ inner header
[TCP]        [TCP]   |
...          ...    /

Here the XCP header appended to the front of the outer header is copied from the inner header, with the appropriate change to the Protocol field to indicate that the next protocol is IP.

When the packet exits the tunnel, the congestion header, which may have been modified by routers along the tunneled path, is copied from the outer header into the inner header.


3. XCP Functions

XCP is concerned with the sender and receiver end-systems and the routers along the packet's path. This section describes the XCP-related algorithms to be implemented in each of these entities. The specific case of TCP as transport protocol is also described. The emphasis in this section is on explicit and detailed definition of the XCP algorithms. The theoretical derivation and analysis of the algorithms can found in [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003..

3.1 End-System Functions

3.1.1 Sending Packets

The sender is responsible for maintaining four parameters, or their equivalent: (1) a desired throughput value, (2) a current estimate of the actual throughput, (3) the maximum throughput allowed by XCP, and (4) a current estimate of the round-trip time

A sender may choose to use any reasonable value, i.e., any achievable value, for desired throughput. This value may be supplied by an application via an API, or it might be the speed of the local interface.

When sending a packet, the sender fills in the fields of the congestion header as follows:

3.1.2 Processing Feedback at the Receiver

An XCP receiver is responsible for copying the Delta_Throughput data it sees on arriving packets into the Reverse_Feedback field of outgoing packets. In TCP, outgoing packets would normally be ACK-only segments.

In some cases returning packets are sent less frequently than arriving packets, e.g., with delayed acknowledgments [RFC1122]Braden, R., Requirements for Internet Hosts - Communication Layers, October 1989.. The receiver is responsible for calculating the sum of the arriving Delta_Throughput fields for placement in outgoing Reverse_Feedback fields. Feedback from the Receiver

The receiver end-system returns XCP congestion feedback from the network to the sender, by copying the Delta_Throughput information from arriving packets into the Reverse_Feedback field of Congestion Headers in outgoing packets (possibly aggregating data as described in Section 3.1.2Processing Feedback at the Receiver).

It is possible that even empty ACK packets may create or encounter congestion in the reverse direction. Although TCP implementations are generally not prepared to do any kind of congestion-based pacing of empty ACK segments, some transport protocols (e.g., DCCP) may be. Such a transport protocol may choose to do XCP congestion control on the returning ACKs as well as on the data.

In the normal case of a undirectional data flow with XCP applied only to that data flow, the feedback can be sent in a Minimal format Congestion Header, in which the RTT, Throughput, and Delta_Throughput fields are set to zero.

3.1.3 Processing Feedback at the Sender

When packets arrive back to the sender carrying reverse feedback, the sender must adjust its sending rate accordingly.

As noted earlier, this throughput adjustment may be made in TCP by updating the sender's congestion window cwnd. This should use the formula:

        cwnd = max(cwnd + feedback * RTT * 1000, MSS)

            cwnd     = current congestion window (bytes)
            feedback = Reverse_Feedback field from received packet,
                       (bytes/sec, may be +/-)
            RTT      = Sender's current round-trip time estimate (ms)
            MSS      = maximum segment size (bytes)

The value of cwnd has a minimum of MSS to avoid the "Silly Window Syndrome" [RFC0813]Clark, D., Window and Acknowledgement Strategy in TCP, July 1982.. Aging the Allowed Throughput

When a sending application does not send data fast enough to fully utilize the allowed throughput, XCP should reduce the allowed throughput as time passes, to avoid sudden bursts of data into the network if the application starts to send data later.

We present a slight modification of the algorithm for aging the allowed throughput below. It is based on Section 4.5 of [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003.. Each RTT in which the sender sends with actual throughput which is less than the allowed throughput, the allowed throughput MUST be reduced by the following exponential averaging formula:

	Allowed_Throughput = Allowed_Throughput*(1-p) + 
			     Actual_Throughput * p

            where: p is a parameter controlling the speed of aging,
                   ranged between 0 and 1.

Using p = 0.5 is suggested. Consideration of values of p or other algorithms is a research topic. Response to Packet Loss

When the transport protocol is TCP, a packet drop or detection of an ECN notification [RFC3168]Ramakrishnan, K., Floyd, S. and D. Black, The Addition of Explicit Congestion Notification (ECN) to IP, September 2001. should trigger a transition to standard TCP congestion control behavior[RFC2581]Allman, M., Paxson, V. and W. Stevens, TCP Congestion Control, April 1999.. In other words, cwnd should be halved and Jacobson's fast retransmission/fast recovery, slow start, and congestion avoidance algorithms should be applied for the remainder of the connection. The assumption is that the packet drop reveals the presence of a congested non-XCP router in the path. Transitioning to standard TCP behavior is a conservative response.

Note also the following: RTT Estimates

Having a good estimate of the round trip time is more important in XCP than in Van Jacobson congestion control. There has been some experimental evidence that small errors in the RTT estimate can result in larger errors in the throughput value. The current cwnd divided by SRTT is only an approximation of the actual throughput. The RTT used in the ns-2 code in [KHR02]Katabi, D., Handley, M. and C. Rohr, Internet Congestion Control for Future High Bandwidth-Delay Product Environments, August 2002. used a smoothed RTT estimator, rather than instantaneous measurements. Additional research is needed to develop recommendations for RTT estimation.

3.2 Router functions

The router calculations for XCP are divided into those that occur upon packet arrival, those that occur upon control interval timeout, those that occur upon packet departure, and the assessment of the persistent queue, which uses a separate timer. The calculations are presented in the following sections as annotated pseudocode.

3.2.1 Calculations Upon Packet Arrival

When a packet arrives at a router, several parameters used by XCP need to be updated. The steps are described in the following pseudocode.

On packet arrival do: 

1. input_traffic += Pkt_size 

2. sum_inv_throughput += Pkt_size / Throughput

3. if (Rtt < MAX_INTERVAL) then 

4.   sum_rtt_by_throughput += Rtt x Pkt_size / Throughput

5. else 

6.   sum_rtt_by_throughput += MAX_INTERVAL x Pkt_ size / Throughput

Line 1:
The variable input_traffic accumulates the volume of data that have arrived during a control interval. When a packet arrives, the packet size is taken from the IP header and is added to the ongoing count.
Line 2:
The variable sum_inv_throughput is used in the control interval calculation (see equation 4.2 of [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003. and in capacity allocation. For each packet, the ratio of packet size (from the IP header) to advertised Throughput (from the XCP header) is accumulated.
Lines 3 and 5:
A test is performed to check whether the round trip time of the flow does exceeds the maximum allowable control interval. If so, MAX_INTERVAL, the maximum allowable control interval, is used in the subsequent calculations. Too large a control interval will delay new flows from acquiring their fair allocation of capacity.
Lines 4 and 6:
As in Line 2, the variable sum_rtt_by_throughput is used in the control interval calculation.

3.2.2 Calculations Upon Control Interval Timeout

When the control timer expires, several variables need to be updated as shown below.

On estimation-control timeout do:

 7. avg_rtt = sum_rtt_by_throughput / sum_inv_throughput

 8. input_bw = input_traffic / ctl_interval 

 9. F = a * (capacity - input_bw) - b * queue / avg_rtt

10. shuffled_traffic = shuffle_function(...)

11. residue_pos_fbk = shuffled_traffic + max(F,0)

12. residue_neg_fbk = shuffled_traffic + max(-F,0)

13. Cp = residue_pos_fbk / sum_inv_throughput

14. Cn = residue_neg_fbk / input_traffic

15. input_traffic = 0

16. sum_inv_throughput = 0

17. sum_rtt_by_throughput = 0

18. ctl_interval = max(avg_rtt, MIN_INTERVAL)

19. timer.reschedule(ctl_interval)

Line 7:
Update avg_rtt by taking the ratio of the two sums maintained in the previous section. This value is used to determine the control interval below (line 17).
Line 8:
The average bandwidth of arriving traffic is calculated by dividing the bytes received in the previous control interval by the duration of the previous control interval.
Line 9:
The aggregate feedback, F, is calculated. The variable `capacity' is the ability of the outbound link to carry IP packets, in bytes/second. The variable avg_rtt was calculated in line 7. The variable queue is the persistent queue and is defined in section Section 3.2.5Obtaining the Persistent Queue. The values a and b are constant parameters. The constant a may be any positive number such that a < (pi/4*sqrt(2)). A nominal value of 0.4 is recommended. The constant b is defined to be b = a^2*sqrt(2). (If the nominal value of a is used, the value for b would be 0.226.) Note that F may be positive or negative.
Line 10:
This line establishes the amount of capacity that will be shuffled in the next control interval through the use of the shuffle_function. Shuffling takes a small amount of the available capacity and redistributes it by adding it to both the positive and negative feedback pools. This allows new flows to acquire capacity in a full loaded system.
The recommended shuffle_function is as follows:

                 shuffled_traffic = max(0, 0.1 * input_bw - |F|)

The variable input_bw is defined above in Line 8. Implementors may choose other functions. It is important to consider that more shuffled traffic decreases the time for new flows to acquire capacity and converge to fairness. Too much shuffling may impede flows from acquiring their fair share of available capacity. (For example, consider a setup of N flows bottlenecked downstream from the given router and another flow, not limited as those, trying to acquire its fair share. In this case shuffling leads to under-utilization of the available bandwidth and impedes the unlimited flow.) Shuffled_traffic is always a positive value.

The objective of the feedback calculations is to obtain a per-packet feedback allocation from the router. Lines 13 and 14 obtain factors in this calculation, which, however, have no physical meaning. One might view them as per-flow capacity allocations that have some additional processing to prepare them for per-packet allocation. Note that, with the use of shuffled_traffic, a non-idle router will always start a control interval with non-zero values for both Cn and Cp.

Line 11:
This line calculates the positive feedback scale factor, Cp. The variables F, shuffled_traffic, and sum_inv_throughput are defined above.
Line 12:
This line calculates the negative feedback scale factor, Cn. This is a positive value. The definitions for F, shuffled_traffic, and input_traffic are given above.
Line 13:
The variable residue_pos_fbk keeps track of the pool of available positive capacity a router has to allocate. It is initialized to the positive aggregate feedback.
Line 14:
The variable residue_neg_fbk keeps track of the pool of available negative capacity a router has to allocate. It is initialized to the negative aggregate feedback. This variable is always positive.
Line 15-17:
Reset various counters for the next control interval.
Line 18:
Set the next control interval. The use of MIN_INTERVAL is important to establish a reasonable control interval when the router is idle.
Line 19:
Set timer.

3.2.3 Calculations Upon Packet Departure

An XCP router processes each packet using the feedback parameters calculated above. As stated earlier, each packet indicates the current Throughput and a throughput adjustment, Delta_Throughput. The router calculates a per-packet capacity change which will be compared to the Delta_Throughput field in the packet header. Using the AIMD rule, positive feedback is applied equally per-flow, while negative feedback is made proportional to each flow's capacity.

To accommodate high-speed routers, XCP uses a fixed-point numeric representation for the Congestion Header fields. This means that the per-packet calculations defined below result in some residual error < 1 Bps. These errors accumulate across all the packets in a control interval, resulting in an inaccuracy in XCP's allocation of available bandwidth to flows. Further work is needed to understand whether this will be a significant problem and, if so, whether there is any solution short of using 64 bit precision or floating point.

Processing should be done according to the pseudocode below.

On packet departure:

20. pos_fbk = Cp * Pkt_size / Throughput

21. neg_fbk = Cn * Pkt_size

22. feedback = pos_fbk - neg_fbk

23. if(Delta_Throughput > feedback) then

24.   Delta_Throughput = feedback

25. else

26.   neg_fbk = min(residue_neg_fbk, neg_fbk + 
                         (feedback - Delta_Throughput))

27.   pos_fbk = Delta_Throughput + neg_fbk

28. residue_pos_fbk = max(0, residue_pos_fbk - pos_fbk)

29. residue_neg_fbk = max(0, residue_neg_fbk - neg_fbk)

30. if (residue_pos_fbk <= 0) then Cp = 0

31. if (residue_neg_fbk <= 0) then Cn = 0


Line 20:
The contribution of positive feedback for the current packet is calculated using Cp, defined in line 13, Pkt_size from the IP header, and Throughput (the flow's advertised throughput) also from the Congestion Header.
Line 21:
The contribution of negative feedback for the current packet is calculated using Cn, defined in line 14, and Pkt_size from the IP header. This value of neg_fbk is positive.
Line 22:
The router's allocated feedback for the packet is the positive per-packet feedback minus the negative per-packet feedback. This value may be positive or negative.
Line 23-24:
Line 23 tests whether the packet is requesting more capacity (via the packet's Delta_Throughput field) than the router has allocated. If so, this means the the sender's desired throughput needs to be reduced to be the router's allocation. In line 24 the Delta_Throughput field in the packet header updated with the router feedback allocation.
Line 25:
This branch is executed when the packet is requesting less throughput than the router's allocation. In this branch, and the rest of this pseudocode, the packet header is not updated and the remaining code is to correctly update the feedback pool variables.
Line 26:
In this line, the packets negative feedback contribution, neg_fbk, is set to be the smaller of two terms. The first term, residue_neg_fbk, is the pool of negative feedback, i.e., this drains the remaining negative feedback in the pool. The second term increases the nominal negative feedback from the router by the amount which the Delta_Throughput is less than net router allocation. This allows the router to capture feedback which is allocated by an upstream bottleneck.
Line 27:
The positive allocation, pos_fbk, is adjusted to be the sum of Delta_Throughput and neg_fbk, from Line 26. This is required for the sum of pos_fbk and neg_fbk to equal Delta_Throughput.
Line 28-29:
In these two lines, the feedback pools, residue_pos_fbk and residue_neg_fbk, are reduced by the values of pos_fbk and neg_fbk accordingly, but prevented from going negative.
Line 30-31:
When a feedback pool becomes empty, set the scale factor to zero, i.e., stop handing out associated feedback.

3.2.4 The Control Interval

The capacity allocation algorithm in XCP router updates several parameters every Control Interval. The Control Interval is currently defined to be the average RTT of the flows passing through the router, i.e., avg_rtt calculated in Line 7 above. Other possible choices for the control interval are under study.

Notes on avg_rtt:

3.2.5 Obtaining the Persistent Queue

In Section 3.2.2Calculations Upon Control Interval Timeout the variable 'queue' contains the persistent queue over the control interval. This is intended to be the minimum standing queue over the (queue estimation) interval.

The following pseudocode describes how to obtain the minimum persistent queue:

On packet departure do:

32. min_queue = min(min_queue, inst_queue)


When the queue-computation timer expires do:

33. queue = min_queue

34. min_queue = inst_queue

35. Tq = max(ALLOWED_QUEUE, (avg_rtt - inst_queue/capacity)/2)

36. queue_timer.reschedule(Tq)


Line 32:
The current instantaneous queue length is checked each time a packet departs compute the minimum queue size.

If avg_rtt is being used as the Control Interval, it MUST NOT be used as the interval for measuring the minimum persistent queue. Doing so can result in a feed-forward loop. For example, if a queue develops the average RTT will increase. If the avg_rtt increases, it takes longer to react to the growing queue and the queue gets larger, leading to instability.

Line 33:
Upon expiration of the queue estimation timer, Tq, the variable queue, the persistent queue, is set to be the minimum queue occupancy over the last Tq.
Line 34:
Upon expiration of the queue estimation timer, reset the running estimate of the minimum queue to be the current queue occupancy.
Line 35:
The first term in the max function, ALLOWED_QUEUE, is the standing queue size that you are willing to maintain. (A nominal value of 2ms worth of queuing is recommended but this may be tuned by implementors.) The second term is an estimate of the propagation delay. In other words the persistent queue is a queue that does not drain in a propagation delay. the division by 2 is a conservative factor to avoid overestimating the propagation delay.
Line 36:
The queue computation timer is set.


4. Unresolved Issues

This section will be expanded to include more detailed text on the topics below.

4.1 XCP With Non-XCP Routers

Non-XCP routers will exist in networks before XCP becomes commonly deployed and we expect other non-XCP systems to continue to be in the network indefinitely. Long term non-XCP network elements include any sort of link-level switches with queueing, e.g. ATM switches and sophisticated Ethernet switches. Even simple multiplexors are non-XCP queues with very little buffering.

Sources and the network care about these non-XCP elements because any one of them can be a site of network congestion, and if an XCP endpoint is bottlenecked at one of these non-XCP elements, no router feedback will inform the endpoint to slow down. If nothing is done, such an element will probably collapse under congestion.

Although exactly how XCP sources will operate in this environment is an open issue, a current promising direction is for endpoints to run a traditional end-to-end congestion detection algorithm in parallel with the XCP algorithm and switch over to using that algorithm for control when congestion is detected that XCP is not controlling. For example, an XCP source that detects 3 duplicate acknowledgments would fall back to TCP Reno behavior.

An endpoint that is limited by its end-to-end congestion algorithm would indicate so to XCP routers by setting a bit in the packet header. A router may process such packets differently from packets from endpoints that are being controlled by XCP. For example, the router might allocate end-to-end controlled packets less feedback or not reduce its feedback pools by the full amount when assigning feedback to those packets.

Though using its end-to-end algorithm to control its sending rate, an endpoint will also monitor the XCP feedback and if the source discovers that the XCP feedback would be more restrictive than the end-to-end control over a round trip time, the endpoint will revert to following XCP feedback. XCP feedback that is more restrictive over a round trip time is an indication that the endpoint's bottleneck is once again at an XCP router and the endpoint should take advantage of the more precise XCP information.

Evaluation of these algorithms is ongoing work

4.2 XCP as a TCP PEP

In the Internet today TCP performance-enhancing proxies (PEPs) are sometimes used to improve application performance over certain networks. TCP PEPs, and the issues surrounding their use are described in [RFC3135]Border, J., Kojo, M., Griner, J., Montenegro, G. and Z. Shelby, Performance Enhancing Proxies Intended to Mitigate Link-Related Degradations, June 2001.. A common mechanism used in TCP PEPs is to split a TCP connection into three parts where the first and last run TCP and a more aggressive transport protocol is run in the middle (across the path which generates poor TCP performance). This improves performance over the "problematic" portion of the path and does not require changing the protocol stacks on the end systems. For example, if a high-speed satellite link was used to connect a LAN to the Internet, a TCP PEP may be placed on either side of the satellite link.

It is not unusual today to find TCP PEPs which, to get high data rates, do not use congestion control at all. Of course, this limits the environments in which they can be used. However, XCP may be used in between two TCP PEPs to get high transfer rates and still respond to congestion in a correct and scalable way.

Work on using XCP as a TCP PEP is just beginning. Objectives for such a mechanism would be:

A system which met the above objectives could also be used for incremental deployment of XCP. A network operator could deploy XCP-capable routers and use PEPs at the periphery of the network to convert from traditional TCP to XCP congestion control. This may result in smaller queues and improved link utilization within the XCP network. (This may be of more interest to wireless network providers than to over-provisioned fiber backbones.)

4.3 Sharing resources between XCP and TCP

Katabi describes a system for sharing router bandwidth between XCP and TCP flows that is based on sharing the output capacity based on the average throughputs of XCP and TCP sources using the router.[Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003. Two queues are installed at the router and they are served with different weights by the forwarding engine. The algorithm is work-conserving; the forwarding engine is never idle when either queue has packets in it.

A TFRC-like algorithm estimates the average congestion window of TCP sources and the XCP algorithm estimates the average throughput of XCP sources. These averages are used to dynamically weight the time the processor spends on each queue. Initial experiments indicate that this system can provide fairness between the two flow classes (XCP/TCP).[Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003.

Open issues remain, however; for example there are questions about how well the TFRC-like algorithm can estimate TCP throughput with only access to local drop rates, convergence time of the weighting algorithm has never been explored, and no system for buffer allocation to complement the link capacity allocation has been put forward. These open issues are under study.

4.4 A Generalized Router Model

The XCP algorithm described here and in [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003. manages congestion at a single point in a router, most likely an output queue. However, resource contention can occur at many points in a router. Input queues, backplanes, computational resources can 'congest' in addition to output buffers. There is a need to develop a general model and a variety of mechanisms to identify and manage resource contention throughout the router.

4.5 Host back-to-back operation

XCP hosts should be capable of back-to-back operation, i.e., with no router in the path. Nominally, this should not be a problem. A sender initializes delta_throughput to the desired value, no router modifies it and, thus, it is automatically granted. However, it has not yet been decided whether an XCP receiver should be capable of (or require) adjusting the delta_throughput to request flow control from the receiver to the sender.

At this point, XCP offers no mechanism for flow control. (Open question: Should it?) It is believed that running XCP on the output queue of a host would solve this problem. However, it isn't clear that the complexity is justified by the need to solve this situation.

4.6 Alternate Responses to Packet Loss

The correct response by an XCP end-system when faced with a packet drop is a research question. The solution sketched out in Section to Packet Loss is fairly conservative, and other solutions have been suggested. [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003. proposed the use of recent drops to estimate a TCP fair rate, as described in [Padhye98]Padhye, J., Firoiu, V., Towsley, D. and J. Krusoe, Modeling TCP throughput: A simple model and its empirical validation, 1998.. Another suggestion has been made to observe Van Jacobson-style behavior for one RTT following a drop, then return to XCP behavior.

4.7 Alternate XCP Data Representations

Alternative representations of Congestion Header values may further reduce router processing. In particular, the divisions are hard and should be done at the sender if possible. Katabi has noted that the Throughput value in the Congestion Header is actually used only as the denominator of Pkt_size/Throughput (see Lines 2 and 20 above). Thus, the divisions performed on every packet can be eliminated by sending the ratio Pkt_size/Throughput, rather than Throughput, in the Congestion Header. Further research is in progress on this possibility.


5. Security Considerations

The presence of a header which may be read and written by entities not participating in the end-to-end communication opens some potential security vulnerabilities. This section describes them and tries to give enough context so that users can understand the risks.

Man-in-the-Middle Attacks
There is a man-in-the-middle attack where a malicious user can force a sender to stop sending by inserting negative feedback into flow. This is little different from a malicious user discarding packets belonging to a flow using VJ congestion control or setting ECN bits. One question worth investigating further is whether the XCP attack is harder to diagnose.
Covert Data Channels
IPsec needs to be modified, as discussed in Section 2.3IPsec issues, to allow routers to read the entire congestion header and write the delta_feedback field. This could become a covert data-channel, i.e., a way in which an end-system can make data viewable to observers in the network, on a compromised end-system.
Malicious Sources
The XCP algorithms rely on senders to advertise invormation about their current RTT and Throughput and correctly respond to feedback delivered from the network. Naturally, the possibility occurs that a sender won't perform these functions correctly. Chapter 7 of [Katabi03]Katabi, D., Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks, March 2003. and [Katabi04]Katabi, D., XCP's Performance in the Presence of Malicious Flows, February 2004. examine these issues.
A source which lies about its throughput cannot affect the link utilization and, in the worst case, can unfairly acquire capacity. However, this is equivalent to a sender opening up multiple TCP flows. So, there is an incentive to lie about throughput. However, because throughput is explicitly stated in each packet header, it is a simpler matter to police it at the edge of the network than, say, for TCP.
A source which lies about its RTT can disrupt the router control algorithm, particularly when a large number of sources lie about their RTT and the router control interval is adaptive and uses the average RTT. However, there is little incentive to lie as it will not affect the fair allocation of capacity and the liar will experience the same degradation as the non-lying flows. Lying about RTT should be considered a weak denial-of-service attack.
A flow may also ignore negative feedback from the router. Such a flow can obtain unfair throughput in a congested router. However, as with lying, the explicit nature of XCP makes it possible to verify that flows are responding to feedback. For example, a policing function in the path (presumeably near the edge so that the load is manageable and it can be expected to see packet flow in both directions) may inspect congestion headers for a flow in both directions. If the policer sees negative feedback heading towards a source and no reduction in throughput it may, e.g., punish the flow by severely restricting the throughput. Note that this can be applied on a probablistic basis, sampling flows only occasionally.


6. IANA Considerations

XCP requires the assignment of an IP protocol number. Once this value has been assigned, the number may be inserted (by the RFC Editor) into Section 2.1Header placement and this paragraph may be removed prior to publication.


7. Acknowledgements

The authors would like to acknowledge the many contributors who have assisted in this work. Bob Braden applied his usual sage guidance to the project and to the spec, in particular. Ted Faber wrote the initial implementation framework and provided much wisdom on kernel development and congestion control. Yuri Pryadkin performed significant analysis on the algorithm and developed much of the code. John Wroclawski advised on project priorities and strategy. Eric Coe developed the initial implementation and testbed. Aman Kapoor performed supporting simulations and debugged kernel code. Padma Haldar ported ns-2 simulation code to ns-2 distribution. Jasmeet Bagga and Anuraag Mittal conducted simulations on various aspects of XCP performance. On the XCP mailing list, Tim Shepherd, Tom Henderson, and Matt Mathis made valuable contributions to the effort. To all the above go our sincere thanks.


8 References

[I-D.ietf-dccp-spec] Kohler, E., "Datagram Congestion Control Protocol (DCCP)", draft-ietf-dccp-spec-05 (work in progress), October 2003 (TXT, PS).
[Jacobson88] Jacobson, V., "Congestion Avoidance and Control", ACM Computer Communication Review Proceedings of the Sigcomm '88 Symposium, August 1988.
[KHR02] Katabi, D., Handley, M. and C. Rohr, "Internet Congestion Control for Future High Bandwidth-Delay Product Environments", ACM Computer Communication Review Proceedings of the Sigcomm '02 Symposium, August 2002.
[Katabi03] Katabi, D., "Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks", MIT PhD. Thesis, March 2003.
[Katabi04] Katabi, D., "XCP's Performance in the Presence of Malicious Flows", Second International Workshop on Protocols for Fast Long-Distance Networks, Presentation, February 2004.
[Padhye98] Padhye, J., Firoiu, V., Towsley, D. and J. Krusoe, "Modeling TCP throughput: A simple model and its empirical validation", ACM SIGCOMM '98 conference on Applications,technologies, architectures, and protocols for computer communication, 1998.
[RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981.
[RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981.
[RFC0813] Clark, D., "Window and Acknowledgement Strategy in TCP", RFC 813, July 1982.
[RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989.
[RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, October 1996 (HTML, XML).
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 (HTML, XML).
[RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J. and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, April 1998 (HTML, XML).
[RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the Internet Protocol", RFC 2401, November 1998 (HTML, XML).
[RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", RFC 2402, November 1998 (HTML, XML).
[RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload (ESP)", RFC 2406, November 1998 (HTML, XML).
[RFC2581] Allman, M., Paxson, V. and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999.
[RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson, "Stream Control Transmission Protocol", RFC 2960, October 2000.
[RFC3031] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001.
[RFC3135] Border, J., Kojo, M., Griner, J., Montenegro, G. and Z. Shelby, "Performance Enhancing Proxies Intended to Mitigate Link-Related Degradations", RFC 3135, June 2001.
[RFC3168] Ramakrishnan, K., Floyd, S. and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001.


Authors' Addresses

  Aaron Falk
  USC Information Sciences Institute
  4676 Admiralty Way
  Suite 1001
  Marina Del Rey, CA 90292
Phone:  310-448-9327
  Dina Katabi
  Massachusetts Institute of Technology
  200 Technology Square
  Cambridge, MA 02139
Phone:  617-324-6027


Intellectual Property Statement

Disclaimer of Validity

Copyright Statement