Network Working Group A. Falk Internet-Draft ISI Expires: August 12, 2004 D. Katabi MIT February 12, 2004 Specification for the Explicit Control Protocol (XCP) ___________[DRAFT -- DO NOT FORWARD]___________ xcp-spec-04.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 12, 2004. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This document specifies the Explicit Control Protocol (XCP), a congestion control protocol where packets carry information from routers to inform end-systems of the congestion state of the network. Packet header, end-system and router behavior, and issues regarding deployment are described. Falk & Katabi Expires August 12, 2004 [Page 1] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 Table of Contents 1. Requirements notation . . . . . . . . . . . . . . . . . . . 3 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Protocol Overview . . . . . . . . . . . . . . . . . . . . . 5 3. The Congestion Header . . . . . . . . . . . . . . . . . . . 7 3.1 Header placement . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Header definition . . . . . . . . . . . . . . . . . . . . . 8 3.3 IPsec issues . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 NAT, middlebox issues . . . . . . . . . . . . . . . . . . . 11 3.5 MPLS/Tunneling Issues . . . . . . . . . . . . . . . . . . . 11 4. XCP Functions . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 End-System Functions . . . . . . . . . . . . . . . . . . . . 12 4.1.1 Sending Packets . . . . . . . . . . . . . . . . . . . . . . 12 4.1.2 Processing Feedback at the Receiver . . . . . . . . . . . . 13 4.1.3 Processing Feedback at the Sender . . . . . . . . . . . . . 14 4.2 Router functions . . . . . . . . . . . . . . . . . . . . . . 15 4.2.1 Calculations which must be done upon packet arrival . . . . 16 4.2.2 Calculations which must be done upon Control Interval timeout . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.3 Calculations which must be done upon packet departure . . . 18 4.2.4 The Control Interval . . . . . . . . . . . . . . . . . . . . 20 4.2.5 Obtaining the persistant queue . . . . . . . . . . . . . . . 21 5. Unresolved Issues . . . . . . . . . . . . . . . . . . . . . 23 5.1 Probing for XCP Capability . . . . . . . . . . . . . . . . . 23 5.2 XCP Within a Cloud . . . . . . . . . . . . . . . . . . . . . 23 5.3 Sharing resources between XCP and TCP . . . . . . . . . . . 23 5.4 A Generalized Router Model . . . . . . . . . . . . . . . . . 23 5.5 Host back-to-back operation . . . . . . . . . . . . . . . . 23 5.6 Adaptively Aging the Congestion Window . . . . . . . . . . . 24 5.7 Alternate Responses to Packet Loss . . . . . . . . . . . . . 24 5.8 Alternate Representation Formats . . . . . . . . . . . . . . 24 6. Transport Protocol Issues . . . . . . . . . . . . . . . . . 25 6.1 TCP is the Canonical Application of XCP . . . . . . . . . . 25 6.2 XCP for other Transport Protocols . . . . . . . . . . . . . 25 7. Security Considerations . . . . . . . . . . . . . . . . . . 26 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . 27 References . . . . . . . . . . . . . . . . . . . . . . . . . 28 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 29 Intellectual Property and Copyright Statements . . . . . . . 30 Falk & Katabi Expires August 12, 2004 [Page 2] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 1. Requirements notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Falk & Katabi Expires August 12, 2004 [Page 3] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 2. Introduction TCP [RFC0793] is the basic end-to-end transport protocol of the Internet, and the Van Jacobson congestion control algorithm used by TCP (proposed and widely implemented based on [Jacobson88] and standardized in [RFC2581]) is fundamental to efficient, high-performance, and stable network operation. Jacobson's congestion control algorithm has been highly successful over many orders of magnitude in Internet growth, but recently has begun to reach its limits. Making a significant advance in TCP congestion control to overcome these limits is a strategic technological problem for the Internet. Gigabit per second file transfers, lossy wireless links, and high latency connections are all driving TCP congestion control outside of its natural operating regime. The resulting performance problems are of great concern for important scientific applications of the network. While there has been substantial research on this issue, most proposals are either tweaks to Jacobson's algorithm or modified queue management techniques, or both. The requirement of extreme scalability together with robustness has inhibited serious proposals using explicit feedback from congested routers, until now. The recently-developed Explicit Control Protocol (XCP) [KHR02] represents a major advance in Internet congestion control. XCP delivers the highest possible application performance over a broad range of network infrastructure, including extremely high speed and very high delay links that are not well served by TCP's current control algorithms. In so doing, it achieves maximum link utilization and wastes no bandwidth due to packet loss. XCP is novel in separating the efficiency and fairness policies of congestion control, enabling routers to put available capacity to work quickly while conservatively managing the allocation of capacity to flows. XCP's scalability is built upon a new principle: carrying per-flow congestion state in packets. XCP packets carry a congestion header through which the sender requests a desired throughput. Routers make a fair per-flow bandwidth allocation without maintaining any per-flow state. Thus, the sender learns of the bottleneck router's allocation in a single round trip. XCP represents a significant change in Internet congestion control since it requires changes in the routers as well as in end systems. It will be necessary to develop and test XCP with real user traffic and in real environments, to gain experience with real router and host implementations and to collect data on performance. Providing specifications is an important step towards enabling experimentation which, in turn, will lead to deployment. Deployment issues will be addressed in more detail in subsequent documents. Falk & Katabi Expires August 12, 2004 [Page 4] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 This document is organized as follows: the remainder of Section 2 provides an overview of the XCP protocol, Section 3 discusses the format of the congestion header, Section 4 describes the functions occurring in the end-systems and routers, Section 5 discusses some unresolved issues, and Section 6 contains some application specific issues. 2.1 Protocol Overview The participants in the XCP protocol include the sender, receiver, and intermediate nodes, often routers, where queueing occurs along the path from the sender to the receiver. The sender uses feedback from the network to determine the maximum sending rate, or throughput, at which data can be injected into the network. This feedback is acquired through the use of a congestion header on each packet it sends. Routers along the path may update the congestion header as it moves from the sender to the receiver. The receiver copies the network feedback into outbound packets of the same flow. An end-system may function as both a sender and a receiver at the same time. The figure below illustrates the entities participating in XCP. The Sender initializes the congestion header, the Routers along the way may update it, and the Receiver copies the feedback from the network into a returning packet in the same flow. +----------+ +--------+ +--------+ +----------+ | |------->| Router |---->| Router |------->| | | Sender | +--------+ +--------+ | Receiver | | |<----------------------------------------| | +----------+ +----------+ In the congestion header there are four pieces of data: o RTT: the sender's current estimate of the round-trip time. o Throughput: the sender's current send-rate or throughput. o Delta_Throughput: initialized to the amount which the sender would like to change its throughput, may be updated by the routers along the path to be the network's allocated change in throughput. This value may be a negative number if a router along the path wants the sender to slow down. o Reverse_Feedback: the network's throughput allocation is returned to the Sender through having the Receiver copy Delta_Throughput into the Reverse_Feedback field of a congestion header in an outgoing acknowledgement. Falk & Katabi Expires August 12, 2004 [Page 5] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 XCP routers calculate a per-packet capacity allocation based on a number of factors. The router will compare its calculated allocation to the sum of Throughput and Delta_Throughput in each congestion header and, if sum is less than the current allocation, the packet is not modified. Otherwise, the router updates the Delta_Throughput field with a value corresponding to the router's allocation. As the packet travels from sender to receiver, each router along the path performs this processing until, when the packet reaches the receiver, it contains the minimal capacity allocation from the network. In other words, the bottleneck bandwidth allocation. The receiver copies this value into the Reverse_Feedback field of a returning packet in the same flow (e.g., an ACK or DATA-ACK for TCP) and, in one round-trip, the sender receives the network's throughput allocation. The initial motivation for XCP has been for use as a congestion control algorithm for TCP [RFC0793]. To control TCP throughput, XCP relies on the notion that the sender maintains a congestion window, or cwnd, which controls the amount of unacknowledged data in the network. (cwnd is defined for Van Jacobson congestion control in [RFC2581].) Additionally, it is possible to use XCP's explicit notification of the bottleneck capacity allocation for other types of applications. For example, XCP may be implemented to support multimedia streams over DCCP [I-D.ietf-dccp-spec] or other transport protocols. More context, analysis, and background can be found in [KHR02] and [Katabi03]. Falk & Katabi Expires August 12, 2004 [Page 6] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 3. The Congestion Header The data required for XCP to function are placed in a new header which is called the Congestion Header. 3.1 Header placement The Congestion Header is located between the IP and transport headers. This is because XCP is neither hop-by-hop communication -- as in IP -- nor end-to-end communication -- as in TCP or UDP -- but end-system-to-network communication. Other choices were considered for header location. For example, making the congestion header a TCP option was suggested. This made sense as the congestion information is related to the transport protocol. However, it requires that routers be aware of the header format for every new transport protocol that might ever use XCP. This seemed like an unreasonable burden to place on the routers and would impede deployment of new transport protocols and/or XCP. It has also been suggested to make the congestion header an IPv4-style option. While this proposal is transport protocol independent, it would force XCP packets to take the slow path even on non-XCP routers. While XCP performance when non-XCP routers are in the path is unclear, it seemed that performance would be worse should non-XCP routers have to inspect every XCP header. (Are there other reasons?) Matt Mathis has suggested that the congestion header should be placed immediately after the IP header, where it would be "easy" to find by routers on packets with no IP options. [XXXXX: Is this true? Perhaps Cisco can offer some feedback.] Falk & Katabi Expires August 12, 2004 [Page 7] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 The XCP protocol uses protocol number [TBD], assigned by IANA. IP packets containing XCP headers will use this protocol number in the IP header's Protocol field [RFC0791] to indicate to routers and end-systems that an XCP congestion header follows the IP header. 3.2 Header definition This section defines the XCP congestion header. All XCP capable packets carry the following header: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |version|format | protocol | length | unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | rtt | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | throughput | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | delta_throughput | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | reverse_feedback | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Version: 4 bits This field indicates the version of XCP which is in use. The version of XCP described in this document corresponds to a value of 0x01. Future values will be assigned by IANA (http:// www.iana.org). See note in Section 8. Format: 4 bits This field indicates the congestion header format. Two formats are defined at this time: a standard format and a minimal format. (The format field may be used in the future to define different representation formats for the throughput, delta_throughput, and/ or rtt fields.) Standard Format. The standard format indicates that the throughput, delta_throughput, and RTT fields are in use, allowing for XCP congestion control of the packets carrying the standard format congestion header. The format field should be set to 0x1 to indicate use of the Standard Format. Falk & Katabi Expires August 12, 2004 [Page 8] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 Minimal Format. The throughput, delta_throughput, and RTT fields are unused and SHOULD be set to zero. Routers seeing a packet using Minimal Format MUST treat these packets as non-XCP packets, i.e., a router should not perform any XCP processing on these packets. The minimal format allows a receiver to return feedback to the sender without congestion controlling the packets carrying the congestion header. TCP acknowldgements MAY use the minimal format. The use of the minimal format should be considered experimental. The format field should be set to 0x2 to indicate use of the Minimal Format. Future format values will be assigned by IANA. See note in Section 8. The table below summarizes format values: +-----------------+-------+ | Format | Value | +-----------------+-------+ | Standard Format | 0x1 | | | | | Minimal Format | 0x2 | +-----------------+-------+ Table 1 Protocol: 8 bits This field indicates the next level protocol used in the data portion of the packet. The values for various protocols are specified by IANA. Length: 8 bits This field indicates the length of the congestion header, measured in bytes. unused: 8 bits This field is unused and MUST be set to zero in this version of XCP. RTT: 32bits This field indicates the round-trip time measured by the sender, in milliseconds. This field is an unsigned integer. Falk & Katabi Expires August 12, 2004 [Page 9] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 The minimum value expressable in this field is 1ms (values should be rounded up). The maximum value expressable in this field is 4.3E9ms or approximately 49 days. A value of zero in the RTT field indicates that the sender does not yet know the round-trip time. Throughput: 32 bits This field indicates the current throughput, i.e., cwnd/RTT, as measured by the sender. The units are in kilobytes per second. Throughput values should be rounded up. The maximum value expressable in this field is 34 Tbps, in steps of 8 kilobits per second. Delta_Throughput: 32 bits This field indicates the desired or allocated change in throughput, measured in bytes per second. This is a signed value. A '0' in bit 0 indicates a positive value and a '1' in bit 0 indicates a negative value. The minimum throughput expressable in this field is -17 Gbps. The maximum value expressable in this field is 17 Gbps, in steps of 8 bits per second. This field is set by the sender to indicate the amount which the sender would like to adjust their throughput. The value may be subsequently reduced by routers along the path (See Section 4.2). Reverse_Feedback: 32bits This field indicates the value of Delta_Throughput received by the data receiver. The receiver copies the field Delta_Throughput into the Reverse_Feedback field of the next outgoing packet in the same connection. If a receiver receives multiple packets containing Delta_Throughput fields before sending a Reverse_Feedback field, the receiver is responsible for aggregating the feedback (see Section 4.1.2). 3.3 IPsec issues IPsec [RFC2401] must be slightly modified to accomodate use of XCP. The specifications for the IP Authenticated Header (AH) [RFC2402] and IP Encapsulating Security Payload (ESP) [RFC2406] state that the IPsec headers immeadiately follow the IP header. This is a problem Falk & Katabi Expires August 12, 2004 [Page 10] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 for XCP in that a) it will make the XCP headers harder to find by the routers, b) ESP encryption will make it impossible for routers along the path to read and write congestion header information and c) AH authentication will fail if any router along the path has modified a congestion header. 3.4 NAT, middlebox issues Middleboxes which attempt to perform actions invisibly on flows must preserve the congestion header. Middleboxes which terminate the TCP should terminate the XCP. Middleboxes containing queues should participate in XCP. 3.5 MPLS/Tunneling Issues When a flow enters an IP tunnel [RFC2003], IPsec ESP tunnel [RFC2406], or MPLS [RFC3031], network ingress point, the congestion header should be replicated on the "front" of the inner IP packet. For example, when a packet enters an IP tunnel, the following transformation should occur: [IP2] \_ outer header [XCP] / [IP1] [IP1] \ [XCP] ---> [XCP] |_ inner header [TCP] [TCP] | ... ... / Note that the XCP header appended to the front of the outer header is copied from the inner header (with teh appropriate change to the Protocol field to indicate the next protocol is IP. When the packet exits the tunnel, the congestion header (which may have been modified by routers along the tunneled path), is copied from the outer header into the inner header. Falk & Katabi Expires August 12, 2004 [Page 11] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 4. XCP Functions XCP is concerned with the sender and receiver end-systems and the routers along the packet's path. This section describes the function of each of these entities. The emphasis in this section is to describe how the algorithm and header processing should work. The derivation and analysis of the algorithms can found in [Katabi03]. 4.1 End-System Functions 4.1.1 Sending Packets The sender is responsible for maintaining two parameters: a current estimate of the round-trip time to the receiver and a congestion window, cwnd. The sender may also maintain a desired throughput and a desired increase in throughput. The desired throughput may be informed by an application via an API or may be the speed of the local interface. A sender may choose to use any reasonable value, i.e., any achievable value, for desired throughput. The desired increase in throughput will normally be the difference between the current throughput and the desired throughput. However, if the sender does not have sufficient data to send to use up the available cwnd, desired increase in throughput SHOULD be zero. When sending a packet, the sender fills in the fields of the congestion header as follows: o The Throughput field is set to be the current throughput. For TCP this value can be estimated using cwnd (in bytes) divided by RTT (in milliseconds). (Throughput SHOULD carry a value of zero before RTT is known. This means that the first packets in a flow will use capacity that will not tracked in the router control algorithms. Since it requires only a single round trip of one packet for each flow to gain an estimate of RTT, this is probably negligable.) o The RTT field is set to be the current round-trip time (or zero if the round-trip time is not yet known). o The Delta_Throughput field is set to be the desired per-packet throughput increase. The desired change in throughput is distributed over each packet to be sent in a round-trip time because... (XXXX: I think there's a stability and fairness argument here but need to check with Dina.) To obtain the per-packet value from the overall desired Falk & Katabi Expires August 12, 2004 [Page 12] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 throughput, first subtract the current throughput (which may be estimated as cwnd/RTT) from the desired throughput. Then calculate the per-packet desired throughput increase by dividing the result from above by the number of packets in the current congestion window. Packets per window may be estimated by dividing cwnd by the Maximum Segment Size (MSS). So, the Delta_Throughput (in bytes/second) can be expressed as: desired_throughput - cwnd*1024/RTT Delta_Throughput = ----------------------------------- cwnd/MSS where: desired_throughput is measured in bytes/second RTT is measured in milliseconds cwnd is measurd in bytes MSS is measured in bytes As stated above, if there is insufficient data to fill the available cwnd, Delta_Throughput should be set to zero. Delta_Throughput should be set to zero if, for any reason, no additional capacity is needed. 4.1.2 Processing Feedback at the Receiver An XCP receiver is responsible for copying the Delta_Throughput data it sees on arriving packets in to the Reverse_Feedback field of outgoing packets. In TCP, outgoing packets would normally be bare acknowledgements. In some cases returning packets are sent less frequently than arriving packets, e.g., with delayed acknowledgements [RFC1122]. The receiver is responsible for calculating the sum of the arriving Delta_Throughput fields for placement in outgoing Reverse_Feedback fields. 4.1.2.1 TCP Acknowledgements XCP allows end-systems to use congestion control for TCP acknowledgements. This means an end-system TCP MAY reduce cwnd when an ACK is sent and otherwise perform XCP congestion control on an outgoing ACK stream. However, an end-system MAY choose not to apply congestion control to Falk & Katabi Expires August 12, 2004 [Page 13] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 TCP ACKs. The receiver end-system MUST still return XCP congestion feedback from the network to the sender. This may be accomplished through the use of the Minimal Format XCP header. The receiver copies the Delta_Throughput information from arriving packets into the Feedback field in outgoing packets (possibly aggregating data as described in Section 4.1.2). The RTT, Throughput, and Delta_Throughput fields MUST be set to zero when using the Minimal Format XCP header. 4.1.3 Processing Feedback at the Sender When packets arrive back to the sender, e.g., TCP acknowledgements, the sender's cwnd is updated using the information in the Reverse_Feedback field according to this formula: cwnd = max(cwnd + feedback * Rtt * g1, MSS) where: cwnd = current congestion window (bytes) feedback = Reverse_Feedback field from received packet, (bytes/sec, may be +/-) Rtt = current round trip time estimate (ms) g1 = 0.001 (sec/ms), conversion factor MSS = maximum segment size (bytes) The value of cwnd has a minimum of MSS to avoid the "Silly Window Syndrome" [RFC0813]. 4.1.3.1 Aging the Congestion Window When an available cwnd is not utilized, XCP should reduce it as time passes to avoid sudden bursts of data into the network if the application starts to send data later. Section 4.5 of [Katabi03] proposes the following algorithm for aging the congestion window. Consideration of other algorithms is a research topic. Each RTT in which the sender does not send a cwnd of data, cwnd MUST be reduced by the following formula: cwnd = max(0.5 * (cwnd - K), MSS) where: K = the number of outstanding bytes 4.1.3.2 Response to Packet Loss If a packet drop or ECN notification [RFC3168] is detected, an XCP Falk & Katabi Expires August 12, 2004 [Page 14] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 sender should transition to Van Jacobson-like behavior as specified in [RFC2581]. In other words, cwnd should be halved and traditional fast retransmission/fast recovery, slow start and congestion avoidance should be applied. Some implementation notes: o The change in congestion control algorithm should be delayed until the three DUPACKs have arrived, as according to the Fast Retransmission/Fast Recovery algorithm. o Once the change to VJ dynamics has occurred, cwnd should be managed using the RFC2581 algorithm except that negative feedback from arriving packets should not be ignored. I.e., cwnd should be reduced by the value of the Reverse_Feedback field, if the value is negative. o The Throughput field in outgoing packets should continue to reflect the current cwnd/RTT. If cwnd exceeds the allocation of a bottleneck router running XCP, negative feedback will be sent. If the router is not running XCP, eventually a packet will drop and VJ dynamics will be invoked. 4.2 Router functions An XCP router maintains two control algorithms on each output port: a congestion controller and a fairness controller. The congestion controller is responsible for making maximal use of the outbound link while at the same time draining any standing queues. The fairness controller is responsible for fairly allocating bandwidth to flows sharing the link. Each port instance of XCP is independent of every other and references to an "XCP router" should be considered an instance of XCP running on a particular output port. [Actually, it is an oversimplification to say that congestion in routers only appears at output ports. Routers are complex devices which may experience resource contention in many forms and locations. Correctly expressing congestion which doesn't occur at the router output port is a topic for further study.] The router calculations are divided into those which occur upon packet arrival, those which occur upon control interval timeout, and those which occur upon packet departure and the assessment of the persistent queue, which uses a seperate timer. The calculations are presented in the following sections as annotated pseudocode. Falk & Katabi Expires August 12, 2004 [Page 15] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 4.2.1 Calculations which must be done upon packet arrival When a packet arrives at a router, several parameters used by XCP need to be updated and is described in the following pseudocode. ======================================================== On packet arrival do: 1. input_traffic += Pkt_size 2. sum_inv_throughput += Pkt_size / Throughput 3. if (Rtt < MAX_INTERVAL) then 4. sum_rtt_by_throughput += Rtt x Pkt_size / Throughput 5. else 6. sum_rtt_by_throughput += MAX_INTERVAL x Pkt_ size / Throughput ======================================================== Line 1: The variable input_traffic accumulates the volume of data that have arrived during a control interval. When a packet arrives, the packet size is taken from the IP header and is added to the ongoing count. Line 2: The variable sum_inv_throughput is used in the control interval calculation (see equation 4.2 of [Katabi03] and in capacity allocation. For each packet, the ratio of packet size (from the IP header) to advertised Throughput (from the XCP header) is accumulated. Lines 3 and 5: A test is performed to check whether the round trip time of the flow does exceeds the maximum allowable control interval. If so, MAX_INTERVAL, the maximum allowable control interval, is used in the subsequent calculations. Too large a control interval will delay new flows from acquiring their fair allocation of capacity. Lines 4 and 6: As in Line 2, the variable sum_rtt_by_throughput is used in the control interval calculation. 4.2.2 Calculations which must be done upon Control Interval timeout When the control timer expires, several variables need to be updated as shown below. Falk & Katabi Expires August 12, 2004 [Page 16] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 ======================================================== On estimation-control timeout do: 7. avg_rtt = sum_rtt_by_throughput / sum_inv_throughput 8. F = a * (capacity - input_traffic / ctl_interval) - b * queue / avg_rtt 9. shuffled_traffic = 0.1 * input_traffic / ctl_interval 10. Cp = (max(F,0) + shuffled_traffic)/sum_inv_throughput 11. Cn = (max(-F,0) + shuffled_traffic)/input_traffic 12. residue_pos_fbk = (max(F,0) + shuffled_traffic) 13. residue_neg_fbk = (max(-F,0) + shuffled_traffic) 14. input_traffic = 0 15. sum_inv_throughput = 0 16. sum_rtt_by_throughput = 0 17. ctl_interval = max(avg_rtt, MIN_INTERVAL) 18. timer.reschedule(ctl_interval) ======================================================== Line 7: Update avg_rtt by taking the ratio of the two sums maintained in the previous section. This value is used to determine the control interval below (line 17). Line 8: The aggregate feedback, F, is calculated. The variable capacity is the outbound link capacity. The variable avg_rtt is calculated in line 7. The variable queue is the persistant queue and is defined in section XXXX. The values a and b are constant parameters. The constant a may be any positive number such that a < (pi/4*sqrt(2)). A nominal value of 0.4 is recommended. The constant b is defined to be b = a^2*sqrt(2). (If the nominal value of a is used, the value for b would be 0.226.) Note that F may be positive or negative. Line 9: This line establishes the amount of capacity that will be shuffled in the next control interval. Shuffling traffic takes a small amount of the available capacity (no more than 10% of the input_traffic) is redistributed by adding it to both the positive and negative feedback pools. This allows new flows to acquire Falk & Katabi Expires August 12, 2004 [Page 17] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 capacity in a full loaded system. The variable input_traffic is defined in line 1 and the value of ctl_interval (defined on line 17) is used from the previous control interval. Implementors may choose other values than 10% of the input-traffic for shuffling by choosing other values than 0.1 in this line. More shuffled traffic decreases the time for new flows to acquire capacity (& converge to fairness). However, more traffic shuffling adds variation to the capacity an individual flow receives and may disturb some applications, although not TCP. Shuffled_traffic is always a positive value. The objective of the feedback calculations is to obtain a per-packet feedback allocation from the router. The following two lines obtain factors in this calculation which have no physical meaning. One might view them as per-flow capacity allocations that have some additional processing to prepare them for per-packet allocation. Note that, with the use of shuffled_traffic, a non-idle router will have non-zero values for both Cn and Cp. Line 10: This line calculates the positive feedback scale factor, Cp. The variables F, shuffled_traffic, and sum_inv_throughput are defined above. Line 11: This line calculates the negative feedback scale factor, Cn. This is a positive value. The definitions for F, shuffled_traffic, and input_traffic are given above. Line 12: The variable residue_pos_fbk keeps track of the pool of available positive capacity a router has to allocate. It is initialized to the positive aggregate feedback. Line 13: The variable residue_neg_fbk keeps track of the pool of available negative capacity a router has to allocate. It is initialized to the negative aggregate feedback. This variable is always positive. Line 14-16: Reset various counters for the next control interval. Line 17: Set the next control interval. The use of MIN_INTERVAL is important to establish a reasonable control interval when the router is idle. Line 18: Set timer. 4.2.3 Calculations which must be done upon packet departure An XCP router processes each packet using the feedback parameters Falk & Katabi Expires August 12, 2004 [Page 18] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 calculated above. As stated earlier, each packet indicates the current Throughput and a throughput adjustment, Delta_Throughput. The router calculates a per-packet capacity change which will be compared to the Delta_Throughput field in the packet header. Using the AIMD rule, positive feedback is applied equally per-flow while negative feedback is made proportional to each flow's capacity. Processing should be done according to the pseudocode below. ======================================================== On packet departure: 19. pos_fbk = Cp * Pkt_size / Throughput 20. neg_fbk = Cn * Pkt_size 21. feedback = pos_fbk - neg_fbk 22. if(Delta_Throughput > feedback) then 23. Delta_Throughput = feedback 24. residue_pos_fbk -= pos_fbk 25. residue_neg_fbk -= neg_fbk 26. else 27. if (Delta_Throughput >= 0) 28. residue_pos_fbk -= Delta_Throughput 29. residue_neg_fbk -= (feedback - Delta_Throughput) 30. else 31. residue_neg_fbk += Delta_Throughput 32. if (feedback >= 0) then 33. residue_neg_fbk -= feedback 34. if (residue_pos_fbk <= 0) then Cp = 0 35. if (residue_neg_fbk <= 0) then Cn = 0 ======================================================== Falk & Katabi Expires August 12, 2004 [Page 19] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 Line 19: The contribution of positive feedback for the current packet is calculated using Cp, defined in line 10, Pkt_size from the packet header, and Throughput (the flow's advertised throughput) also from the packet header. Line 20: The contribution of negative feedback for the current packet is calculated using Cn, defined in line 11, and Pkt_size from the packet header. This value of neg_fbk is positive. Line 21: The router's allocated feedback for the packet is the positive per-packet feedback minus the negative per-packet feedback. This value may be positive or negative. Line 22-25: In line 22 there is a test to see whether the packet is requesting more capacity (via the packet's Delta_Throughput field) than the router has allocated. If so, this means the the sender's desired throughput needs to be reduced to be the router's allocation. In line 23 the Delta_Throughput field in the packet header updated with the router feedback allocation. In lines 24 and 25 the pool of feedback is reduced by the amount allocated to the packet. Line 27-29: Line 27 tests that the flow would like to increase its throughput by less than the router's allocation. (The router has made a positive allocation here.) In this case, the packet header is unchanged, i.e., the router grants the flow's request. The positive feedback pool is reduced by the amount the flow plans to increase in line 28. In line 29 the negative feedback pool is decreased by the difference between the router's allocation and the flow's request, as though a negative allocation of that amount was made to the flow. Line 30-34: In this block of code, Delta_Throughput is negative and less than feedback, the router's allocation. In line 31 the negative feedback pool is reduced by Delta_Throughput, similar to line 29. In line 33 the negative feedback pool is reduced because the flow is receiving negative feedback, even though it was another router than allocated it. Line 35-36: When the residual pool becomes empty, set the scale factor to zero. 4.2.4 The Control Interval The capacity allocation algorithm in XCP router updates several parameters every Control Interval. The Control Interval MAY be any "reasonable" value. The Control Interval used above, and in [KHR02], Falk & Katabi Expires August 12, 2004 [Page 20] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 is the average RTT of the flows passing through the router. Notes on avg_rtt: o Note that when avg_rtt is used in this document, it refers to the last calculated avg_rtt. In other words, the avg_rtt calculated based on packets arriving in the previous control interval. o The avg_rtt calculation should ignore packets with an RTT of zero in the header. o avg_rtt MUST have a minimum value. This is to allow flows to acquire bandwidth from a previously idle router. The default minimum value, MIN_INTERVAL, should be max(5-10ms, propagation delay on attached link). o avg_rtt MUST have a maximum value. The default maximum value, MAX_INTERVAL, should be max(0.5-1 sec, propagation delay on attached link). 4.2.5 Obtaining the persistant queue In Section 4.2.2 the variable queue contains the "persistant queue" over the control interval. This is intended to be the minimum standing queue over the interval. The following pseudocode describes how to obtain the minimum persistant queue: ======================================================== On packet departure do: 36. if (inst_queue < min_queue) then min_queue = inst_queue ======================================================== When the queue-computation timer expires do: 37. queue = min_queue 38. Tq = max(MAX_QUEUE, (avg_rtt - inst_queue/capacity)/2) 39. queue_timer.reschedule(Tq) ======================================================== Falk & Katabi Expires August 12, 2004 [Page 21] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 Line 36: The current instantaneous queue length is checked each time a packet departs to see if the minimum requires updating. If avg_rtt is being used as the Control Interval, it MUST NOT be used as the interval for measureing the minimum persistant queue. Doing so can result in a feed-forward loop. For example, if a queue develops the average RTT will increase. If the avg_rtt increases, it takes longer to react to the growing queue and the queue gets larger, leading to instability. Line 37: Upon expiration of the queue estimation timer, Tq, the variable queue, the persistant queue, is set to be the minimum queue occupancy over the last Tq. Line 38: The first term in the max function is the queue size that you are willing to maintain anyways. (A nominal value of 2ms worth of queuing is recommended but this may be tuned by implementors.) The second term is an estimate of the propagation delay. In other words the persistent queue is a queue that does not drain in a propagation delay. the division by 2 is a conservative factor to avoid overestimating the propagation delay. Line 39: The queue computation timer is set. Falk & Katabi Expires August 12, 2004 [Page 22] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 5. Unresolved Issues This section will be expanded to include more detailed text on the topics below. 5.1 Probing for XCP Capability [Describe Dina's idea about how XCP routers can discover if their adjacent router is XCP capable.] [Tim asks: What will that discovery mechanism cost (in terms of round trips)? Should we have a mechanism to remember what endpoints do and do not have XCP capability? To configure when XCP should and should not be used? ] 5.2 XCP Within a Cloud [Describe the functions required of a router at the edge of an XCP cloud in the middle of the network.] 5.3 Sharing resources between XCP and TCP [Describe the need for separate queues for XCP and TCP flows. (Really should be XCP and everything else.)] 5.4 A Generalized Router Model The XCP algorithm described here and in [Katabi03] manages congestion at a single point in a router, most likely an output queue. However, resource contention can occur at many points in a router. Input queues, backplanes, computational resources can 'congest' in addition to output buffers. There is a need to develop a general model and a variety of mechanisms to identify and manage resource contention throughout the router. 5.5 Host back-to-back operation XCP hosts should be capable of back-to-back operation, i.e., with no router in the path. Nominally, this should not be a problem. A sender intializes delta_throughput to the desired value, no router modifies it and, thus, it is automatically granted. However, it has not yet been decided whether an XCP reciever should be capable of (or require) adjusting the delta_throughput to request flow control from the receiver to the sender. At this point, XCP offers no mechanism for flow control. (Open question: Should it?) Falk & Katabi Expires August 12, 2004 [Page 23] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 5.6 Adaptively Aging the Congestion Window The constant 0.5 in the above equation is somewhat arbitrary. It has been suggested that be made a function of cwnd so that a small value of cwnd ages more slowly than a large value of cwnd. This would protect the network from large flows starting up with no probing and allow small flows to start quickly. Values other than 0.5, however, should be considered experimental. 5.7 Alternate Responses to Packet Loss The correct response by an XCP end-system when faced with a packet drop is a research question. The solution sketched out in Section 4.1.3.2 is fairly conservative and other solutions have been suggested as well. [Katabi03] proposes that an appropriate response is to use recent drops to estimate a TCP fair rate as described in [Padhye98]. Another suggestion has been made to observe Van Jacobson-style behavior for one RTT following a drop, then return to XCP behavior. 5.8 Alternate Representation Formats Need to consider alternate representations of header values to reduce router processing. In particular the divisions are hard and should be done at the sender if possible. Falk & Katabi Expires August 12, 2004 [Page 24] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 6. Transport Protocol Issues This section addresses some of the details of how XCP, as a congestion control protocol, should be used by particular transport protocols. 6.1 TCP is the Canonical Application of XCP XCP congestion control was developed with TCP in mind as a primary application. This section describes additional considerations when using XCP as congestion control for TCP. 6.2 XCP for other Transport Protocols XCP may be used for more than non-real time, reliable file transfer. For example, XCP may be implemented as a CCID for DCCP [I-D.ietf-dccp-spec] for non-reliable transport. This section will discuss considerations when using XCP for transport protocols other than TCP. Falk & Katabi Expires August 12, 2004 [Page 25] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 7. Security Considerations The presence of a header which may be read and written by entities not participating in the end-to-end communication opens some potential security vulnerabilities. This section describes them and tries to give enough context so that users can understand the risks. Man-in-the-Middle Attacks There is a man-in-the-middle attack where a malicious user can force a sender to stop sending by inserting negative feedback into flow. This is little different from a malicious user discarding packets belonging to a flow using VJ congestion control or setting ECN bits. One question worth investigating further is whether the XCP attack is harder to diagnose. Covert Data Channels IPsec needs to be modified, as discussed in Section 3.3, to allow routers to read the entire congestion header and write the delta_feedback field. This could become a covert data-channel, i.e., a way in which an end-system can make data viewable to observers in the network, on a compromised end-system. Falk & Katabi Expires August 12, 2004 [Page 26] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 8. IANA Considerations XCP requires the following registries: o XCP version number, 4 bits. o XCP header format, 4 bits. XCP also requires the assignment of a protocol number. Once this value has been assigned, the number may be inserted (by the RFC Editor) into Section 3.1 and this paragraph may be removed prior to publication. Falk & Katabi Expires August 12, 2004 [Page 27] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 References [I-D.ietf-dccp-spec] Kohler, E., "Datagram Congestion Control Protocol (DCCP)", draft-ietf-dccp-spec-05 (work in progress), October 2003. [Jacobson88] Jacobson, V., "Congestion Avoidance and Control", ACM Computer Communication Review Proceedings of the Sigcomm '88 Symposium, August 1988. [KHR02] Katabi, D., Handley, M. and C. Rohr, "Internet Congestion Control for Future High Bandwidth-Delay Product Environments", ACM Computer Communication Review Proceedings of the Sigcomm '02 Symposium, August 2002. [Katabi03] Katabi, D., "Decoupling Congestion Control and Bandwidth Allocation Policy With Application to High Bandwidth-Delay Product Networks", MIT PhD. Thesis, March 2003. [Padhye98] Padhye, J., Firoiu, V., Towsley, D. and J. Krusoe, "Modeling TCP throughput: A simple model and its empirical validation", ACM SIGCOMM '98 conference on Applications,technologies, architectures, and protocols for computer communication, 1998. [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [RFC0813] Clark, D., "Window and Acknowledgement Strategy in TCP", RFC 813, July 1982. [RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989. [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the Internet Protocol", RFC 2401, November 1998. Falk & Katabi Expires August 12, 2004 [Page 28] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 [RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", RFC 2402, November 1998. [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload (ESP)", RFC 2406, November 1998. [RFC2581] Allman, M., Paxson, V. and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999. [RFC3031] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001. [RFC3168] Ramakrishnan, K., Floyd, S. and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. Authors' Addresses Aaron Falk USC Information Sciences Institute 4676 Admiralty Way Suite 1001 Marina Del Rey, CA 90292 Phone: 310-448-9327 EMail: falk@isi.edu URI: http://www.isi.edu/~falk Dina Katabi Massachusetts Institute of Technology 200 Technology Square Cambridge, MA 02139 Phone: 617-324-6027 EMail: dk@mit.edu URI: http://www.ana.lcs.mit.edu/dina/ Falk & Katabi Expires August 12, 2004 [Page 29] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement Copyright (C) The Internet Society (2004). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION Falk & Katabi Expires August 12, 2004 [Page 30] Internet-Draft XCP Spec *DRAFT -- DO NOT FORWARD* February 2004 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Falk & Katabi Expires August 12, 2004 [Page 31]