ALN – Active Learning Networks

Making TCP Smarter

ALN applies experience with previous network connections to help tune future network connections.

Current Internet hosts open new connections that are initialized with a number of default parameters. These defaults are intended to be conservative, such as ‘start with one packet at a time’ and ‘assume you know nothing about the round trip time’.

Various experience with TCP has shown that it can be useful to apply past experience to help tune some of these parameters for future connections.

ALN assumes:

  • TCP does a good job converging on TCB state over time,
    but a lousy job of guessing initial conditions
  • TCP experiences stable net over the connection,
    and stable offered load

Does the following:

  • RECORDS TCP end-of-connection state, as well as ‘kitchen sink’ data (weather, endpoint loc, etc.)
  • TRAINS an adaptive learning module on the state data (TCB state:kitchen sink state)
  • APPLIES TCBs of new connections based on predictive lookup (i.e., lookup kitchen sink state and retrieve expected TCP initial state)

Parts of the ALN Project

Data Collection
Collect TCP state information with associated ‘real-world’ context and look for possible correlation axes. Also examine the start/end state and determine potential bounds of performance benefit.
Predictors
Predictors are trained off-line using measured data and the run on-line to predict future TCP parameter initial conditions.
 Integration
ALN is planning to deploy learning in a real OS. This includes hooks into the kernel to allow initialization of real code, and user-level off-line training and on-line predicting:

The testbed is intended to enable statistics gathering, when run as a proxy cache for conventional endsystems. It also enables experiments using artificially inserted errors (e.g., bad prediction) and delays (e.g., in the prediction algorithm).

Evaluation
Includes integration of the system in a real OS, and measuring the performance benefit and overhead.

Effort sponsored by the Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory, Air Force Materiel Command, USAF, under agreement number FA8750-05-1-0051, order # T981. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), the Air Force Research Laboratory, or the U.S. Government.

Why ALN?

ALN applies experience with previous network connections to help tune future network connections. Current Internet transport protocols do not recover context or experience between sessions, and so restart anew for each new connection. ALN applies knowledge about past context to help tune TCP, the dominant transport protocol, to provide enhanced performance.

About TCP

TCP connections maintain state that describes the property of a connection to provide reasonable throughput for a set of network conditions. During a connection, this state evolves and tunes, and the algorithms therein are well understood.

Unfortunately, TCP does not reuse this acquired knowledge for new connections. New connections, even those started during existing connections to the same site, do not apply this knowledge; instead, all connections are started with the same initial state.

There are some cases where TCP has tried to reuse state. Round trip times and maximum packet size are sometimes cached in a table. RFC2140 proposed explicitly sharing state between concurrent connections and reusing state to recurring sites or networks, and the Congestion Manager tried to integrate the state of multiple connections in a single mechanism.

However, much of this state cannot be simply reused. Networks experience daily and weekly patterns of congestion. Some of these patterns also vary with holidays, weather, and a number of other external context. Unfortunately, this can be difficult to capture from within a network alone.

ALN tries to capture and discover these relationships, to allow TCP to start new connections with more appropriate initial conditions. This is based on a few observations about TCP connections:

  1. TCP does a good job converging on TCB state over time,
    but a poor job (actually, none!) of guessing initial conditions
  2. TCP experiences stable net over the connection,
    and stable offered load

Item #1 means that we’re not using smart predictors to replace TCP’s closed-loop feedback system. Experience and analysis shows that it works very well, and the stability of the network requires that connections play the game by the same rules.

Item #2 is a simplifying assumption; we could try to find a predictor for the appropriate initial conditions for each connection. This assumption allows us to assume that the final conditions of a connection are reasonable appropriate initial conditions.

Networks have limited learning

Networks do learn. The Internet as a whole ‘learns’ reachability, in which ICMP (Internet signalling) redirects packets, blocks dead-ends, and informs about MTU (message transmission unit, i.e., packet size) issues. Connection protocols such as TCP, SCTP, and DCCP learn during the lifetime of a connection. There are other protocols that don’t learn (UDP), but they are used very little.

There have been a number of attempts to share the learning of connection protocols. Transaction TCP (T/TCP) [RFC1644], TCP Control Block Sharing [RFC2140], and the Congestion Manager [RFC3124] all explored ways of sharing state across connections. This sharing was limited, considering sharing between connections to the same address or LAN.

Unfortunately, external context about connections has not been reused, and can substantially affect how well previous connections can predict future connections. This is the dimension ALN explores, by applying external context and previous state experience to predict future initial state.

 

The goal of ALN is to help tune these connection protocols, to make them work better. We already know how to tune particular protocols to particular environments, but the challenge is that we don’t always know when to use each optimization; ALN helps tune these protocols so that the right ‘variant’ is effectively used when needed, based on applying past experience.

Importance of a Solution

TCP is the dominant Internet protocol, affecting 90% of current applications, and an even higher percentage of current network traffic. Future network uses will be dominated by connection-oriented protocols like TCP, some variants of which are emerging (SCTP, DCCP), but all of which operate on the same principles.

Most of the Internet traffic is TCP. Below are (in order) the number of packets and bytes (in bandwidth) on a large Internet backbone.

 

 

Most applications use TCP as well. The diagrams below show the same traffic where non-TCP is shown in red and dark red; orange through blue are all TCP. Again, the first graph is by packets, and the second is by bytes.

Approach and Goals

ALN uses learning to continuously accumulate and integrate the context and history of past sessions, and applies that learning to the initial conditions of new sessions. It integrates session-oriented learning with established closed-loop feedback mechanisms. The specific goals are to apply learning, show the benefit, and measure the costs.

TCP Background

TCP uses a number of state variables to describe a connection. Some are independent – the endpoints, the date/time, and the ports that describe the service or application. Others are dependent, based on properties of the network at that time, between those points, for that service, and describe the network and endpoint capabilities. These include:

  • CWND: sender side congestion window size
  • RCV_WND receiver advertised window size
  • SSTHRESH threshold for congestion avoidance
  • RTT round trip time
  • MTU maximum transmission unit (packet) size
  • Buffer sizes
  • Options for the connection or endpoint, e.g., NODELAY, SACK, etc., that describe TCP variations

Consider how some of these variables affect how TCP decides how to use the network, notably the ‘windowing algorithm’. This algorithm determines how many packets are in-transit between the sender and receiver, i.e., pipelined and awaiting ackhowledgement:

TCP starts with a very small initial window (a few packets) and learns what the network can tolerate over time, ramping up during each new connection:

For most of the connection, TCP underutlizes the network. Each new connection starts from the same initial conditions. A perfect TCP, however, would already know the appropriate window size and use it from the beginning of the connection; the transfer would finish more quickly as a result (perfect in green, and current TCP in red).

ALN will try to approach the profile of perfect TCP, but we expect some differences. ALN will take some time to determine the appropriate initial conditions, so the connection will start a bit later. Further, the prediction may not be perfect; TCP will continue to converge, perhaps on a better value for the window. Shown below are these differences, with ALN’s TCP in blue:

Resources

The data and information on variables are available for download.

Papers

Bannister, J., Shen, W., Touch, J., Hou. F., and Pingali, V., “Applied Learning Networks,” ISI Technical Report 637, April 2007.

People

Joe Bannister – architecture, integration

Wei-Min Shen – predictors, evaluation

Joe Touch – data collection, software

Graduate students: Feili Hou – predictors; Venkata Pingali – data collection

Data Collection

The ALN data is based on a set of variables, stored in an SQL database.  Below is a description of the ALN data collected per TCP connection, grouped by category and annotated.

Connection Variables

(Includes TCP-KIS Variables, Geographical Location, Timing Information, Incorporates variable information from Web100 )

Legend:

  • Instantaneous
    • Instantaneous values that are appropriate for real time decisions and not for offline processing such as what ALN right now is expected to do.
  • Cumulative
    • These lines by themselves may not be as interesting as derived values such as average and deviation.
  • RunTimeConfig
    • These are variables that really are interfaces to edit/set other variables. These are by themselves not very interesting.
  • Debug
    • Debug/experimentation-related variables (may be interesting)
  • WAD
    • Variables used by Work Around Daemon of Web100 to tune variables at run time.
  • Sampled
    • These are variables that are sampled at regular intervals. The variables typically contain sums of values of the other sampled variables.
  • Parameter
    • These are one time settings/exchanges/computed values. These variables are typically set at startup time.
  • Location
    • These variables map an IPaddress to a location (City/country etc., latitude/longitude).
  • Time
    • These variables convert the starting time into usable information such as day of the week, hour etc. The learning network may be able to predict at varying time and information granularities.
  • Holidays
    • These variables (actually only one is relevant) identify whether the starting time of the connection falls within a holiday.
  • Weather
    • Weather information collected from wunderground.

Notes:
There is some redundancy in the variables between sheets esp. involving the non-connection data (in time, location and holidays). There is a discrepancy between the documentation (variables specified here/in tcp-kis.txt and the implemented variables. The implemented variables are a subset and will probably be supported at a later date

ALN-specific information:

  • Priority
    • Hint on how important is the prediction of this value is. Low: Utility of this information for tuning purposes is unclear and/or this information is difficult to predict. Med: Great if can be done. High: It is important to predict this value and accurately
  • Condition/Output
    • Condition: Value is used at the beginning of the connection for prediction. Output: This value should be predicted by the Learning Network
  • Min/Max
    • Range of values for the variable
  • Kernel/Application Support
    • What is status of support? Variables in the kernel: supported, supported but has a different name, not supported. Variables in the application space: Supported, needs more work
  • Nature
    • Continuous – integer range, Discrete – from a set of values, Range of values for the variable Binary – true/false, enabled/disabled