USC / Information Sciences Institute
touch@isi.edu
ABSTRACT: PC-ATOMIC is a PC interface for the ATOMIC LAN. PC-ATOMIC is implemented as a VL-Bus (VESA) short-form card for Intel i486 PCs, providing an interface for low-cost workstations to a 640 Mbps LAN. This document describes the PC-ATOMIC interface, its design, capabilities, and performance. The board design is public, and a small number of boards are available as government-furnished equipment for research projects.
This document describes the design and use of the PC-ATOMIC board. It also reviews its capabilities, and reports some preliminary performance measurements. The board design is public, and small numbers of the board are available as government-furnished equipment for research purposes.

ATOMIC is a source-routed, cut-through packet-switched LAN based on the components of the CalTech Mosaic supercomputer [4]. The Mosaic supercomputer uses custom VLSI 16-bit processors with integrated 2-dimensional communication channels, arranged in a mesh. The Mosaic channels are 8-bits wide, at 100 Mhz, for a channel rate of 800 Mbps. Packet routing is performed in simple on-chip hardware. ISI developed ATOMIC to use Mosaic chips to implement an inexpensive high-speed LAN. The preliminary ATOMIC LAN components were designed by CalTech, and programmed by ISI.
In 1994, members of ISI's ATOMIC group and CalTech's Mosaic group left their respective organizations to form Myricom to develop the ATOMIC LAN as a commercial product called Myrinet [9]. Myrinet components are not compatible with the prototype ATOMIC LAN, but are based on the same design principles. Myrinet is 8 bits wide at 80 Mbps per line, for a channel rate of 640 Mbps. Myricom currently produces host interfaces for Sun SPARC workstations, cables, and switches. PC-ATOMIC is compatible with Myrinet hardware, and can be programmed for use in a production Myricom LAN. Myrinet is based on the LANai chip, a descendant of the Mosaic chip.
The PC-ATOMIC network interfaces are designed to be compatible with the emerging Myricom hardware. In addition, the PC-ATOMIC board design examines some general host-interface design issues, such as zero-cost IP checksum hardware, DMA and board control by both on- and off-board processors, and interrupt signalling. The result is a host interface whose programmed I/O rate exceeds that of comparable prototypes from Myricom.
General and fast hardware requires Direct Memory Access (DMA) transfer capability. The PC-ATOMIC card is DMA-capable, although time constraints did not permit complete testing of the current PLD programming. DMA is possible via firmware upgrades. In addition, the board has the capability to allow the LANai to access the DMA and board-level registers in a future firmware release. All board-level interrupts are maskable, and some can be triggered under register control.
There is a zero-overhead IP checksum, that is capable of 1.2 Gbps. The LANai 1.2 supports clock speeds up to 20 Mhz, but in the PC-ATOMIC card it is clocked at 1/2 the VL-Bus frequency, i.e., at 17.5 Mhz. Using a half-rate VL-Bus clock simplifies the design significantly.

The board has a set of configuration jumpers, one for setting the board's hardware base address, and the other for setting the Myricom link interface [7]. The link interface jumpers are specified in the Myricom literature, and come pre-configured by ISI. The address jumpers specify a block of 2^24 bytes (16 Mbytes), in the 00xx xxxx - 0Fxx xxxx range (i.e., the lower nibble of the high byte of the 32-bit address). The board is preconfigured to a base address of 0A00 0000.
The board consists of 128 K bytes of dual-access RAM, configured as 32K of 32-bit words, and a set of eight 32-bit board-level registers that repeats in the next 128 K byte range. The overall space of 256 K bytes repeats throughout the 16 M byte block. The host and on-board interface processor share access to this RAM on alternate on-board clock cycles, emulating dual-ported RAM. This RAM is accessed through the on-board processor, which maps the top 64 bytes of the space as a shadow of its internal processor registers, and the bottom 8 K bytes are write-accessible only off-board. The base of RAM is the start location for the on-board processor following a reset.

2.1.1 Board-level configuration
The board contains a LANai 1.2 subsystem, nearly identical to that on the prototype Myricom 1.2 Sun SPARC SBus host interfaces [7]. The LANai uses a separate on-board crystal to control the link clocking, which is set to 40 Mhz for compatibility with the current Myrinet switches.
There are 4 programmable LEDs, connected directly to the LANai processor, as documented in the Myricom literature. The board also contains 4 AMD MACH 435 programmable PLDs. These are socketed for firmware upgrades.
The board also has two sets of jumpers. One set of three, located near the link crystal, is used to specify the link clock offset, as specified in the Myricom literature. The other set of jumpers are used to specify the base address of the board. They indicate the values of bits 27 through 24, i.e., the low nibble of the high byte of the board base address.
This board also has loopback capability. The loopback can be performed after the cable drivers, or before. When used after the cable drivers, a standard Myricom DB-37 D-connector loopback is used on the Myricom cable interface. Use of pre-driver loopback requires removal of the AT&T 41MM cable driver chips, and installation of a half-twisted ribbon connecting two 26-pin headers in the pre-driver header sockets.
2.1.2 Board registers
There are five main board registers which provide for Internet checksum, board-level reset, interface-processor reset-and-hold/release, maskable interrupts, and DMA control (DMA can be supported in future firmware releases) [8].
The control register provides general board-level management. Bits in the register can be used to:
The IP checksum register maintains a partial Internet checksum [1]. All data accesses are incrementally summed into this register when checksumming is enabled (as per a bit in the control register). This includes data reads and writes, and both programmed I/O and DMA are included. The checksum value is maintained as a pair of 16-bit ones-complement sums, one each for high and low half-words. The sum can be read at any time, and its halves folded to yield the actual IP checksum. The register is cleared by writing any value to it.
The two DMA registers are used to initiate DMA transfers to and from the host. One register contains a bit indicating the direction of the data transfer, the LANai-side transfer base address, and the number of 32-bit words to be transferred. The other register indicates the host-side base address. DMA operation has not been fully tested at this time. DMA capability has been disabled in the PLDs on the board, as distributed.
The entire board can be reset by writing to a phantom "reset" register. A reset re-initializes the board-level registers, and places the LANai processor in stasis to allow for the host to load the LANai control program into the board RAM.

All graphs have been computed with 90% confidence intervals which are near +/- 2 Mbps for each data point. The i486 PC-ATOMIC system performance degrades at 8K byte packets, due to page-boundary crossing.
In comparing the PC-ATOMIC and Myricom interfaces, it is useful to keep the following information in mind:

Table 1 indicates that the PC-ATOMIC tests use a faster backplane than the Sun SBus tests, but use a slower CPU (see Note) and slower LANai processor. Even so, Figure 4 indicates that the local RAM access performance is nearly identical for the bottleneck read rate. This may be an indication that the backplane speed plays a critical role in the performance of the host interface.
The application-application bandwidth was also measured, both between PC-ATOMIC interfaces, and to Myricom interfaces. Figure 5 indicates the memory-memory bandwidth for native ATOMIC packets sent directly from a user process. For reference, the Myricom SBus PI/O bcopy performance and kernel-based TCP measurements are also included. These performance graphs indicate that the PC-ATOMIC interface performs as well as the Myricom interface for programmed I/O.

The PC is a little-endian host, but network-standard byte-order is big-endian. This problem is compounded by the use of shared-memory for communication, and a big-endian 16-bit off-board processor (the LANai). We considered the use of a dual address space, where byte-order conversion was performed in the data path via wire routing from two sets of data buffers. This was not implemented in the PC-ATOMIC interface due to space and complexity limitations. We are not sure of the utility of such a mechanism.
The Internet checksum was implemented in a very inexpensive part at gigabit rates (Appendix A). This provided a zero-overhead checksum during any DMA and P I/O operations. Subsequent research considered the replacement of this checksum with an IPv6 header authentication algorithm [10].
DMA capability is part of the PC-ATOMIC design. It has not been fully tested, due to OS limitations. The PLDs can be reprogrammed to enable DMA, and other data paths that are part of the board design (Appendix B). This includes LANai control of the board-level registers.
The choice of interface bus has proven the major limitation to the PC-ATOMIC interface. At the time the project was initiated, the VL-Bus and PCI bus were in development. The VL-Bus hosts were available at the time the project was underway, and the bus specification was stable enough to implement an interface. There were no standard interface chips for the VL-Bus at the time. The PCI bus as subsequently become the de-facto standard for PC host platforms. PCI interface chips are now available, making host interface development much simpler.
[2] Boden, N., et. al, "Myrinet - A Gigabit-per-Second Local-Area Network," IEEE Micro, Vol. 15, No. 1, Feb. 1995, pp. 29-36.
[3] Felderman, R., DeSchon, A., Cohen, D., and Finn, G., "ATOMIC: A High-Speed Local Communication Architecture," Journal of High Speed Networks, Vol. 3, No. 1, 1994, pp. 1-29.
[4] Information Sciences Institute, ATOMIC Web site, http://www.isi.edu/div7/atomic.
[5] Information Sciences Institute, PC-ATOMIC Web site, http://www.isi.edu/div7/pcatomic
[6] Information Sciences Institute, "PC-ATOMIC (overview),", part of the PC-ATOMIC Software Release, Nov. 1994, available separately via ftp://ftp.isi.edu/pub/hpcc-papers/touch/pca_overview.txt.
[7] Information Sciences Institute, "PC-ATOMIC Board Information," part of the PC-ATOMIC Software Release, in docs/board.info, Nov. 1994.
[8] Information Sciences Institute, "PC-ATOMIC Register Information," part of the PC-ATOMIC Software Release, in docs/register.info, Nov. 1994.
[9] Myricom, Inc., Myricom Web site, http://www.myri.com
[10] Touch, J., "Performance Analysis of MD5," to appear in Sigcomm `95.
[11] Touch, J., and Parham, B., "Computing the Internet Checksum in Hardware," (paper in progress).
The PC-ATOMIC interface required an inexpensive and fast implementation of the Internet Checksum in hardware [1]. Various designs were considered, including MSI 16-bit fast-adders. The final solution used a $40 AMD MACH 435 PLD to implement the checksum at 1.23 Gbps, with one 32-bit word accumulated every 26 ns [11].
The Internet checksum is computed as a 16-bit ones-complement sum. A ones-complement sum is equivalent to the twos-complement sum, where carries are summed back into the accumulation. It can also be designed `natively' as a twos-complement adder where every bit includes the carry-in of the ring of bits to its right (wrapped around to the left, stopping at the bit to its left). It is this "toroidal" native property that we exploit.
The PC-ATOMIC Internet Checksum is computed as a pair of 16 ones-complement sums, over the high and low half-words of the data. The pair of partial sums are folded together in a single ones-complement sum, which is then inverted to result in the Internet Checksum.
The implementation of this checksum in the AMD MACH 435 PLD uses input latching of 32-bit words, one per clock. The data is then summed into the accumulator on the next clock, such that the latch is pipelined. The summation is composed of groups of 2- and 3-bit fast carry-lookahead adders with pipelined carries between the adder stages in a ring (Figure 6). The carries are propagated during all clocks, and when data is not present on the latch, a zero is added-in (e.g., as a null operation). The resulting pipeline settles in 6 clock cycles.

The PC-ATOMIC host interface incorporates a Myricom LANai 1.2 processor and communication subassembly, and an interface with registers and control (Figure 7). The LANai interface is very similar to that on the Myricom Sun SBus interface [9]. Logically, the entire LANai processor and communication subassembly appears as RAM to the interface, and thus to the host. The LANai uses dual-access RAM for communication between the network and host.

The remainder of the interface converts between VL-Bus and LANai interface signals, and provides board-level registers (Figure 8).

The host can access the board level registers by recognizing a VL-Bus access, decoding the board-level register address, and enabling the appropriate register and data path and direction (Figure 10).
The host can access the LANai dual-access RAM via dual-ported data registers on the interface assembly (Figure 11). These data registers hold data to adjust the clocking between the board (at 33 Mhz) and LANai subsystem (17.5 Mhz). This also permits the RAM accesses to occur during the appropriate phase of the LANai subsystem clock. For host access of LANai RAM, the address is passed on through the interface. Data is asynchronously propagated through the registers when received from the LANai, because the LANai is clocked more slowly. Data sent to the LANai must be clocked through the data port registers to latch it long enough for the LANai.
DMA operation is permitted by using the board-level interface registers to drive the VL-bus and LANai addresses independently as the data is clocked through the data port register (Figure 12). Data written to the LANai is clocked through the data port registers, as in host access of LANai RAM. Data read from the LANai is similarly unclocked (as in host access), because data is held stable longer over the slower LANai clock.
The LANai is permitted to access the board-level registers by passing LANai addressing through, and using the data port registers as a data bus (Figure 13). The data port registers use asynchronous signal propagation in both directions, because the LANai read and write cycles are synchronous, where the data is stable< until acknowledged.



