Chapter 6

m-Net under a m-Scope

Evaluation of m‑Net requires measurement of cache miss (M), prefetch cache miss (F), and m‑Net miss (P) probabilities. Cache performance parameters are available in the literature [Sm82]. Some measurements of opcode statistics have also been published [He90], but some of the statistics required to evaluate m‑Net are not available, and were made by direct measurement of opcode execution. The measured design is called m‑Net (MicroNet), so this measurement method is called m‑Scope (MicroScope). A description of m‑Scope appears in Appendix H.

Required measurements which were not available in the literature include statistics of opcode distributions according to the partitioning indicated by m‑Net. m‑Net also requires measurements of the average number of non-control transfer opcodes between control transfer opcodes; the definition of control transfer differs for the various levels of design discussed (Chapter 5). m‑Scope was developed to perform these measurements. Other tools for opcode distribution measurement were considered (as discussed in Appendix H), but were either insufficient for the measurements desired (Pixie) or proprietary and not available to us.

These measurements are made on the dynamic trace of opcode executions of various benchmarking programs. The applicability of benchmarks for general analysis is not advocated here. This selection represents widely available benchmarks which compiled and ran without explicit error in m‑Scope. Other benchmarks were considered, but omitted because they were not available in either SPARC Assembler or C language source code (which m‑Scope is limited to), or because of system limitations during the preparation of this dissertation (i.e., the benchmark was of limited interest due to its specificity, and there was insufficient disk space and insufficient processing capability to examine all available benchmarks).

 The GNU C, TeX, Linpack, and Dhrystone benchmarks were chosen for these measurements. GNU C is a freely distributed C language compiler; our version (1.35) and the test set weightings used were extracted from the SPEC Benchmark Release 1.0 [Wa90]. TeX is an embedded text typesetting program; our version was contained in the benchmark distribution of [He90][1]. The C language versions of both Dhrystone and Linpack were obtained from standard Internet libraries[2]. Each benchmark has particular characteristics, listed below:

¥ GNU C - a large program, complex, recursive; uses many language features

¥ TeX - a typical large frequently-used program

¥ Linpack - a set of routines collected from a mathematical software library

¥ Dhrystone - a contrived program claimed to represent intense integer computing

Both Linpack and Dhrystone represent intense mathematical computing, but lack sophisticated use of compiler and language constructs. They are often used for estimating measurements of theoretical CPU performance, such as FLOPS or MIPS.

TeX represents a common but complex application program, exhibiting significant levels of recursion and opcode proportions in general purpose systems. The GNU C compiler represents a typical upper-bound (or at least a reasonable bound) of complexity, both in its static structure and dynamic execution.

Some of the measurements performed here have also been presented elsewhere. They are repeated here because it is important to compare the measurements of various characteristics on a common set of benchmarks, and because some necessary measurements were not found in the literature, notably the variability in limb length with various treatments of jump, call, and return instructions, as discussed in detail below.

6.1. Performance gains

The performance gain of m‑Net is measured by the extent to which P is less than either M or F. We need to compare actual measurements in order to make these measurements. As noted before, there is no incompatibility between conventional caches and m‑Net, so the comparison is simplified by the assumption that m‑Net also has a conventional cache inside the Filter Cache. The Filter Cache then behaves like the conventional cache, with the exception of the one entry which represents the latest opcode/address pair as received from the incoming communication stream. Further, if this ÔstreamingÕ entry is hit, it is copied into a conventional entry, just as when a memory reply occurs after a conventional cache miss.

In prefetching caches, a miss of a particular datum causes a sequence of data to be fetched. This is equivalent to the assumption that all opcodes are of type OTHER (as described in Chapter 5) (i.e., simply increment the PC), and that the memory will send a number of opcodes equivalent to the line size or lookahead of the cache upon a miss.

Communication can be modeled as a branching stream (as described in Chapter 2). The branching of this stream is 2 because BRANCH opcodes accommodate 2 alternates. The limb length of the stream describes the extent to which a particular image value (PC) remains predictable by transforms. In m‑Net, limb length is the average number of non-control flow opcodes between control flow opcodes. Opcodes which alter the flow of control are those that cause branching in the stream (BRANCHes), or that cause a round trip latency penalty because they are not anticipated (INDIRECT, and opcodes not anticipated in a given implementation).

Dynamic opcode traces (i.e., execution traces) indicate the following distributions of opcode classes. I-CallÕ and ÔI-JumpÕ denote indirect CALL and JUMP opcodes, respectively; SPARC CPUs have no indirect branches (Figure 6.1). Other published measurements [Ka91] of troff (GK troff) and cc (GK cc) are included for comparison.

Figure 6.1

Dynamic control opcode distributions

In the following discussion, INDIRECT refers to the aggregate of all indirect opcode types. These measurements were made on the actual execution of the benchmarks on a SPARC (SUN-4), which has no indirect branch opcode. The vast majority of the indirect opcodes were indirect JUMPs, due to a preference in the C compiler used[3].

The code was measured by compiling the benchmarks (using optimization), and generating SPARC assembler code. The assembler was modified by an AWK script, which inserted additional instructions to count the various classes of opcodes during execution, as well as measuring the number of OTHER opcodes between control of flow instructions, such a JUMP, BRANCH, CALL, and RETURN. Existing tools, such as PIXIE and PIXIESTATS for the MIPS processor, were not used because the method for determining basic blocks causes errors in the measurement of opcodes due to the effects of indirect JUMPs and CALLs. The SUN equivalents, SPIX and SPIXTOOLS, as well as the SUN SPARC emulator SHADOW, were requested but not available for public use. See Appendix H for a further discussion on the measurement techniques used here.

The variance in these measurements is large, due to the arbitrary choice of canonical code examples, and the variability between benchmarks. Estimates of the measurements are summarized in Table 6.1. These are verified by similar, albeit more limited published measurements [Ka91][4].

 

OPCODE

%

Other

84

JUMP

2

CALL

2

RETURN

2

Indirect

0.3

Table 6.1

Approximate dynamic opcode distribution

Comparing m‑Net to an architecture without a cache, m‑Net reduces the effect of latency by the percent of instructions which can be accommodated by the Code Pump (Figure 6.1, Table 6.2).

These comparisons assume that the majority of time spent in execution is due to round trip latency, i.e., that  by at least one order of magnitude, preferably at least two. This necessitates a latency of at least 100 instruction execution times, which in a 1 gigabit/sec pipe, assuming 32-bit instructions, corresponds to a latency of 3,200 bits, or 3.2 msec, or 960 meters, which is about 6 city blocks. For example, these results apply where the processor and memory are separated by at least 3 city blocks, which is completely reasonable for a client/server system located on a small college campus.


 

Opcode groups anticipated

% not predicted

Speedup

Caveats

none (conventional)

100

1

 

Other

16

6.3 x

 

Other, JUMP

14

7.1 x

 

Other, JUMP, CALL

12

8.3 x

 

Other, JUMP, CALL, RETURN

10

10 x

unlimited stack

all but Indirect

0.3

333 x

unlimited TreeStack

Table 6.2

Approximate speedup in various degrees of implementation

Measurements of cache utilization have appeared often in the literature [He90] [Sm82]. The miss rate M is approximately 3%, which with prefetching (F) drops to 1.5%. These assume extremely large (or infinite) caches. A 1K cache has a miss rate of about 20%, a 4K about 12%, a 128K about 5%, and the miss rate approaches 3% only for caches larger than 256K.

Large (infinite) caches miss mostly due to first time code use, whereas smaller caches miss both from first time code use and from most control opcodes, because the cache is so small that changes in control flow are likely to require opcodes not in the cache. A prefetching cache misses only from control flow opcodes because first time code use is predicted by the prefetch.

Changes in control flow occur from JUMPs, CALLs, RETURNs, and INDIRECT opcodes, as well as from half of the BRANCHes, the latter assuming that BRANCHes are taken about 50% of the time [He90]. Such control opcodes comprise 11.3% of the instruction stream (JUMP, CALL, RETURN, INDIRECT, and half of the BRANCHes), so control opcode misses comprise 11.3% of the cache misses, i.e., prefetching caches are estimated to miss 11.3% of the time.

This estimate can be further refined because branches can be separated into backward and forward branches, where forward branches are more likely to cause misses. About 25% of branches are backward (empirically [He90]), because backward branches are generated for loops, whereas forward branches are generated for IF and CASE statements, which are more common in source code. Backward branches are more likely to hit existing cache entries, so a revised formula for misses in small prefetching caches is 10% (JUMP, CALL, RETURN, INDIRECT, and half of 3/4 of the BRANCHes).

Prefetching reduces the miss rate up to 50% for large caches, compared with non-prefetching caches [Sm82]. Misses occur due to CALLS and RETURNS more than BRANCHES, because most local code remains in the cache. A small cache has a 20% miss rate and 50% of these misses are predictable (due to the opcode distribution difference), so the miss rate of a small prefetching cache is expected to be 10%, equivalent to our approximation based on opcode distributions.

Cacheing and cache anticipation are compared to m‑Net anticipation, in Table 6.3, assuming nothing about branch outcome. Opcodes are called JUMP, CALL, RETURN, BRANCH, INDIRECT, and OTHER for all other types, and labelled respectively J, C, R, B, I, & O [He90]. The table notes the cache size required to approximate various versions of m‑Net, based on which sets of opcodes are anticipated. In the Recursion case, storage is required in the Code Pump of 100 addresses; justification of this figure is discussed later.

 

m‑Net Version

Opcodes anticipated

miss rate

equivalent cache

m‑Net storage

Unit Linear

O

16%

2K bytes

none

Linear

O+J+C

12%

4K bytes

none

Recursion

O+J+C+R

10%

8K bytes

100 addresses

Table 6.3

m‑Net implementations and cache equivalents (no branch assumption)

Assuming branches are taken 50% of the time, i.e., that half the branches are predicted, more dramatic comparisons result (Table 6.4).

A naive comparison of these numbers results in the conclusion that most degrees of implementation of m‑Net are far worse than either caches or prefetching caches. Although cache miss rates are often quoted as 3% (1.5% for prefetching caches), these rates apply only to very large caches. Current microprocessors have small caches (8K for an Intel 80486, 256 bytes for a Motorola 68030, 4K for a Motorola 68040 and Intel 80860 [Mo89]). In comparison, m‑Net beats most of these on-chip caches, if transmission latency is large.

 

m‑Net Version

Opcodes anticipated

miss rate

equivalent cache

m‑Net storage

Unit Linear

O

11%

8K bytes

none

Linear

O+J+C

7%

16K bytes

none

Recursion

O+J+C+R

5%

50K bytes

100 addresses

Table 6.4

m‑Net implementations and cache equivalents (equiprobable branching)

 

Opcode

% of control

estimated cache miss distribution

OTHER

-

50 %

JUMP

12 %

6 %

CALL

12 %

6 %

RETURN

12 %

6 %

BRANCH

62 %

31 %

INDIRECT

2 %

1 %

Table 6.5

Adjusted dynamic opcode distributions (approx. a cache miss stream)

These results can be extrapolated to a system where m‑Net complements cacheing, rather than replacing it. Opcode distributions change with the inclusion of a cache, and measurements including a cache implementation were beyond the scope of this dissertation. One estimate assumes that the proportions of control opcodes (relative to each other) remains the same, between a conventional instruction stream and the stream of cache misses. Assuming that control opcodes increase to a total of 50% of the cache miss stream [He90], keeping all other relative proportions the same, this results in an adjusted opcode distribution (Table 6.5), and expected speedup (Table 6.6). The combined entries (cache with m‑Net) indicate that a resulting miss is the result of the combination of a cache miss and a failure of m‑Net to have anticipated the request for that opcode (i.e., the miss and failure rates are multiplied).

 

Implementation

Failure rate

Speedup (ratio to null)

cache alone (large- 256K)

3%

33 x

cache alone (small- 4K)

10%

10 x

prefetch cache (large- 256K)

1.5%

67 x

small cache + m‑Net(O)

10% * 50% = 5%

20 x

small cache + m‑Net(O,J)

10% * 44% = 4.4%

22 x

small cache + m‑Net(O,J,C)

10% * 38% = 3.8%

26 x

small cache + m‑Net(O,J,C,R)

10% * 32% = 3.2%

32 x

small cache + m‑Net(O,J,C,R,B)

10% * 1% = 0.1%

1000 x

large cache + m‑Net(O,J,C,R,B) [BEST]

3% * 1% = 0.03%

3,333 x

Table 6.6

Performance increases where latency dominates design parameters

According to this table, a small cache with a Recursion m‑Net (using 400 bytes of internal stack space) uses a total of 4.4K bytes of space, but achieves the miss rate equivalent to a 256K byte cache. Note that the Recursion m‑Net uses the same memory bandwidth as a system without a cache; only the Branching and Total m‑Nets require bandwidth higher than the Null design.

These are overestimates of the expected speedup, because the distributions of cache miss code are different from distributions of overall code use. Misses will be generated disproportionately by JUMPs, CALLs, RETURNs, and INDIRECTs, because these kind of control flow changes are not likely to be contained in code already accessed, due to the principle of locality. Measurements of code distributions in cache misses were unfortunately beyond the scope of this dissertation, but we do not expect they would significantly change the conclusion that m‑Net/cache exceeds the speedup capability of cache/prefetch or cacheing alone.

More sophisticated experiments are required to extend this analysis to the Branching and Total m‑Net cases, as described later.

6.2. On the feasibility of implementations as architectures

The feasibility of these implementations as realizable architectures depends on the real size of the stack or TreeStack data structures in the Code Pump.

For example, the size of the stack structure in the Recursion m‑Net is equal to the largest number of pending RETURN opcodes during the entire code execution (i.e., the depth of recursion). Cumulative percentages of the CALL opcodes at each level of recursion are plotted in Figure 6.2. For a particular stack size, the plot indicates the probability of not being able to store a CALL address, for use by its corresponding RETURN. For example, 85% of CALL/RETURN pairs can be accommodated in the TeX benchmark by a stack of size 17, whereas the stack would have to be increased to a size of 50 to handle the same likelihood of overflow in the GCC benchmark.

Figure 6.2

Percent of CALLs occurring at or above a given depth of recursion

Some of these depth measurements are not very accurate because the depth of pending calls could not be traced within the operating system with m‑Scope (Figure 6.3). This analysis is similar to that presented in [He90], where their results indicate that C compiled code has 5% overflow at a depth of 6, LISP at a depth of 7, and SmallTalk at a depth of 9, so the results depend on the program type as well as language and compiler characteristics.

Figure 6.3

Percent of CALLS not depth-traced (system calls)

Because only RETURN addresses need to be stored, a stack as small as 10 items will handle all of the Dhrystone and Linpack RETURNs, and miss only 10% of the TeX RETURNS, although it would fail in 85% of RETURNS in a highly recursive program, such as the GCC compiler. Increasing the stack to as little as 100 values allows the Recursion m‑Net to anticipate 100% of the user-code RETURN opcodes. Thus storage for a stack structure in the Code Pump which is large enough to accommodate all pending levels of recursion should be trivial to implement.

The size of the TreeStack structure also limits the feasibility of implementations. The TreeStack represents not only the information in the stack of pending recursions, but also the information on the expansion of the state space image. In order to estimate the size of this structure, there are two options. The first involves emulating the Code Pump, and determining the maximum extent of the structure as the benchmarks were executed. This level of experimentation was beyond the scope of this dissertation, but may be considered as a future research area.

The second involves some assumptions. Rather than directly measuring the size of the TreeStack, an estimate of its size can be made using the same simplifying assumptions as the branching stream analysis (Chapter 2). Assuming that the state begins in a finite set of values (i.e., a single PC, in this case), and that the sequence of PCs can be expressed as a tree, a snapshot of the image of the remote PC at any given time is expressed by the set of PCs in a level of the tree thus formed.

The branch degree of the tree is 2, modeling only BRANCH opcodes, because INDIRECT opcodes expand the space infinitely (actually, they expand the space by a degree of the size of the PC, at which point that level in the tree covers all possible PCs, as described before). The other relevant tree characteristic is the average limb length, defined as the number of opcodes between those which cause the tree to either branch or terminate (due to a terminated modeling).

For example, in the Total m‑Net, limb length is the number of opcodes between BRANCH or INDIRECT opcodes. In the Branching m‑Net, limb length is measured between BRANCH, INDIRECT, and RETURN opcodes. Figure 6.4 shows the average limb length in each benchmark, as measured for different degrees of implementation. In these graphs, linearity refers to limb length, i.e., linearities in the flow of control in the opcode stream.

Figure 6.4

Average limb length (linearity)

For this graph (Figure 6.4), ALL refers to the Unit Linear m‑Net (i.e., all control opcodes delimit limb lengths). BCRi refers to the Linear m‑Net (BRANCH, CALL, RETURN, and INDIRECT), and Bi refers to the Total m‑Net. Other combinations were not measured, but would be interpolations of these values (e.g., Branching m‑Net and Recursion m‑Net), because the Total m‑Net exhibits the longest possible limb length, and Unit Linear exhibits the smallest. Modeling JUMP instructions (ALL/Unit Linear vs. BCRi/Linear) increases the average length by 10-20% from around 6 to around 7 opcodes, but that modeling CALLs and RETURNs (Recursion) increases the averages by about 40% (or as much as 100%), to around 9 (Figure 6.5).

Figure 6.5

Percent increase in limb length (linearity), adding calls and returns

The distributions of limb lengths was also measured because the variance in these averages was very large (greater than 100%). These figures plot cumulative probabilities, i.e., for a given length, the ratio of the number of limbs at that length or less to the total is plotted, so that the point indicates Òthe probability that a random limb is shorter or equal to the limb length indicatedÓ (Figures 6.6, 6.7, 6.8, 6.9). The mode of these distributions can be read directly from these plots, as the 50% value (i.e., 50% of the arm lengths are less than or equal to the plotted value).

In this set of plots, ALL indicates the limb length distribution of the Unit Linear m‑Net, BCRi indicates the limb length distribution of the Linear m‑Net, and Bi indicates the limb length distribution of the Total m‑Net. Linpack exhibits a jump in limb length statistics, presumably because it consists primarily of a set of nested FOR loops, and data from the inner portion of these loops overwhelms the statistics. The scientific programs (Linpack and Dhrystone) exhibit large limb length increases if CALLs and RETURNs are accommodated within the limbs, because the repetitious structure of the code ensures interleaving of CALL/RETURN opcodes with BRANCHes; if CALLs and RETURNs are then included within the limb, limb length increases.

Figure 6.6

Dhrystone limb length distribution

Figure 6.7

GCC (weighted) limb length distribution

Figure 6.8

Linpack limb length distribution

Figure 6.9

TEX limb length distribution

Cumulative branch arm length distributions of various benchmark programs

Rather than using the complete distributions shown above to evaluate the effects of branching on channel utilization, the mean and median of these distributions are used (Table 6.7). These measurements can be used to compare the effects of limb length increases on the Total m‑Net implementation. In effect, the performance of the Total m‑Net is compared to a Total m‑Net without recursive accommodation, called Total-Recursive, and to a Total m‑Net without either recursive or linear accommodation (i.e., Branching m‑Net without linear accommodation), called Branching-Linear.

Now that reasonable values for limb length (Table 6.7), and branch degree (2)have been measured, m‑NetÕs channel utilization can be evaluated. m‑Net utilization is determined by the miss rate of the m‑Net anticipation mechanism; Equation 6.1 describes the performance measure, and Equation 6.2 is the miss probability component which has been simplified for a branch degree of 2.

 

m‑Net implementation

Limbs delimited by

Mean L

Median L

Branching-Linear

BRANCH, CALL, RETURN, JUMP, INDIRECT

7

4.3

Total-Recursive

BRANCH, CALL, RETURN, INDIRECT