Chapter 5

m-Net

5.1. Preface to Chapters 5 and 6

The following two chapters comprise a discussion of the application of Mirage to processor-memory interaction as a protocol. This, Chapter 5, is a description of the process of modeling processor-memory interaction as a protocol (Sections 5.2-5.3.), and an elaboration of the various degrees of implementation of the design that the process suggests (Sections 5.4.-5.5).

Chapter 6 compares the feasibility of the degrees of implementation, based on measured opcode statistics (Sections 6.1.-6.3). That chapter also contains the discussion of prior work specific to the processor-memory domain; more general prior work regarding Mirage is contained in Chapter 3, in Section 6.4. Chapter 6 concludes with an evaluation of the utility of this modeling process, and the value of the results it generated (Section 6.5).

A more concise version of the material in these two chapters, describing m-Net and its results, can be found by reading the following sections:

                 
                  5.2. (inclusive)                      Introduction

                  5.4. (inclusive)                      m-Net - Design

                  5.5.6.2.                                    The Total implementation of m-Net

                  5.6-5.7. (inclusive)             Implications

                  6.1.-6.2. (inclusive)           Performance, Feasibility

                  6.3.2.                                         Other Observations

                  6.5. (inclusive)                      Conclusions

As a preview, m‑Net is a processor-memory interface design derived using Mirage as a model of the interaction as a protocol. The processor is augmented with a filtering device, to accept only messages (opcodes) with guards (addresses) that match the processorÕs state (program counter).

The memory is augmented with a mechanism, called a Code Pump, to model the state of the processor (PC) as a set of possible states, using a data structure (we call it the TreeStack). The mechanism sends data messages to the processor ([address, opcode] pairs) that increase the modeled state (i.e., send transformations), using a component called a Diverger. The mechanism receives data from the processor (PC values) that collapse the modeled state (i.e., receive transformation), using a component called a Converger.

This design is very similar to, and a superset of, current research trends in memory anticipation mechanisms, notable that of proactive memory. Performance increases are achieved by using idle bandwidth during round-trip latencies to send anticipatory information. m-Net sender anticipation as suggested by Mirage.

Measurements indicate that a simple, partial implementation would reduce opcode pipeline gaps and increase processor performance by a factor of 10; a complete implementation would increase performance by a factor of 300 to 3,000, when coupled with a cache. Feasible implementations require a simple stack mechanism and as little as 400 bytes of storage to achieve the same performance over high latencies as a 50 Kilobyte cache alone.

5.2. Introduction

Mirage provides methods for mitigating the effects of latency on communication, provided that the communication is well described by the protocol. In Mirage, the protocol doesnÕt merely police the information stream, it is the information stream. Evaluating Mirage requires the selection of a protocol in which shared state is the goal of the communication, rather than merely a means for connection management. In Chapter 4, NTP was used to describe the abstract components of the Mirage model in real terms, but many aspects of Mirage did not have analogs in NTP. For further analysis of the model, a new domain was chosen so as to exhibit some features of Mirage that existing protocols do not possess. We call the design that results from this analysis m‑Net (pronounced ÔMicroNetÕ).

There are two common interpretations of a protocol - a mechanism to maintain shared state (i.e., the protocol is the communication), and a mechanism to manage communication (i.e., the protocol provides bit-transmission). Existing protocols manage only connection information as shared state; most of the data communicated is not part of this shared state. For the shared state to comprise the actual communication, a large state is not required; in NTP as few as 2 variables (current time and rate factor) are shared. In TCP a similarly small number of variables are shared[1], but this state information is used to facilitate other data transmission. Most current protocols are not currently used to manage state spaces directly (as Mirage requires in order to show advantage).

A more complete investigation of Mirage requires the selection of a domain in which communication protocols may not be normally applied, yet one in which state sharing is the goal of the communication. We have chosen a domain in which communication protocols are traditionally not used, that of processor-memory communication. Existing processor-memory communication architectures are shown equivalent to existing protocols, and application of the Mirage model yields novel communication architectures that improve performance where transmission latency is high.

As discussed in Chapters 1 and 2, Mirage adds a parameter of transit time to the conventional parameters of a communication channel, those of bandwidth and channel width (parallelization) (Figure 5.1). Consider an analogous structure, in which the communicating nodes are that of processor/RAM and program memory (read-only) (Figure 5.2); this example is justified by its similarity to the client-server model of computation, in which code is stored on shared remote nodes, and utilized locally. This structure is a variant of the Harvard-style architecture, in which the data bus is local to the CPU and the code bus is a latency communication channel. The benefits of this architecture are not discussed here; it was chosen only as a domain in which to demonstrate the benefits of Mirage. The application of the Mirage model to this domain provides methods for reducing the performance effects of latency.

Figure 5.1

Mirage extends the channel model to include latency

 

Figure 5.2

Communication channel analog of processor/memory interaction

In this domain, performance is defined as the time of execution of a fixed code measure. In some cases reducing the execution time is cosmetically desirable, i.e., the computational outcome remains the same, but a speedup is desired; it is in this way that performance is usually considered. Execution time is as valid a measure of code as any other (i.e., correctness, completeness), and performance can be considered a correctness criterion, especially in time critical environments (see Chapter 3, on Real Time systems).

The characteristics of this architecture can be measured as the channel latency increases. The channel is a logical interaction between the processor/RAM and program memory, where the address of the desired code and the code itself are the communication exchanged across the interface.

5.2.1. The Domain - the processor / memory interface

A processor communicates with memory for two uses - retrieving data with which to operate and retrieving code to direct these operations. In the Harvard architecture, these two communication streams are both logically and physically independent[2]. The domain considered here is a version of this architecture, where the code stream communicates across a distance, but the data stream is local. This resembles a client-server model of computation, where the data (and perhaps I/O devices) are located at the CPU, but the code is archived at remote stations.

The advantages of this example are not of prime importance to us here, although they have been discussed elsewhere (as RPCs, or simply shared code on mounted disks in NFS). In this domain the shared state is concise and the temporal and communicative transformation functions on the state space are well understood. The shared state is the address whose code is being retrieved (i.e., the program counter), and the state space transformations are indicated in the opcode (e.g., jump, jump-to-subroutine), in some structure in the data space (return), or implied (the default for most opcodes). The nature of these transformation functions will be elaborated later; initially current architectures are described, together with their behavior as the code storage area is moved away from the CPU/data area.

The following discussion focuses only on communication between the CPU/data storage (noted as CPU) and the code storage area (noted as Memory). The effects on the communication of program code are measured as the program memory is moved away from the site of its utilization.

5.2.2. Description of current architectures

The processor/memory interface can be considered as a communication channel. In conventional computer designs, a processor communicates with program memory by either a request/response or a timed-response protocol, usually across a bus. The processor places the address of the desired code on a bus, and indicates a memory request by a pulse or level on a signal line (e.g., read/write, memory request, etc.). Memory responds to this request by placing the opcode at the desired address on a bus, either setting a reply line (memory ready, etc.) or assuming that the processor will wait a fixed number of clock cycles for the reply. These two protocols are shown below (Figures 5.3, 5.4, 5.5, 5.6), both as conventionally shown as a bus interaction (time/wire label) (Figures 5.3, 5.5), and as a protocol time line (time/space) (Figures 5.4, 5.6). We denote the former protocol as explicit; and the latter as timer- based.

Explicit and timer-based protocols differ from the conventional designations of synchronous and asynchronous. Synchronous protocols ensure (and rely) on signals being aligned with the clock pulses, and asynchronous protocols permit clock signal independence. Timer-based protocols are synchronous, but explicit protocols can be either synchronous or asynchronous, depending on whether the reply signal must be aligned with the clock pulse.

Figure 5.3

Explicit processor-memory protocol (voltage/time diagram)

Figure 5.4

Explicit processor-memory protocol (as a protocol time line)

Figure 5.5

Timer based processor-memory protocol (voltage/time diagram)

Figure 5.6

Timer-based processor-memory protocol (as a protocol time line)

There are usually few or no provisions for a failure of this request-response type of protocol; it is assumed that if this communication fails, little can be accomplished anyway (i.e., if an arbitrary opcode cannot be retrieved, neither can the interrupt handler code)[3]. If an explicit protocol is used, the processor would wait indefinitely; if a timer-based protocol is used, the processor would read invalid data, because it would not otherwise know that the memory had failed to reply correctly.

Consider the characteristics of this communication as the processor and program memory are moved farther apart. In this description, CPU refers to the processor and local RAM memory (read/write data), and Memory refers to program memory (i.e., read only). The two components communicate by a high bandwidth path (Figure 5.7), whose latency increases as a parameter of this investigation.

Figure 5.7

Processor-memory interaction across a distance

This configuration is usually augmented with a program cache (Figure 5.8). The cache is located with the processor and communicates to it via a low latency, high-bandwidth path. The CPU initiates a request for code to the cache, and the cache either replies directly (in the case of a cache hit) or forwards the request to the program memory over the high latency path. The memory reply is forwarded back to the processor and also copied into the cache for later reuse.

Figure 5.8

Processor-memory interaction via a cache

There are extensions to this architecture which support prefetching, where the cache requests program code in anticipation of its use by the processor. The receiver of the data (the cache, ÔspeakingÕ for the processor) asks for the data before it is needed; this is called a lookahead cache.

The Mirage model, when applied to this domain, indicates a new way to design this architecture. In this new design, the sender (memory) anticipates the needs of the receiver (processor). This has been alluded to as proactive memory, although we refer to a specific architecture here (i.e., suitable for implementation), whereas the prior work refers only to a Òpossible future directionÓ [Kr91].

In the new architecture, called m‑Net [To91b], a Code Pump manages the anticipation at the site of the memory, and a Filter Cache emulates the function of a conventional cache, isolating the processor from the details of the mechanism (so that the system appears to the processor to be identical to Figure 5.9 (CPU-memory only), except in performance).

Figure 5.9

m‑Net processor-memory interaction

In terms of the earlier protocol figures (Figures 5.4, 5.6), the protocol timelines of the three architectures can be shown symbolically (Figure 5.10). The conventional and cache architectures exhibit the same protocols, except that the cache can omit communication when a hit occurs. A prefetching cache can retrieve multiple replies with a single request. m‑Net permits memory to send replies before requests are received, i.e., the sender anticipates the receiver requests (the gray line in the m-Net version in Figure 5.10).

 

Conventional/cache

Cache with prefetch

m‑Net

 

Figure 5.10

Protocol timeline comparisons of processor-memory protocols

A conventional architecture with a cache incurs a round trip latency expense (miss penalty) whenever a miss occurs. A prefetching cache periodically requests blocks of code in advance, whereas m‑Net is a self-adapting sender-based version of prefetching where the memory sends code that the processor might need and concurrently receives updates of the processors actual state.

Cacheing is complementary both to prefetching and to anticipatory replies as implemented in m‑Net. Cacheing reduces access bandwidth to memory. Prefetching increases memory use because fetched code may not be utilized; m‑Net similarly increases memory use by sending sets of codes (isopotent sets), but only one member of each set is used. Cacheing is a way of looking into the past (of code use), making assumptions that code is reused in the future, to reduce memory bandwidth, whereas both prefetching and m‑Net are ways of anticipating future requests (beyond just code reuse) which subsequently increase memory bandwidth.

5.2.3. Effect of latency on existing architectures

The effect of latency on these architectures can be considered by describing the time it takes to execute some number of instructions (N). Assume that the processor executes 1 instruction per time unit (t) (thus defined). A conventional architecture requires Nt time units to execute N instructions. This is the optimal case, with respect to the communication latency (i.e., latency is not considered, or zero) (Equation 5.1).

Equation 5.1:   

This formula is augmented to include latency (i.e., as latency increases beyond a negligible amount), in Equation 5.2, where r denotes the round trip latency. Negligible latency is defined as any time at least one order of magnitude smaller than the execution time of a single instruction, in which case Equation 5.2 reduces to Equation 5.1 (i.e., latency contributions are negligible in comparison to instruction execution).

Equation 5.2:   

Performance is described as the time required to execute N instructions vs. the time required in the optimal case. The ratio of particular execution time to optimal time is defines the slowdown (Equation 5.3). This ration in conventional architecture analysis is usually called the speedup, but we consider cases where the instances examined are slower than optimal. In the case of a conventional architecture, this is Equation 5.4.

Equation 5.3:   

Equation 5.4:   

In the case of the conventional architecture augmented by a cache, the effects of the latency are reduced proportional to the effectiveness of the cache utilization (Equation 5.5). M denotes the probability of a miss in the cache. On a cache miss, both the round trip and the execution time are incurred. A cache hit costs only one instruction execution time. Equation 5.6 reduces to Equation 5.4 where the cache is 100% effective. In this case, no communication costs are incurred, because all code accesses are intercepted by the (local) cache.

Equation 5.5:   

Equation 5.6:   

The time to execute N instructions in an architecture where the cache prefetches can also be described. Let k denote the prefetch length, in instructions. Further characterization of the communication stream will be required to estimate the probability of a prefetch occurring (described later), as used in the equation.

Prefetching is usually implemented in a linear fashion only. When an opcode at address i is accessed, address i+1 though i+k are prefetched, where k is the linear lookahead, and usually k is the number of opcode words in a cache line[4]. A miss occurs when the prefetch fails to anticipate the next opcode used, i.e., when the linear anticipation assumed by most implementations is breached. Misses also occur when the destination of a control transfer instruction (e.g., BRANCH, JUMP, CALL, RETURN) is not in the cache.

At most, prefetches occur at the miss rate of the cache alone (i.e., the same as without prefetching). The actual prefetch miss rate is also at most the rate of occurrence of control instructions, because the opcode after a control transfer may not be in the cache (Equation 5.8).

Control instructions are assumed always to prevent the prefetch of the next opcode. This assumption is valid for JUMP, CALL, and RETURN opcodes, but it is not always valid for BRANCHES. In the case where a BRANCH is not taken, the next opcode is successfully anticipated. Because branches are taken about 50% of the time [He90], the formula can be further refined (Equation 5.9).

For the moment, the probability of a fetch occurring is denoted as F (Equations 5.7, 5.10). F is always less than M, because control instructions are only part of the cause of misses in a conventional cache, so Equation 5.7 is always at least as good as Equation 5.5. Empirical values of F and M are discussed in Chapter 6.

Equation 5.7:   

Equation 5.8:   

               where opcode occurrences are denoted by percentages:

                           J = % JUMPS (direct only)

                           B = % BRANCHES (direct only)

                           C = % CALLS (direct only)

                           R = % RETURNS[5]

                           I = % other indirect opcodes (JUMPS, CALLS, BRANCHES)

Equation 5.9:   

Equation 5.10: 

In m‑Net, the equations are a little more complicated. The time to execute the N instructions depends on the ability of the Code Pump to predict correctly the opcodes desired by the processor. The probability of an incorrect prediction is denoted as P, and one round trip time will be required for each misprediction, to allow the processor to fetch the desired data which is not already available (Equations 5.11, 5.12). This case reduces to the optimal (Equation 5.1) where P = 0, i.e., when the prediction is perfect. An estimate of P will be discussed later, with the conditions under which it is smaller than F and M.

Equation 5.11: 

Equation 5.12: 

5.3. m‑Net

m‑Net [To91b] refers to our choice of the processor-memory interaction as being analyzed by a communication model, Mirage. This is in keeping with the tradition that all protocol research, at one time, must coin a network name suffixed in ÔNetÕ. This was not done with the model name (MirageNet?), so the tradition is upheld in the name of the example.

The m‑Net domain is easily modeled in Mirage, by suitably defining the shared state space and by defining the state transformations of Mirage in terms of this space. In this example, communication is affected by classes of opcodes in the instruction stream. The usual techniques of cache prefetching, widening cache line sizes, and branch prediction are confirmed from the Mirage model of m‑Net. Further, m‑Net exhibits new techniques for active opcode anticipation, which are only recently being proposed as new solutions to latency issues in architectural design. Also, some simple data structures in the memory interface can reduce the latency for some kinds of memory access operations; these structures implement isopotent anticipation as described in Mirage.

In order to consider the Mirage model of m‑Net, the components of m‑Net need to be defined in terms of the model components in Mirage. The first and most fundamental of these is the shared state between the sender and receiver.

In the processor/memory interface, what state is shared? Usually, this is minimally the program counter (PC) value. The program counter denotes the address of the next instruction desired, to be placed on the address lines when a memory request is made. During the memory request, that address is received by memory, after which the desired opcode is replied. The state space is thus the space of all PC values. Conventional protocols communicate by understanding the state of the remote node as a point in this space, and moving this point explicitly. Therefore, until the memory receives the new value of the program counter, the previous value remains in effect, as far as the memory is concerned (Figure 5.11).

Figure 5.11

CPU state and Memory image

In terms of the Mirage model, the state is a PC value, and an image is a set of PC values, i.e., a subset of the entire PC space. Transformations map PC values onto new PC values, or sets onto sets.

The value of the state space changes with the progression of time, the sending of a message, and the receipt of a message. The state space, in this case, is the program counter, because the memory models the processorÕs current PC value.

This analysis considers the memoryÕs model of the processor, not the processorÕs model of the memory. This is appropriate because the communication is essentially unidirectional, although there are components in both directions (control in one direction vs. data in the other).

5.3.1. Time transformations

As time progresses, two things happen at the processor. The processor is executing the current instruction, then it must wait for opcodes to be sent from memory, sitting idle until its request for the subsequent opcode is serviced. As a result, the state of the processor (i.e., the PC, the part of the state that the memory is concerned with) is stable during the execution of the instruction, and changes only afterwards.

This assumes, however, that there is no cache at the processor. When there is, the processor may use opcodes already in the cache during the time lapse. The opcodes in the cache have already been sent from memory, so they are already accounted for in the memoryÕs image of the processorÕs PC.

The passage of time then, in this domain, denotes the scheduling of the need for opcodes by the processor. During the time in which the current opcode is executing, the PC image does not change. At the time when the execution changes, the PC image changes to reflect the need for the next opcode.

The simplified model assumes a RISC architecture with a single opcode execution time. Variability in execution times (e.g., in CISC architectures) can be accommodated by tables in the sender when opcode execution times are static, or by a dynamic structure, if execution times vary, as in pipelining. This latter dynamic structure emulates the pipeline activity, in order to predict the need for new opcodes at the pipeline input.


5.3.2. Receive transformations

When messages are received, the state space collapses accordingly. There are cases (in Table 5.1) in which the state space of the program memory can grow, by modeling both arms of a conditional, or by expanding to the limit of the state space (in the case of indirect opcodes). The messages received by the program memory are processor PC values, which collapse the PC value set in the program memory to a single value. In fact, in this case, the PC set is collapsed to the value received, and then expanded to account for the transit time of the message. In this way, the PC set in the program memory always models the PC of the processor, out of sync by the transit time of a message.

5.3.3. Send transformations

When information is sent, the state space expands through the unioning of the sets of state space before the message and that state space as affected by the message. The state space consists only of PC values (and possible PC values), so the actual opcodes sent are ignored, except in the way in which they affect the current PC. The memory has some set of possible PC values; it sends the opcodes as these PCs indicate (i.e., it sends the opcodes at those addresses), and transforms each PC by a function indicated by the opcode. After transmission, the new PC set becomes the union of the prior set and the transformed set. For example, if the opcode is ordinary (OTHER, i.e., not otherwise distinguished hereafter), the PC would increment when the opcode is sent. The resulting PC set is the union of the PC and PC+1. The transformation depends on the opcode sent; some opcodes increment the current PC, some transform it, some expand it to a set of two PCs (i.e., two possible PCs), and some expand the PC to the set of all PCs (Table 5.1).

In the most general case, the PC would be arbitrarily transformed with each opcode, but this is not effective in modeling the evolution of the imprecision of the knowledge in the PC as known by the program memory. Because the program memory knows the contents of the message (i.e., the opcode), it can determine how the message will potentially affect the PC at the processor (i.e., the transformation).

The image of the PC in the program memory is transformed differently for various classes of opcodes sent. The opcodes are distinguished only by the way in which they transform the current PC over time (Table 5.1).

 

OPCODE

transform: PC Ý

new PC values are computed based on...

other[6]

PC+1

PC

direct jump

PCÕ

PC, opcode

indirect jump

{all PCs}

PC, opcode, any CPU register, program or data memory value

direct call

PCÕ

PC, opcode (prior PC is stored in a known structure in RAM )

indirect call

{all PCs}

PC, opcode, any CPU register, program or data memory value (prior PC is stored in a known structure in RAM )

direct branch

{PC+1,PCÕ}

PC, opcode

indirect branch

{PC+1,PCÕ}

PC, opcode, any CPU register, program or data memory value

return

PCÕ

stored in a known structure in RAM (previously saved)

Table 5.1

Opcode time transformations

For the majority of the opcodes (denoted as OTHER), the PC is incremented. In the case of a JUMP, the program counter becomes a new value based on the destination. In the case of a BRANCH, the program counter becomes one of two new possible program counter values, one being the incremented PC, the other being the newly indicated branch destination, thus the model of the PC becomes a set of PCs. This is the expansion of the state space referred to in the Mirage model.

At some later time, at the receipt of a message from the processor, this set is collapsed down to a single member. CALLs behave similarly to JUMPs, except that in addition to performing the transformation indicated, some local state is maintained (in a stack). A RETURN utilizes this local state to perform a JUMP back to the opcode following the origin of the corresponding CALL.

The table distinguishes between direct and indirect types of jumps, branches, and calls, because in the former case the resultant PC can be computed, whereas in the latter the state space becomes completely unpredictable[7] (i.e., it expands completely to encompass the entire state space). Opcodes differ in the way in which they affect the expansion of the state space over time; this is the criterion for partitioning them as described.

In the case where the PC set expands to the limits of the state space (for indirect opcodes), further temporal transformations become impossible to compute. The memory must wait for a message from the processor of the actual PC value chosen, before it can proceed further. Indirect opcodes necessarily cause Ôbubbles in the communication loopÕ, where sender anticipation cannot be accommodated. Indirect opcodes, are, in effect, too unpredictable to model.

5.3.4. Guarded messages

In the conventional processor/memory architecture, there is no need to label the opcodes which are sent from memory, because only one memory request is outstanding per unit time. Even in the cache prefetch case, the cache either requests each memory value independently, or an initial request is made and the resulting values are assumed to arrive sequentially.

The opcodes being sent from memory must be labelled, in order to permit the processor to receive them conditionally. Consider the case where the PC modeled in memory contains a BRANCH instruction. Memory sends the next instructions, but more than one instruction is sent (i.e., more than one possible next request, as discussed before). Replies to these requests must be differentiated, so that the processor (which knows its own state) can select the appropriate one. Although there are more efficient labellings, the opcode can be labelled with the part of the state space to which it applies (i.e., with the PC it is located at). A single bit could be used to indicate the two arms of a branch,, but any label thus chosen is limited to some finite partitioning of the state space (i.e., only 1 branch lookahead per bit).

5.3.5. Partitioning the state space (stability)

The state space is not partitioned into sets of PC values, which would be computationally prohibitive. The PC image may indicate two PC values which request the same opcode, which could be sent only once with a guard indicating both PCs. Such an implementation would be excessive, because the resulting reduction in bandwidth would be at the expense of excessive encoding.

Instead, the state space is partitioned into its individual points (single PC values), so each message (opcode) requires a single PC value as a guard. The state space is not separated into Òlast timestep / this timestepÓ, as in the union of the two spaces in the Mirage send transformations. m‑Net assumes that messages are not lost, so the prior state (before the sent message) need not be retained. This prevents having to resend opcodes, under the assumption that the communication channel is lossless. If messages always arrive, in some fixed time delay, the part of the state space which corresponds to the prior state can be ignored because the probability density highly favors the send-transformed state.

5.3.6. Isopotent sets

There are two analogs of isopotent sets in this domain. In the first, isopotency occurs when two PC image values recombine, i.e., when instruction streams reconverge. In the second, isopotency indicates the satisfaction of both paths of a conditional by the set of both destination opcodes. This latter isopotency is accommodated in m‑Net by the expansion of the state space when a BRANCH occurs, because subsequent message sets furnish both arms of the branch.

The reconvergence case occurs either during a single timestep (simultaneously), or at differing timesteps. For example, when two current PC image values JUMP to the same location, the streams converge simultaneously. This convergence is managed as a side-effect of our implementation; if two PCs in the image at time t become the same at time t+1, only one remains.

When a forward branch occurs, the streams converge at different time values, because the PC of the branch-taken stream is the same as the PC of the branch-not-taken stream at some earlier time. This is not handled in m‑Net. In this latter case, isopotency also exists where two PCs in the image become one, but only under message sequences of different length. If information were maintained on the past state in the model (it isnÕt - see the discussion on Partitioning, above), this recombination would be recognizable. m‑Net cannot take advantage of the case where one stream becomes a time-shifted image of another stream. This inability is not corrected because the complexity in modeling past time images would be prohibitive, and a corresponding benefit may not exist.

As a result, the way in which a set is isopotent depends on the ways in which the members of the transformed set are considered equivalent, i.e., the way in which the messages are equivalent.

5.4. m‑Net - Design

Although Mirage is an abstract model whose direct implementation was argued against, m‑Net can be implemented directly. Models are not usually intended to be implemented, but the domain of this problem has been sufficiently constrained to permit such an implementation to be reasonable, both to give a real impetus to the abstract model and to permit a better understanding of the modelÕs advantages.

Communication in m‑Net is unidirectional - although the processor needs to communicate its state to the memory to specify the desired opcode location, the processor is modeled in the memory, and not the converse. Only the memory is concerned with the state of the remote entity (processor) here. The model of the processor can be implemented by putting a PC in the memory. In fact, a structure is placed near memory which will permit the modeling of more than one possible PC.

5.4.1. The Code Pump

The Code Pump manages the image of the processor for the memory. It also contains the send and receive mechanisms, insofar as they affect the state space modeled within the Code Pump. The Code Pump implements the time, receive, and send state space transformations, and the state space image (albeit limited) (see detail, Figure 5.12).

Figure 5.12

Detail of Filter Cache and Code Pump of m‑Net

The Code Pump contains three main components: a TreeStack data structure, a Converger, and a DMA/Diverger. The TreeStack structure maintains the image of the processorÕs PC; its tree-like attribute manages bifurcations in the PC image caused by sending BRANCH opcodes, and its stack-like attribute manages the information of saved and restored PC images which are caused by CALLs and RETURNs, respectively.

The Converger manages the collapse of the PC image in the TreeStack structure, as indicated by the receipt of messages from the processor (i.e., addresses). The DMA/Diverger manages the preparation and sending of messages anticipating the processorÕs request (DMA), and the expansion of the PC image in the TreeStack, as indicated by the send transformations.

5.4.2. The Filter Cache

The Filter Cache serves only to make the Code Pump mechanism appear invisible. It buffers data coming in from memory, and passes through only opcodes whose label (PC value, i.e., address) match the requested address whose data the processor is waiting for. All other data, for addresses not requested, is thrown away. It thus implements the receiver requirements for guarded message use.

The Filter Cache needs only a small amount of space to operate. If the rate of the processor is known(i.e., the rate at which opcode requests are issued, which is nearly trivial in a RISC processor), the Filter Cache needs only one opcode/address unit of space. The opcode arrives with the address as a conditional label (guard), so that the opcode is used only if the address is pending a request, which is the purpose of the comparator in the filter. The Code Pump knows this rate, so it sends the messages only when needed (see the time transformation).

The Filter Cache is also completely compatible with a conventional cache in parallel at the same location. The Filter CacheÕs purpose is to manage the incoming stream of new information, and pass the relevant parts on to the processor, whereas the conventional cache passes old information back to the processor. The two caches are complementary.

5.4.3. Degrees of design

The model is implemented directly in m‑Net, so there are some variations to the level of implementation which can be performed. For example, a TreeStack structure cannot be effectively implemented which models the PC after an indirect opcode,  because the state space grows to its limits, and further partitioning of the state space to predict the next desired opcode is impractical. The design can be further simplified by not implementing any branch transformations, or not implementing RETURNs,. The following is an enumeration of the various levels of implementation, in increasing order of complexity (Table 5.2).

In the Code Pump implementation, the effects of the sent message on the PC image must be modeled, as well as modeling the old image (i.e., unioning of the unaffected and affected PCs). One simplifying implementation decision is to ignore the possibility of message loss, and omit the old PC image, replacing it with the new one as soon as the message (i.e., opcode) is sent. This works under the assumption that messages are not lost, simply delayed by a fixed time, in a way that simplifies the design to account for a reduction in the probability of states in the old image remaining valid.

The Code Pump can be implemented as a null device, which reduces to a conventional or conventional plus cache architecture. Prediction can be limited to only OTHER opcodes, where the Code Pump requires only a single PC image and an incrementer. This is analogous to an opcode prefetch in current CPUs (680x0, 80x86, MIPS, SPARC, etc.), but differs in that the PC and incrementer are placed in the memory, rather than in part of the processor (i.e., the PC is on both sides of the interface), thus reducing the comparable latency by half. At this point, the Filter Cache needs only 1 element of space to operate - to hold the opcode/address pair as it is received, and to compare it to the current desired address.

 

Level of opcodes done

Filter size

Code Pump Components

Storage required (in Code Pump, in PCs)[8]

(none)

(null)