AVID: A Complexity Effective Front-End Architecture for Fast
Memory Disambiguation and Early Data Prefetching
Report Submitted To: Professor. Todd Austin EECS573 Project
Submitted By: Seokwoo Lee, Allen Cheng
Department of Electrical Engineering and Computer Science
University of Michigan
Ann Arbor, MI
{seokwoo,accheng}@eecs.umich.edu
(734) 214-2988 April 22, 2002
1
AVID: A Complexity Effective Front-End Architecture for Fast Memory Disambiguation and Early Data Prefetching
Allen Cheng and Seokwoo Lee
Department of Electrical Engineering and Computer Science
University of Michigan
Ann Arbor, Michigan 48109-2122 {accheng, seokwoo}@eecs.umich.edu
Abstract
False dependency between load and store prevents independent load from
executing. Without speculation, ambiguous load should wait until there is no address confliction of preceding store instruction, which cause unnecessary latency. On the other hand, the load can be executed speculatively, which costs the penalty when it turns out to be miss-predicted by back-end. We propose a complexity effective runnahead processor, AVID, that will clear false memory dependencies and will hoist load instructions in safe and fast manner. Early load address resolution to run independent load ahead of store relies on register tracking. It will provide safe computation of effective address of memory references in the front end after the decoding stage. It is also complexity effective because it costs a storage of register state represented by base register, offset and a bit indicating the trackability of entry. In addition to disambiguation, AVID has a small adder to compute effective address of load based on the value of ‘known’ state register. After the safe
computation of effective address, AVID take over prefetching the data to hide the memory latency by caused cache miss. We simulated the AVID with simplescalar and analyze its performance results as well as potentials in front end.
2
1. Introduction
Problem
False dependency between load and store prevents independent loads from executing Long miss latency of load results in a less effective utilization of execution resources Solution
A complexity effective runahead processor that will clear false memory dependencies and will hoist load execution in a fast and safe fashion
One of the key factors influencing microprocessor performance is to prevent from unnecessary latency of independent load. With higher microprocessor frequencies, the relative performance cost of memory access is increased.
This paper introduces a new technique to decrease the unnecessary load latency by eliminating false load dependency. This technique is similar in the sense that it
determines the address of a load instruction during an early stage of pipeline to initiate the memory access earlier. However, the proposed scheme is safe and dose not rely on the history of the previous access. The proposed decoupled AVID can safely compute the memory address at the decoded time by using ‘register tracking algorithm’. AVID tracks the state of register and resolve ambiguous memory address in safe manner and pre-compute the effective actual effective address if possible.
The remainder of this paper is organized as follows. Section 2 presents the motivation of register tracking. Section 3 describes the way to design complexity effective front end AVID processor. Section 4 provides the detail implementation of AVID, and registers tracking algorithms for memory disambiguation and early data pre-fetching. Section 5 shows the methodology of simulation and short description of benchmarks and the experimental results with AVID. Discussion and analysis will be given in section 6. Related work and concluding remarks will be provided in section7 and 8. Last, we propose the future direction of AVID.
2. Motivation
In any RISC instruction set architecture, for any integer general-purpose register that is modified by an integer ALU immediate operation, its dependency can be represented by a base register and an offset value, i.e., Reg = (Rbase, offset). For example, if R2 ß R1 + 100, R3 ß R2 + 100, R3 can be rewritten as R3 ß (R1 + 100) + 100, or R3 ß R1 + 200, which R1 is the base register of R3 and 200 is the offset.
3
Thus, we could express R3 as R3=(R1, 200). This type of representation can be used to facilitate the execution of ALU instructions with an immediate operand or memory operations with displaced addressing mode.
In conventional superscalar architecture, the value of the immediate operand, if any, is explicitly provided by the instruction’s immediate field after the instruction decode stage. This gives us the inspiration that we could resolve the effective address of loads and stores with displaced addressing mode earlier than the current architectural scheme. That is, as long as the value of the destination’s base register is known, we can start calculating the effective address of loads and stores as soon as its immediate operand is available. This means that we no longer have to wait for a load or a store instruction entering the execution stage to access the memory and pay for the miss penalty if the data cache does not contain the requested data. We can start data prefetching from the data memory as soon as the effective address of a load or a store becomes available. Using early resolved effective address of loads and stores, we can also perform memory disambiguation on the fly and remove any false memory dependency between a load and a store so independent loads can bypass proceeding stores.
We define ‘known’ state of a register that the value of a register is consistent with register file. When a register gets the value, it turned to be known state, which means that the value of register is consistent with register file. The state of register, a base register and offset, can be changed by subsequent instructions consuming those as source register. AVID dynamically analyzes the subsequent instruction streams and
validate/invalidate the state of each register. Simultaneously, the AVID can compute the effective address of memory reference with small adder and prefetch the
referenced data from low level memory hierarchy in order to hide memory latency. 1 2: invalidate known Unkown 3: validate
4
Figure1. State transition of register
4
Fig 1. shows that address tracking state diagram with the register known and
register unknown states and arc numbers corresponding operations to trigger the state transition. Whenever the current register value is known and the subsequent
operations based on the sources that have the known registers as base, AVID validates the register state and immediately checks that instruction as tracked and computes the effective address of the load and the prefetches the data from memory. When untrackeable register state modification is recognized, AVID switches the register state from known to unknown and no early address computation and prefetching is performed in this state. By setting a register value as invalidate, the other registers that use that register as a base should be also invalidated.
To maintain the consistency between AVID’s tracker and register file, the AVID need the communication port to execution core of back-end. Whenever the backend mutates the value, AVID tracker’s corresponding entry should be marked as known.
3. Design
This section explains the design aspects of non-sepculative register tracking. 3.1 depicts the tracking algorithm and its advantages. 3.2 and 3.3 describe two immediate applications of register tracking, i.e. dynamic memory disambiguation and early data prefetch. Finally, 3.4 illustrates the AVID architecture at the end.
3.1 Non-speculative register tracking
3.1.1 Algorithm
Since the type of the instruction and the value of the immediate operand are explicitly provided by the opcode after the instruction is decoded, we connect the register tracking hardware right after the instruction decode stage. The tracker inspects each decoded instruction, in program order, to detect instruction modifying the values of architectural registers as well as memory operations using these architectural registers for address calculation. Note that the tracker currently only tracks integer typed instructions and integer typed general-purpose registers ($r0-$r31). Furthermore, the trackable integer typed instructions are single cycle integer ALU operation with one immediate source operand. Similarly, the trackable memory operations are only loads and stores of displaced addressing mode. An instruction is trackable if its source register is trackable, i.e., the value of the source register’s base register is known. If a trackable load or store is detected, the tracker computes its effective address, dynamically disambiguates memory dependency, and initiates a data prefetch. It a trackable ALU operation is detected,
5
the tracker validates and updates its entry indexed by the trackable operation’s destination register. For other untrackable integer operation with a destination register is detected, the tracker invalidates its entry indexed by this untrackable destination register. The tracker will update and validate this entry from the core’s register file when the untrackable operation commits.
3.1.2 Advantages The advantages of this register tracking scheme are that it is fast, simple, and safe. It is simple because there is no complex ALU required. All the ALU needs to do is to add an immediate value, signed or unsigned, to a register. To track all the architectural registers, the tracker requires only a copy of register file with one extra status bit to indicate a register’s trackability, and one extra 5-bit field to label an entry’s base register. It is fast because the tracker only tracks single cycle integer ALU operations so all the tracker’s entry can be validated and updated within one cycle. This register tracking is completely safe because it does not require any speculation. The tracker’s entry is validated and updated only if the tracker knows the value of an operation’s source register. If an operation is detected to be untrackable, the tracker invalidates the corresponding entry first and waits until this operation is committed to re-validate and update this entry. Since register tracking is non-speculative, there is no verification or recovery mechanism required. Because the tracking mechanism is safe, the resolved effective address is correct. Therefore, the tracker-initiated early data prefetch and dynamic memory disambiguation is safe and valid; hence, it does not waste additional scheduling or memory bandwidth .
3.2 Dynamic memory disambiguation
As mentioned previous section, a certain register can be propagated to the other register as a base register. Base register transits between the known state and unknown state by subsequent instruction stream. AVID is able to keep track of the state of each register by inspecting the
instruction in program order. AVID can tell that the referenced address by looking the state of register, not by really computing the effective address itself. It can be referred as symbolic disambiguation, which means that it is guaranteed that the two memory reference are different, where they share the base register as known state and have different offset. AVID validate or invalidate the state of each register with subsequent operation of instruction and communication with the execution core of back-end. The proposed data structure and description of algorithms are presented in Fig 2.
6
$R6 ß 0x1000 (known state) $R7 ß $R6 + 0x10BA $R8 ß $R7 + 0x4434 == { $R6 + 0x10BA} + 0x4434 $R9 ß $R8 + 0x1123 == {{ $R6 + 0x10BA} + 0x4434} + 0x1234
$R6 is base register and propagating to the other registers
Figure 2. Propagating of base registers
AVID looks up the entry of tracker and set the address of each load/store
instructions as tracked and disambiguates the address send signals to decode logic to issue the independent load run ahead of store.
3.3 Early data prefetching
After the tracker has computed the effective address of a load or a store, it initiates a data prefetch: it accesses the data cache and data TLB to prefetch the data needed by subsequent dependent instructions. The type of access, either read or write, initiated by the tracker is depending on whether it is a load or a store. For store, we do not need the actual value of store data in order to initiate a write access to the data memory. All we care about is simply pre-allocating an entry for store in the D-cache so there will be no write miss when the store is actually executed by the core. For load, if it does not collide with any previous stores due to true data dependency, this technique can effectively reduce the miss rate of L1 D-cache and hide load miss latency by pre-executing loads as early at the end of tracking stage.
3.4 AVID architecture Figure 1 illustrates the architectural scheme of AVID. AVID partitions the conventional multi-issue superscalar processor into two parts: the front-end and the back-end. The front-end consists of instruction fetch, instruction decode, and the register tracking hardware, or simply the tracker. The back-end consists of the remaining superscalar’s scheduling, issuing, executing, and committing logics. After instruction decode, the decoded operations are sent to both the renaming logic and the tracker and the instructions from both streams are processed at the same time. After processing an instruction, if the instruction is a trackable memory operation, the tracker sends out disambiguation information to the scheduler so any independent
7
load can bypass earlier stores. Meanwhile, the tracker will also initiate a early data prefetch to perform a read or write access to the data memory depending on whether the memory operation is a load or a store.
Figure 3: The Block Diagram for AVID Architecture
4. Implementation
To implement our design, we chose the PISA to be our target ISA because it is uncomplicated and clean and it’s the native ISA of the SimpleScalar. The registers under track are all integer general-purpose register, $r0 to $r31. The trackable instructions are integer instructions only. For load and store memory operations, only those of displaced addressing mode are tracked. For ALU operations, only those has one immediate operand are tracked. To implement the tracker, we utilize a small tracking table plus a simple ALU. For PISA, the ALU only does the add immediate (addi) and add unsigned immediate. (addiu). The tracking table is used to store the tracker’s states. Figure 2 depicts the organization of the trackable table used by the tracker to keep track of all architectural registers. When a decoded instruction coming, the tracker inspects if it is a type of tracker operation. A trackable operation is of the form Rdes ß Rsrc ¡Ó immediate. If the instruction is not a trackable type but has a destination, the tracker invalidates its corresponding entry. If the instruction is a trackable type, the tracker first checks the trackability of the entry indexed by Rsrc. If Rsrc is trackable, the tracker uses Rsrc’s base register to update Rdes’s base register. The Rdes’s offset is the sum of
8
Rsrc’s offset and the instruction’s immediate. Then the entry indexed by Rdes is checked trackable, or validated. If Rsrc is untrackable, the entry indexed by Rdes is checked untrackable, or invalidated.
Figure 4: The Organization of the Tracking Table
5. Experimental Evaluation
This section analyses the performance of register tracking and early load address resolution scheme and evaluates its impact on overall processor performance. Section 5.1 presents our simulation methodology
5.1. Simulation Methodology
Results provided in this paper were simulated using Simplescalar built on
PISA(little). In this study we set the baseline as attached configuration file. Register tracking as well as early load address resolution for memory disambiguation and data prefetching are performed right after the decode stage.
Table shows that the baseline featuring that we used in projects. Details are described in ss8.cfg file submitted with simplescalar
9
Baseline Configuration Instruction fetch queue size BTB configuration Return address stack size Decode width Issue width Commit width RUU size LSQ size
Memory latency D-TLB TLB miss latency Total number of available integer ALU Total number of available integer multiplier/divider Total number of floating point ALU Total number of available FP multiplier/divider Total number of memory system ports available D-L1 cache D-L2 cache Cache hit latency Table 1. Baseline configuration We simulated our AVID based on baseline. To get fine-grained results, we skip first 0.5 billion instructions and only simulate the second 0.5 billion instructions. We used the SPEC95 integer benchmarks for PISA, shown in Table 2, for evaluation. Benchmark Compress Go Gcc Perl Li Input data set 40000 2 2231 9 9 -O 1stmt.i perl.pl perl-test.pl Size 16 4 way, 512 sets 16 8 8 8 128 64 50, 5 32 32:4096:4:1 64:4096:4:1 50 cycles 8 8 4 2 4 32KB, 2way 512KB, 4way 12 cycles au.lsp boyer.lsp Table 2. Benchmarks with input data sets The Fig . shows that the performance evaluation with the AVID and without 10 AVID. 2.521.5AVIDBASELINE1IPC0.50perlgocc1licompress95 Figure 5. The performance results with AVID 6. Discussion and Analysis 6.1 The results of simulation Form the Fig. AVID gains performance improvements about 5% except go and vortex. According our survey of vortex, it consists of a setup phase followed by a teardown phase. Most of the time is spent the main loop. The main loop is about 4B instructions per iteration and has lots of indirect reference. We consider these characteristics might affect the performance. Even if the AVID has the performance gain, it is less than we expected. From close investigation and reviewing our work, we conclude that the limitation of the performance improvements results from that the AVID only considered the instructions register and immediate formats. In this project, the AVID manages only 1) integer instructions 2) Load/Store with base and displacement format 3) ALU operations with immediate value. Hence, if the AVID will be extended to handle various instructions format to dynamically disambiguates the memory dependences, the higher improvements of performance can be achieved in the future. 11 6.2 Trackability of Individual Registers In this project, we tracked every state of register at front end. The AVID check the instructions stream in program order. Not surprisingly, some of registers are more trackable than the other registers. Stack pointer and the base register of array can be more widely propagated to other registers. Namely, those registers would be more frequently used as base register of the others 6.3 Limitations of AVID scheme The gain of early load address resolution comes from issuing loads to the cache in the front-end part of the pipeline ahead of other memory operations. However, a load cannot be advanced ahead of a preceding store in program order to the same address. Communication overhead between the tracker and the backend when (1)Prefetch (2)Disambiguate memory dependency (3)Update tracker’s state at commit 7. Related work Several techniques have been proposed to filter the hazards in front end. Some of them rely on the variation of SMT techniques: Rotenberg et. al propose the slipstream processors. A slipstream processor reduces the length of a running program by dynamically skipping computation non-essential for correct forward progress. A slipstream processor concurrently runs two copies of the program. One short program always runs ahead of the other. The short program passes its control and data flow outcomes to the full program for checking. The main long program fetches and executes more efficiently due to having an accurate picture of the future. Both programs are sped up because the run ahead streams would filter the hazards and provide the experience of the future. Dynamic branch Decoupled Architecture propose an alternative approach to branch resolution based on the decoupling. Branch decoupling is a technique to decouple a single instruction stream program into two streams. One stream is solely dedicated to resolving branches as early as possible. The other computing stream consumes the resolved branch targets. To hide the memory latency, Mudge et. al propose the pre-executing instructions under cache misses. They refer it as runahead processing. When a miss occurs, it makes use of idle resources, to prefetch the data under miss. 12 In register tracking, Berkerman et. al suggested that ESP based register tracking algorithms in front end. They only track the ESP stack pointer and pre-compute the ESP-based address calculation mechanism. They implement the early stack loads resolution scheme by classifying the instructions that potentially change the state of ESP. Previously, Austin and Sohi proposed the caching SP and GP registers as well as the number of general registers used recently as base or index components of memory address. Cached register value are used to speculatively generate a virtual address at the end of the decode stage and access the cache speculatively. This project combined the symbolic disambiguation with the register tracking. We disambiguate the memory address by the register state represented as base register and offset. Memory disambiguation and prefetching is performed dynamically and non-speculatively in front end. We, also, extend the tracking algorithms to general purpose registers. 8. Conclusion. In this project, we present a register tracking algorithms for early memory address resolution and prefetching the contents of effective address from low-level memory in the front-end part of the processor pipeline. The proposed scheme provides the safe and fast memory disambiguation, not by computing the actual effective address of memory reference, but by looking the state of the register. The AVID classified the state of register as known and unknown state and kept track of the state of registers communicating with the back-end processor. Register tracking may be performed right after decode stage. The AVID check the every instruction in program order and validate/invalidate the state of register by classifying instructions. Symbolic disambiguation can be achieved by register tracking. If the same known state base register with the different offsets represents two memory addresses, the AVID can immediately recognize it to be different. As opposed to address prediction, the proposed scheme is not speculative, and therefore it requires no storage for recovery, it does not waste the power and cycles due to correcting the speculated data. The AVID, complexity effective fast and safe runahead processor, will clear memory dependences and hoist load/store execution. In addition to dynamic memory disambiguation, the AVID early computes the effective address for memory reference and can hide memory latency by prefetching the memory early. It can be achieved by adding a small adder in AVID. The simulations were performed with using Simplescalar. To implement the AVID, we added suitable data structures to keep track of the state of registers and modified 13 RUU_dispatch() and laq_refersh() to communicate between AVID and back-end processor. Results from the simulation is provided in Fig . It shows that the overall performance gain is about 5%, which is less than we expected. We concluded that it is due to our restriction within integer instruction scope. In the future, the AVID should manage various instructions. 9. Future Work We should restrict the scope of AVID considerably. However, we convince that the AVID is more powerful if we continue to research. In recent days, the architect may need the deeper pipeline to achieve high clock speed. As the machine has deeper pipeline, the AVID is more powerful because AVID can alleviate load-to-use latency by register tracking. First, as mentioned before, the AVID can improve the performance if we extend the AVID handle various class of instructions such as indexed addressing mode of load/store and logical ALU operations. Second, we can extend our tracking scheme from ‘known’ to ‘const’. If we clarify effective address that is based on the const register, we can disambiguate memory dependency in safe and fast. It will lead to extend the region of memory disambiguation in front end and the processor can save the cycles that would have been wasted by independent load. Third, AVID combines address prediction and register tracking with dynamic memory disambiguation. Unlike speculation in back-end, the AVID has enough information about the registers. Hence, it might provide more powerful, correct way of memory prediction. Finally, if the register has the life cycle, AVID can identify the location of state of register in the cycle. It can be used for predicting/tracking the pattern of memory references or for cache replacement policy. 14 Reference [1] M. Bekerman, A. Yoaz, F. Gabbay, S. Jourdan, M. Kalaev, and R. Ronen, “Early Load Address Resolution Via Register Tracking,” [2] D. C. Burger and T. M. Austin, “The Simplescalar Tool Set, version 2.0,” Technical Report CS-TR-97-1342, University of Wisconsin, Madison, June 1997. [3] A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, “Speculation Techniques for Improving Load Related Instruction Scheduling,” in Proceedings of the 26th Annual International Symposium on computer Architecture, May 1999. [4] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar processors,” In Proceedings of the 24th annual International Symposium on Computer Architecture, pages 206-218, June 1997. [5] Karthiik Sundaramoorthy, Zach Purser, Eric Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance [6] Zach Purser, Karthiik Sundaramoorthy, Eric rotenberg. A study of SlipstreamProcessors [7] T .M.Austin, G. S. Sohi, Zero Cycle Loads: Microarchitecture Support for Reducing Load Latency, in Proceedings of the 28th Annual International Symposium on Microarchitecture, November 1995 [8] T.M Austin D.N Pnevmatikatos, G. S. Sohi. Streamlining Data Cache Access with Fast Address Calculation. In 22nd International Symposium on Computer Architecture, 1995 360-380 15 因篇幅问题不能全部显示,请点此查看更多更全内容