# Titel:

.

A Next-Generation Superscalar Uniprocessor Architecture

### Autor:

Dr.sc.techn. Wolfgang Matthes Franz-Mehring-Straße 22 9006 Chemnitz

#### Abstract

A uniprocessor superscalar architecture is proposed which comprises four universal operation units arranged according to a tree-shaped dataflow graph. The control principles are based on VLIW, microprogramming, and dataflow concepts. Each of the operation units is an ensemble of high-performance comparable to state-of-the-art resources processing The whole processor may be processors (ө. g. i860). implemented with 10 to 50 million transistors, thus being a suitable implementation target for IC technologies of the 90's.

#### 1. The Bases of Experience

Present superscalar architectures had been developed on the basis of comprehensive analytical work, especially relying on instruction level statistics. Such a measurement-oriented approach (MH86) may lead to considerabely good machines, but has the obvious drawbacks of a disappointingly low rate it usable inherent parallelism and of reflecting current of leaving opportunities for further innoprogramming habits, vation untouched. Hence our approach is not to study programs, but to study underlying mathematical structures important application problems. To obtain initial data of we have simply browsed some collections of formulas. Figure frequently needed mathematical operations 1 shows two together with the corresponding dataflow graphs.

# 2. The Proposed Structure

A more elaborate bookkeeping of the resources needed will show that four universal operation units may be used efficiently (MA91). Calculations whose dataflow graph comprises more than four nodes are to be executed in more essential only two realize We steps. processing interconnection structures: (1) none (i. e. independently operating units) and (2) tree-shaped structures. The basic arrangement is shown in the Figure 2. The units form a tree structure with four operand data paths from memory and one result path to memory. The fourth unit is connected with a stack-organized accumulating memory which is used as the runtime data stack. Independent operation of four units requires some bypass provisions. This the cost/performance tradeoff will cause effecitvity losses if vectorized or unrolled code is to be executed. Figure 3 shows an extension of the proposed structure which avoids of the operation units has а drawback. Each this memory (MPM) which can be used as an multipurpose accumulator, a stack, a collection of vector registers, and a control storage. It has two independent ports for read and respectively. Its capacity should be at write accesses, 8 kBytes, organized as 1024 buckets of 128 bits (if least used as a vector register, it could hold two vectors of 1024 64-bit- elements). The whole structure is connected to the memory subsystem via four read- only and four read/write ports (the latter are used to provide the paths of the tree-shaped structure as well). This scheme allows to load and store the MPMs at maximum speed. Each of the operation units can execute even triadic operations (e. g. SAXPY) with two of the operands delivered from memory and one from the Results will be stored in the MPMs. They can be moved MPM. to memory at maximum speed after the operations have been completed.

# 3. The Internal Structure of an Operation Unit

Each of the operation units can process numerical and nonnumerical data, respectively. The structure of a processing kernel is shown in Figure 4. Some of the <u>Multiplication of complex numbers</u>  $(a_1, b_1)*(a_2, b_2) = (a_1 a_2 - b_1 b_2, a_1 b_2 + a_2 b_1)$ 



Addition/subtraction of rational numbers



Figure 1: Dataflow graph examples.



*Figure 2:* The proposed structure of four operation units.

8 Read/write ports (5...8) 6 Read only ports (1...4) 2 В В Α 1 Α R R store 2, load 3B store 1, load 3A 3 В А R store 4, load 48 store 3, load 4A В Α 4 R

Internal details of an operation unit Β **—** 1 MA MB ∳ B 0 A Multi-Purpose Processing Memory Kernel MPM R SEL /SEL bypass SEL R

*Figure 3:* The proposed structure extended.



Figure 4: Internal structure of a processing kernel.

\_\_\_\_



# 32 bit Resource Selection Word (RSW)

