





# Kilo Instruction Processors

Adrián Cristal

**YALE 80** 

#### **Processor-DRAM Gap (latency)**



D.A. Patterson "New directions in Computer Architecture" Berkeley, June 1998



#### Integer, 8-way, L2 1MB



**ROB Size / Branch Predictor** 

Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: "Large Virtual ROBs by Processor Checkpointing", TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003



#### Floating-point, 8-way, L2 1MB



**ROB Size / Branch Predictor** 

Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: "Large Virtual ROBs by Processor Checkpointing", TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003



#### **Execution Locality**

```
void smvp(int nodes, double ***A, int *Acol,
                   int *Aindex, double **v, double **w) {
  [...]
   sum0 = A[Anext][0][0]*v[i][0] + A[Anext][0][1]*v[i][1] + A[Anext][0][2]*v[i][2];
  [...]
                                              Cache-Dependent
                                                                            Miss-Dependent
                                                                          Instruction Clusters
                                              Code
                                                               equake::smvp() Decode to Issue Distribution
                                                     Instructions
                                                     of
                                                                                      1200
                                                                                                1600
                                                                  Recode to Issue Latency in Cycles
```



#### **Mapping Clusters to Processors**

An execution cluster is a partition of the dynamic DDG belonging to the same locality group.

#### High Locality Clusters:

- large amount of instructions (70%) SpecFP, even more SpecINT
   need to tolerate Lz cache hit latencies
- advance as fast as possible (prefetching effect!)
- thus the Cache Processor can be small, but must be Out-of-Order

#### Low Locality Clusters:

- small amount of instructions (<30%)</li>
- generally not in critical path (Karkhanis, WMPI'02)
- thus the **Memory Processor** can be even smaller, and probably **In-Order**



#### 003 again LLC instead of L2? osman.s.unsal osman.s.unsal, 7/2/2019

#### A different view: D-KIP





Miquel Pericas et al, "A decoupled KILO-instruction processor", HPCA06

#### Flexible Heterogeneous MultiCore (I)



Barcelona
Supercomputing
Center
Centro Nacional de Supercomputación

Miquel Pericas et al, "A Flexible Heterogeneous Multi-Core Architecture", PACT07

#### Flexible Heterogeneous MultiCore (II)





#### Kilo, Runahead and Prefetching

- Prefetching
  - Anticipates memory requests
  - Reduces the impact of misses in the memory hierarchy
- Runahead Mechanism
  - Executes speculative instructions under a LLC miss
  - Prevents the processor from stall when the ROB is full
  - Allows generating useful data prefetch
- Kilo-instruction Processors
  - Exploits more ILP by maintaining thousands of in-flight instructions while long-latency loads are outstanding in memory (implicit prefetching)



## Performance versus RunAhead and Stride Prefetching

- •OoO and RunAhead are 4-way with 64/256-entry ROBs
- Cache Processors are 4-way with 64-entry ROB
- Memory Processor/Memory Engines are two-way in-order processors
- •A Memory Engine can hold up to 128 long-latency instructions and 128 loads/stores
- •RunAhead features ideal runahead cache
- •Stream prefetcher can hold up to 64KB of prefetched data





## "Kilo-processor" and multiprocessor systems



M. Galluzzi et al. "A First glance at Kiloinstruction Based Multiprocessors" Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004



#### What we wanted to do

 Can we extend a Big-Little multicore to implement the FMC?

 Are the Memory Engines (Mes) used all the time or are they waiting for long latency loads?

 Can we do something to avoid discarding all the MEs in case of branch mispredictions?

How does a practical kilo-vector processor look like?



## Some ideas "stolen" from "Edge Processors" and "Decoupled Architectures"





### **Waiting Queue**

- Instructions + Logical Registers
  - Wait until all used logical registers are ready
  - Assign a Memory Engine







## Where to start a waiting queue

- Loop:
- •
- •
- If: Br ...
- •
- •
- Else: ...
- •
- ...
- Fi:
- ...
- Endloop





## Where to start a waiting queue

- Loop:
- ...
- •
- If: Br ...
- ...
- ...
- Else: ...
- •
- •
- Fi:
- •
- Endloop





### Where to start a waiting queue

- Loop:
- •
- ...
- If: Br ...
- ...
- ...
- Else: ...
- •
- •
- Fi:
- ...
- Endloop





#### **Problems and more problems**

 What to do if the addresses of Loads and Stores are modified

Fetching instructions, and partial reexecution

- Pointer Chasing
  - Start a new waiting queue or suspend the execution of a waiting queue?



### "Kilo-vector" processor









## Thank you

Adrian.cristal@bsc.es