In 1984, three of my students (Wen-mei Hwu, Steve Melvin, and Mike Shebanow) and I conceived of what was then a revolutionary idea for microprocessor implementation: HPS. We advocated serious branch prediction, wide issue, breaking complex instructions into microops (we called them nodes), executing them out-of-order, and retiring instructions in program order so as to achieve precise exceptions. The objective then was as it is now to maximize performance by maximizing instruction and data supply while preventing unnecessary delays in the processing core. Our first papers were published in Micro-18 in 1985, followed soon after in ISCA 1986. The first part of this talk is to take the audience through the thought processes we went through then, since the world really has not changed much since. ...although, tomorrow we will have 10 billion, not one million transistors on a chip, operating at frequencies in excess of 10 GHz, not 20 MHz. Still, the objective is to make instructions and data available as quickly as possible, and operate without unnecessary delays. The second part of the talk is to point to what I believe will be some of the characteristics of the microprocessor ten years from now.
A multinomial is a (non-recursive) mathematical expression in several variables consisting of a sum of product terms. While such multinomial evaluations are typically done using software on a general-purpose processor, there are situations where custom hardware implementation maybe called for. The main motivation for the current research is the Bayesian Network Multinomial (BNM) evaluation. In recent proposals Bayesian Networks are represented using a characteristic multinomial and the most important operation done on a Bayesian Network - probabilistic inference - requires evaluation of this multinomial. BNMs are exponential in size and many real-time Bayesian Networks cannot be practically solved on a general-purpose processor. The method to be presented in this talk, named MOLE (Multinomial On-Line Evaluation), mechanizes the procedure of building an arithmetic circuit (a hardware implementation consisting of interconnected arithmetic modules) starting from the E-graph of a multinomial (evaluation graphs are efficient representations of multinomials in a factored form). MOLE consists of the following two steps: 1. Application of various linear-system operators to an E-graph representing a multinomial to convert it into a network of linear systems, and 2. Solving of the linear systems in an online fashion with a cycle time independent of the precision. A linear-system operator (LSO) transforms a subgraph of an E-graph into an equivalent linear system. By equivalence, we mean that one of the unknowns of the linear system (typically the unknown whose index is zero) evaluates to the same value (or a power-of-two scaled value) as the expression represented by the subgraph.
Over the last decade, we have developed and implemented a compiler - called Compaan - for multi-core embedded systems, and a mapper - called Laura - for reconfigurable platforms. The compiler translates sequencially specified inested loop programs to equivalent parallel specifications. The mapper transforms the parallel specification to FPGA specifications. In this talk I will briefly schets methods and tools, and go through some appealing examples.
The WISCONSIN MULTIFACET PROJECT, which I co-lead with David Wood, seeks to improve the multiprocessor servers that form the computational infrastructure for Internet web servers, databases, and other demanding applications. Work focuses on using the transistor bounty provided by Moore's Law to improve multiprocessor performance, cost, and fault tolerance, while also making these systems easier to design and program. This talk quickly summarizes our work and then focuses on one result: A Flight Data Recorder. Debugging is easier when one can faithfully replay a buggy execution. This is not the default for multithreaded workloads, due to race non-determinism. To facilitate deterministic debugging, we have developed a practical low-overhead hardware recorder for cache-coherent multiprocessors, called Flight Data Recorder (FDR). Like an aircraft flight data recorder, FDR continuously records selected execution events, even on deployed systems, logging the execution for post-mortem analysis. FDR piggybacks on cache-coherence hardware and can capture the last one second of the execution with modest (less than 2%) slowdown
Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness. In this talk, I describe how to design memory managers that support high-performance applications. First, I show how our Heap Layers infrastructure addresses the software engineering challenges of building efficient memory managers. Next, I'll show why current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe Hoard, a fast, provably scalable general-purpose memory manager that solves these problems. As an example of its effectiveness, British Telecom uses Hoard to improve real application performance fivefold. Finally, I show how to sup port server applications that must deal with runaway modules, terminated connections or aborted transactions. I present a generalization of regions and heaps called "reaps" that supports these applications with higher performance and lower memory consumption than current approaches.
As computer systems are becoming increasingly present in our life, more human activities depend on their availability. Human intervention is too costly and unacceptably slow when computer systems monitoring and repairing must be done fast and reliably regardless of scale, network availability, or system impairing. In this talk, I propose to build defensive computer systems by augmenting their hardware or software architecture with trusted intelligent backdoors. An intelligent backdoor can be programmed to perform automated observation and intervention on a computer system without involving its operating system, and can communicate with other backdoors over private networks. Backdoors can be realized in hardware using a programmable network interface or in software over a virtual machine monitor. In this talk, I will present our results in prototyping a backdoor-based architecture for remote repairing and recovery, and I will outline our plans to further exploit its potential.
The evolution of high-performance micropocessors is fast approaching several significant inflection points. First, the marginal utility of additional single-core complexity is now rapidly diminishing due to a number of factors. The increase in instructions per cycle from increases in sizes and numbers of functional units has plateaued. Meanwhile the increasing sizes of functional units and cores are beginning to have significant negative impacts on pipeline depths and the scalability of processor clock cycle times. Second, the power of high performance microprocessors has increased rapidly over the last two decades, even as device switching energies have been significantly reduced by supply voltage scaling. However future voltage scaling will be limited by minimum practical threshold voltages. Current high-performance microprocessors are already near market limits of acceptable power dissipation. Thus scaling microprocessor performance while maintaining or even reducing overall power dissipation without the benefit of appreciable further voltage scaling will prove especially challenging. In this keynote talk we will discuss these issues and propose likely scenarios for the future evolution of high-performance microprocessors.
We present RegionScout a technique for reducing power and bandwidth demand in small-scale, snoopy coherent symmetric multiprocessors (SMPs). We will briefly discuss the technological and market tradeoffs that make snoopy-coherent SMPs increasingly attractive for a variety of applications ranging from general purpose servers to embedded systems. Memory requests in SMPs are expensive in terms of power and bandwidth since they are broadcast to all nodes. RegionScout exploits common program behavior to dynamically eliminate many broadcasts thus reducing power and bandwidth demand. We define a region to be a continuous, aligned memory area whose size is a power of two and observe that many requests find that no other node caches a block in the same region even for regions as large as 16K bytes. RegionScout is a family of simple filter mechanisms that dynamically detect most accesses to non-shared regions. A node with a RegionScout filter can determine in advance that a request will miss in all remote nodes and thus avoid broadcasting the request. RegionScout filters are implemented as a layered extension over existing snoop-based coherence and they utilize very simple structures. Their operation is completely transparent to software and the operating system. These characteristics are made possible by utilizing imprecise information about the regions cached in each node.
In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.
In this paper, we present a system capable of improving the I/O performance in an automatic way. This system is able to learn the behavior of the applications running on top and find the best data placement in the disk in order to improve the I/O performance. This system is built by three independent modules. The first one is able to learn the behavior of a workload in order to be able to reproduce its behavior later on, without a new execution. The second module is a drive modeler that is able to learn how a storage drive works taking it as a black box . Finally, the third module generates a set of placement alternatives and uses the afore mentioned models to predict the performance each alternative will achieve. We tested the system with five benchmarks and the system was able to find better alternatives in most cases and improve the performance significantly (up to 225%). Most important, the performance predicted where always very accurate (less that 10% error).
SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss.
Traditional branch predictors exploit correlations between pattern history and branch outcome to predict branches, but there is a stronger and more natural correlation between path history and branch outcome. I exploit this correlation with piecewise linear branch prediction, an idealized branch predictor that develops a set of linear functions, one for each program path to the branch to be predicted, that separate predicted taken from predicted not taken branches. Taken together, all of these linear functions form a piecewise linear decision surface. Disregarding implementation concerns modulo a 64.25 kilobit hardware budget, I presented this idealized branch predictor for the first Championship Branch Predictor competition. I describe the idea of the algorithm and as well as tricks used to squeeze it into 64.25 kilobits while maintaining good accuracy.
In recent years, processor manufacturers have converged on two types of register file architectures. Both IBM with its POWER series and Intel with its Pentium series are using a central storage for all in-flight values, which offers a high performance potential. AMD, on the other hand, uses an optimized implementation of the Future File for its line of Opteron processors. Both approaches have limitations that may preclude there application in future processor implementations. The centralized register file scales poorly in terms of power-performance. The Future File may be limited by the requirement of distributed reservation stations and by the branch misprediction recovery scheme. This paper proposes to give processor designer teams another choice by combining a traditional future file architecture with the concept of a central physical register file. This new register file is used in the ”front end” in combination with value storage in the instruction queue. Further, it is shown that, contrary to what happens with a centralized R10k-like architecture, the proposed architecture can use banking that scales well with the size of structures in the processor. Only a 0.3% IPC degradation was observed when using this technique. The energy savings due to banking are fully utilized in our proposal. Further, the implementation is shown to integrate well with a technique to early release short lived values that we call writeback filtering. Overall, the resulting energy delay product is lower than in previous proposals. Keywords: Register File, Reservation Stations, Register Distribution, IPC, Low Power
The performance of current parallel and distributed systems continues to disappoint. This is due partially to the inability of applications, compilers and computer systems to adapt to their own dynamic changes. It is also due to the current compartmentalized approach to optimization: applications, compilers, OS and hardware configurations are designed and optimized in isolation. No global optimization is attempted and the needs of each running application does not constitute the primary optimization consideration.
To address these problem we propose application-centric computing, or Smart Applications (SmartApps). In the SmartApps executable, the compiler embeds most run-time system services, and a optimizing feedback loop that monitors the application's performance and adaptively reconfigures the application and the OS/system platform. At run-time, after incorporating the code's input and determining the system's resources and state, the SmartApps performs an instance specific optimization, which is more tractable than a global generic optimization between application, OS and system. The overriding philosophy of SmartApps is ``measure, compare, and adapt if beneficial.''
SmartApps is being developed in the STAPL infrastructure. STAPL (the Standard Template Adaptive Parallel Library) is a framework for developing highly-optimizable, adaptable, and portable parallel and distributed applications. It consists of a relatively new and still evolving collection of generic parallel algorithms and distributed containers and a run-time system (RTS) through which the application and compiler interact with the OS and hardware.
SmartApps are developed for several important computational science applications including a discrete-ordinates code simulating subatomic particle transport, a molecular dynamics code and a program using a novel method to simulate protein folding and ligand binding. We use current ASCI shared and distributed machines, networks of workstations that run under Linux and K42 operating systems and, as of late, the Blue Gene machine.
This talk will focus on the genesis of System X at Virginia Tech and the need to support the research and development of software for clusters, supercomputers and grids.
Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge. Ever increasing wire-delay relative to switching speed and the exponential cost of circuit complexity make simply scaling up existing processor designs futile. This talk presents an alternative to superscalar design, WaveScalar. WaveScalar is a dataflow instruction set architecture and execution model designed for scalable, low-complexity/high-performance processors. WaveScalar is unique in efficiently providing unified support for single- and multi-threaded imperative programming models that require traditional memory semantics as well as traditional dataflow models that can exploit fine-grained dataflow parallelism.
The WaveScalar ISA is designed to run on an intelligent memory system. Each instruction in a WaveScalar binary executes in place in the memory system and explicitly communicates with its dependents in dataflow fashion. WaveScalar architectures cache instructions and the values they operate on in a WaveCache, a simple grid of ``alu-in-cache'' nodes. By co-locating computation and data in physical space, the WaveCache minimizes long wire, high-latency communication.
Results for single-threaded SPEC and Mediabench applications demonstrate that the WaveCache out-performs an aggressively configured superscalar design by 2-7 times, with ample opportunities for future optimizations. The WaveCache achieves up to 110 IPC on Splash2 parallel programming workloads and between 100-370 IPC on dataflow kernels improving performance by a factor of 2-20 over optimized, parallelized imperative-language versions.
I will also present preliminary area and delay results for a synthesziable RTL model of the WaveCache's key components.
By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for co-residency based on each thread's expected resource utilization but do not take into account variance in thread behavior. As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. The proposed memory monitoring framework centers on the concept of a cache activity vector, which exposes runtime cache resource information to the operating system to improve job scheduling. Using this scheduling technique, we experimentally evaluate the overall performance improvement of workloads on an SMT machine compared against the most recent Linux job scheduler. This work is first motivated with experiments in a simulated environment, then validated on a Hyperthreading-enabled Intel Pentium-4 Xeon microprocessor running a modified version of the latest Linux Kernel.
The BlueGene/L supercomputer will consist of 65,536 dual-processor compute nodes interconnected by two high-speed networks: a three-dimensional torus network and a tree topology network. Each compute node can only address its own local memory, making message passing the natural programming model for BlueGene/L. In this talk I will present our implementation of MPI for BlueGene/L. I will discuss how we leveraged the architectural features of to arrive at an efficient implementation of MPI in this machine.
The presentation introduces Microprocessor Technology Labs at Intel as the start. Moor's law leads Intel microprocessor successfully in decades. Transistor densities will continue to increase. However, device speed scaling will not follow historical trends and leakage power remains an issue. Meanwhile, new usage models and workloads will demand greater performance. We can't run business as usual. We need to develop uArchitecture features more effectively with power constraints. Software will play more and more important roles in microprocessor design. Emerging applications like recognition, mining and synthesis will become the next generation dominated applications. System software has been changing its landscape as well. It's challenges and opportunities for microprocessor designers and researchers to explore new uArchitecture features to meet the needs of the new application and system software. The presentation will give couple examples to show how system software enable HW design. It will show the potential direction of uArchitecture for such emerging applications. The talk will describe the emerging paradigms of programming systems and its impact to uArchitecture design.
That's all folks!!!!!