Modern computers often use multi-core architectures, covering clusters of homogeneous cores for high performance computing, to heterogeneous architectures typically found in embedded systems. To efficiently program such architectures, it is important to be able to partition and map programs onto the cores of the architecture. We believe that communication patterns need to become explicit in the source code to make it easier to analyze and partition parallel programs. Extraction of these patterns are difficult to automate due to limitations in compiler techniques when determining the effects of pointers.
We have proposed an OpenMP extension which allows programmers to explicitly declare the pointer based data-sharing between coarse-grain program parts. We present a dependency directive, expressing the input and output relation between program parts and pointers to shared data, as well as a set of runtime operations which are necessary to enforce declarations made by the programmer. The cost and scalability of the runtime operations are evaluated using micro-benchmarks and a benchmark from the NAS parallel benchmark suite. The measurements show that the overhead of the runtime operations is small. In fact, no performance degradation is found when using the runtime operations in the benchmark from the NAS parallel benchmark suite.
Numerous schemes for extracting the secret key out of cryptographic devices using side channel attacks have been developed. One of the most effective side channel attacks is through maliciously injecting faults into the device and observing the erroneous results produced by the device. In some extreme cases, a single fault injection experiment has been shown to be sufficient for retrieving the secret key. In this talk we describe several fault injection attacks on symmetric key and public key ciphers and outline countermeasures that have been developed to protect cryptographic devices against such attacks. We then show that some of these countermeasures do not provide the desired protection, and even worse, they may make other side channel attacks easier to mount.
In this talk I will present two recent works on power and thermal-aware load-balancing techniques for massive multi-core architectures.
(a) Hardware-based load balancing for massive multi-core architectures implementing power gating (To appear in the IEEE Transactions on Computer-Aided Design)
Many-core architectures provide a computation platform with high execution throughput, enabling them to efficiently execute workloads with a significant degree of thread-level parallelism. The burst-like nature of these workloads allows large power savings by power gating the idle cores. In addition, the load balancing of threads to cores also impacts the power and thermal behavior of the processor.
Processor implementations of many-core architectures may choose to group several cores into clusters sharing the area overhead, so that the whole cluster is power gated as opposed to the individual cores. However, the potential for power savings is reduced due to the coarser level of power gating.
In this work, several hardware-based, stateless load-balancing schemes are evaluated for these clustered homogeneous multi-core architectures in terms of their power and thermal behavior. All these methods can be unified into a parameterized technique that dynamically adjusts to obtain the desired goal (lower power, higher performance, lower hotspot temperature).
(b) Trading off higher execution latency for increased reliability in tile-based massive multi-core architectures (Published at the 2009 IEEE International Symposium on Quality Electronics and Design)
Massive multi-core architectures provide a computation platform with high execution throughput, enabling the efficient execution of workloads with a significant degree of thread-level parallelism, like networking, DSP and e-commerce. The burst-like nature of these workloads render most of the cores idle most of the time. Therefore, there is a large potential for power savings by power gating these idle cores. The ideal scenario from a power dissipation point of view is to execute the requests as fast as possible so that the cores can be power gated the longest. But due to the exponential dependency of (static) power on temperature, it may be the case that a cluster of spatially close cores consumes more than if these cores were farther apart from each other. The former case may certainly be best for performance (since the cores are closer to the neighbor's caches), but in the presence of spare cores in the die, it may be possible that by executing the requests in distant cores the overall throughput is still maintained and at the same time both power and hot spots are reduced, thus increasing the processor's reliability.
In this work, the power, performance and thermal behavior of a tile-based massive multi-core architecture is modeled and evaluated under different workload scenarios. Under a low ingress rate of requests or low inter-core communication traffic, both higher power savings and more uniformly chip wear are obtained by assigning requests to physically distant cores.
Power efficiency has constrained the growth of single-threaded performance, but will soon also constrain the scaling of multicore chips. In this talk, I will project how Moore's Law will affect multicore designs, and show that energy efficiency will determine the number of cores that we can fit on a chip, leading to a model that I call "pinhole processing." To address the efficiency of individual cores, I will describe the TFlex microarchitecture, a class of ultra-adaptive EDGE-based cores that can enable dynamic heterogeneity through composability, subsuming many of the heterogeneous multicore design points. Finally, I will offer some thoughts on what comes after multicore.
In this talk we present several recent results obtained in the design of parallel algorithms for dense and sparse linear algebra. The overall goal of this research is to reformulate and redesign linear algebra algorithms so that they are optimal in an amount of the communication they perform, while retaining the numerical stability. The work here involves both theoretical investigation and practical coding on diverse computational platforms. In the theoretical investigation we identified lower bounds on communication for different operations in linear algebra, where communication refers to data movement between processors in the parallel case, and to data movement between different levels of memory hierarchy in the sequential case. The results obtained to date concern the LU and QR factorizations of dense matrices. We present new algorithms that attain the communication lower bounds (up to polylogarithmic factors), and thus greatly reduce the communication relative to conventional algorithms as implemented in the widely used libraries LAPACK and ScaLAPACK. The implementation of the new algorithms on distributed memory computers lead to important speedups over the algorithms in ScaLAPACK. Our current research focuses on their adaptation to the emergent hierarchical models of clusters of multi-core processors, as used for example in future petascale machines.
This is joint work with J. Demmel and M. Hoemmen from UC Berkeley, J. Langou from CU Denver and H. Xiang from University Paris 6.
A key challenge in the field of computer architecture is "balanced" system design, in which computational capability is well-adjusted to the supply of data. In both academe and industry, computer architects are increasingly drawing system roadmaps which predict many-fold increases in raw computational throughput per chip -- hundreds of cores within the next three technology generations. At the same time, CMOS technologists have been warning of the "end of scaling," particularly for six-transistor SRAM. This is a disturbing forecast, since easily 50% of microprocessor silicon area is commonly occupied by SRAM caches. Reconciling these two divergent paths is the topic of this talk.
A particularly long-standing debate has surrounded one dense, resilient, on-chip storage alternative: embedded DRAM. This talk will shed light on the technology causes of the infamous memory wall, provide a tutorial on the technology behind eDRAM, and abstract use of SRAM replacements into the system-level metrics of performance, capacity, and availability.
The parallelization of non-doall loops requires explicit synchronization between threads of execution. In this regard, efficient placement of the synchronization primitives - say, post, wait - plays a key role in achieving a high degree of thread-level parallelism (TLP). We propose novel compiler techniques to enhance the above. Specifically, given a control flow graph (CFG), the proposed techniques place a post as early as possible and place a wait as late as possible in the CFG, subject to dependences. We present evidence of the efficacy of our techniques on a real machine, using real code kernels from the SPEC CPU benchmarks, the Linux kernel and other widely used open source codes. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.
Leakage power dissipation continues to be a problem in L2 caches. Many circuit and architectural techniques have been proposed to mitigate this. In particular, memory cell leakage has been dealt with quite successfully. However, considerable leakage power is still dissipated in the so called SRAM peripheral circuits, e.g., decoders, wordline and I/O drivers. This talk will discuss peripheral leakage and techniques to reduce it based on stacking sleep transistors. Two "static" architectural techniques to control the circuit mechanism are described. An adaptive mechanism is then proposed over the static techniques and shown to result in high leakage reduction. A 52% average L2 leakage reduction is achieved for SPEC2K benchmarks.
Multi-Processor Systems-on-Chip (MPSoCs) are increasingly penetrating the consumer electronics market as a powerful, yet commercially viable, solution to answer the strong and steadily growing demand for scalable and high performance systems, at limited design complexity. Nevertheless, MPSoCs are prone to alarming temperature variations on the die, which seriously decrease their expected reliability and lifetime. Thus, it is critical to develop dedicated design methodologies for multi-core architectures that seamlessly address their thermal modeling, analysis and management. In this seminar, I present modelling and analysis tools for MPSoC architectures. In particular, I describe a novel thermal exploration framework based on a combined HW-SW emulation approach exploiting Field-Programmable Gate Arrays (FPGAs), which enables the accurate characterization of the thermal behavior of MPSoCs, while being three orders of magnitude faster than state-of-the-art architectural and system simulators. Then, using this novel thermal exploration framework, I will introduce different HW-based policies for controlling thermal runaway in MPSoCs, based on dynamic frequency and voltage scaling. Finally, I will show how thermal balancing policies can be developed for MPSoCs, combining HW-based temperature control mechanisms with task migration at the operating system level.
HP Labs' COTSon simulator based on AMD's SimNow is a full system simulation infrastructure. It allows to simulate complete systems ranging from multicore nodes up to full clusters of multicore nodes with complete network simulation. It is composed of a pluggable architecture, in which most features can be substituted for your own development, thus allowing researchers to use it as their simulation platform. There are tons of simulators, why a new one? COTSon is not just another simulator, it is a simulation infrastructure where you can plugin your own simulation modules. Our holistic approach simulates the whole system at once, because we believe that multicore multithreaded architectures of the future can not be understood without taking into account the whole system, including devices and the whole operating system. Something similar can be said about disk and network research. As a design principle, COTSon trades off accuracy for speed and viceversa, dynamically allowing the researcher to determine the interesting parts of their application, as well as doing large space explorations at higher speeds. Why use many tools if one suffices? We hope COTSon becomes the de facto standard simulation infrastructure for next generation systems simulation, and that is why we are making it freely available under request. If you belong to any kind of research lab or university and you are interested in microarchitecture simulation, disk simulation, network simulation or system simulation, COTSon may be perfect for you. In this talk, we provide a general description of COTSon and explain the different research challengues and solutions behind the development of our simulation infrastructure. More information about COTSon can be found at http://sites.google.com/site/hplabscotson.
Shrinking transistor sizes and a trend toward low-power processors have caused increased leakage, high per-device variation and a larger number of hard and soft errors. Maintaining precise digital behavior on these devices grows more expensive with each technology generation. In some cases, replacing digital units with analog equivalents allows similar computation to be performed at higher speed and lower power. The units that can most easily benefit from this approach are those whose results do not have to be precise, such as various types of predictors. We introduce the Scaled Neural Predictor, a highly accurate prediction algorithm that is infeasible in a purely digital implementation but can be implemented using analog circuitry. Our predictor uses current summation to replace the expensive digital dot-product computation required in perceptron predictors. We show that the analog predictor can outperform digital neural predictors because of the reduced cost, in power and latency, of the key computations. The analog neural predictor circuit is able to produce an accuracy equivalent to an infeasible digital neural predictor that requires 128 additions per prediction. The analog version, however, can run in 200 picoseconds, with the analog portion of the prediction computation requiring less than 0.4 milliwatts at a 45 nm technology, which is negligible compared to the power required for the table lookups in this and conventional predictors.
Recently the computer architecture world is shifting toward multi-core and many core systems. This "new" trend already happened in the past with only partial success. As we discuss in the talk, a major part of the past failure was the inability to efficiently use parallel systems. In order to avoid repeating past mistakes, the research community needs to provide essential solutions to critical issues. In my talk I will provide a short historical perspective on the past similar trends, and highlight a few critical directions that the research must address in order to make the new trend result in greater success.
The Blue Gene/P system is the current leading solution in generation of massively parallel supercomputers that IBM architected based on orders of magnitude in system size and significant power consumption efficiency. BG/P succeeds BG/L in the Blue Gene supercomputer line, and it comes with many enhancements to the machine design as well as new architectural features at the hardware and software levels. In this talk, I will give an overview of the BG/P messaging software stack with a focus on the Deep Computing Messaging Framework (DCMF) and on the Component Collective Messaging Interface (CCMI). DCMF and CCMI have been designed to easily support several programming paradigms such as the Message Passing Interface (MPI), Aggregate Memory Copy Interface (ARMCI), Charm++ and others. Beside the production message passing runtime system designed for HPC applications, I will also discuss some research explorations for utilizing BG/P in non HPC domains like financial streaming applications.
The shift from single to multiple core architectures means that, in order to increase application performance, programmers must write concurrent, multithreaded programs. Unfortunately, multithreaded applications are susceptible to numerous errors, including deadlocks, race conditions, atomicity violations, and order violations. These errors are notoriously difficult for programmers to debug. This talk presents Grace, a runtime system for safe and efficient multithreading. Grace replaces the standard pthreads library with a new runtime system that eliminates concurrency errors while maintaining good scalability and high performance. Grace works with unaltered C/C++ programs, requires no compiler support, and runs on standard hardware platforms. I will show how Grace can ensure the correctness of otherwise-buggy multithreaded programs, and at the same time, achieve high performance and scalability.
That's all folks!!!!!