Seminar 2009-10 abstracts

Seminar 09/10 - Abstracts

Stochastic Model of Robust Resource Management for Heterogeneous Parallel Computing Systems
H. J. Siegel
Department of Electrical and Computer Engineering and Department of Computer Science, Colorado State University

What does it mean for a computer system to be robust? How can robustness be described? How does one determine if a claim of robustness is true? How can one decide which of two systems is more robust? Parallel computing systems are often heterogeneous mixtures of machines, used to execute collections of tasks with diverse computational requirements. A critical research problem is how to allocate resources to tasks to optimize some performance objective. However, systems frequently have degraded performance due to uncertainties, such as inaccurate estimates of actual workload parameters. It is important for system performance to be robust against uncertainty. To accomplish this, we present a stochastic model for deriving the robustness of a resource allocation. This model assumes that stochastic (experiential) information is available for a parameter whose actual values are uncertain. The robustness of a resource allocation is quantified as the probability that a user-specified level of system performance can be met. We show how to use this stochastic model to evaluate the robustness of resource assignments and to design resource management heuristics that produce robust allocations. The stochastic robustness analysis approach can be applied to a variety of computing and communication system environments, including parallel, distributed, cluster, grid, Internet, cloud, embedded, multicore, content distribution networks, wireless networks, and sensor networks. Furthermore, the robustness model is generally applicable to design problems throughout various scientific and engineering fields.

Scaling the Memory Wall with Phase Change Memories
Moinuddin Qureshi
IBM T.J. Watson Research Center

DRAM has been the building block for main memory systems for several decades. However, with each technology generation, significant portion of the total system power and the total system cost is spent in the DRAM memory system, and this trend continues make DRAM a less desirable choice for future large-scale memory systems. Therefore, architects and system designers must look at alternative technologies for growing memory capacity. Phase-Change Memory (PCM) is an emerging technology which is expected to be denser than DRAM and can boost memory capacity in a scalable and power-efficient manner. However, PCM has it own unique challenges such as higher read latency (than DRAM), much higher write latency, and limited lifetime due to write endurance. In this talk I will focus on architectural solutions that can leverage the density and power-efficiency of PCM while addressing its challenges. I will propose a Hybrid Memory system that combines PCM-based main memory with a DRAM buffer, thereby obtaining the capacity benefits of PCM and latency benefits of DRAM. I will then describe a simple, novel, and efficient wear leveling technique for PCM memories that obtains near-perfect lifetime while incurring a storage overhead of less than 13 bytes. Finally, I will present extensions to PCM devices than can adaptively cancel or pause write requests to reduce latency of read requests when there is significant contention from the (slow) write requests.

HPC in Phase Change: Towards a New Execution Model
Thomas Sterling
Louisiana State University

HPC is entering a new phase in system structure and operation driven by a combination of technology and architecture trends. Perhaps foremost are the constraints of power and complexity that as a result of the flat-lining of clock rates relies on multicore as the primary means by which performance gain is being achieved with Moores Law. Indeed, for all intense and purposes, multicore is the new Moores Law with steady increases in the number of cores per socket. Added to this are the highly multithreaded GPU components moving HPC into the heterogeneous modality for additional performance gain. These dramatic changes in system architecture are forcing new methods of use including programming and system management. Historically HPC has experienced five previous phase changes involving technology, architecture, and programming models. The current phase of two decades is exemplified by the communicating sequential model of computation replacing previous vector and SIMD models. HPC is now faced with the need for new effective means of sustaining performance growth with technology through rapid expansion of multicore with anticipated structures of hundreds of millions of cores by the end of this decade delivering Exaflops performance. This presentation will discuss the driving trends and issues of the new phase change in HPC and will discuss the ParalleX execution model that is serving as a pathfinding framework for exploring an innovative synthesis of semantic constructs and mechanisms that may serve as a foundation for computational systems and techniques in the Exascale era. This talk is being given just as DARPA is initiating its UHPC program and DOE is launching additional programs such as their X-stack all aimed at catalyzing research in to the challenging area.

Computing as the Third Mode of Scientific and Mathematical Discovery
David H. Bailey
Lawrence Berkeley National Lab, USA

The latest state-of-the-art scientific computer systems have achieved over 1 "Pflop/s" (one million billion floating-point arithmetic operations per second). Scientists have capitalized on this computational power by developing a wide range of sophisticated programs that are becoming so effective that scientific computing is now widely regarded as the third mode of scientific discovery, after theory and experiment.
In other words, the computer has become a virtual laboratory, wherein "experiments" can be performed to explore phenomena that are too complicated, expensive or dangerous to explore by ordinary empirical experiment. Many examples of this methodology will be described, including studies in climate and environmental science, astrophysics, biology, engineering, and mathematics.
Among the examples of such computations, the author will describe some recent research wherein new formulas of mathematics have been discovered, using high-precision numerical computations on state-of-the-art computers. Perhaps the best-known such discovery is a new formula for the mathematical constant pi, which has the curious property that it permits one to calculate digits of pi beginning at an arbitrary starting position in the binary expansion.

A cost-effective load-balancing policy for tile-based, massive multi-core packet processors
Enric Musoll
Consentry Networks

Massive multi-core architectures provide a computation platform with high processing throughput, enabling the efficient processing of workloads with a significant degree of thread-level parallelism found in networking environments.

Communication-centric workloads, like those in LAN and WAN environments, are fundamentally composed of sets of packets, named flows. The packets within a flow usually have dependencies among them, which reduce the amount of parallelism. However, packets of different flows tend to have very few or no dependencies among them, and thus can exploit thread-level parallelism to its fullest extent.

Therefore, in massive tile-based multi-core architectures, it is important that the processing of the packets of a particular flow takes place in a set of cores physically close to each other to minimize the communication latency among those cores. Moreover, it is also desirable to spread out the processing of the different flows across all the cores of the processor in order to minimize the stress on a reduced number of cores, thus minimizing the potential for thermal hotspots and increasing the reliability of the processor. In addition, the burst-like nature of packet-based workloads render most of the cores idle most of the time, enabling large power savings by power gating these idle cores.

This work presents a high-level study of the performance, power and thermal behavior of tile-based architectures with a large number of cores executing flow-based packet workloads, and proposes a load-balancing policy of assigning packets to cores that minimizes the communication latency while featuring a hotspot-free thermal profile.

Minimum intrusion Grid
Brian Vinter
University of Copenhagen

This talk introduces the philosophy behind a new Grid model, the "Minimum intrusion Grid", MiG. The idea behind MiG is to introduce a `fat' Grid infrastructure which will allow much `slimmer' Grid installations on both the user and resource side. The talk presents the ideas behind MiG, some initial designs and finally a status report of the implementation. Grid computing has just topped the hype-curve, and while large demonstrations of Grid middleware exists, including Globus toolkit and NorduGrid ARC, the tendency in Grid middleware these days is towards a less powerful model -- Grid services -- than what was available previously. This reduction in sophistication is driven by a desire to provide more stable and manageable Grid systems. While striving for stability and manageability is obviously right, doing so at the cost of features and flexibility is not so obviously correct. The "Minimum intrusion Grid", MiG, is an attempt to design a new platform for Grid computing which is driven by a stand-alone approach to Grid, rather than integration with existing systems. The goal of the MiG project is to provide a Grid infrastructure where the requirements on users and resources alike, to join Grid, is as small as possible -- thus the minimum intrusion part. While striving for minimum intrusion, MiG will still seek to provide a feature rich and dependable Grid solution.

Value Prediction in Parallel Architectures
Jean-Luc Gaudiot
Department of Electrical Engineering and Computer Science , University of California

The newly emerging many-core-on-a-chip designs have renewed an intense interest in parallel processing. By applying Amdahls formulation to the programs in the PARSEC and SPLASH-2 benchmark suites, we find that most applications may not have sufficient parallelism to efficiently utilize modern parallel machines. The long sequential portions in these application programs are caused by computation as well as communication latency. However, value prediction techniques may allow the parallelization of the sequential portion by predicting values before they are produced. In conventional superscalar architectures, the computation latency dominates the sequential sections. Thus value prediction techniques may be used to predict the computation result before it is produced. In many-core architectures, since the communication latency increases with the number of cores, value prediction techniques may be used to reduce both the communication and computation latency. We extend these ideas by using GPUs to accelerate programs that contain limited parallelism and those that are hard to parallelize.

Adaptive routing in irregular topologies (virtual) in the presence of multiple node and links failures. Analysis and Validation.
Dimiter Avresky
International Research Institute on Autonomic Network Computing (IRIANC)

The talk will discuss the problems related to adaptive routing in irregular topologies in the presence of multiple node and links failures. The proposed techniques will allow to design networks with predictable behavior and to restore functionality of computer system in the presence of anomalies (multiple node and link failures, intrusions, DoS). By implementing an entire software-based virtual network in middleware, we prevent failures in the physical layer from propagating up toward the application layer. We will analyze different virtual network topologies and determine the maximum number of anomalies that can be tolerated without partitioning the network. For lesser anomalies, the system will dynamically reconfigure itself. By monitoring the failure margin, we can also pro-actively save the ongoing computation when the network degrades beyond a certain threshold. Virtual network has been implemented in high-speed clusters and Ethernet environment. The proposed algorithm show improved routing success and performance characteristics (throughput and latency) in comparison with existing techniques.

Clouds and Science: Opportunities and Obstacles
Daniel S. Katz
Computation Institute at the University of Chicago

Clouds offer a number of opportunities for science, but current applications will need to change to take advantage of these opportunities. This talk with use the author's experiences in graduate school, where parallel programming was starting to become mainstream in HPC, and at JPL, where grids were starting to become mainstream for some HPC science, to look at the issues in today's context where clouds are becoming popular, but are not currently used for a high fraction of large-scale science.

Optimizing Application Performance on Petascale Systems with Scalasca
Felix Wolf
German Research School for Simulation Sciences, RWTH Aachen University

Driven by application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers grows from generation to generation. This imposes scalability demands not only on applications but also on the software tools needed for their development. The open-source toolset Scalasca has been designed to analyze the performance behavior of parallel applications specifically on such large-scale systems. A distinctive feature is its ability to identify program wait states, which often present a major challenge in achieving satisfactory performance especially when trying to scale communication-intensive applications to large processor counts.
In this talk, we review the current toolset architecture, emphasizing its scalable design and the role of the different components in transforming raw measurement data into useful knowledge of execution behavior. The scalability and effectiveness of Scalasca are surveyed from experience in measuring and analyzing real-world applications on a range of computer systems.

Portable, Scalable, per-Core Power Estimation for Intelligent Resource Management
Sally A. McKee
Chalmers University of Technology

Power and temperature have joined performance as first-order design constraints. Balancing power efficiency, thermal constraints, and performance requires some means to convey data about real-time power consumption and temperature to intelligent resource managers.
Resource managers can use this information to meet performance goals, maintain power budgets, and obey thermal constraints. Unfortunately, obtaining the required machine introspection is challenging. Most current chips provide no support for per-core power monitoring, and when support exists, it is not exposed to software.
We present a methodology for deriving per-core power models using sampled performance counter values and temperature sensor readings. We develop accurate, application-independent models for several CMPs, and show how they can be used to guide scheduling decisions in power-aware resource managers enforcing a specified power envelope.

Towards energy-aware autonomic distributed systems
Jean-Marc Pierson
University Paul Sabatier

In this presentation, we will explore some approaches for reducing the energy consumption of distributed infrastructure. We will detail a decision algorithm for task allocation taking into account energy consumption of tasks and hosts and tasks resources requirements. Second, we will show how an autonomic approach can help to dynamically adjust the system to deliver with less energy consumption still a good quality of services, and its potential integration in an autonomic computing platform. Finally we will briefly describe the other directions of research held in the research group around mathematical modeling, virtual machines manipulation, and so on.

Studying Microarchitectural Structures with Object Code Reordering
Daniel Á. Jiménez
Department of Computer Science, The University of Texas at San Antonio

Modern microprocessors have many microarchitectural features. Quantifying the performance impact of one feature such as dynamic branch prediction can be difficult. On one hand, a timing simulator can predict the difference in performance given two different implementations of the technique, but simulators can be quite inaccurate. On the other hand, real systems are very accurate representations of themselves, but often cannot be modified to study the impact of a new technique.
We demonstrate how to develop a performance model for branch prediction using real systems based on object code reordering. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a new branch predictor by simulating only the predictor and not the rest of the microarchitecture.
We also use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predictor. We simulate several branch predictors using Pin and measure which hypothetical branch predictor has the highest correlation with the real one.
This study in object code reordering points to way to future work on estimating the impact of other structures such as the instruction cache, the second-level cache, instruction decoders, indirect branch prediction, etc.

Parallel Computing in Pisa: Structured Parallel Programming, Fault Tolerance, Adaptivity and Dynamicity
Carlo Bertolli
University of Pisa

In this talk I will describe the research results obtained in the last decade by the "High-Performance Parallel Programming Research Group" at the University of Pisa. The keypoint of the research has been based on structured parallel programming (e.g. algorithmic skeletons), which has been used to: derive innovative and optimized checkpointing and rollback recovery techniques; introduce dynamicity mechanisms in parallel application support, allowing their fast re-configuration; introduce adaptivity strategies to support complex High-Performance applications on pervasive grids. In the talk I will describe the hypotheses on which these research work are based, and I will show how these are used to derive the mentioned results.

Chapel, the Cascade High-Productivity Language
Brad Chamberlain
Cray

Chapel is a new programming language being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS). Chapel strives to increase productivity for supercomputer users by supporting higher levels of abstraction compared to current parallel programming models while also supporting the ability to optimize to performance that meets or surpasses current techn ologies. Chapel is designed for portability -- from desktop multicore workstations to commodity clusters to the high-end machines developed by Cray and our competitors. In this talk, I will provide an overview of the Chapel language, including motivating philosophies and recent work on user-defined data distributions. I'll also mention several opportunities for collaboration and future work.

Efficient Resource Management for Large Scale Parallelism
Christos Kozyrakis
Stanford University

Multi-core chips will soon include hundreds of cores, support thousands of hardware threads, and feature deep memory hierarchies with non-uniform latency characteristics. To maximize efficiency from such systems, we must carefully manage resources (cores, memory, and interconnect) in a manner that balances performance and power consumption, improves both load balance and locality, and minimizes management overheads. This talk will present early work on resource management for large-scale multi-core systems at the Pervasive Parallelism Lab (PPL) in Stanford University. PPL is investigating software and hardware techniques for pervasive parallelism based on programs written in domain specific languages (DSLs). First, we will discuss how to scale a user-level runtime environment for a DSL to hundreds of cores and NUMA latencies. We will show that by carefully reconsidering the algorithms and data-structures, we can improve speedup by up to 19x for highly parallel applications. Second, we will discuss separation of functions and interfaces between user-level runtimes and the operating system. We will also show how to overcome scalability issues in virtual memory operations for shared-memory operating systems like Linux. Finally, we will describe simple hardware support for fine-grain parallelism that allows for the development of low-overhead, software-mostly runtime systems that scale efficiently to hundreds of hardware threads. We will show that the proposed runtimes can exceed the performance of hardware-only scheduling by up to a factor of 2x.

That's all folks!!!!!