Workshop on Challenges for Parallel Computing

Workshop on Challenges for Parallel Computing
October 20, 2005
Sheraton Parkway Toronto North Hotel and Convention Centre,
Richmond Hill, Ontario, Canada
Associated with CASCON 2005

Chairs:	Kit Barton, University of Alberta ( cbarton at cs.ualberta.edu )
	Alejandro Duran, Universitat Politècnica de Catalunya ( aduran at ac.upc.edu )

Workshop abstract

Parallel computing has evolved significantly over the last few years and is now being used in many non-traditional environments. These new environments include, but are not limited to, desktop computers containing more than one processor and new processor technology that supports the ability to run more than one thread of execution (hyperthreading, simultaneous multithreading). On the other hand, massively parallel machines are being built posing new challenges.

This workshop discussed the new challenges facing parallel computing. Parallel computing users presented problems they found in new parallel systems. Leading researchers presented the work they were doing to address these problems.

Program

Parallelization Challenges for the Cell Broadband Engine. Kevin O'Brien, IBM Research
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture. Yun Zhang, University of Toronto
Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers. Priya Unnikrishnan, IBM Toronto Lab
Limits of Parallel/Distributed Computing. Erik Dirkx, Vrije Universiteit Brussel
Threading Maya. Martin Watt, Alias Systems Corp.

Presentations

Parallelization Challenges for the Cell Broadband Engine
Kevin O'Brien, IBM Research

Programming the SPE processor is enhanced by the availability of an optimizing compiler which supports SIMD intrinsics and automatic simdization. But it is through the exploitation of heterogeneous parallelism at the system level that the full performance potential of the machine can be realized. However, automatic code generation for the CELL Processor, the coupled PPE and 8 SPE processors, is a complex task, requiring partitioning of an application to accommodate the memory constraints of the SPE, parallelization across multiple SPEs, orchestration of the data transfer through DMA, and compiling for two ISAs. We have developed a research prototype compiler which manages this complexity and exploits this parallelism by means of OpenMP directives. The OpenMP standard defines a shared memory model for parallelization and we have successfully implemented this on the complex memory subsystem of the CELL processor. Our parallel implementation currently uses OpenMP pragmas to guide parallelization decisions, but the techniques we describe can be extended to support other parallel programming paradigms such as UPC or automatic parallelization.

The performance measurements produced with this compiler demonstrate this is a promising approach for programming the CELL architecture. In this talk we will discuss the feasibility of developing a fully and semi automatic approach to compiling for the CELL architecture, drawing on our experience with the single source OpenMP compiler.

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Slides
Yun Zhang, University of Toronto

Simultaneous Multi-Threading (SMT) processors are now available in commodity workstations and servers. This technology is designed to increase throughput by executing multiple concurrent threads on a single physical processor. IBM introduced SMT technology into their Power 5 architecture. The multiple threads executing on an SMT share the processor's functional units and on-chip memory hierarchy in an attempt to make better use of idle resources. Most OpenMP applications have been written assuming a Symmetric Multiprocessor (SMP), not an SMT model.

Threads executing on the same physical processor have interactions on data locality and resource sharing that do not occur on traditional SMPs. This work focuses on tuning the behavior of OpenMP applications executing on the Power 5 architecture with SMT processors. We propose a multi-level hierarchical OpenMP loop scheduler that can be configured to apply different schedulers at each architectural level.

Using this approach, load balance can be maintained across multi-chip modules and cores, while encouraging better data locality among co-located threads. We will first present experiments performed on Intel Xeon HyperThreaded processors, and then discuss our initial experiences with an IBM Power 5 system.

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Slides
Priya Unnikrishnan, IBM Toronto Lab

The compiler plays a significant role in the performance and scalability of parallel applications. The IBM XL Fortran and VisualAge C/C++ parallelizing compilers implement the OpenMP standards and also performs automatic parallelization. The two factors that impact parallel performance the most are: appropriate selection of parallel code and the controlled parallelization of the selected code. Selecting parallel code in an application has been discussed widely. Most compilers apply two broad classes of tests for selecting parallel code a) Parallelism analysis (Is it safe to parallelize?) and b) Cost-based analysis (Is is worthwhile to parallelize?). However, appropriate selection alone does not ensure good parallel performance. Parallel performance of an application is largely dependent on the amount of work available as well as the number of threads/processors used to execute it. Controlled parallelization aims to restrict parallelization by determining the optimal number of processors required for parallelization depending on the amount of work and other compile and runtime values. Controlled parallelization does not apply just to automatic parallelization. It can be applied to user-parallel code as well. This talk aims to discuss some of the issues involved in controlling parallelization and the perfomance benefits obtained by applying this to the SPEC2000FP and SPECOMP2001 benchmark suite.

Limits of Parallel/Distributed Computing Slides
Erik Dirkx, Vrije Universiteit Brussel

The ubiquity and proliferation of parallel and distributed computer systems results in a waste of human and financial resources due to the duplication and compartimentalization of work. Granularity variability in hardware and software is one of the key tools to address these issues. A general theorethical model and experimental results from a number of application types are presented.

Threading Maya
Martin Watt, Alias System Corp.

Maya is a 3D modelling, animation, effects and rendering package aimed at film and video artists, game developers, visualization professionals, and Web and print designers. Maya customers always push the software to its limits and continually demand higher performance. With the recent release of multi-core desktop processors, and the lack of further improvements in single processor speeds, there is now a requirement to thread key parts of the application in order to continue to raise the performance levels.

This presentation discusses the work done in threading some parts of Maya for the most recent release of the software, and covers some of the problems encountered and the solutions applied. Compiler issues and cross-platform difficulties are also discussed. The main work focused on threading the Fluids component of Maya. This is a Navier-Stokes solver used to simulate natural effects such as clouds, smoke, fire and explosions.