Difference between revisions of "DistOS-2011W General Purpose Frameworks for Performance-Portable Code"

From Soma-notes
Jump to navigation Jump to search
Line 52: Line 52:
==STAPL==
==STAPL==


[[File:stapl.png|300px|thumb|left|An overview of the framework for adaptive algorithm selection in STAPL.]]
Thomas ''et. al'' introduced a general framework for adaptive algorithm selection called the Standard Template Adaptive Parallel Library (STAPL) [5]. They realized that often times there are many algorithms that can solve a problem, all of which perform differently depending on many factors and deciding which one to use was often a difficult problem. This problem only gets worse in distributed settings. These issues are direct contributors to the problem that we have been trying to address in this paper, the problem of writing portable code that performs well on a variety of hardware architectures and for various input sizes. Thomas ''et. al'' thought that the programmer should only indicate which operation to be performed and it should be up to the system to decide which algorithm to use. The main focus of STAPL is on the high-level selection of an algorithm to solve a given problem from a set of functionally equivalent algorithms. This section will give a overview of how the STAPL framework works as well as a brief summary of some performance results.
Thomas ''et. al'' introduced a general framework for adaptive algorithm selection called the Standard Template Adaptive Parallel Library (STAPL) [5]. They realized that often times there are many algorithms that can solve a problem, all of which perform differently depending on many factors and deciding which one to use was often a difficult problem. This problem only gets worse in distributed settings. These issues are direct contributors to the problem that we have been trying to address in this paper, the problem of writing portable code that performs well on a variety of hardware architectures and for various input sizes. Thomas ''et. al'' thought that the programmer should only indicate which operation to be performed and it should be up to the system to decide which algorithm to use. The main focus of STAPL is on the high-level selection of an algorithm to solve a given problem from a set of functionally equivalent algorithms. This section will give a overview of how the STAPL framework works as well as a brief summary of some performance results.



Revision as of 22:47, 28 February 2011

PDF available here

Introduction

The overall computer industry has changed drastically in the last 5-10 years. With multicore computing we no longer see the pattern of single chips with increasing clock speeds being released, we see the number of cores on a given chip increase as time goes on. In spite of this, Moore's law persists which lets us expect machines with hundreds if not thousands of individual cores in the not so distant future. Because of this we can also expect the physical makeup of individual machines to vary significantly and possibly be made up of heterogeneous smaller computational components. As described in the Berkeley View [2] this poses an enormous problem to the performance portability of code, as performance and portability are classically conflicting objectives when it comes to software generation. This is obviously a major concern in distributed computing as well since it is essential to have efficient code throughout the system, regardless of what physical architecture it resides on. For this reason, having code that can 'auto-tune' itself to achieve near optimal performance on any given hardware architecture will be a key issue moving forward within the distributed computing community and will be absolutely essential if the dream of a true distributed operating system running at internet scale is ever to be realized.

Many problems within the world of linear algebra have auto-tuned libraries which achieve near-optimal performance on a wide range of machines. One such system is ATLAS [9], which is presented here as an example. The problem with these libraries are that they are not general enough to service all of the high performance computing needs of the scientific community and the optimizations they employ are too specific to be applied to other problems. To combat this, a new area of general purpose frameworks for performance-portable code is emerging. The purpose of this literature review is to give the reader an overview of the most popular such frameworks that exist today, give an idea of how they work and briefly discuss their overall performance.

Motivation

The main motivation behind this work was presented in the The Landscape of Parallel Computing Research: A View from Berkeley [2]. This document discusses the recent switch of the computer hardware world to multicore processors and the drastic changes that need to be implemented in the software world to be able to adapt to the next generation of computers. The scope of this paper is much more than can be covered here, so the only important aspects that motivate our later discussion will be looked at.

Moore's law describes a trend in computer hardware where the number of transistors that could affordably be put on a chip doubled every 18 months to two years. In terms of processors, this would mean that one could expect faster and faster sequential computers to come out on a regular basis. This lead to the conventional wisdom that there was no real need for major optimizations or to parallelize sequential programs because they would simply run faster once a new generation of processors came out. However, due to many factors, the performance of individual processors have not been meeting the exponential improvements that they have in the past. In spite of this, Moore's law persists. Instead of seeing faster and faster individual cores, we now see the number of processor cores on a chip going up. It is currently not uncommon for a personal computer to have 2, 4 or even 8 processor cores. As this trend continues we can expect chips with thousands of individual cores on a single chip in the not so distant future. On top of this, it is not hard to imagine that the physical makeup of individual machines will vary significantly and possibly be made up of heterogeneous smaller computational components as well.

This obviously introduces some major problems into the world of computer software. Software vendors can no longer rely on faster individual cores to improve the performance of their sequential programs. Many already parallelized programs will not be able to scale as the number of cores increase dramatically, meaning they will effectively reach a performance barrier as well. Another equally, if not more urgent, problem is the issue of code performance portability. Currently, we write code and compile it to be efficient on a given machine. When this code needs to be run on a different machine, it is possible that significant changes must be made to the code and compilation process to get near optimal performance. Obviously this is undesirable as it demands a significant amount of time to adjust and tune the code by hand by highly trained experts. As the variety and complexity of heterogeneous systems increase, the time needed to hand tune code for specific machines becomes unmanageable. This is a key concern within distributed systems as overall performance is largely influenced by how quickly individual machines can execute code. If only a small number of machines have optimal code, then either the operating system will always try to use these machines for specific tasks, or it will run the tasks on machines with suboptimal implementations of the given code. Either way this results in poor overall performance.

One potential solution to these problems are new programming models and auto-tuning frameworks. Programming models need to allow the programmer to balance their productivity with implementation efficiently. One key way to do this is to abstract the underlying architecture from the programmer. This will allow the programmer to worry about implementing their program and leave the low level details to the system. However, regardless of the programming model used, the overall performance will be determined by the quality of the generated code, which is currently done by the compiler. The compiler is responsible for the optimization selection and then choosing the parameters for these optimizations as well as choosing between different alternative implementations of library kernels. This search space is often too large for a compiler to be able to make all of the right choices all of the time on every different computer architecture. Also, it is very difficult to add new optimizations into existing compilers and is often not worth the time needed to implement them. Because of this, most optimal solutions have to be manually tuned for each different system, taking a significant human investment. This is where auto-tuners step in. An auto-tuned algorithm is one that effectively searches through the optimization search space (either at compile time or run time) to produce efficient implementations on a wide array of hardware architectures. Auto-tuners have become popular in the last few years to produce highly portable, efficient scientific code. Many of these auto-tuners do this by generating many different algorithm variations of a given problem and then benchmark each one on a specific machine. The search also tries many combinations of the optimizations available through the compiler. This is a lengthy process, sometimes lasting several hours, but it only needs to be performed once at install time. The resulting implementation is often significantly faster than a naive implementations and sometimes even produces code faster than vendor supplied libraries. This is accomplished because the auto-tuner searches the optimization space in ways that a person would not, often finding unusual and unexpected optimization combinations. In the next section, an example of an auto-tuner implementing linear algebra libraries will be presented.

An Example of an Auto-tuned Algorithm: ATLAS

ATLAS (Automatically Tuned Linear Algebra Software) [9] is a software package that was produced to provide a portable and efficient implementation of the Basic Linear Algebra Subprograms (BLAS), a widely used linear algebra library. ATLAS is regarded as some of the earlier pioneering work in the field of auto-tuning algorithms, still being used by the scientific community as well as other researchers as a performance benchmark today.

BLAS are building block methods for performing basic linear vector and matrix operations. It has three logical levels. Below these levels are described as well as a brief summary of what ATLAS tries to do to make their code more efficient at each level.

  • Level 1 BLAS
This level performs vector-vector operations. At this level it is not possible to process the vectors to be able to reuse memory (as it is with matrix operations), and is expected to only achieve a roughly 0-15% speedup over untuned code. Generally the compiled can do a good job of optimizing this code so not many extra optimizations are performed by ATLAS.
  • Level 2 BLAS
This level performs matrix-vector operations. At this level using appropriate cache and register blocking can drastically reduce the references that must be satisfied out of main memory. By ensuring that certain segments of vectors and matrices remain in registers or in cache for a significant amount of time, one can expect a modest 10-300% speedup at this level.
  • Level 3 BLAS
This level performs matrix-matrix operations. Using major-block format for the matrices (with the correct size blocks) allows for a minimization of cache thrashing as well as maximizing the amount of cache that gets reused. Another important aspect of these operations is deciding on the order in which the matrices are computed in, especially if you are dealing with relatively small matrices. If not, a relatively complex partitioning of the matrices can ensure maximum memory reuse, which will in-turn produce significant performance improvements. A lot of significant details are being omitted here, but suffice to say that ATLAS achieves orders of magnitude of speedup at this level.

At install time ATLAS runs a series of benchmarks to calculate the various cache and main memory sizes as well as to determine various latency timings within the given system. ATLAS also probes the system to gain information regarding the floating point units of the system. On top of this, ATLAS explores other optimizations such as loop unrollings and blocking factors and applies the best ones found.

Although many of the details of ATLAS have been omitted here, it is hoped that this gives the reader an idea of what an auto-tuned algorithm tries to accomplish. Auto-tuning algorithms have been very successfully in the area of linear algebra (see [3, 4, 7] for other examples), however little has been done on non-numerical problems. One potential reason for this is the investment it takes to produce an efficient, portable auto-tuned algorithm. ATLAS and systems like it are certainly not trivial and are the result of many years of research and development. Another issue is that many other problems do not have the regular and pre-known data qualities that are common to many linear algebra problems. It is for these reasons that recently more general purpose approaches have taken off.

General Purpose Frameworks for Performance-Portable Code

As stated in the previous section, there has been a significant amount of work put in to auto-tuning algorithms related to linear algebra problems but not much work done outside of this. To combat this, some researchers looked at the problem of portable performance from another angle. Instead of focusing on a single problem and building an auto-tuned algorithm to solve that problem efficiently on a wide variety of computer architectures, we are starting to see the emergence of general purpose frameworks for performance-portable code. Generally speaking, these frameworks allow the programmer to define the problem that they are trying to solve without explicitly worrying about specific performance details. The framework would then have some kind of auto-tuner to then tune the code to allow for an efficient implementation on a given machine. Although this area is relatively small, there are already a handful of very powerful frameworks that allow for performance-portable code. In the following subsections an overview of the available frameworks will be given.

ADAPT

An overview of the ADAPT framework for high-level adaptive program optimization.

ADAPT [8] represents one of the earlier general purpose frameworks we will look at. It was realized that with the convergence of modern computing technology that it was possible to produce a single executable program for a wide range of compatible but still wildly varied computer systems. These differences can come from different memory sizes and hierarchies, processors and networking capabilities, all of which can have a significant impact on performance if not taken into consideration by the running program. To overcome this issue, Voss et. al proposed APAPT, a system that supports dynamic code compilation with parameterization while allowing the user to search the the optimization space to achieve near optimal performance.

ADAPT has two main components, one being compile-time and the other being a run-time component. ADAPT provides its own language (ADAPT Language, or AL for short) for the users to define their optimization heuristics to be applied by the ADAPT run-time system. An example of an optimization heuristic written in AL was given in the paper, and it was for a dynamic loop unroller. Using this language it is possible to code a very wide range of optimization heuristics to help optimize the application code. The ADAPT compiler reads the application code as well as user-defined optimization heuristics and generates a run-time system for applying these these heuristics dynamically to the application. To do this, the compiler first identifies code pieces that are candidates for optimization. The compiler then generates a run-time system based on the candidate code sections as well as the user supplied optimization heuristics. This run-time system consists of the original application code with new control paths for each optimization candidate. For each candidate, the first path represents the best known version of the code (using whatever optimizations have been applied in the past) and the second path represents an experimental version. Every time a new experimental version is created, it is run at the next opportunity (the next time that piece of code is called). The experimental version is monitored and then its performance data is fed back into the optimizer. This allows the ADAPT optimizer to successively swap in better and better performing pieces of code and represents a key component of the overall ADAPT framework.

The ADAPT optimizer consists of two parts, a local optimizer as well as a remote optimizer. The local optimizer identifies hotspots within the code, representing the most time consuming sections. The local optimizer will then communicate this information to the remote optimizer which has the source code of each candidate, a description of the target machine, the user specified optimization heuristics and tools and compilers that can be used to optimize the code. The remote optimizer then optimizes the code with all of the tools it has available to it and when a new code variant is produced, it is sent back to the local optimizer to be dynamically linked into the running application to be run and evaluated at the next opportunity. The more that a section candidate code is run, the faster it gets over time due to this process. ADAPT relies on this fact to achieve peak performance in its applications. An overview of the ADAPT framework can be seen provided figure.

The ADAPT framework was then tested on two different computer systems by running six different experiments. These experiments covered different applications, including floating point benchmarks, kernels for matrix-matrix multiplication and a finite differencing solver. These experiments give an idea of the variety of techniques that can be implemented using the ADAPT framework. It was found that, on average, the ADAPT system offers a better solution than static optimization tools. In general, ADAPT was found to offer the best performance with the least risk of degradation in all of the tests performed.

STAPL

An overview of the framework for adaptive algorithm selection in STAPL.

Thomas et. al introduced a general framework for adaptive algorithm selection called the Standard Template Adaptive Parallel Library (STAPL) [5]. They realized that often times there are many algorithms that can solve a problem, all of which perform differently depending on many factors and deciding which one to use was often a difficult problem. This problem only gets worse in distributed settings. These issues are direct contributors to the problem that we have been trying to address in this paper, the problem of writing portable code that performs well on a variety of hardware architectures and for various input sizes. Thomas et. al thought that the programmer should only indicate which operation to be performed and it should be up to the system to decide which algorithm to use. The main focus of STAPL is on the high-level selection of an algorithm to solve a given problem from a set of functionally equivalent algorithms. This section will give a overview of how the STAPL framework works as well as a brief summary of some performance results.

STAPL is, at it's core, a library of ISO Standard C++ components with interfaces similar to the regular ISO C++ standard library. It's underlying infrastructure consists of platform independent and platform dependent components. The core library consists of pAlgorithms (parallel algorithms, parallel equivalent of an STL algorithm) and pContainers (distributed data structures). STAPL programs can adapt to the underlying architecture through the use of a number of different algorithms that have the same interface and functionality but perform differently depending on the input data and underlying hardware. When STAPL is installed on a target system it collects information about the system from various standard header files and vendor system calls. Information regarding available memory, cache size, number of processors, etc. is gathered and stored in an internal data repository. At this point, a series of installation benchmarks for the algorithmic options available in STAPL are run and the resulting data is also stored in the data repository. The next step involves using machine learning techniques to analyze the data. Three different empirical learning approaches were implemented within the STAPL framework: a decision tree learner (used most often), a standard back propagation neural network and a naive Bayes classifier. All of the learners share a common input format and interface for querying for predictions, so they can be easily interchanged at any time. Once the prediction models are created they can be used, in conjunction with some small run-time tests, to predict which implementation of a given algorithm (with respect to a given input) should be used. These decisions are then monitored and fed back into the data repository so they can influence decisions made in the future. An overview of the STAPL framework is given in the figure provided.

One of the tests done on STAPL was on parallel sorting. In this test, they considered three popular yet very different parallel sorting algorithms: sample sort, radix sort and column sort. The details of each individual sorting algorithm are omitted here, but suffice to say that each one behaves very differently when the input and underlying hardware are varied. It was found that metrics such as the number of processors, cache size, number of elements in the input and the data type of these elements were important attributes of a given problem instance. Another metric that was also calculated at run-time was the degree of input presortedness, which can have a significant effect on the running time of the different sorting algorithms. All of this information was used by the learners to predict which algorithm to use. This test was done on three very different architectures and in all instances the decision tree prediction model performed best. Overall, STAPL was found to use the correct algorithm over 95% of the time on average. They also compared the adaptive solution provided by STAPL to a standalone version of the fastest algorithm for each input and found that the overhead incurred by STAPL was relatively small.

POET Transformation Engine

PetaBricks

Active Harmony and CHiLL

Conclusion

As the pace of hardware evolution continues to grow, the problem of portable, efficient code will grow as well. This will become a key challenge to the distributed computing community, who generally want to build very large systems made up of very diverse, heterogeneous parts. For these systems to be successful, they need to ensure that performance critical kernels will be efficiently implemented across their system to allow for the high level of performance expected from the user. The promise of auto-tuning algorithms has been realized for a small number of numerical problems, such as ATLAS for the BLAS problems. The main issue with these auto-tuners is that the optimizations they implement are generally domain specific and therefore difficult to generalize to other problems. Presented here was an overview of the emerging field of general purpose frameworks for performance-portable code. Although this area is relatively small, there are already some very mature frameworks which can allow for very versatile and efficient implementations of a wide range of algorithms. Although these systems all work slightly differently, they all strive for the same thing: the ability to allow the programmer to describe the algorithm they wish to implement in an abstract way to allow it to adapt to any given physical architecture. This, in turn, will result in an efficient implementation of a given algorithm anywhere. The overall importance of these frameworks will only grow as our computer architectures continue to evolve and become more complex.

References

[1] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 38–49, New York, NY, USA, 2009. ACM.

[2] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.

[3] Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’10, pages 115–126, New York, NY, USA, 2010. ACM.

[4] James Demmel, Jack Dongarra, Viktor Eijkhout, Erika Fuentes, Antoine Petitet, Richard Vuduc, R. Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proc. IEEE, 93(2):293–312, February 2005.

[5] Nathan Thomas, Gabriel Tanase, Olga Tkachyshyn, Jack Perdue, Nancy M. Amato, and Lawrence Rauchwerger. A framework for adaptive algorithm selection in stapl. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’05, pages 277–288, New York, NY, USA, 2005. ACM.

[6] A. Tiwari, Chun Chen, J. Chame, M. Hall, and J.K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1 –12, May 2009.

[7] Sundaresan Venkatasubramanian, Richard W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In Proceedings of the 23rd international conference on Supercomputing, ICS ’09, pages 244–255, New York, NY, USA, 2009. ACM.

[8] Michael J. Voss and Rudolf Eigemann. High-level adaptive program optimization with adapt. SIGPLAN Not., 36:93–102, June 2001.

[9] Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing, 27, 2000.

[10] Qing Yi and R. Clint Whaley. Automated transformation for performance-critical kernels. In Proceedings of the 2007 Symposium on Library-Centric Software Design, LCSD ’07, pages 109–119, New York, NY, USA, 2007. ACM.