DistOS-2011W General Purpose Frameworks for Performance-Portable Code
Introduction
The overall computer industry has changed drastically in the last 5-10 years. With multicore computing we no longer see the pattern of single chips with increasing clock speeds being released, we see the number of cores on a given chip increase as time goes on. In spite of this, Moore's law persists which lets us expect machines with hundreds if not thousands of individual cores in the not so distant future. Because of this we can also expect the physical makeup of individual machines to vary significantly and possibly be made up of heterogeneous smaller computational components. As described in the Berkeley View [2] this poses an enormous problem to the performance portability of code, as performance and portability are classically conflicting objectives when it comes to software generation. This is obviously a major concern in distributed computing as well since it is essential to have efficient code throughout the system, regardless of what physical architecture it resides on. For this reason, having code that can `auto-tune' itself to achieve near optimal performance on any given hardware architecture will be a key issue moving forward within the distributed computing community and will be absolutely essential if the dream of a true distributed operating system running at internet scale is ever to be realized.
Many problems within the world of linear algebra have auto-tuned libraries which achieve near-optimal performance on a wide range of machines. One such system is ATLAS [9], which is presented here as an example. The problem with these libraries are that they are not general enough to service all of the high performance computing needs of the scientific community and the optimizations they employ are too specific to be applied to other problems. To combat this, a new area of general purpose frameworks for performance-portable code is emerging. The purpose of this literature review is to give the reader an overview of the most popular such frameworks that exist today, give an idea of how they work and their overall performance.
References
[1] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 38–49, New York, NY, USA, 2009. ACM.
[2] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.
[3] Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’10, pages 115–126, New York, NY, USA, 2010. ACM.
[4] James Demmel, Jack Dongarra, Viktor Eijkhout, Erika Fuentes, Antoine Petitet, Richard Vuduc, R. Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proc. IEEE, 93(2):293–312, February 2005.
[5] Nathan Thomas, Gabriel Tanase, Olga Tkachyshyn, Jack Perdue, Nancy M. Amato, and Lawrence Rauchwerger. A framework for adaptive algorithm selection in stapl. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’05, pages 277–288, New York, NY, USA, 2005. ACM.
[6] A. Tiwari, Chun Chen, J. Chame, M. Hall, and J.K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1 –12, May 2009.
[7] Sundaresan Venkatasubramanian, Richard W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In Proceedings of the 23rd international conference on Supercomputing, ICS ’09, pages 244–255, New York, NY, USA, 2009. ACM.
[8] Michael J. Voss and Rudolf Eigemann. High-level adaptive program optimization with adapt. SIGPLAN Not., 36:93–102, June 2001.
[9] Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing, 27, 2000.
[10] Qing Yi and R. Clint Whaley. Automated transformation for performance-critical kernels. In Proceedings of the 2007 Symposium on Library-Centric Software Design, LCSD ’07, pages 109–119, New York, NY, USA, 2007. ACM.