Type: Research Highlight
Title: Scaling Machine Learning via Compressed Linear Algebra
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald
Available in: PDF
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/Obound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable very fast matrix-vector operations on in-memory data. Generalpurpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable block-wise uncompressed operations. Compressed linear algebra (CLA) avoids these problems by applying lightweight lossless database compression techniques to matrices and then executing linear algebra operations such as matrix-vector multiplication directly on the compressed representations. The key ingredients are effective column compression schemes, cache-conscious operations, and an e!cient sampling-based compression algorithm. Experiments on an initial implementation in SystemML show in-memory operations performance close to the uncompressed case and good compression ratios. We thereby obtain significant end-to-end performance improvements up to 26x or reduced memory requirements.
Ahmed Elgohary is a third year PhD in computer science student at the University of Maryland, College Park. He is co-advised by Professor Douglas Oard and Professor Philip Resnik. His research interests in natural language processing include computational semantics, multilingual learning and textual inference. Along with his co-authors, he has won the best paper award of VLDB 2016 for his work on speeding up machine learning with compressed linear algebra. He is also a co-recipient of the best paper award of the first workshop on evaluating vector space representations for NLP 2016 and the best paper award runner-up of ICDM 2014. Ahmed has been named a 2017-2018 IBM PhD Fellow.
Matthias Boehm is a Research Staff Member at IBM Research – Almaden, where he is working since 2012 on optimization and runtime techniques for declarative, large-scale machine learning in SystemML. He received his Ph.D. from Technische Universitaet Dresden in 2011 with a dissertation on cost-based optimization of integration flows under the supervision of Prof. Wolfgang Lehner. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing. In 2016, he received the VLDB Best Paper Award.
Peter J. Haas has been a Research Staff Member at the IBM Almaden Research Center since 1987, where he has pursued research at the interface of Information Management, Applied Probability, Statistics, and Computer Simulation. He is also a Consulting Professor in the Department of Management Science and Engineering at Stanford University. He was designated an IBM Master Inventor in 2012, and his ideas have been incorporated into products including IBM’s DB2 database system. He is a Fellow of both ACM and INFORMS, and has received a number of awards from IBM and from both the Simulation and Computer Science communities, including several Best Paper awards and an ACM SIGMOD 10-year Best Paper award. Other work has included the Splash platform for collaborative modeling and simulation, techniques for massive-scale matrix completion, Monte Carlo methods for scalable querying and machine learning over massive uncertain data, automated relationship discovery in databases, query optimization methods, and autonomic computing. He serves on the editorial boards of ACM TODS, Operations Research and ACM TOMACS, and was an Associate Editor for the VLDB Journal from 2007 to 2013. He is the author of over 100 conference publications, journal articles, and books.
Frederick R. Reiss is the Chief Architect at the IBM Spark Technology Center in San Francisco and is one of the founding employees of the Center. Fred received his Ph.D. from UC Berkeley in 2006, then worked for IBM Research — Almaden for the next nine years. At Almaden, Fred worked on the SystemML and SystemT projects, as well as on the research prototype of DB2 with BLU Acceleration.
Berthold Reinwald is a Principal RSM at the IBM Almaden Research Center. His research areas include scalable analytics platforms and database technology. He is the technical lead for SystemML. He holds a Ph.D. in Computer Science from the University of Erlangen-Nuernberg, Germany.