A National Science Foundation (NSF) grant is funding the University at Buffalo and the Texas Advanced Computing Center (TACC) at The University of Texas at Austin to evaluate the effectiveness of high-performance computing (HPC) systems in the NSF Extreme Science and Engineering Discovery Environment (XSEDE) program and HPC systems in general.
Today's high-performance computing systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. In this environment, service providers, users, system managers and funding agencies find it difficult to know if systems are realizing their optimal performance, or if all subcomponents are functioning properly.
Through the "Integrated HPC Systems Usage and Performance of Resources Monitoring and Modeling (SUPReMM)" grant, the University at Buffalo and TACC will develop new tools and a comprehensive knowledge base to improve the ability to monitor and understand performance for diverse applications on HPC systems.
The close to $1 million grant will build on and combine work that has been underway at the University at Buffalo under the Technology Audit Service (TAS) for XSEDE and at TACC as part of the Ranger Technology Insertion effort.
"Obtaining reliable data without efficient data management is impossible in today's complex HPC environment," said Barry Schneider, program director in the NSF's Office of Cyberinfrastructure. "This collaborative project will enable a much more complete understanding of the resources available through the XSEDE program and will increase the productivity of all of the stakeholders, service providers, users and sponsors in our computational ecosystem."
"Ultimately, it will advance our goals of providing open source tools for the entire science community to effectively utilize all HPC resources being deployed by the NSF for open science research in the academic community," Schneider said.
Working with the XSEDE TAS team at Buffalo, TACC staff members are running data gathering tools on the Ranger and Lonestar supercomputers to evaluate data that is relevant to application performance.
"We gather data on every system node at the beginning and end of every job, and every 10 minutes during the job—that's a billion records per system each month," said Bill Barth, director of high-performance computing at TACC. "It's going to end up being a Big Data problem in the end."
The tools will present various views on existing XSEDE usage data from the central database, according to Barth. This data will include how individual user jobs and codes are performing on a system at a detailed level. In the coming year, the research and development effort will gather data and evaluate performance on all XSEDE systems, including Stampede, which will launch in January 2013.
"HPC resources are always at a premium," said Abani Patra, principal investigator of the University at Buffalo project. "Even a 10 percent increase in operational efficiency will save millions of dollars. This is a logical extension of the larger XSEDE TAS effort."
TAS, through the XSEDE Metric on Demand (XDMoD) portal, provides quantitative and qualitative metrics of performance rapidly to all stakeholders, including NSF leadership, service providers, and the XSEDE user community.
Work on the grant began on July 1, 2012, and will continue for two years.