2014 — 2017 |
Papka, Michael Vishwanath, Venkatram |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cc*Iie Integration: Collaborative Research: Epson: Embracing Parallel Networks and Storage For Predictable End-to-End Data Movement @ Northern Illinois University
Geographically distributed scientific communities require increasingly sophisticated data transfer mechanisms that can handle the challenges of sharing large datasets over heterogeneous networks. These challenges include optimization of networks with different configurations and protocols, I/O mechanisms to efficiently read and write to parallel storage, and the varying demands of widely different data transfer workloads.
To address these challenges, the EPSON project is developing, implementing, and evaluating application programming interfaces and tools that facilitate end-to-end parallel data transfers. EPSON researchers focus on three areas: (1) enabling parallel network data movement, by taking into account the diversity of parallel network characteristics of both shared networks and infrastructures with dedicated circuits and paths and effectively balancing the flows among paths for more predictable performance; (2) developing a GridFTP data storage interface, enabling scalable I/O to and from parallel filesystems -- critical for campus infrastructures to deal with large-scale datasets; and (3) devising mechanisms that overlap network transfers with storage I/O and incorporate data-staging heuristics, matching the impedance between storage and networking capabilities to improve end-to-end data transfers. The project involves close collaboration with application scientists, with the objective of providing advanced networking tools to support the requirements of the applications deployed at the University of Chicago and Northern Illinois University campuses.
|
0.972 |
2021 — 2026 |
Lu, Shan (co-PI) [⬀] Gunawi, Haryadi Vishwanath, Venkatram Hoffmann, Henry Ross, Robert |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Pposs: Large: Scalestuds: Foundations For Correctness Checkability and Performance Predictability of Systems At Scale
In light of the limits of Moore's Law and Dennard scaling and the ever increasing computing demand, the last decade has seen unprecedented deployment scales; Google is known to run clusters with thousands of machines each, Apple deploys a total of 100,000 database machines, and Netflix runs tens of database clusters with 500 nodes each. This era of extreme-scale distributed systems has given birth to a new class of faults, "scalability faults" -- complex latent faults that are scale-dependent, whose symptoms surface in large-scale deployments but not necessarily in small/medium-scale deployments. Many fundamental research questions are not answerable today. On correctness: How to detect bugs that only manifest under large scale through program analysis? How to test and reproduce various dimensions of system scales efficiently on one machine? How to prevent and fix scalability-related faults? On performance: How to reason about software performance on various heterogeneous devices? How to accurately predict performance of fine-grained tasks to reduce inaccuracies at the aggregate level and project performance to future architectures? Finally, in combination: How to answer all these questions for the larger connected ecosystem -- not just the individual software and hardware components -- and to eventually build future-generation systems that are reproducible and verifiable by construction with respect to correctness and performance at scale?
The ScaleStuds project involves a team of ten researchers to develop the foundations of correctness checkability (CC) and performance predictability (PP) of systems at scale. The key principle of this project is to "check large with large" -- check large-scale systems with a large fleet of data, analysis, tests, learning, models, and proofs. The vision is to build an ecosystem of distributed "CC+PP-certified" software-software and -hardware interactions. The project is paving the vision one "floor" at a time, creating composable building blocks ("the studs"). The project first builds new mechanisms such as a scale-testing platform and a unified database of software program properties and hardware performance profiles exposing clear APIs. These studs then enable multi-dimensional automated scalability tests and program analysis and performance learning and prediction at various levels of the software/hardware stack. Ultimately all of these experiences are intended to lead to correct and performant cross-layer/service interactions and future design principles including reproducible- and verified-by-construction development methods. The project novelties include the advancement of debugging, testing, learning, and prediction methods to ensure correctness checkability and performance predictability of extreme-scale systems and applications both on classical hardware platforms and emerging ones; a unified data ecosystem of software/hardware properties and profiles that facilitates automated analyses via clear APIs; a multi-dimensional scale-testing framework that empowers the development of new large-scale unit-tests and program analysis; detailed device profiling and observation to enable large-scale performance learning/prediction and deliver lessons for learning/predicting the behavior of other devices and layers in an end-to-end hardware/software stack; and ultimately a clear definition of CC+PP-certifiability for today's systems and future verifiable/reproducible-by-construction development methods.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |