2017 — 2019 |
Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Crii: Shf: Improving Programmability of Gpgpu/Nvram Integrated Systems With Holistic Architectural Support @ University of Southern California
In the era of big data, the industry faces growing demand for higher computing power and large-capacity high performance storage. GPGPU and NVRAM are two prominent technologies that will play the key role in the "Big Data revolution". This project, which holistically improves the programmability of GPGPU/NVRAM integrated systems, tackles the "programmability bottleneck" faced in GPGPU and NVRAM. It will make it easier to develop correct applications in GPGPU and NVRAM with high performance. As a result, the project will enforce the desire of applying GPGPUs and NVRAM into a wide-range of HPC and big data applications which could then gain hundreds times speedup while ensuring recoverability. Overall, the outcomes of this project will help ensure the sustainable performance to support the supercomputing/big data processing in science and engineering (e.g. finance, medical, biology, petroleum, aerospace, and geology). This project will also contribute to society through engaging high-school and undergraduate students from minority-serving institutions into research, attracting women and under-represented groups into graduate education, expanding the computer engineering curriculum with GPGPU/NVRAM architectures, disseminating research infrastructure for education and training, and collaborating with the industry.
This research investigates synergetic approaches and techniques to holistically improve the programmability of GPGPU/NVRAM integrated systems with the following techniques: (1) Timestamp-Based GPU Coherence Protocol. It avoids storage overhead by not storing sharing states (e.g. Shared, Modified, Exclusive, etc.) and the list of sharers. It reduces the traffic overhead by not sending explicit invalidation messages. (2) Integration of Persistency and the Scoped-Synchronization. This research aims to study the new notion of Persistent Scope (PS) , which incorporates the necessary persistency semantics into the existing scoped-synchronization in GPGPU programming models. Efficient architecture design that fully decouples consistency and persistency will be explored. (3) Data Sharing-Aware CTA Scheduler and Cache Management. This research plans to investigate a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to improve temporal and spatial locality.
|
0.951 |
2017 — 2020 |
Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Csr: Small: Collaborative Research: Gambit: Efficient Graph Processing On a Memristor-Based Embedded Computing Platform @ University of Southern California
Recently, graph processing received intensive interests in light of a wide range of needs to understand relationships. Graph analytics are widely used in key domains in our society, such as cyber security, social media, infrastructure monitoring (e.g., smart building), natural language processing, system biology, recommendation systems. These important applications all fall into fast-growing sectors in computer science and engineering research. On the other hand, in many emerging applications, the graph analytics are ideally performed in the edge (e.g., a mobile or embedded system) in order to allow the relationships between events to be discovered in the field where they are unfold. Unfortunately, the existing embedded systems equipped with conventional computing units like CPU/GPU cannot efficiently process large graphs in real time. Instead, large data centers are required to perform the graph processing, either incurring extra latency and energy due to data communication or only providing forensic (offline) graph analysis. This research aims to effectively enable graph analytics in embedded system with disruptive emerging technology.
To support graph analytic applications with the limited hardware resources in embedded systems, this project seeks to develop GAMBIT -- a memristor-based embedded computing framework for efficient graph processing. Our research program aims to develop multi-layer techniques to enable highly efficient (e.g., 1000X) and scalable real-time graph analytics in embedded systems (i.e., network edge). It contains research efforts across circuit, architecture, system and vertical integration. (1) At the circuit level, the project proposes a memristor-based graph computing core to enable efficient computations for graph processing. (2) At the architecture level, the project proposes the complete memristor-based graph processing architecture for partitioned graph and various algorithms. (3) At the system level, the project develops a graph analytics framework for embedded systems and integrates it with a popular embedded OS. (4) For integration, the project proposes to develop an emulator of the proposed architecture and cross-layer HW/SW co-design techniques. This project contributes to society through engaging high-school and undergraduate students from minority-serving institutions into research, attracting women and under-represented groups into graduate education, expanding the computer engineering curriculum with graph processing and other emerging applications in embedded systems, disseminating research infrastructure for education and training, and collaborating with the industry.
|
0.951 |
2017 — 2020 |
Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Shf: Small: Accelerating Graph Processing With Vertically Integrated Programming Model, Runtime and Architecture @ University of Southern California
Recently, graph processing received intensive interests due to the increasing need to understand relationships. For example, in cyber security, the graph analytics are needed to identify probes on the network. In social media, the graph analytics are employed to figure out the relationships and influences between people. In infrastructure monitoring (e.g. smart building), the graph analytics are crucial in spotting failures based on system dependencies before they become critical and cause cascading failures. On the other hand, in-memory graph processing is becoming more appealing due to recent technology advances (e.g. NDP with 3D integration) that improved the scalability of memory system with lower cost. Therefore, this research program timely considers graph processing(which has broad applications) with the emerging trends in the memory system.
This project will investigate a vertically integrated approach involving programming model, runtime system and architecture to holistically accelerate in-memory graph processing. It contains three research innovations and cross-stack integration: (1) Reducing data movements with novel programming model. It will study a new graph processing programming model,?Two-phase Vertex Program?, designed for PIM that supports a novel "source-cut" data partition. (2) Batched regular inter-cube communication and intra-cube locality enhancement. It will examine how to re-organize the computation to make the inter-cube communications happen in a controlled manner. This allows batched communication and the overlapping of computation and communication. To this end, it will study how to partition the cores in the same cube into two groups (Process and Apply unit) to improve intra-cube memory access locality. (3) Co-designed locality-aware scheduler and prefetcher. This project will develop a novel architectural interface so that the software and architecture could interact. On one side, it provides scheduler the capability to query the locality information of scheduling candidates to make better decisions. On the other side, the scheduler could convey the scheduling decisions to architecture so that even a simple prefetcher can precisely fetch the data related to the active vertices that will be scheduled soon. The proposed research will also contribute to society through engaging high-school and undergraduate students from minority-serving institutions into research, attracting women and under-represented groups into graduate education, expanding the computer engineering curriculum with graph processing architectures and runtime systems, disseminating research infrastructure for education and training, and collaborating with the industry.
|
0.951 |
2017 — 2018 |
Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Student Travel Support For the 2017 International Conference On Architecture Support For Programming Languages and Operating Systems (Asplos) @ University of Southern California
This proposal is for support of a travel to the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), to be held in Xi'an, China, April 8-12, 2017. ASPLOS is a premier forum for multidisciplinary systems research spanning computer architecture and hardware, programming languages and compilers, and operating systems. Papers in ASPLOS may target diverse goals such as performance, energy and thermal efficiency, resiliency, security, and sustainability. The proposal requests travel support funds to enable students to defray the costs of traveling to and attending ASPLOS-22. The funds requested from NSF will help support a large number of students who are expected to attend ASPLOS, including those who are not ACM members. Priority will be given to students that are US citizens and permanent residents, as well as to students belonging to under-represented groups.
The importance of forum such as ASPLOS continues to grow as we come to the end of Moore's Law, experience the explosion of big data, scales ranging from ultra-low power wearable devices to exascale parallel and cloud computers, and the need for sustainability, and increasingly human-centered applications. ASPLOS embraces systems research that directly targets these new problems in new ways. Historically, the conference has attracted top research papers from both academia and industry, and many innovations published in the proceedings have been influential in the history of the processor industry. The acceptance rates for submitted papers is typically between 15-20%.
|
0.951 |
2019 — 2022 |
Prasanna, Viktor (co-PI) [⬀] Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Spx: Collaborative Research: Fastleap: Fpga Based Compact Deep Learning Platform @ University of Southern California
With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used because of their high accuracy, excellent scalability, and self-adaptiveness properties. Many applications employ DNNs as the core technology, such as face detection, speech recognition, scene parsing. To meet the high accuracy requirement of various applications, DNN models are becoming deeper and larger, and are evolving at a fast pace. They are computation and memory intensive and pose intensive challenges to the conventional Von Neumann architecture used in computing. The key problem addressed by the project is how to accelerate deep learning, not only inference, but also training and model compression, which have not received enough attention in the prior research. This endeavor has the potential to enable the design of fast and energy-efficient deep learning systems, applications of which are found in our daily lives -- ranging from autonomous driving, through mobile devices, to IoT systems, thus benefiting the society at large.
The outcome of this project is FASTLEAP - an Field Programmable Gate Array (FPGA)-based platform for accelerating deep learning. The platform takes in a dataset as an input and outputs a model which is trained, pruned, and mapped on FPGA, optimized for fast inferencing. The project will utilize the emerging FPGA technologies that have access to High Bandwidth Memory (HBM) and consist of floating-point DSP units. In a vertical perspective, FASTLEAP integrates innovations from multiple levels of the whole system stack algorithm, architecture and down to efficient FPGA hardware implementation. In a horizontal perspective, it embraces systematic DNN model compression and associated FPGA-based training, as well as FPGA-based inference acceleration of compressed DNN models. The platform will be delivered as a complete solution, with both the software tool chain and hardware implementation to ensure the ease of use. At algorithm level of FASTLEAP, the proposed Alternating Direction Method of Multipliers for Neural Networks (ADMM-NN) framework, will perform unified weight pruning and quantization, given training data, target accuracy, and target FPGA platform characteristics (performance models, inter-accelerator communication). The training procedure in ADMM-NN is performed on a platform with multiple FPGA accelerators, dictated by the architecture-level optimizations on communication and parallelism. Finally, the optimized FPGA inference design is generated based on the trained DNN model with compression, accounting for FPGA performance modeling. The project will address the following SPX research areas: 1) Algorithms: Bridging the gap between deep learning developments in theory and their system implementations cognizant of performance model of the platform. 2) Applications: Scaling of deep learning for domains such as image processing. 3) Architecture and Systems: Automatic generation of deep learning designs on FPGA optimizing area, energy-efficiency, latency, and throughput.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.951 |
2021 — 2022 |
Kuppannagari, Sanmukh Rao Prasanna, Viktor [⬀] Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research:Pposs:Planning: Streamware - a Scalable Framework For Accelerating Streaming Data Science @ University of Southern California
In grand-challenge scientific applications, the enormous amount of data produced by the sensing and instrumentation infrastructure often loses its value after a small window of time. Thus, to obtain actionable intelligence from the data, streaming analytics, i.e., the ability to analyze in-motion data, is increasingly becoming critical. Moreover, modern computing systems are highly heterogeneous, consisting of processors, accelerators, and large high-bandwidth external memories. To develop scalable streaming analytics applications, challenges across the full system stack -- from application to target platform -- need to be addressed. In this regard, this planning project is identifying a comprehensive set of research challenges, goals, key innovations and timelines in algorithms and applications, systems software, hardware-software co-design, and computer architecture. This project is bringing together a community of application developers and users, computer scientists, and data scientists, whose interests lie in building streaming data science applications targeting a wide variety of scalable systems. This project is demonstrating preliminary results on how it will achieve significant cross-stack performance improvements using Privacy Preserving Streaming Graph Learning for Secure Smart Grids as the driving application.
Modern data-science applications are characterized as being highly decentralized, distributed and requiring composition and orchestration between localized analytics on thousands or millions of edge platforms and massive centralized analytics in cloud/data centers, as well as requiring real-time analytics on streaming data. To enable scalable performance of grand-challenge streaming data-science applications, a framework that allows developers to seamlessly build these applications targeting a wide variety of scalable systems is needed. This planning project is conducting preliminary research towards a large proposal for developing an opensource framework, StreamWare, that will enable users to develop streaming data-science applications. This project is establishing a community of application developers and users, computer scientists, and data scientists who would serve as early adopters and developers of the StreamWare framework. In consultation with domain experts, a list of key data-science kernels for StreamWare is being generated, and their existing state-of-the-art algorithms and hardware IPs are being evaluated to identify performance limitations and opportunities for improvement. This project is also articulating the requirements of novel abstractions that can represent and operate on streaming data on heterogeneous platforms. This project uses Privacy Preserving Streaming Graph Learning for Secure Smart Grids as a motivating application to show preliminary evidence of end-to-end scalability using a novel notion of symbiotic scalability that captures the impact of StreamWare's cross-layer optimizations. The expected outcomes of this planning project include a proposal for the research activities to be carried out in the large grant, publications on the results of the survey activities and future research directions for enabling streaming data science, and curricula for future graduate and undergraduate courses.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.951 |
2021 — 2024 |
Qian, Xuehai |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Shf: Small: High Performance Graph Pattern Mining System and Architecture @ University of Southern California
This research project aims to develop high-performance systems and architectures for graph pattern mining, which the key component for various applications, including mining biochemical structures, finding biological conserved subnetworks, finding functional modules, program control-flow analysis, intrusion network analysis, mining communication graphs, social-network analysis, anomaly detection, and mining XML structures. High-performance graph pattern mining enables fundamental scientific research advance. The research is motivated by the need for scaling to large graphs and patterns; the significant gap between the fastest algorithm and general graph pattern mining systems; and the inefficiency in current computer architectures when executing such workloads. The project vertically advances the field by seeking synergies between algorithm, system, architecture, and hardware implementations. The project provides research opportunities to female, minority and undergraduate students to enhance the broader participation of computer science education. In particular, the project involves non-CS major students, introducing them to graph-analytics techniques to solve problems in science and engineering.
This research takes a top-down approach, starting from algorithms and developing efficient graph pattern mining systems and architectures. Based on pattern-decomposition algorithms, it develops efficient and general system mechanisms and compiler optimizations with an accurate cost model. To support distributed graph pattern mining with partitioned graphs, it proposes the idea of breaking down pattern-enumeration algorithms to small tasks with a key abstraction, extendable embedding, and builds an efficient execution model to overlap the communication and computation. At the architecture level, the research proposes novel instruction-set extensions and architectural components to support the stream and intersection operations. The proposed techniques will be implemented in two hardware prototypes: (1) a RISC-V processor with an instruction-set extension for stream and intersection operations; and (2) a distributed FPGA accelerator for graph pattern mining with extendable embedding as the primitive. The research outcomes will be published in top system and architecture conferences. The project will deliver several open-source graph pattern mining systems, architecture simulators and hardware prototypes.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.951 |