1998 — 2003 |
Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Value and Memory Access Profiling For Compiler Optimization @ University of California-San Diego
Profile feedback optimizations have been shown to improve program performance for several compiler research areas. The main profiling information used to guide most of these optimizations are basic block and control flow edge frequencies. This is actually only a small fraction of the profile information that could be gathered and used for compiler optimizations. This proposal focuses on two relatively new areas of profile feedback optimizations. The first area will investigate gathering value profiles, which keep track of the top values for instructions and variables found during execution. This information is then used to guide value-based code specialization, which can significantly reduce the number of executed instructions. The second profiling area deals with memory access profiles. Memory access profiles will be used to guide a new optimization procedure called Cache-Conscious Data Placement, as well as to guide speculative instruction scheduling. Cache-conscious data placement uses a temporal relationship memory profile to determine at compile- time where to place stack, global, heap, and constant objects in the cache. This can potentially result in a significant reduction in the data cache miss rate. In addition to the performance benefits from these optimizations, this research will investigate fast and efficient approaches for gathering the profile information.
|
0.915 |
1998 — 2001 |
Tullsen, Dean [⬀] Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hardware Generation of Threads in a Multithreading Processor @ University of California-San Diego
Simultaneous Multithreaded (SMT) processors depend on thread- level parallelism (multiple jobs to run) to surpass the performance of single-threaded processors. However, single- thread performance is still important for mainstream processors. When there are only one or a few processes in the system, hardware can create threads that increase the ILP available to the SMT processor, using otherwise idle resources. This project is developing hardware thread- generation techniques to increase SMT performance in the absence of software-generated thread-level parallelism. The techniques being investigating include multiple path execution, instruction recycling, and speculative loop execution. Threaded Multiple-path Execution (TME) takes advantage of idle hardware contexts to solve the branch problem, speculatively executing multiple paths through conditional branches in a single application. Instruction Recycling increases the efficiency of multiple-path execution, as it avoids refetching instructions which are not path-dependent on a branch, and avoids re-executing instructions that are not data-dependent on the branch. Speculative Loop Parallelism speculatively executes multiple future loop iterations in parallel, using compiler analysis and dynamic hardware detection of loops and induction statements.
|
0.915 |
2000 — 2003 |
Ferrante, Jeanne [⬀] Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Predicate-Sensitive Software and Hardware Analysis to Enable Optimization and Speculation @ University of California-San Diego
Predicated execution is a feature used in the Explicitly Parallel Instruction Computing (EPIC) architecture for achieving the instruction level parallelism (ILP) needed to keep increasing future processor performance. The IA-64 processor being developed at Intel with Hewlett Packard is an example of an EPIC architecture. An advantage of predicated execution is the elimination of hard-to-predict branches by combining both paths of a branch into a single path, thereby obtaining additional opportunities for ILP. However, this merging of several paths into one has disadvantages, as it complicates optimizations and scheduling in both software and hardware.
This research develops a comprehensive framework for new compiler and hardware analysis whose projected impact is to realize the performance of predicated execution. Underlying our framework is the efficient maintenance and use of predicate relationships and precise information about predicated regions. This proposal builds on our prior work by (1) incorporating critical path and resource constraints into a compiler intermediate form for predicated compilation, (2) developing hardware structures to allow predicate speculation and out-of-order execution, (3) developing software and hardware dynamic predication and (4) developing predicate-sensitive compiler optimizations, especially those based on value prediction or profiling.
|
0.915 |
2001 — 2004 |
Tullsen, Dean [⬀] Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Critical Path Computing @ University of California-San Diego
Critical path prediction is a processor architecture technique that uses the past behavior of instructions in the instruction stream to predict which fetched instructions will be on the critical path; that is, which instructions will have a significant impact on processor performance, and which will not. This information can then be used to guide the selective application of a variety of processor optimizations.
Modern processors remove most artificial constraints on execution throughput. Therefore, the bottleneck for many workloads on current processors is the true dependences in the code. Chains of dependent instructions constrain the overall throughput of the machine, often leaving aggressive processor technology highly underutilized. These chains of dependent instructions constitute the critical performance path, or critical path (CP), though the code.
The performance of the processor is thus determined by the speed at which it executes the instructions along this critical path. In our efforts to get the maximum performance from the processor, it is no longer reasonable to treat all instructions the same. If we can know which instructions are critical to performance, we can accelerate their execution, possibly at the expense of instructions not on the critical path.
This research will attempt to identify these critical instructions dynamically in hardware. We call this critical path prediction. This prediction is based on the behavior of previous invocations of the instruction in the pipeline. This prediction will enable the processor to make better decisions about where to apply certain policies and optimizations. A variety of critical path predictors will be examined.
In many cases, critical path prediction will enable more effective application of other resources or optimizations. Possible applications of critical path prediction include guiding value prediction, instruction reuse, instruction issue priority, instruction scheduling on a clustered architecture, speculation control on a power-constrained processor, arbitration between instructions or threads on a multithreaded architecture, or to guide the spawning of speculative threads in a speculative multithreaded processor.
|
0.915 |
2003 — 2006 |
Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Application Specific Architecture Customization and Co-Exploration @ University of California-San Diego
Collaborative Research: Application Specific Architecture Customization and Co-Exploration
The embedded market demands processors that can achieve high levels of performance in the smallest possible area, while simultaneously minimizing power and energy dissipation. Customized processors offer a way of meeting these demands through targeted architectures specifically constructed to meet the performance, area, and power demands of a given application.
Our approach is different from the traditional ASICs design, since our focus is to bring the customization up to a higher level where we generate programmable reconfigurable processors that allow both the algorithm and architecture to be co-configured (configured together). Architecture customization is not able to realize significant gains until both the algorithms, data structures, ISA and certain architecture components can all be configured. This grant focuses on co-exploring the algorithm and architecture design space for speech, cryptographic and network processors to discover common customizable components. The goal is to identify customizable components from these application domains, so that they can be incorporated into an overall infrastructure for co-exploration.
To broader impact, this proposal will provide infrastructure to aid in the automated co-exploration of architectures and algorithms for industry and academic researchers. We will make this infrastructure available to aid researchers in exploring these and other application areas.
|
0.915 |
2003 — 2006 |
Tullsen, Dean [⬀] Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Compiler Optimizations to Exploit Simultaneous Multithreading @ University of California-San Diego
Recently introduced and forthcoming multithreading processor architectures represent new challenges and opportunities for the compilation system. This proposal will focus on three areas, exploiting features of a simultaneous multithreading (SMT) processor:
1. Generating Task Threads and Helper Threads - Task-based parallelism provides heterogeneous parallelism that is particularly effective for an SMT processor. Helper threads assist and accelerate the execution of other threads, without necessarily offloading any computation.
2. Simultaneous Compilation - Using spare contexts to do dynamic compilation, optimization, and profiling provides the opportunity to perform these functions concurrent with the running program, and continuously, without interrupting other threads.
3. Program Placement for SMT - Efficiently using the memory hierarchy is more difficult on an SMT processor because programs interact in non-deterministic ways. Novel code, data, and page placement compiler algorithms will reduce cache misses for multithreaded workloads.
This research will involve grad and undergrad students, including students from underrepresented groups, training them in research methodology and practice, and developing particular research expertise. This research will create a compiler and simulation infrastructure that will be made available widely to other academic institutions. This should most benefit institutions that lack resources to develop such an infrastructure themselves.
|
0.915 |
2003 — 2006 |
Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Phase-Directed Architecture Optimization and Simulation @ University of California-San Diego
Phase-Directed Architecture Optimization and Simulation
A program's execution typically exhibits time-varying behavior, where architecture resources needed to obtain peak performance for the application can vary significantly over time. We have found that this time-varying behavior has a repetitive pattern for most applications, which allows a program's execution to be broken up into phases, and we can then tailor a program's execution to each phase.
This proposal concentrates on developing efficient run-time architectures to capture and predict phase information. We will examine creating phase tracking and prediction architectures for power and energy architecture optimizations as well as to perform hardware code specialization based on phases for trace cache optimizations. Energy consumption can potentially be saved by dynamically re-partitioning many components in a processor (e.g., caches, issue and decode width, branch predictors) on an application specific basis, or selectively using different implementations of these components based upon the phase behavior seen in an application. In addition, trace cache compiler optimizations will be performed for different phases to exploit phase-based behavior.
For broader impact, our phase-classification algorithms will allow researchers to efficiently identify, classify, and predict phase-based behavior in programs and use this information to guide their phase-based hardware, compiler, and operating system research.
|
0.915 |
2004 — 2007 |
Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Using Phase Analysis to Perform Accurate and Efficient Simulation @ University of California-San Diego
Calder, Bradely CCF-0342522
A crucial problem for architecture and systems research involving program simulation is how to simulate the smallest amount of the program (to gather the results as fast as possible), and still achieve accurate results representative of the complete execution of the program. The goal of this proposal is to address this problem by fully developing techniques using phase analysis to guide efficient and accurate simulation and program analysis.
Our focus is to create new phase analysis techniques to provide very accurate and efficient simulation infrastructure for single program, and multi-threaded workloads. These new phase analysis techniques will also incorporate statistical sampling to provide a minimal set of simulation points with statistical guarantees.
|
0.915 |
2005 — 2009 |
Varghese, George [⬀] Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Csr-Ehs - Building a High Throughput Programmable Network Processor Through Algorithm and Architecture Co-Exploration @ University of California-San Diego
This project focuses on creating a programmable router processor designed to efficiently execute a variety of network algorithms with the goal to make it a feasible alternative to custom ASICs. To achieve the high throughput required by core routers, the research will focus on (1) fast data plane algorithms, (2) support for co-exploration of algorithms and architecture design, and (3) creating low power routers that do not sacrifice worst case throughput.
The project will spur the development of more revolutionary approaches to high speed network processors, allowing programmable processors to penetrate the core router market. This in turn can help reduce the cost of building and maintaining routers. The results will also help reduce the power costs to run these high speed network routers. In terms of academic impact, this research will bring collaboration between two hitherto separate academic communities (computer architecture and networking) to spark new ideas from this interdisciplinary interaction.
|
0.915 |
2006 — 2009 |
Tullsen, Dean (co-PI) [⬀] Calder, Bradley |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Concurrent Optimization For Multi-Core and Multithreaded Architectures @ University of California-San Diego
For future architectures, execution and compilation should no longer be thought of as separate activities. Instead, compilation should take place concurrently with execution. Concurrent optimization allows the processor to exploit available thread contexts for higher multithreaded performance in two ways -- it uses available thread contexts or cores to improve the quality of running code, and allows the processor to take full advantage of runtime information, with minimal overhead, to dynamically adapt the running code to the parallel architecture. This research will create a concurrent optimization system that uses available cores/contexts to continuously optimize a program's execution. The optimization system sits as a thin virtual machine layer underneath the operating system, optimizing the native ISA application binaries, libraries, and even operating system, tailoring each to the runtime behavior of the system. The optimization system will dynamically search the optimization space trying different compiler optimizations and levels of optimization to hot traces finding the most aggressive combination of optimizations to achieve the best possible performance.
|
0.915 |