15-618 Final Project
Milestone Report
Betweenness Centrality with CPU and GPU
Work Completed
Yi-Ning completed the CPU side of the project by implementing the serial Brandes baseline and the OpenMP version using a shared CSR graph representation. She also built out the benchmarking and comparison workflow for the project, including the shared graph-loading semantics and the benchmark harness we now use to run the CPU baselines in a consistent way. This work established the correctness and performance baseline that the rest of the project depends on.
Isaiah completed the CUDA side of the baseline by implementing the naive GPU betweenness-centrality version with an edge-parallel mapping and atomic updates for shared vertex state, then integrating that CUDA path into the shared codebase and getting the build system working on the GHC Andrew machines. He also connected the GPU path to the same CLI and benchmark flow as the CPU implementations, added CUDA progress logging for long runs, and updated the timing pipeline so graph-building time is reported separately from BC compute time.
Deliverable Status
With respect to goals, the CPU and multi-core implementations are on track. However, the CUDA implementation is still behind schedule because the naive implementation only recently started working with benchmarking, and getting to that point took longer than was recently thought. With that delay, the optimization phase is not yet complete.
The nice-to-haves are still on schedule. The main issue is that, on the CUDA side, the time ratio across environment setup, debugging, benchmarking, profiling, and optimization was not estimated correctly at the start. As a result, more time than expected was spent getting the naive implementation to a stable, benchmarkable state, which pushed back the optimized CUDA work. Please see the revised detailed schedule below for the updated half-week plan for the remainder of the project.
Revised Detailed Schedule
- April 14–16: Isaiah runs Nsight profiling. Yi-Ning collects serial and OpenMP benchmarks. Shared: finish milestone report.
- April 16–18: Isaiah writes the first conflict-aware CUDA optimization. Yi-Ning sets up the final SNAP and synthetic datasets.
- April 18–21: Isaiah debugs optimized CUDA and runs early benchmarks. Yi-Ning finishes CPU benchmark suite runs. Shared: evaluate naive versus optimized GPU scaling.
- April 21–24: Isaiah tries a second conflict-aware strategy if time permits. Yi-Ning runs dense/uniform graph experiments. Shared: start plotting the results.
- April 24–27: Isaiah investigates remaining GPU bottlenecks and drafts GPU report sections. Yi-Ning finalizes CPU charts. Shared: write the implementation narrative.
- April 27–30: Complete the final report, poster, and Autolab submission.
Preliminary Results
We have begun collecting preliminary benchmark data for the CPU and CUDA baselines, including the runs used in the milestone report.
Please refer to the full milestone PDF below for the benchmark tables, plots, and figure captions included in the report.
Updated Goals for Poster Session
For the poster session, we plan to present two concurrent stories that reflect the two main technical tracks of the project. The first story will focus on the CPU side, showing the serial baseline and the multicore OpenMP implementation, along with the benchmarking results that show how the CPU versions scale. The second story will focus on the GPU side, following the major stepping stones from the naive CUDA implementation toward our most optimized version and explaining how each change was motivated by benchmarking and profiling measurements.
The poster will therefore emphasize progression rather than just final numbers. Specifically, we plan to show:
- Performance Plots (CPU and GPU): Speedup graphs comparing Serial, OpenMP, Naive GPU, and Optimized GPU across various SNAP and synthetic datasets to explain how both tracks scale.
- GPU Bottleneck Analysis: Nsight profiling data showing why the naive edge-parallel CUDA version choked and evidence that our optimizations fixed these bottlenecks.
- Optimization Impact: The runtime differences and intermediate measurements that motivated our optimization decisions between the naive baseline and our optimized strategies.
Our goal is that someone reading the poster can understand both how the CPU and GPU versions compare and how the GPU implementation evolved from a poorly scaling baseline into a more optimized design.
Concerns and Risks
At the moment, we do not have any major technical concerns about completing the project. The main issue identified during the milestone period was improper time allocation, especially on the CUDA side where environment setup, debugging, and getting the naive implementation into a benchmarkable state took longer than originally expected.
We believe this is now manageable with the revised schedule. At this point, it is mainly a matter of allocating more time, and that is likely possible because work from our other classes is beginning to die down over the coming weeks. With the Andrew-machine build path working, the benchmark pipeline in place, and the project plan updated into half-week increments, we now have a much clearer path for the rest of the project.