15-618 Final Project
Project Proposal
Betweenness Centrality with CPU and GPU
Summary
Our project is about graph processing on CUDA. We are implementing Betweenness Centrality (BC) and benchmarking an optimized, conflict-aware edge-parallel GPU implementation against both serial and multicore CPU baselines. As part of our analysis, we will document the optimization journey from a naive GPU kernel to our final optimized version, highlighting how we reduce or avoid conflicting writes to shared vertex states.
Background
We are evaluating parallel graph processing using the Betweenness Centrality algorithm. BC is computationally heavy and highly irregular, requiring parallel Breadth-First Searches (BFS) from multiple source nodes.
The core idea we want to explore is edge-parallel graph computation. In this style of execution, one thread is assigned to one edge, which makes the GPU mapping feel natural because large sparse graphs usually have many more edges than vertices. The catch is that edges are not independent. As the BFS frontier expands, multiple threads processing different edges will attempt to update the shortest-path count for the same destination node simultaneously. That is where contention shows up, and that is what makes the project interesting.
The Challenge
Graph workloads are irregular in ways that GPUs do not always like. Different vertices have different degrees, which means some edges are associated with much more work or much more contention than others. Memory accesses are also less regular than in dense numerical code, so performance can become limited by data movement and locality instead of pure arithmetic throughput.
The specific challenge we want to study is write conflict. In an edge-parallel kernel computing BC, many threads need to update the same vertex state. The simplest solution is to protect those updates with atomic operations. We will build this naive version first to establish a baseline for GPU bottlenecks. A more careful strategy is to become conflict-aware: instead of letting all edges run together, we organize or schedule them so that conflicting updates are reduced or separated.
- Write Contention: Multiple threads updating the same vertex state simultaneously
- Memory Access Patterns: Irregular graph structure leads to non-coalesced memory access
- Load Imbalance: Vertices with varying degrees create uneven thread workload
- Synchronization Overhead: Cost of atomic operations and barriers
- Warp Divergence: Different execution paths based on graph structure
Resources
- Hardware
- GHC machines with NVIDIA RTX 2080 GPUs for CUDA implementations. Same environment used for CPU baselines to ensure fair comparison.
- Dataset
- Public graph datasets from the Stanford Network Analysis Project (SNAP) providing standard, real-world sparse graphs with varying sizes and structures.
- Software
- C++ and CUDA for implementations, OpenMP for multicore CPU version, Nsight tools for GPU profiling.
- Code Base
- Kernels, benchmarking code, and evaluation pipeline implemented from scratch. May reuse course infrastructure for build convenience.
Goals and Deliverables
Plan to Achieve (Core Goals)
Implement three primary implementations of Betweenness Centrality:
- Serial CPU baseline
- Multicore CPU implementation (OpenMP)
- Optimized CUDA implementation with conflict-aware edge-parallel design
We will also build a naive CUDA implementation using atomics to resolve shared updates. This will serve as a stepping stone to profile bottlenecks and motivate our optimized GPU design.
Hope to Achieve (Stretch Goals)
- Compare multiple conflict-aware strategies rather than just one
- Test on dense or uniform graphs to see if trends hold across different graph structures
- Implement top-K parameter to extract highest-betweenness nodes
Evaluation Deliverables
- Serial and multicore CPU baseline implementations for BC
- Documented optimization path from naive CUDA to conflict-aware CUDA version
- Runtime and speedup graphs comparing serial, multicore, and optimized GPU implementations
- Discussion of bottlenecks: contention, synchronization overhead, memory behavior, and workload imbalance
- Analysis of how graph structure affects results
Platform Choice
Why CUDA? CUDA is a good fit because the workload exposes a lot of potential parallelism across graph edges. If we assign one edge to one thread, the GPU can process a large number of edges at once. At the same time, this is not an easy GPU problem because shared vertex updates create conflicts, and sparse graph structure tends to lead to irregular memory access and uneven work distribution.
This is exactly why this is a worthwhile project. We are not choosing CUDA just because it is fast hardware. We are choosing it because it forces us to deal with the real problems that show up in irregular parallel workloads. The CPU baselines give us a grounded comparison and help us understand whether the GPU is winning because of real throughput advantages or losing because contention and irregularity dominate the computation.
Benchmarking Plan
Metrics
- Wall-clock runtime (seconds)
- Speedup relative to serial baseline
- Throughput (edges processed per second)
Baselines
- Serial CPU baseline (reference)
- Multicore CPU (OpenMP) baseline
- Naive GPU (atomic-based) intermediate benchmark
Input Datasets
Public SNAP graphs with varying characteristics:
- Small graphs (thousands of nodes)
- Medium graphs (millions of nodes)
- Large graphs (tens of millions of nodes)
- Graphs with different degree distributions
Profiling
Use NVIDIA Nsight to identify bottlenecks in atomic operations, memory stalls, and warp divergence in both naive and optimized GPU versions.
Schedule
Mar 30 – Apr 5: Baselines and Naive GPU
- Isaiah: Set up CUDA project skeleton and implement Naive GPU BC kernel using atomics
- Yi-Ning: Implement Serial BC baseline and Multicore CPU (OpenMP) BC implementation
- Deliverable: Serial, Multicore, and Naive GPU versions execute on test graph with matching centrality scores
Apr 6 – Apr 12: Profiling and Milestone Prep
- Isaiah: Profile Naive GPU using Nsight to document bottlenecks, begin optimization towards conflict-aware version
- Yi-Ning: Build automated benchmarking harness and run initial scaling tests
- Deliverable: Conflict-Aware GPU kernel prototype and raw performance data for milestone
Apr 13 – Apr 19: Milestone and Optimization
- Isaiah: Debug and optimize GPU kernel (memory coalescing, warp divergence), add top-K parameter
- Yi-Ning: Run full comparative benchmarks across SNAP datasets
- Deliverable: Milestone submitted, full final benchmarking data collected
Apr 20 – Apr 26: Deep Analysis and Edge Cases
- Isaiah: Conduct ablation study mapping performance gains from naive to optimized GPU
- Yi-Ning: Test on dense/uniform graphs, generate final runtime and speedup charts
- Deliverable: Code finalized, visual charts and bottleneck analysis completed
Apr 27 – May 1: Final Report and Poster
- Shared: Write final report, design and print poster
- Deliverable: Final PDF and code submission, poster prepared
Team
- Isaiah Velez: CUDA implementation, GPU optimization, kernel profiling with Nsight, and ablation studies
- Yi-Ning Huang: CPU baselines (serial and multicore), benchmarking harness, experimental design, and data analysis
Full Proposal Document
View the complete proposal below: