15-618 Final Project

Project Proposal

Betweenness Centrality with CPU and GPU

Summary

Our project is about graph processing on CUDA. We are implementing Betweenness Centrality (BC) and benchmarking an optimized, conflict-aware edge-parallel GPU implementation against both serial and multicore CPU baselines. As part of our analysis, we will document the optimization journey from a naive GPU kernel to our final optimized version, highlighting how we reduce or avoid conflicting writes to shared vertex states.

Background

We are evaluating parallel graph processing using the Betweenness Centrality algorithm. BC is computationally heavy and highly irregular, requiring parallel Breadth-First Searches (BFS) from multiple source nodes.

The core idea we want to explore is edge-parallel graph computation. In this style of execution, one thread is assigned to one edge, which makes the GPU mapping feel natural because large sparse graphs usually have many more edges than vertices. The catch is that edges are not independent. As the BFS frontier expands, multiple threads processing different edges will attempt to update the shortest-path count for the same destination node simultaneously. That is where contention shows up, and that is what makes the project interesting.

The Challenge

Graph workloads are irregular in ways that GPUs do not always like. Different vertices have different degrees, which means some edges are associated with much more work or much more contention than others. Memory accesses are also less regular than in dense numerical code, so performance can become limited by data movement and locality instead of pure arithmetic throughput.

The specific challenge we want to study is write conflict. In an edge-parallel kernel computing BC, many threads need to update the same vertex state. The simplest solution is to protect those updates with atomic operations. We will build this naive version first to establish a baseline for GPU bottlenecks. A more careful strategy is to become conflict-aware: instead of letting all edges run together, we organize or schedule them so that conflicting updates are reduced or separated.

Write Contention: Multiple threads updating the same vertex state simultaneously
Memory Access Patterns: Irregular graph structure leads to non-coalesced memory access
Load Imbalance: Vertices with varying degrees create uneven thread workload
Synchronization Overhead: Cost of atomic operations and barriers
Warp Divergence: Different execution paths based on graph structure

Resources

Hardware: GHC machines with NVIDIA RTX 2080 GPUs for CUDA implementations. Same environment used for CPU baselines to ensure fair comparison.
Dataset: Public graph datasets from the Stanford Network Analysis Project (SNAP) providing standard, real-world sparse graphs with varying sizes and structures.
Software: C++ and CUDA for implementations, OpenMP for multicore CPU version, Nsight tools for GPU profiling.
Code Base: Kernels, benchmarking code, and evaluation pipeline implemented from scratch. May reuse course infrastructure for build convenience.

Goals and Deliverables

Plan to Achieve (Core Goals)

Implement three primary implementations of Betweenness Centrality:

Serial CPU baseline
Multicore CPU implementation (OpenMP)
Optimized CUDA implementation with conflict-aware edge-parallel design

We will also build a naive CUDA implementation using atomics to resolve shared updates. This will serve as a stepping stone to profile bottlenecks and motivate our optimized GPU design.

Hope to Achieve (Stretch Goals)

Compare multiple conflict-aware strategies rather than just one
Test on dense or uniform graphs to see if trends hold across different graph structures
Implement top-K parameter to extract highest-betweenness nodes

Evaluation Deliverables

Serial and multicore CPU baseline implementations for BC
Documented optimization path from naive CUDA to conflict-aware CUDA version
Runtime and speedup graphs comparing serial, multicore, and optimized GPU implementations
Discussion of bottlenecks: contention, synchronization overhead, memory behavior, and workload imbalance
Analysis of how graph structure affects results

Platform Choice

Why CUDA? CUDA is a good fit because the workload exposes a lot of potential parallelism across graph edges. If we assign one edge to one thread, the GPU can process a large number of edges at once. At the same time, this is not an easy GPU problem because shared vertex updates create conflicts, and sparse graph structure tends to lead to irregular memory access and uneven work distribution.

This is exactly why this is a worthwhile project. We are not choosing CUDA just because it is fast hardware. We are choosing it because it forces us to deal with the real problems that show up in irregular parallel workloads. The CPU baselines give us a grounded comparison and help us understand whether the GPU is winning because of real throughput advantages or losing because contention and irregularity dominate the computation.

Benchmarking Plan

Metrics

Wall-clock runtime (seconds)
Speedup relative to serial baseline
Throughput (edges processed per second)

Baselines

Serial CPU baseline (reference)
Multicore CPU (OpenMP) baseline
Naive GPU (atomic-based) intermediate benchmark

Input Datasets

Public SNAP graphs with varying characteristics:

Small graphs (thousands of nodes)
Medium graphs (millions of nodes)
Large graphs (tens of millions of nodes)
Graphs with different degree distributions

Profiling

Use NVIDIA Nsight to identify bottlenecks in atomic operations, memory stalls, and warp divergence in both naive and optimized GPU versions.

Schedule

Mar 30 – Apr 5: Baselines and Naive GPU

Isaiah: Set up CUDA project skeleton and implement Naive GPU BC kernel using atomics
Yi-Ning: Implement Serial BC baseline and Multicore CPU (OpenMP) BC implementation
Deliverable: Serial, Multicore, and Naive GPU versions execute on test graph with matching centrality scores

Apr 6 – Apr 12: Profiling and Milestone Prep

Isaiah: Profile Naive GPU using Nsight to document bottlenecks, begin optimization towards conflict-aware version
Yi-Ning: Build automated benchmarking harness and run initial scaling tests
Deliverable: Conflict-Aware GPU kernel prototype and raw performance data for milestone

Apr 13 – Apr 19: Milestone and Optimization

Isaiah: Debug and optimize GPU kernel (memory coalescing, warp divergence), add top-K parameter
Yi-Ning: Run full comparative benchmarks across SNAP datasets
Deliverable: Milestone submitted, full final benchmarking data collected

Apr 20 – Apr 26: Deep Analysis and Edge Cases

Isaiah: Conduct ablation study mapping performance gains from naive to optimized GPU
Yi-Ning: Test on dense/uniform graphs, generate final runtime and speedup charts
Deliverable: Code finalized, visual charts and bottleneck analysis completed

Apr 27 – May 1: Final Report and Poster

Shared: Write final report, design and print poster
Deliverable: Final PDF and code submission, poster prepared

Team

Isaiah Velez: CUDA implementation, GPU optimization, kernel profiling with Nsight, and ablation studies
Yi-Ning Huang: CPU baselines (serial and multicore), benchmarking harness, experimental design, and data analysis

Full Proposal Document

View the complete proposal below: