Final Report

Parallelizing Pauli Paths (P3): GPU-Accelerated Quantum Circuit Simulation

Arul Rhik Mazumder (arulm), Daniel Ragazzo (dragazzo)

December 8, 2025

Abstract: We present a GPU-accelerated implementation of the Pauli path propagation algorithm for quantum circuit simulation. Our CUDA implementation achieves speedups of up to 626× over a single-threaded CPU baseline on circuits with 1000+ initial Pauli words, with an average speedup of 213× across our stress test suite. Key contributions include a multi-block dynamic work distribution algorithm, efficient inter-block coordination through word-count-only transfers, and parallel truncation using prefix sums.

Summary

We implemented a GPU-accelerated Pauli path propagation algorithm for quantum circuit simulation using CUDA on NVIDIA GPUs and compared it against a single-threaded CPU baseline in C++. Our CUDA implementation on the GHC cluster machines (NVIDIA GeForce RTX 2080 GPUs) achieves speedups of up to 626× on large-scale simulations, with an average speedup of 213× across our stress test suite containing circuits with 500–30,000 initial Pauli words and 50–500 layers.

Project deliverables include:

All implementations run on GHC cluster machines with NVIDIA GeForce RTX 2080 GPUs.

Background

Motivation

In quantum computing, a central computational task is estimating expectation values of observables. Given a quantum circuit \( U \) that prepares a state \( \lvert \psi \rangle = U \lvert 0 \rangle \), we seek to compute \( \langle \hat{H} \rangle = \langle \psi \lvert \hat{H} \lvert \psi \rangle \) where \( \hat{H} \) is the observable of interest, typically decomposed as a sum of Pauli words. This expectation value is fundamental in quantum algorithms such as variational quantum eigensolvers for quantum chemistry and quantum approximate optimization algorithms.

Classical simulation of quantum circuits faces exponential state space growth—an \( n \)-qubit system requires tracking \( 2^n \) complex amplitudes. The Pauli path method offers an alternative approach that can be more tractable for certain circuit types by tracking the evolution of observables rather than the full quantum state.

Pauli Words and Observables

A Pauli word is a tensor product of single-qubit Pauli matrices {I, X, Y, Z} acting on each qubit. For an \( n \)-qubit system, a Pauli word has the form \( P = P_1 \otimes P_2 \otimes \cdots \otimes P_n \). The Pauli matrices represent: I (identity), X (bit-flip), Y (combined bit and phase flip), Z (phase-flip). Any observable \( \hat{H} \) can be expressed as a linear combination of Pauli words: \( \hat{H} = \sum_i c_i P_i \).

Key Data Structures

PauliWord: Contains a vector of Pauli operators (one per qubit, encoded as enum: I=0, X=1, Y=2, Z=3) and a complex phase coefficient. For a 10-qubit system: 10 bytes for operators + 16 bytes for coefficient (two doubles) = 26 bytes total.

Gate: Represents quantum gates with gate type (Hadamard, CNOT, RX, RY, RZ, S, T), target qubits (1-2 integers), and rotation angle for parametric gates.

Key Operations

Algorithm Input and Output

Inputs:

Output: Complex number representing \( \langle \hat{H} \rangle \)

Parallelism Analysis

The computational bottleneck is gate conjugation applied to large numbers of words—perfectly suited for GPU parallelization.

Pauli word evolution
Figure 1: Pauli word count evolution through circuit execution. Clifford-only: constant word count. Mixed: moderate growth from occasional rotations. Rotation-heavy: exponential expansion requiring truncation.
Gate conjugation behavior
Figure 2: Per-gate transformation rules. Clifford gates (H, CNOT, S) and T gate produce one output word each. Rotation gates (RX, RY, RZ) produce two terms.

Approach

High-Level Overview

Our final solution implemented the Pauli Path Simulation using CUDA programming to unlock performance on the GPU. At a high level, the algorithm maintains a collection of Pauli words and their corresponding coefficients, which must be transformed in memory for each specified gate. We parallelized these so that each thread was responsible for transforming one Pauli word.

Crucially, the number of terms can change during execution through two mechanisms:

  1. Rotation gate splitting: Non-Clifford gates (RX, RY, RZ) cause certain Pauli words to "split" and become two separate Pauli words with different coefficients.
  2. Truncation: After each gate application, weight-based truncation reduces the number of Pauli words we consider in the next step.

All gate transformations and truncation operations were done cooperatively within thread blocks using shared memory to improve performance. Implementing inter-block coordination efficiently became the main concern for achieving good performance.

Technologies Used

CPU Baseline Implementation

Our CPU implementation (pauli.cpp, ~436 lines) uses std::map<PauliWord, Complex> for automatic deduplication. It processes gates in reverse order, applying conjugation rules sequentially with weight-based truncation after each non-Clifford gate. Time complexity is \( O(N \cdot D \cdot (q + \log N)) \) where \( N \) is word count, \( D \) is circuit depth, and \( q \) is qubit count.

The main difference: the CPU version uses a hashmap to keep track of all words (enabling automatic deduplication), while our GPU version uses raw arrays that can be operated on in parallel where each thread handles a single Pauli word.

GPU Memory Layout

We use the following constants: q = qubits, T = 256 threads/block, B = 400 max blocks.

Shared Memory Organization

Each thread block required:

Total shared memory per block: (2q + 40) × T bytes. This was the main barrier to getting more work done per block.

Global Memory Organization

Global memory was laid out such that each thread block had its own region:

Key design principle: Each block owns non-overlapping memory regions, eliminating inter-block synchronization.

Mapping to GPU Hardware

We assign one CUDA thread per Pauli word, enabling parallel evolution of thousands of words simultaneously. Thread blocks of 256 threads process words in parallel, chosen to balance occupancy against shared memory availability.

For 10,000 words with 10 qubits: 10,000 threads total (~40 blocks), each block uses ~15KB shared memory, blocks execute independently across 46 SMs.

Algorithm and Operations

Initial Single-Block Approach (Abandoned)

Single thread block with all words in shared memory. Limited to ~256 words due to 48KB shared memory limit. Achieved only 1.1–1.5× speedup—insufficient parallelism.

Multi-Block Dynamic Algorithm (Final Approach)

To overcome the single-block limitation, we implemented a multi-block dynamic algorithm:

  1. Each thread block dynamically determines which Pauli words to load from global memory
  2. Loads words into shared memory for fast access
  3. Applies gate transformations in parallel (each thread processes its assigned word)
  4. Performs parallel truncation using prefix sum
  5. Exits on rotation gates or when capacity is exceeded
  6. Writes results to dedicated global memory regions
  7. CPU reads only word counts (not full word data—this is the critical optimization)
  8. CPU launches next kernel with updated configuration and block count

Critical optimization: The CPU only has to load the number of words each thread block finished, not the words themselves. This drastically decreases the memory bandwidth between the CPU and the GPU—transferring just B integers (~1.6KB for B=400) instead of potentially megabytes of Pauli word data.

Multi-block algorithm flowchart
Figure 3: Flowchart of the multi-block dynamic algorithm showing work distribution across thread blocks, with shared memory for fast gate operations and global memory for inter-block coordination.

Parallel Truncation with Prefix Sum

To ensure valid Pauli words remain adjacent in memory after truncation, we used a prefix sum:

  1. Each thread computes a binary flag: 1 if its word passes the weight threshold, 0 otherwise
  2. Parallel exclusive prefix sum computes destination indices in O(log T) time
  3. Threads write their valid words to compacted positions in parallel
  4. Final prefix sum value gives total valid count

We used the exact prefix sum implementation from Assignment 2, adapted to use uint16_t for space efficiency. This achieves O(log 256) = 8 parallel steps versus O(N) sequential.

Optimization Journey

Code Provenance

This implementation was developed from scratch. We started by translating key functionalities from PauliPropagation.jl and Qiskit/pauli-prop into C++ for performance and flexibility. All parallel algorithm design and optimization is original work.

Results

Performance Metrics

Experimental Setup

Hardware: GHC machines with Intel Xeon CPUs and NVIDIA GeForce RTX 2080 GPUs (8GB VRAM)

Baseline: Single-threaded CPU using std::map

Test Parameters:

Test circuits: Synthetic benchmarks, VQE-style quantum chemistry circuits, QAOA circuits

Speedup Results

Gate composition impact:

Stress Test Results (7-qubit circuits with Clifford gates)

Configuration CPU Time GPU Time Speedup
30K words, 500 layers 80.89s 0.129s 626×
8K words, 50 layers 3.34s 0.009s 377×
5K words, 120 layers 5.47s 0.021s 263×
5K words, 150 layers 6.77s 0.026s 261×
4K words, 100 layers 3.55s 0.017s 204×
3K words, 200 layers 5.53s 0.034s 161×
2K words, 250 layers 4.67s 0.043s 108×
1K words, 300 layers 2.78s 0.052s 54×
1K words, 400 layers 3.66s 0.069s 53×
500 words, 500 layers 2.26s 0.086s 26×
Average 213×
Speedup chart
Figure 4: GPU speedup over single-threaded CPU baseline across different test configurations. Speedups range from 26× to 626× depending on workload size.

Scaling Behavior

Three distinct regimes observed:

Empirical scaling formula: Speedup \( S \approx \min(600, 0.02 \times N \times D) \) where \( N \) is initial word count and \( D \) is circuit depth.

Performance scaling
Figure 5: Performance scaling with problem size. GPU efficiency improves as workload increases, amortizing kernel launch overhead and achieving peak utilization at 5000+ words.

Parameter Sensitivity

Parameter analysis
Figure 6: Parameter sensitivity analysis showing how qubit count, circuit depth, initial word count, and gate composition affect performance and speedup.

Correctness Validation

Comprehensive test suite with 34+ test cases validates:

Memory Analysis

Memory analysis
Figure 7: Memory usage patterns for CPU and GPU implementations, showing efficient utilization of GPU shared memory and global memory.

Performance Limiting Factors

Based on profiling analysis:

  1. Shared Memory Capacity: GPU shared memory limits words per block to ~256 maximum (MAX_PAULI_WORDS = 512 with 2× factor for rotation splits). This was the main barrier to processing more work per block.
  2. Kernel Launch Overhead: For small workloads (<100 words), ~10μs per kernel launch dominates. Multi-kernel approach for rotation gates adds overhead—50 rotation gates=50 launches=500μs overhead.
  3. Branch Divergence: Non-Clifford gates cause divergent execution. Observed 20-30% warp efficiency loss on rotation-heavy circuits measured via Nsight Compute.
  4. Memory Transfer: Our optimization of only transferring counts (not words) significantly reduced this from dominant to minor factor.

Note: Dominant limiting factor varies by workload. Clifford-only circuits approach peak parallelism (~90% efficiency). Rotation-heavy circuits are bound by multi-kernel coordination overhead (~60% efficiency).

Platform Choice Analysis

GPU was appropriate because:

CPU-only with OpenMP would struggle because:

However, for small circuits (<50 words), CPU overhead is lower and would be preferred due to kernel launch latency. The GPU advantage only manifests at scale.

References

  1. L. J. S. Ferreira, D. E. Welander, and M. P. da Silva, "Simulating quantum dynamics with pauli paths," arXiv preprint, 2020.
  2. H. Gharibyan, S. Hariprakash, M. Z. Mullath, and V. P. Su, "A practical guide to using pauli path simulators for utility-scale quantum experiments," arXiv preprint arXiv:2507.10771, 2025.
  3. M. S. Rudolph, "PauliPropagation.jl: Julia implementation of pauli propagation," https://github.com/MSRudolph/PauliPropagation.jl, 2024.
  4. Qiskit Development Team, "Qiskit pauli propagation library," https://github.com/Qiskit/pauli-prop, 2024.

Work Distribution

Arul Rhik Mazumder (arulm)

Daniel Ragazzo (dragazzo)

Credit Distribution

50% – 50%

Both partners contributed significantly to design discussions, debugging sessions, and project direction. The division of implementation work (CPU vs GPU) naturally split responsibilities while requiring close coordination on data structure compatibility and correctness validation.