Project Proposal - Parallelizing Pauli Paths

Summary

We will implement a parallelized version of the Pauli path algorithm for quantum circuit simulation, focusing on parallelizing the computation over the number of Pauli words. The goal is to accelerate quantum circuit simulation by performing independent Pauli word computations concurrently on NVIDIA GPUs. We will use CUDA programming to exploit GPU parallelism and optimize the simulation for larger quantum systems.

Background

Motivation

In quantum computing, a central task is to estimate expectation values for an observable \( \hat{H} \) given a quantum circuit that prepares a quantum state \( \lvert \psi \rangle \). Specifically, for a quantum system that evolves under a quantum circuit \( U \), the expectation value of an observable \( \hat{H} \) is given by:

\[ \langle \hat{H} \rangle = \langle \psi \lvert \hat{H} \lvert \psi \rangle = \langle \psi_0 \lvert U^\dagger \hat{H} U \lvert \psi_0 \rangle \]

This expectation value is fundamental in various quantum algorithms and simulations, such as in quantum chemistry for determining energy levels, or in general quantum information processing tasks.

However, simulating quantum systems and computing such expectation values classically is not easy. The quantum state of a system grows exponentially with the number of qubits, making direct simulation of large systems infeasible. Thus, methods that exploit the structure of quantum circuits and observables are essential for efficient simulation.

Naive Statevector Simulation

The most direct and intuitive classical simulation method for quantum circuits is statevector simulation. In this approach, the quantum state \( \lvert \psi \rangle \) is represented as a vector in a complex vector space, and quantum gates are applied directly to this state. The evolution of the system is tracked by applying the corresponding unitary matrices of the quantum gates to the statevector. This can be mathematically described as:

\[ \lvert \psi' \rangle = U \lvert \psi \rangle \]

where \( U \) is the unitary matrix corresponding to a gate in the circuit.

While statevector simulation works well for small systems, it faces exponential scaling with the number of qubits. For a system of \( n \) qubits, the quantum statevector has \( 2^n \) complex amplitudes. As the number of qubits grows, the computational resources required to store and manipulate the statevector grow exponentially, making it infeasible for larger systems.

Hamiltonian Decomposition into Pauli Words

In quantum simulations, a more scalable approach for simulating circuits and computing expectation values is to represent the observable \( \hat{H} \) as a sum of Pauli words. Pauli words are tensor products of Pauli matrices \( \{ I, X, Y, Z \} \), where each matrix acts on one or more qubits. As such, an observable can be written as:

\[ \hat{H} = \sum_i c_i P_i \]

Here, \( P_i \) represents a Pauli word (such as \( X \otimes Z \otimes I \)), and \( c_i \) is the corresponding coefficient. Pauli words are particularly useful because they provide a structured and efficient way to represent the observable in quantum simulations. In practice, most quantum systems are modeled using observables that are sums of Pauli words.

Schrödinger Picture vs Heisenberg Picture

In the Schrödinger picture, we track the evolution of the quantum state under the action of quantum gates. However, in the Heisenberg picture, the state remains fixed, and instead, we describe the system's evolution by the time evolution of observables.

In this framework, we focus on how the Pauli words, which represent the observable \( \hat{H} \), evolve as quantum gates are applied. The key difference between the two pictures is:

Schrödinger Picture: The quantum state \( \lvert \psi(t) \rangle \) evolves according to the unitary operator \( U(t) \):
\[ \lvert \psi(t) \rangle = U(t) \lvert \psi(0) \rangle \]

Expectation value is computed as:

\[ \langle \hat{H} \rangle = \langle \psi(t) \lvert \hat{H} \lvert \psi(t) \rangle \]
Heisenberg Picture: In this picture, the operator \( \hat{H}(t) \) evolves instead of the state. The evolution of the observable is given by:
\[ \hat{H}(t) = U^\dagger(t) \hat{H} U(t) \]

The expectation value is then calculated as:

\[ \langle \hat{H} \rangle = \langle \psi(0) \lvert \hat{H}(t) \lvert \psi(0) \rangle \]

In this approach, we track the evolution of observables—specifically Pauli words—instead of the quantum state. Pauli propagation focuses on how these Pauli words evolve under quantum gates. One challenge in simulating quantum circuits is the exponential growth of Pauli words as the number of qubits increases. Pauli Paths offer an efficient way to manage this complexity by tracking the evolution of Pauli operators, rather than the full state vector. In the Schrödinger picture, simulating state evolution can be computationally expensive. In contrast, the Heisenberg picture allows us to focus on the evolution of Pauli words, where each gate acts on them deterministically. This significantly reduces the computational cost and avoids the exponential blow-up that comes with tracking the full quantum state.

Truncation of Pauli Words for Scalability

A major obstacle in simulating quantum circuits is the rapid growth in the number of Pauli words as the number of qubits increases. If we were to track all possible Pauli words during the simulation, the computational cost would become prohibitive. To mitigate this, truncation techniques are often applied to limit the number of Pauli words tracked at any given time.

Truncation involves discarding Pauli words that exceed a certain weight threshold. The weight of a Pauli word refers to the number of qubits on which it has a non-trivial component (such as \( X \), \( Y \), or \( Z \), rather than the identity \( I \)). By setting a maximum weight, we can discard Pauli words that would otherwise require excessive computational resources.

Algorithm Overview: Pauli Propagation for Expectation Value Estimation

The following pseudocode outlines the key steps involved in estimating expectation values using Pauli propagation. This approach efficiently tracks the evolution of Pauli words in the Heisenberg picture without the need to simulate the full quantum state vector.

                # Pseudocode for Pauli Propagation Simulation

                # Step 1: Decompose observable into Pauli words
                observable = decompose_hamiltonian(H)
                # observable is a list of Pauli words {P1, P2, ..., Pn}
                # where each Pauli word Pi has a corresponding coefficient ci

                # Step 2: Initialize Pauli words and their coefficients
                Pauli_words = {P1: c1, P2: c2, ..., Pn: cn}

                # Step 3: Iterate over quantum gates applied to the circuit
                for each gate in quantum_circuit:
                    # Step 3a: Apply gate transformation to each Pauli word
                    for each Pauli_word in Pauli_words:
                        Pauli_words[Pauli_word] = apply_gate_to_pauli(gate, Pauli_word)
                    
                    # Step 3b: Truncate Pauli words based on weight threshold
                    Pauli_words = truncate_pauli_words(Pauli_words, max_weight)

                # Step 4: Compute expectation value by contracting evolved observable with initial state
                expectation_value = 0
                for each Pauli_word, coefficient in Pauli_words:
                    expectation_value += coefficient * compute_expectation(Pauli_word, initial_state)

                return expectation_value

Leveraging Parallelism for Efficient Simulation

Simulating quantum circuits involves managing large-scale computations, especially when tracking Pauli words, which grow exponentially with the number of qubits. The challenge of simulating these systems can be mitigated by parallelizing the computation across multiple processing units, such as GPUs. Certain aspects of the Pauli path propagation algorithm are well-suited to parallelism, which can significantly speed up the computation and make simulating larger quantum circuits feasible.

Parallelism in Pauli Word Computations

One of the key aspects of the Pauli path propagation algorithm is that each Pauli word evolves independently under the action of quantum gates. This independence makes Pauli word computations an ideal candidate for parallelization. By mapping each Pauli word to a separate processing thread, we can perform these operations concurrently, dramatically reducing the overall time required for the simulation.

How GPU Parallelism Works in Pauli Path Propagation

In our parallelized implementation, each Pauli word is treated as an independent task, and we assign each task to a separate thread on the GPU. The main steps that benefit from parallelism include:

Pauli Word Evolution: Each Pauli word undergoes a transformation when quantum gates are applied. Since these transformations are independent for each Pauli word, they can be executed concurrently across multiple GPU threads.
Pauli Word Truncation: After each gate transformation, Pauli words may need to be truncated based on a weight threshold. This process is also independent for each Pauli word, allowing for parallel computation of truncation steps.
Expectation Value Calculation: Finally, the expectation value is calculated by contracting the evolved Pauli words with the initial state. Since each Pauli word contributes independently, this computation can be distributed across GPU threads, speeding up the overall calculation.

The Challenge

Simulating quantum circuits, especially for large-scale quantum systems, presents significant challenges due to the exponential growth of the number of Pauli words as the number of qubits increases. The core difficulty lies in efficiently managing the evolution of a growing number of Pauli words, while also ensuring that the computation remains scalable and performant when parallelized.

Exponential Growth of Pauli Words

The most significant challenge in parallelizing this workload is the exponential growth of the number of Pauli words with respect to the number of qubits in the quantum system. As more qubits are added, the size of the state space increases exponentially, which results in a proportional increase in the number of Pauli words needed to represent observables. This means that mapping each Pauli word to an independent GPU thread becomes more difficult as the number of Pauli words increases, since we may quickly exceed the number of available threads or run into memory limitations.

For instance, with just 5 qubits, there are \( 4^5 = 1024 \) possible Pauli words, but with 10 qubits, this number grows to \( 4^{10} = 1,048,576 \). As the qubit count rises, it becomes increasingly difficult to allocate enough threads to manage all the Pauli words without significant overhead or memory constraints.

Workload Characteristics

The computational workload associated with evolving Pauli words is inherently parallel, as each Pauli word can evolve independently under the action of quantum gates. However, the following characteristics of the workload make mapping it to a parallel system challenging:

Memory Access Patterns: Each Pauli word contains a tensor product of Pauli matrices (e.g., \( X \otimes Z \otimes I \)), which requires memory access for storing and manipulating these matrices. Since the number of Pauli words increases with the number of qubits, there is a substantial increase in the memory required to store both the Pauli words and their corresponding coefficients. The memory access pattern can become highly scattered, which may result in inefficient memory usage, especially if data is not cached or stored contiguously in memory.
Communication-to-Computation Ratio: The communication-to-computation ratio can become problematic as the number of Pauli words increases. When using a parallelized approach, each thread must not only compute the transformation of its assigned Pauli word but also communicate its result with the global state (for expectation value computation). As the number of threads grows, the overhead associated with synchronizing these computations and collecting results may increase, leading to diminishing returns from parallelization.
Divergent Execution: Different threads need to perform different computations based on their assigned Pauli word. For example, some Pauli words might involve a complex set of gate transformations (Clifford vs non-Clifford Gates), while others may not require as much computation. This variation in workload per thread can result in inefficient GPU resource utilization, where some threads are idle while others are overloaded with computation. Managing this divergence efficiently is a key challenge in parallelizing Pauli word propagation.

System Constraints

Several system constraints make mapping the workload of Pauli word simulation to a parallel architecture challenging:

Thread Scaling: The number of threads available on a GPU is finite, and as the number of Pauli words grows exponentially, it becomes increasingly difficult to map each Pauli word to an independent thread. We may eventually run out of threads, especially for systems with a large number of qubits. When this happens, we must either increase the number of GPUs or implement techniques like work-stealing (dynamically assigning Pauli words to idle threads) or thread-blocking (grouping Pauli words into blocks of computations to be processed sequentially).
Memory Constraints: As the number of Pauli words grows, the memory required to store all Pauli words and their corresponding coefficients can quickly exceed the available GPU memory. This requires either more advanced memory management strategies (e.g., memory paging) or limiting the number of Pauli words that can be tracked at any given time, potentially leading to truncated or approximated results.
Synchronization Overhead: In a parallelized system, synchronization is necessary to ensure that all threads are working on the correct parts of the Pauli word set and that intermediate results are properly handled. As the number of threads increases, the synchronization cost may begin to dominate, causing delays in computation. Efficient load balancing becomes crucial to avoid threads idling while waiting for others to complete their tasks.

What We Hope to Learn

By undertaking this project, we aim to understand how to efficiently scale the simulation of quantum circuits by managing Pauli word evolution through parallelism. Key aspects of the problem we hope to address include:

How to efficiently map an exponentially growing number of Pauli words to a fixed number of GPU threads.
How to manage memory constraints while tracking a large number of Pauli words without sacrificing accuracy.
The trade-offs between computational overhead (e.g., synchronization) and the benefits of parallelization.
How to address divergent execution and achieve load balancing when the computation per Pauli word varies.
Strategies for scaling the simulation to large quantum systems while maintaining performance and accuracy.

Through experimentation and optimization, we hope to find solutions that balance these challenges, enabling efficient and scalable simulation of large quantum circuits.

Resources

The primary starting point for this project will be existing open-source codebases and simulators. These include:

We plan to start by implementing C++ versions of the key functionalities found in the PauliPropagation.jl and Qiskit/pauli-prop libraries. These repositories provide a foundation for Pauli path simulation, and translating the existing implementations into C++ will provide both performance improvements and more flexibility for adapting the algorithms to different use cases. C++ will be chosen for its high-performance capabilities, especially for numerical computations, which will be crucial for handling large-scale quantum systems efficiently.

Starting Point: Papers and References

The primary theoretical framework for this project comes from the paper:

“Simulating Quantum Dynamics with Pauli Paths” by L. J. S. Ferreira, D. E. Welander, and M. P. da Silva (2020).

Additional reference material and technical details will be obtained from the upcoming paper:

“A Practical Guide to using Pauli Path Simulators for Utility-Scale Quantum Experiments” by H. Gharibyan, S. Hariprakash, M. Z. Mullath, and V. P. Su, arXiv:2507.10771 (2025). Available: https://arxiv.org/abs/2507.10771

Additional Resources

While the above resources will provide most of the theoretical and practical tools needed for this project, there may be some additional computational resources required. Specifically, access to specialized quantum computing hardware or simulators (such as those available through platforms like IBM Qiskit, AWS Braket, or Microsoft Azure Quantum) will be crucial for benchmarking. By using these platforms, we can generate results to compare against our own simulations, helping us evaluate the performance and accuracy of our Pauli paths implementation. This comparison will be key for identifying areas where our simulator can be optimized or refined.

Goals and Deliverables

Plan to Achieve:

Implement a parallel GPU-based Pauli Path simulator.
Build a CUDA/C++ implementation of the Pauli Path algorithm that exploits embarrassingly parallel steps—particularly independent Pauli word evolution.
Compare GPU implementation to CPU-based simulators.
Most current Pauli Path simulators operate on CPUs. We will benchmark our GPU version against these implementations and analyze where performance differences arise.
Evaluate truncation strategies on GPUs.
Because the Pauli word set can grow large, we will investigate parallelizable truncation/pruning approaches and measure their effect on performance, memory usage, and simulation accuracy.

Hope to Achieve:

Match or exceed performance of public Pauli Path simulators.
Since GPU-optimized Pauli Path simulators are rare, we aim for substantial speedups compared to existing CPU implementations.
Implement CPU multithreaded parallelism.
If time permits, we will extend our simulator with OpenMP/Pthreads-based CPU parallelism to compare CPU- and GPU-parallelization strategies.

Poster Demo Deliverables:

Speedup plots comparing CPU vs GPU implementations
Visualization of Pauli word population growth and truncation over time
Expectation-value outputs matching published simulators to validate correctness

Schedule

Week 1 (11/17–11/23)

Develop a working minimal C++/CUDA Pauli Path simulator for tiny circuits
Establish baseline timings for later speedup comparisons

Week 2 (11/24–11/30)

Add and test truncation mechanisms
Begin major optimization work

Week 3 (12/1–1/8)

Finalize GPU optimizations
Run full benchmark suite and generate plots
Produce final project poster

Platform Choice

For hardware we will be using the GHC machines for both their multi-core CPUs and NVIDIA GPUs. We plan to write our entire project in C++ and CUDA