Speeding up BloqadeNoisy simulations

Hi!

I am a new user of both Julia and Bloqade, and I am trying to simulate a simple system under the Aquila noise model but have experienced some performance issues.

A minimal example of what I am trying to do is:

using Bloqade
using BloqadeNoisy

Δ = 0
Ω = 1*2π
ϕ = 0
N = 2
sep = 5 # Atom separation in um

atoms = generate_sites(ChainLattice(), N, scale=sep) # Construct chain
H = rydberg_h(atoms; Δ=Δ, Ω=Ω, ϕ=ϕ) # Hamiltonian
t = LinRange(0, 2, 1000) # in us

psi0 = zero_state(N) # Create initial state, all atoms in |g>
prob_noise = NoisySchrodingerProblem(psi0, t, H, Aquila()) # Use Aquila noise model
n1_op = mat(chain(N, put(1=>Op.n))) # Rydberg density on site 1
n2_op = mat(chain(N, put(2=>Op.n))) # Rydberg density on site 2
n_traj = 10
noisy_sim = emulate_noisy(prob_noise, n_traj, [n1_op, n2_op], 
   readout_error = true, ensemble_algo = EnsembleThreads());

So, I construct the Hamiltonian for some sensible parameters for a pair of atoms, and then evolve the initial state under that Hamiltonian and the Aquila noise model, measuring the expectation values of some observables. The problem is that, even for just a single Monte Carlo trajectory, this is pretty slow (~ a few minutes to run the code above). This means that running ~ 10/100/1000 trajectories, and for larger system sizes, will surely be too slow to be practical. I am using BloqadeNoisy v0.1.1, and running it with VS Code.

I tried a couple of things to speed this up:

  1. Using subspace methods as described here: Working with Subspace · Bloqade.jl . However, doing this in a naive way and passing the state created with the blockade subspace throws an error, as the NoisySchrodingerProblem function does not seem to support such objects.
  2. Using multithreading by passing ensemble_algo = EnsembleThreads() to emulate_noisy, but this does not obviously do anything. Trying to choose the backend as described here: CPU Acceleration · Bloqade.jl throws an error (the set_backend function is not recognised.

So to sum up I have 3 questions:

  1. Can subspace methods work with BloqadeNoisy? If so, can anyone provide a minimal example of this?
  2. How can I use multithreading with Bloqade to speed up embarrassingly parallel problems, such as simulating independent Monte Carlo trajectories?
  3. Is there anything else in the above code/way of running it that is obviously inefficient?