During my talk for the MathWorks Computational Finance Virtual Conference, I was able to significantly accelerate the performance of a Monte Carlo simulation by moving its random number generation (RNG) component onto a Graphics Processing Unit (GPU). GPUs excel at the sorts of independent, repeated, massively-parallelizable computations that are required for RNGs-- so much so that this functionality is built in to MATLAB's core GPU methods.

What we get with MATLAB is a uniformly-distributed, independent RNG (`gpuArray.rand`), a uniformly-distributed, integer RNG (`gpuArray.randi`), and a standard-normally-distributed, independent RNG (`gpuArray.randn`). As useful as these may be, they can't generate the *correlated* random numbers typically required for (for instance) a mean-variance financial framework. For the Virtual Conference, I built just such an RNG, and I had to think through a few non-trivial concerns to get there. This posting outlines some of the techniques I explored-- I hope you find it useful!

## 1. Preliminary concerns

Before I even get started, I want to review a few on my biggest concerns for while I'm writing this code. GPUs have several weaknesses that we are best off avoiding, in particular:

1. Relatively slow data transfer: we want to send as little data between the CPU and the GPU as possible.

2. High functional overhead: because it takes so long to get GPU threads rolling, we would much rather call a single command on a very large set of data than a sequence of the same command nested within a `for` loop.

3. Less memory: generally speaking, GPUs perform best when they have more threads executing. This is often limited by the GPU memory size, so we always want to watch for way to shrink the GPU memory footprint.

## 2. Generating correlated random numbers

Before I could create this functionality on a GPU, I first needed to ask how these calculations are done on a CPU. (A great place to start is the source code for MATLAB's Statistics Toolbox function `mvnrnd`.)

The expected covariance matrix `expCov` needs to be run through a `Cholesky decomposition`, a fantastic linear algebra technique that (among other things) allows us to convert uncorrelated random numbers into properly-correlated ones. It simultaneously scales a uniform normal into one with the appropriate variances. So, supposing we need `nDraws` from the correlated distribution and are given an `nAssets`-by-|nAssets| expected covariance matrix, we can just execute:

T = cholcov(expCov); % The Cholesky decomposition

uncorrelated = randn(nDraws, nAssets); % Uncorrelated, standard normals.

correlated = uncorrelated * T; % Correlated and properly-scaled normals.

## 3. Re-introducing the expected means

Having now accomodated the "variance" part of the mean-variance framework, we need to bring in the expected means, `expReturn`, as well. For the sake of simplicity, we'll assume that it's a 1-by-|nAssets| vector.

What we now need to do is add each element of this 1-by-|nAssets| vector to the appropriate columns of the `nDraws`-by-|nAssets| matrix `correlated`. This is best done by my favourite MATLAB function, the horribly-named-but-fantastically-useful `bsxfun` (which is short for Binary Singleton eXpansion if you're asked at your next trivia night). It is specifically designed for tasks like this and is much faster and more memory efficient than alternatives like `for` loops or `repmat`. This is especially true when we are using GPUs, so we use it here:

randNums = bsxfun(@plus, correlated, expReturn);

## 4. Re-shaping to our needs

This now gives us an `nDraws`-by-|nAssets| matrix of numbers, which might be the end of the story. In the webinar, though, I really needed a 3D array of random numbers, something like `nPeriods`-by-|nTrials|-by-|nAssets|. Because of the matrix multiplication times the Cholesky matrix above, I couldn't simply start with a 3D array. Furthermore, since GPUs prefer to do their work in a single shot, running this in a `for` loop (say, one period at a time) isn't effective, either. My best remaining option was to define `nDraws` to be `nPeriods` times `nTrials`, and I now reshape the matrix:

randNums = reshape(randNums, nTrials, nPeriods, nAssets);

The important thing here is that since `nAssets` is the final dimension of the 2D matrix, so too must it be the final dimension of the 3D array (this preserves the correlation characteristics of our data).

If the rest of our code base requires a different ordering to the rows, columns, and pages of the 3D array (for instance, assets are to be read across the rows instead of through the pages), then the `permute` command is useful here. That's probably easier than rewriting the rest of the code base to accomodate this particular ordering.

## 5. Moving to the GPU

We now have some fast and compact code to generate an arbitrary number of Monte Carlo simulations (`nTrials`) over an arbitrary number of periods (`nPeriods`) for our assets. The only question now is what (if any) of these operations should be moved to the GPU.

Regardless of how the expected covariances and expected returns are calculated, I'd recommend that they be estimated on the CPU. If nothing else, these calculations typically involve matrices of size `nEstimationPeriods`-by-|nAssets| (where the former is much larger than the latter), so we're better off calculating these moments on the CPU and transferring only the (much smaller) variables `expReturns` and `T` (the Cholesky matrix) instead.

The rest of the calculations might be faster on the GPU, but only a benchmarking exercise can confirm this. The following table shows my results (in seconds) for `nPeriods` set to 100 and `nAssets` set to 10:

nTrials: | 1e4 | 1e5 | 3e5 | 1e6 |

CPU only: | 0.164 | 1.61 | 4.90 | 16.3 |

GPU (no gather): | 0.000929 | 0.00163 | 0.694 | N/A |

GPU (incl. gather): | 0.0715 | 0.704 | 2.20 | N/A |

Note the following features:

1. We benchmark both the cases where we gather the random numbers back to the CPU (applicable when subsequent calculations are to be done on the CPU) and where we don't (applicable when further calculations are on the GPU).

2. The time to transfer data from the GPU far outweighs the time to generate the correlated numbers themselves!

3. The GPU runs out of memory at about 400,000 trials. If we needed more trials, we could consider running the code on the GPU in pieces (for instance, one period at a time). The "best" way to do this depends on how these simulated numbers are subsequently used.

I explored several variations of the code, adjusting which operations were done on the GPU and which came after the `gather` command. The final, fastest version of the code looks something like this:

T = cholcov(expCov);

randNums = gpuArray.randn(nTrials * nPeriods, nAssets);

randNums = uncorrelated * T;

randNums = bsxfun(@plus, randNums, expReturn);

randNums = reshape(randNums, nTrials, nPeriods, nAssets);

randNums = gather(randNums);

It certainly seems simple enough in the end, but we see that the choice and ordering of the operations can have a significant effect on execution speeds. These same lessons and techniques are applicable in many other GPU-based calculations.