running MCs on GPUs is certainly a highly interesting option and I would not fully exclude it. Yet, there are inherently quite complex issues to be overcome, which are not obvious for general purpose MC codes.
GPUs offer very high performance because one has a large number of individual processors, which actually might not be very performant by themselves (depending on the hardware of course). So it is the combined amount which makes the difference. As those processors are optimized and organized to work on data that can be represented in matrices, they work well with only specific types of algorithms. Algorithms which frequently branch will kill performance and might make them perform worse than standard CPUs. Thus, it requires to change well established algorithms completely from the scratch. Furthermore, memory models are differing as, depending on the hardware, some support only a limited amount of static memory assigned to GPUs, which cannot be shared with the standard memory of the machine, whereas others allow for spill-over but at the cost of speed. There is also the subject of precision which goes too far to discuss here.
Last but not least, to get the best performance one needs to use vendor specific languages, which means that for example code targeted for NVIDIA GPUs will not work on AMD and vice versa.
In short, GPUs are very powerful but those excessive speed gains that you might hear about highly depend on the application, how it can be represented in terms of mathematical algorithms and requires hardware and vendor-specific programming.
In complement to Chris’s answer, one can note that FLUKA is already relying on parallelism; but at the process level.
There are several benefits with the present multiple single-threaded processes approach:
(1) In case of crash, one process is killed, not the entire simulation. The post-processing scripts allow the user to get results from all processes which completed successfully. This is particularly useful in case of a crash independent from the simulation itself (process killed by the server, server failure etc).
(2) Debugging. A multithreaded application requires more effort to debug (and maintain).
(3) Scalability. While threads will always need to remain in the context of the process’s memory space, on can launch multiple processes on different machines.
This is not incompatible with also exposing parallelism within each process though: in view of a multithreaded application, or of offloading work to GPUs. It would be interesting to benchmark possible gains versus the present multiple single-threaded processes approach. However, because of the cons Chris mentioned (significant code re-design, vendor dependency, etc), the performance gains would need to be major to justify that approach. There is no plan of supporting work offload to GPUs in the near future. To be noted is that vendor locking can however be overpassed by the performance portability libraries (but this is still indeed an extra dependency).