Dear FLUKA experts
My name is David and I encountered a rare allocation error with the subroutine read_phase_space_file.
Background
I simulate energy deposition events from a phase space file using the “source_cosmic_new.f” user routine with the embedded subroutine “read_phase_space_file”. I do this on a computer cluster at our institute via the slurm scheduler using ~2000 independent, parallel runs. The corresponding scripts were successfully applied for already over 1 year without any problems.
Problem
Since this month, some (~1 out of 1000) runs regularly show a fatal allocation error resulting in a core dump and crash of the related run (2E5 primaries per run). All other runs are not affected and go through without problems.
Problem Solving
- I traced the error back to read_phase_space_file subroutine, line 1041 and/or 1067, with the log-file error being:
“malloc(): mismatching next->prev_size (unsorted)” - The errors were encountered on all the nodes available on the cluster, so a node specific hardware issue seems to be unlikely.
- I ensured that sufficient memory is allocated for each run, so a memory allocation overflow is also unlikely.
- I contacted the administrator for our cluster and he pointed to a potential problem with the subroutine read_phase_space_file.
- Increasing the number of phase space file entries increases the error rate (1E7 lines corresponds to ~1 error per 100 runs for 2E5 primaries per run).
- Note that I utilize FLUKA’s new point-wise neutron transport treatment for the simulation. I tested the setup with both the JEFF-3.3 and ENDF-VIII.0 nuclear libraries and encountered allocation errors with both.
Questions
-
Can you reproduce the error I encountered on your cluster system with the files provided below?
-
Do you have a solution to prevent this rare error?
Files
Because of the their sizes, I provide the files via this link:
The folder contains a MRE as well as two runs, for which I encountered the described error with all log and error files:
/InputFiles: template input (& flair) file, which is adapted for each run
/PhaseSpaceFile: contains phase space file used for all runs
/UserRoutines: Contains custom user-routines used for all runs
/run00322: first example run with the described error
/run01675: second example run with the described error
Additional Information
I performed all simulations with the FLUKA CERN version 4-4.1.
Thank you very much for your support
Cheers,
David