Rare allocation error with subroutine read_phase_space_file

BV96 · 13 December 2024 14:47

Dear FLUKA experts

My name is David and I encountered a rare allocation error with the subroutine read_phase_space_file.

Background
I simulate energy deposition events from a phase space file using the “source_cosmic_new.f” user routine with the embedded subroutine “read_phase_space_file”. I do this on a computer cluster at our institute via the slurm scheduler using ~2000 independent, parallel runs. The corresponding scripts were successfully applied for already over 1 year without any problems.

Problem
Since this month, some (~1 out of 1000) runs regularly show a fatal allocation error resulting in a core dump and crash of the related run (2E5 primaries per run). All other runs are not affected and go through without problems.

Problem Solving

I traced the error back to read_phase_space_file subroutine, line 1041 and/or 1067, with the log-file error being:
“malloc(): mismatching next->prev_size (unsorted)”
The errors were encountered on all the nodes available on the cluster, so a node specific hardware issue seems to be unlikely.
I ensured that sufficient memory is allocated for each run, so a memory allocation overflow is also unlikely.
I contacted the administrator for our cluster and he pointed to a potential problem with the subroutine read_phase_space_file.
Increasing the number of phase space file entries increases the error rate (1E7 lines corresponds to ~1 error per 100 runs for 2E5 primaries per run).
Note that I utilize FLUKA’s new point-wise neutron transport treatment for the simulation. I tested the setup with both the JEFF-3.3 and ENDF-VIII.0 nuclear libraries and encountered allocation errors with both.

Questions

Can you reproduce the error I encountered on your cluster system with the files provided below?
Do you have a solution to prevent this rare error?

Files
Because of the their sizes, I provide the files via this link:

The folder contains a MRE as well as two runs, for which I encountered the described error with all log and error files:

/InputFiles: template input (& flair) file, which is adapted for each run
/PhaseSpaceFile: contains phase space file used for all runs
/UserRoutines: Contains custom user-routines used for all runs
/run00322: first example run with the described error
/run01675: second example run with the described error

Additional Information
I performed all simulations with the FLUKA CERN version 4-4.1.

Thank you very much for your support
Cheers,
David

horvathd · 17 December 2024 08:22

Dear David,

I run your input on our cluster for 10’000 jobs, and none crashed due to memory issues. (Please note, our submission script launches the jobs staggered, so not all were run at the same time.)

So to try to find a workaround I have a few questions:

Are the crashes reproducible? I mean does the same input file fail again if you submit it again?
Did you try to use different Gfortran versions?

Unfortunately, I can’t see how the code could cause the issue since you are reading the same file, and it works most of the time.
Line 1041 and 1067 are where a line is read from the file, but I’m unaware if that could cause memory allocation errors while using a static variable.

Some other things to try:

Modify the code so it doesn’t use dynamic memory allocation by hard coding the array size.
Copy the phase-space file to each FLUKA job so you don’t read from only one instance with all 2000 processes.

Cheers,
David

BV96 · 2 January 2025 10:22

Dear David,

Thank you for taking the time to run my input files and for providing such a detailed feedback. I appreciate your patience, especially as it has taken me some time to reply due to the holiday season.

Answers to Your Questions

Reproducibility: The crashes are not reproducible. Resubmitting the same input file generally results in successful runs.
GFortran Versions: Unfortunately, testing with a different GFortran version would require involving the administrator, which I could not do over the holidays.

Update on the Issue

I tested your second suggestion of copying the phase-space file to each FLUKA job, ensuring that no two processes accessed the same file simultaneously. This workaround successfully resolved the issue. I haven’t encountered any crashes since implementing this change, even with the same input files that previously caused errors.

Closing Thoughts

While I can’t definitively pinpoint the root cause, the workaround has proven effective. I suspect it may have been related to the simultaneous access of the phase-space file by multiple processes, though the error itself appeared sporadic and hard to diagnose.

Thanks again for your guidance and suggestions.

Best regards and happy new year!
David