Chrash with error message "SIGSEV: Segmentation fault - invalid memory reference"

jbpruvost · 13 September 2023 14:52

Dear experts,
I’m trying to run (with Fluka4-3.3/Flair3.2-4.5) on a cluster a problem with a source routine to score mainly DOSE-EQ usrbin.
Routine compilation and link to fluka exe seem to be ok. Run starts well but before the end of the 1st cycle, the run is stopped with the message “executable returned RC=139”. (from the prompt)
.out and .err files are useless (from my point of view - nothing indicated there).
.log file give info about a SIGSEV signal: segmentaion fault - invalid memory reference.

I found out on the forum that a previous post dealt about RC=139 error. It might come from the memory needed for the usrbin.
It as to be said that I’ve already runned this input without the source routine and huge (almost identical) usrbin scoring without any problem. Thus, I’d try to run the same input with no usrbin at all. It crashes with identical messages!?!
Note also, that memory available on the cluster is quite huge. So it should not be an issue.

So may the source routine involved in this trouble? And why? Any help will be greatly appreciated.

Here are the fluka-flair files.
Fluka-forum-RC139.zip (274.6 KB)

Many thanks in advance.
Best Regards,
Jean-Baptiste

horvathd · 15 September 2023 10:53

Dear Jean-Baptiste,

I was able to run your input without any issues.

Did you try to run jobs parallel? If yes, how many, and how much memory is available in your PC?

Cheers,
David

ceruttif · 15 September 2023 11:10

Dear Jean-Baptiste,
I confirm @horvathd 's experience with your input also from my side.
Note that, in order to characterize the crash, you should not focus on RC=139, rather on the content of the log file. As you reported, the problem here was ‘Segmentation fault - invalid memory reference’, pointing in the backtrace below to a specific line of the FLUKA core code, which however appears to be pretty innocent. The error type seems to suggest a possibly temporary system issue (tbc).

jbpruvost · 15 September 2023 14:45

Dear David,

First, many thanks for your answer.

Second, in fact I did ran 23 spawns with that input. But as I wrote, I did almost the same calcultion with identical usrbin scoring; running also 23 spawns without any problem. The difference, as far as I checked is the use of the source routine when it crashed.

I’ll check the memory available on the cluster and come back to you as soon as possible.

Best regards,

Jean-Baptiste PRUVOST

jbpruvost · 15 September 2023 14:53

Thank you very much Francesco for your answer.

It seems to suggest that probably, during the calculation, the temporary memory stack is able to grow hugely - and it may be enhanced by the source routine (?) - slightly above the processor capabilities?

Best regards,
Jean-Baptiste

jbpruvost · 15 September 2023 15:50

Dear David,

Here some additional information:

I am using one node on a cluster. Its RAM DDR4 is 128 GB at 2.1GHz
One node has 24 cpu Xeon ES2680 @2.5GHz (and I am using 23 on it by spawns).
The scratch directory used during job execution has a capacity of several 10TB.

Hope it may help understanding.

Best regards,
J-B.

horvathd · 19 September 2023 08:55

Hi Jean-Baptiste,

The source routine doesn’t increase memory usage during the run, and I didn’t observe such a thing in my tests.

Is the crash reproducible on your machine? If not then I agree with Francesco, that this was probably a system hiccup.

Cheers,
David

jbpruvost · 21 September 2023 08:01

Hi David,

I take a bit of time in order to run some additional tests.
I did try again running the same case on the cluster but with only 12 spawns.
The crash did happened again during first cycle and after several 10e6 of primaries. It was runned on a different node than previous ones. Number of primary / cycle was set to 2e7.
And the error message in the .log file is again “Program received signal SIGSEV: Segmentation fault - invalid memory reference”.
I did try also with another source routine, let say a little bit more sophisticated in terms of sampling by means of position and primary particle energy sampling but without using whasou() from input. Also on 12 spawns on the cluster but on another different node. Again same result at the 1st cycle. For each trial, I set a different random seed.

I did another try with the same input and source routine on my lap-top. Of course it runs in a slower way (~1.7ms/pr instead of 0.84ms/pr) but it is still running today! I launch only 4 spawns (my laptop has only 8 cpus) with “only” 8e6 pr/cycle. Now the run is at its 11th cycle and still running apparently without any trouble.

It seems that the way it runs on the cluster is may be an issue. Do you agree?

By the other hand, some almost identical inputs without using a source routine are running perfectly well on the same nodes with huge number of primary and involving almost all the cpu of the node (23/24) without any trouble.

I am far from being sure that it may cause a difference but - to complete information - the cluster is running with a CentOS and my laptop with Ubuntu OS.

Thanks again for your kind help.
Best regards,
Jean-Baptiste

sfaruk · 21 September 2023 13:00

Hi Jean-Baptiste,

You probably know this by now. When opening your flair file, I’ve seen these warning/error messages on my console. The simulation ran okay without any errors.

Best wishes,
Sanjeev