A few spawned runs crash

sunil · 10 May 2022 19:20

Dear FLUKA team

I run FLUKA in a cluster utilizing a few hundred cores. In one particular run, I got a few crashes that I could not understand the reason for. Looking at one temporary directory, I can see the following files (I replaced the filename part of it with *).

drwxrwx---+  2 sunil cels   16384 May 10 13:32 .
drwxrwx---+ 34 sunil cels 1048576 May 10 13:05 ..
-rw-rw----+  1 sunil cels      46 May  9 21:39 *_171001.log
-rw-rw----+  1 sunil cels   15873 May  9 21:39 *_171.inp
-rw-rw----+  1 sunil cels      82 May  9 21:39 *_68001.err
-rw-rw----+  1 sunil cels     622 May  9 21:39 *_68001.log
-rw-rw----+  1 sunil cels   65534 May  9 21:39 *_68001.out
-rw-rw----+  1 sunil cels   15873 May  9 21:39 *_68.inp
-rw-rw----+  1 sunil cels       0 May  9 21:39 core
-rw-rw----+  1 sunil cels       0 May  9 21:39 fort.1
lrwxrwxrwx   1 sunil cels      40 May  9 21:39 fort.11 ->*_68001.out
lrwxrwxrwx   1 sunil cels      40 May  9 21:39 fort.15 -> *_68001.err
-rw-rw----+  1 sunil cels   42563 May  9 21:39 fort.16
lrwxrwxrwx   1 sunil cels      40 May  9 21:39 fort.2 -> *_171002
-rw-rw----+  1 sunil cels       0 May  9 21:39 .timer.out

I see two input files (!) the 68th and 171th spawns and two log files pertaining to the these. They both have the same error message.

At line 31 of file main/flabrt.f (unit = 15)
Fortran runtime error: Cannot open file 'fort.15': Too many levels of symbolic links
Error termination. Backtrace:
#0  0x2b301beb23fa in data_transfer_init
        at /blues/gpfs/software/centos7/spack-latest/var/spack/stage/gcc-9.2.0-pkmzcztqna4f2m7hxvqjrrrzpqyclnt3/gcc-9.2.0/libgfortran/io/transfer.c:2869
#1  0x54231f in flabrt_
        at main/flabrt.f:31
#2  0x490276 in fl64rd_
        at rnd/flrm64.f:210
#3  0x593b51 in rnread_
        at rnd/rnread.f:26
#4  0x404711 in flukam_
        at main/flukam.f:1565
#5  0x402f50 in fluka
        at main/fluka.f:77
#6  0x402f50 in main
        at /shared/src/usflmd.inc:15

The out file has this message at the end.

***** Next control card *****   RANDOMIZ   1.000       171.0       0.000       0.000       0.000       0.000

  **** No Random file available !!!!!! ****
 Abort called from FLRM64 reason NO RANDOM FILE Run stopped!
 STOP NO RANDOM FILE

These happened with about 10 runs out of 180. I thought FLUKA writes to a temporary folder with _00 suffix in the name if there exists a folder with the name that it was about to write to, to prevent two input files ending up in the same temporary directory. Is that what was happening here, or is it something else?
Many thanks for all your help.
Sunil

vasilis · 12 May 2022 08:08

I see that you submitted a run using the rfluka and a filename containing the star “*” character, which creates a link to a file pattern *_68001.err instead of a input_68001.err leading to a crash

sunil · 12 May 2022 11:24

There is no * in the input file name. Sorry for the confusion by introducing it, which was meant as a wildcard to show the long filename that I removed before posting here. Here is what is inside the temporary folder for another crash.

(base) [sunil@beboplogin2 fluka_31659]$ ls -al
total 741
drwxrwx---+ 2 sunil cels  16384 May 11 12:09 .
drwxrwx---+ 3 sunil cels 262144 May 11 19:35 ..
-rw-rw----+ 1 sunil cels    622 May 11 12:09 11V_Li_stand_plugged_176001.log
-rw-rw----+ 1 sunil cels  17688 May 11 12:09 11V_Li_stand_plugged_176.inp
-rw-rw----+ 1 sunil cels    622 May 11 12:09 11V_Li_stand_plugged_69001.log
-rw-rw----+ 1 sunil cels  74014 May 11 12:09 11V_Li_stand_plugged_69001.out
-rw-rw----+ 1 sunil cels  17688 May 11 12:09 11V_Li_stand_plugged_69.inp
-rw-rw----+ 1 sunil cels      0 May 11 12:09 core
lrwxrwxrwx  1 sunil cels     77 May 11 12:09 fort.1 -> /lcrc/project/RadPhyRP/ATLAS/nuCAR/11VnuCARIBU/ran11V_Li_stand_plugged_176001
lrwxrwxrwx  1 sunil cels     30 May 11 12:09 fort.11 -> 11V_Li_stand_plugged_69001.out
lrwxrwxrwx  1 sunil cels     30 May 11 12:09 fort.15 -> 11V_Li_stand_plugged_69001.err
-rw-rw----+ 1 sunil cels  48586 May 11 12:09 fort.16
lrwxrwxrwx  1 sunil cels     30 May 11 12:09 fort.2 -> ran11V_Li_stand_plugged_176002
-rw-rw----+ 1 sunil cels      0 May 11 12:09 .timer.out
(base) [sunil@beboplogin2 fluka_31659]$

vasilis · 12 May 2022 12:53

This is also strange, I see 2 .inp and .log files inside without any .err. Normally rfluka should copy only the master input.
Could it be that due to many multiple runs you have some that run on the same temporary fluka_XXX folder?

sunil · 12 May 2022 13:13

Vasilis,
Yes, I agree, its strange and that is what is perplexing to me.

Could it be that due to many multiple runs you have some that run on the same temporary fluka_XXX folder?

It appears that is what is happening here unless the cluster/slurm is doing something. But isn’t that scenario supposed to not happen, because FLUKA renames a target folder with a suffix “_00” or something like that when there exists a folder with the intended folder number/name ? I have seen FLUKA doing that as well.

vasilis · 12 May 2022 13:47

Normally it should check for _00, _01, … until it finds an non existing directory, and it should abord if there are 100dirs with the same name
The check is done in rfluka lines 232-249,

Ηowever there is no semaphore locking, so we might have a race condition

vasilis · 13 May 2022 14:23

@sunil if what I suspect is the cause of your problem, one easy solution from your side would be to submit each run in a separated folder