Executable error RC=143

jtbc · 6 October 2023 07:43

Dear fluka experts,
I’m running some simulations over a fleet of computers, and if most of the job were properly executed (except some harmless file closing difficulties sometimes) there are a couple that crashed (2-3 % of the total jobs), with the error :
Error: "XXX/flukadpm" executable returned RC=143
As pointed in other post, I tried to look at the .err .out and .log of such jobs and there are no real tell for the problem.
I wonder if it is because I do not create an executable for each machine but rather build one and use it for all machines (as they share storage). They are all on the same os and using a fluka installation at the same location, hence I believe I do not need to build a new executable dedicated to each machine.

Do you know if this code corresponds to the try to open a file already open or it corresponds to a different type of problem ?

Best regards,

amario · 6 October 2023 10:03

Dear @jtbc,

As long as the executable location is accessible from each machine, this should not cause any problem.

RC=143 should mean that the operating system sent a SIGTERM signal to “gracefully terminate” the command. It is not possible to infer more.

Could you please share the .err .our and .log files please?

jtbc · 8 October 2023 06:37

Dear @amario ,
Thanks for the info, it might be that I was to taxing on the computer and had fully loaded them, so maybe a little flag raised to tell it to shutdown some of the jobs.
As asked I join you the .err file of a crashed job as well as the previous run without problem (.log is always empty). The RC=143 appeared in the stdout which I captured with nohup command.
some_file-correct_14007.err (23.3 KB)
some_file-correct_14007.out (304.1 KB)
some_file-crashed_14008.err (9.1 KB)
some_file-crashed_14008.out (270.4 KB)

sniang · 25 October 2023 15:24

Dear @jtbc,

By decreasing the number of jobs, is it working now without the crashes? This RC message (return code) are coming from your system/cluster not from FLUKA.
I invite you to look the into the log files of your system to have more details and find what is the number of jobs you can launch at the same time and what are the wall times.

If it’s ok for you, I think we can close the thread here.