Time in START card with PBS

sunil · 7 June 2024 15:31

Dear Team,
We now have 2 different job scheduling systems in our HPC, slurm and PBS. I usually set a time limit (WHAT(6) in START) to terminate a run within the allocation requested for a job. Its a lazy way to get a job terminated cleanly instead of calculating the time required from the number of histories from a trial run.
With slurm, I get clean exit with this technique while with PBS, the jobs get killed. Its not a small difference that I can take into account by adjusting the wall time. For example, a 3 hour job goes on 30 minutes past the time I set in the start card. In this case, I asked for more time to see if it gets terminated after say 5 or 10 minutes past the time limit. I had to intervene (rfluka.stop technique) to stop the run. While I can think of a few ways to automate this technique if I cannot solve this problem, I am curious to know;

Does anyone know why this happens?
I understand FLUKA looks for some type of signal when the run time is close to the set time. What type of signals are these?

Any additional information will help me discuss this further with my system administrators.

Many thanks
Cheers, Sunil

vasilis · 7 June 2024 16:20

Hi Sunil,

the START WHAT(6) time available for the run, checks on every primary the CPU-time (not user-real-time) that was consumed.
It stops the cycle if the CPU time exceeds the specified limit.
Could it be that the CPU-time vs real-time, that in your PBS cluster you are running more jobs than you have cores available?

Another issue I can think of is that the subsequent cycles will be still launched by the rfluka script.

FYI some years ago we had implemented a smooth termination for the PBS/TORQUE systems intercepting the SIGTERM signal that is send few minutes before. We will add it into the code.

sunil · 7 June 2024 17:00

Hi Vasilis,
Thanks for your response.
I can see all the jobs submitted begins at the same time. The script is set up to ask for number of nodes by utilizing all cores in a node. The number of jobs are equal to the number of cores requested.
Also, in the HPC, I run only one cycle per job. I wait to inspect the output files before deciding on running additional cycles.
The feature that you mention will undoubtedly help.
Sunil