Dear Team,
We now have 2 different job scheduling systems in our HPC, slurm and PBS. I usually set a time limit (WHAT(6) in START) to terminate a run within the allocation requested for a job. Its a lazy way to get a job terminated cleanly instead of calculating the time required from the number of histories from a trial run.
With slurm, I get clean exit with this technique while with PBS, the jobs get killed. Its not a small difference that I can take into account by adjusting the wall time. For example, a 3 hour job goes on 30 minutes past the time I set in the start card. In this case, I asked for more time to see if it gets terminated after say 5 or 10 minutes past the time limit. I had to intervene (rfluka.stop technique) to stop the run. While I can think of a few ways to automate this technique if I cannot solve this problem, I am curious to know;
- Does anyone know why this happens?
- I understand FLUKA looks for some type of signal when the run time is close to the set time. What type of signals are these?
Any additional information will help me discuss this further with my system administrators.
Many thanks
Cheers, Sunil