Running a script file on Linux cluster via SSH

script
cluster

#1

I am currently trying to run a script file on a 24 node cluster, which I have accessed through PuTTY. The script will probably take around 48 hours and is a series of 400 individual simulations. However if the SSH connection is interrupted for any reason (computer goes into sleep mode, require undocking my laptop and so internet connection switches from ethernet to wifi etc.) then the simulation stops, citing a ‘Fatal I/O error’.

I have tried using the ‘screen’ command and detaching the screen, I have also tried ‘nohup’. Both still terminate the simulation if the session is closed.

I think the problem is that even if I use the -nw option, every time a new simulation starts a job manager window pops up with the usual dialogue. I believe it is this window that Lumerical is trying to deliver to my computer screen when disconnected and providing the error. Is there any way to get around this?

EDIT: It’s preferable to not have the job manager screen pop up with each simulation run, is there any way to suppress this screen from appearing? (I suppose I could append all the simulations to a job list and then run that)


#2

I assume you are using FDTD Solutions. If you are running a parameter sweep that generates, run and analyze the series of 400 simulations, it might be better to split the script in 2 parts: first, you generate all the simulation files and submit them to the cluster (the file generation will require the GUI but you can run the jobs on the cluster without the GUI, as shown here: https://kb.lumerical.com/en/index.html?user_guide_run_linux_solver_command_line_mpi.html).
Then, once all jobs are done, you can load the results and process them if needed (that will require the GUI again).

Usually, clusters use a job scheduler that starts jobs when the required resources are available. This allows you to submit the job and then log out from your session on the cluster while the jobs are queued and later executed. Do you know if one is installed?


#3

The solution I found to avoid the X-application to interfere with a batch running procedure is to use the Xvfb. It is like a X server, although the output is not shown by the graphic system. I use it to generate fsd files from lsf scripts without the need to physically open the CAD editor. Say for example that I want to prepare the layout of a simulation and I have all the steps coded in model.lsf. These are the steps to run the script without windows poping out:

  1. Xvfb :25&

  2. /path_to_CAD/CAD -run model.lsf -display :25

One problem you might have is that for some reason the Xvfb sometimes dies. If you are running many scripts in a row, it might happen that the display you are putting in the command line doesn’t exists any longer. I used to have a python script to prevent this to happen, which basically opens a Xvfb, does the task, kills it and removes the lock.

try:

    # Launch the Xvfb.
    os.system('Xvfb :25&')
    # Run the CAD code
    retcode = sub.call('/usr/local/lumerical_7/bin/CAD -run tmp.lsf -display :25', shell=True)
    # Get PID of the Xvfb to kill the Xvfb
    PID_l=os.popen('ps -e | grep Xvfb').readlines()
    if len(PID_l) > 0:
         PID=PID_l[0].split()[0]
         os.system('kill -9 '+PID)
    # Remove the lock
    os.system('rm -f /tmp/.X25-lock')

    # Some stuff to check what is happening
    if retcode < 0:
        print >>sys.stderr, "Child was terminated by signal", -retcode
    else:
        print >>sys.stderr, "Child returned", retcode
except OSError, e:
    print >>sys.stderr, "Execution failed:", e

I believe that this trick might well fix your problem.


#4

Yes, FDTD solutions.
So to clarify, you are suggesting I generate and save individual simulation files for all the different parameters, then iteratively run each simulation file via MPI using a bash script?

It seems that the problem here is that script files are intrinsically linked to the GUI when there is no real need for this to be the case.

Unfortunately this cluster does not have a job scheduler, it’s a fairly quiet cluster that’s only used by a few people in my department.


#5

This is the solution and is fairly simple. I simply replaced the ‘run;’ line in my scrip with ‘save(‘filenamex’);’ command to generate all the files, then wrote a bash script that first ran the .lsf before looping over all the files with the MPI solver. This has the advantage of using the MPI and making everything faster. Thank you.