Distributed computing in Linux


#1

How can we run simulations using distributed computing on Linux systems?

Distributed computing details can be found:
https://kb.lumerical.com/en/index.html?installation__distributing-a-job-across-several_computers.html

Using the latest version of FDTD and MODE solutions, we can run a simulations using distributed computing from the Terminal window only. This is not possible using the CAD/GUI of the software.

We can use the command to test the MPI functionality. Taking note to use the IP Address of the computers instead of the HostName.

/opt/lumerical/mode/mpich2/nemesis/bin/mpiexec -hosts <node1_IP>:4,<node2_IP>:4 /opt/lumerical/mode/mpitest/cpi-mpich2nem

For information on MPI test program CPi can be found here:
https://kb.lumerical.com/en/index.html?installation_and_setup_linux_run-mpi-test-program-cpi.html

To run a simulation job distributed to 2 computers we can use the command:

/opt/lumerical/fdtd/mpich2/nemesis/bin/mpiexec -hosts <node1_IP>:4,<node2_IP>:4 /opt/lumerical/fdtd/bin/fdtd-engine-mpich2nem -t 1 -logall /<file_location>/<filename.fsp>

Additional information on running the solver with MPICH2 is found here:
https://kb.lumerical.com/en/index.html?user_guide_run_linux_solver_command_line_mpi.html


#2

while our computing system is a cluster with several couputing nodes and we use task scheduling system
(like PBS or slurm) to run the job from different user account.
So is it possible to run the distributed computing through PBS system ,e.g run a *.fsp file with very high memory required on several nodes at the same time.


#3

Please find a link for KB page on submitting jobs to a scheduler.
https://kb.lumerical.com/en/index.html?install_linux_cluster_setup.html

update the script files accordingly using the command that we provided similar to this:
/opt/lumerical/fdtd/mpich2/nemesis/bin/mpiexec -hosts <node1_IP>:4,<node2_IP>:4 /opt/lumerical/fdtd/bin/fdtd-engine-mpich2nem


#4

I note that "c -hosts <node1_IP>:4,<node2_IP>:4 " this means I should set the host(IP) to run the simulations.
this means the work will not run under the contral of the job manager system(PBS).This is not allow by our cluster administrator.


#5

Hi @wenqiang.wang,

Unfortunately, based on our tests, we are unable to run the distributed command parameters using the Hostname on multiple Linux machines. You can try to use the HostName instead of the IP Address, to test if it will be able to resolve the node’s HostName and run the simulation job without any errors using the scheduler.


#6

While,I just want to run the single simulation on sseveral machine through the job manager system.
And I think it is very difficult for me to realize


#7

Unfortunately, by using IP address to defining nodes is the only workaround we have as of now.


#8

:joy:Thanks all the same.