First steps with COMSOL 4.2 and Cluster Computing, by Maxime Harazi
- - Geometry
  - Physics
  - Meshing
  - Simulation
  - Results
  - Cluster Computing

First steps with COMSOL 4.2 and Cluster Computing, by Maxime Harazi

Geometry

Physics

Meshing

Simulation

Results

Cluster Computing

COMSOL simulations can be run on clusters. Here is explained how to use COMSOL 4.2 on Colosse, one of the clusters of CLUMEQ (link), and take advantage of the cluster.

1) Create your model on your desktop computer, as if you would like to run it. Now, add a “Cluster Computing node” in the study :

Then go into the “Cluster Computing node” to edit its parameters, and check the “Distribute parametric sweep” :

Doing this, you're telling COMSOL that you're going to use several nodes and that it can distribute the parameters on these nodes. For example, if you're doing a simulation on two nodes (16 cores, since each node of Colosse has 8 cores) with an excitation whose frequency varies between 1Hz and 1000Hz, COMSOL will tell one node to do the simulation from 1Hz to 500Hz to one node, and from 500Hz to 1000Hz to the other.

Finally, save your file.

2) Send the mph file on Colosse. For example in /home/your_user_name[TO CHANGE]/model.mph

3) Create a new file on Colosse called - for example - example.sh, with that content :

#!/bin/bash

#####################################
# Options obligatoires #
#####################################


#PBS -N give_a_name	     # The name of the job.
#PBS -A dke-481-aa 	     # Specifies  the  project (RAPI number from CCDB) to  which this job is assigned.
#PBS -l nodes=10:ppn=8       # Number of nodes and cores per node
#PBS -l walltime=15:00:00    # All jobs must be submitted with an estimated run time. (fifteen hours here)
#PBS -l gres=comsol_scavone  # That line permits to tell Colosse not to launch more than one instance of job requiring COMSOL license (since we only have one license of COMSOL).


#####################################
# Options facultatives #
#####################################

# List of users to which the server that executes the job has to send mail
#PBS -M john@doe.ca

# Under which circumstances mail is to be sent?
# "b" = when job begins
# "e" = when job ends
# "a" = when job aborts

#PBS -m bea

# Execute the job from the current working directory.
cd "${PBS_O_WORKDIR}"

module load /rap/dke-481-aa/modulefiles/comsol

# Command to run

# Creating the hostfile
/clumeq/bin/moabhl2hl.py --format HP-MPI > hosts.txt
# Count the number of nodes available
NN=$(wc -l < hosts.txt)

# Initialize mpd
comsol -nn ${NN} mpd boot -f hosts.txt -mpirsh ssh

# Launch the COMSOL job
comsol -nn ${NN} batch -inputfile /home/harazi/model.mph -outputfile /home/harazi/results.mph -batchlog /home/harazi/model.log -tmpdir /scratch/dke-481-aa/

# Kill all instances of mpd once finished
comsol mpd allexit

# Delete the hostfile
rm hosts.txt

The last lines with the

hosts.txt

and

mpd

stuff is just the way to tell COMSOL the different adresses of the nodes to use.

Now, start a SSH connection to Colosse, locate the directory where your script (example.sh) is, and run the command

msub example.sh

Your simulation hasn't started yet, since Colosse - like almost all public clusters - is using what is called a “scheduler” (Colosse is using Moab), because many people are using the cluster, and Colosse can't run everything at the same time. So you just put your work in a queue. Colosse will run your task when it has enough ressources. One important thing to know is Colosse is also using a fairshare scheduling. That means the more computing power (i.e. number of nodes) you're asking for, and the more you're likely to wait for your task to start … So don't ask for 512 cores, unless it's 3AM and you really need it …

The waiting time is very dependant on the ressources you're asking for and the other tasks people sent. To check that your task is in the queue, type

showq -w user=harazi

To know how many tasks Colosse is doing and how many cores are used, type

colosse-info

Finally, you know that your job is finished when you don't see it anymore in showq -w user=harazi. You can then check the file /home/harazi/results.mph to see the results.

The official documentation of Clumeq

To check that COMSOL was correctly using the Cluster, some tests were conducted. Here, in blue, is the simulation time versus the number of cores used, for a given simulation. We can verify that the times are approximately in accordance with what was expected, in red : something proportional to 1/number of cores.

In conclusion, we can say that using Colosse with COMSOL is very useful, since it can easily reduce by a factor 10 (for 80 cores) the simulation time !

Table of Contents