Table of Contents
Accurate first-principles electronic structure and transport calculations of realistic systems are often enormously time-consuming. One important way to improve the efficiency is to employ parallel computing. This has been implemented in ATK by using the MPI specification (Message Passing Interface). In this document, we will describe how you can take advantage of the parallel power of ATK, by looking at system requirements, and showing how to launch a parallel calculation. The different license options for parallel computing using ATK will also be discussed briefly.
Moreover, we will also describe which parts of the code that have been parallelized, and discuss how this influences the parallel performance for various types of systems and simulations. As you will see, in the best cases, you can achieve almost linear scaling with the number of processors for heavy transport calculations.
Observe that running in parallel does not necessarily imply that you must have a large number of CPUs at your disposal. Even just a handful, or as few as two CPUs (e.g. dual-core CPUs), will provide a substantial performance boost for many systems.
In the current version of ATK (2008.10), the high performance Intel Math Kernel Library (MKL) library is used in many time consuming parts of ATK. MKL supports multi-threading for multi-core CPUs. For a full description of threading using the MKL library, see Controlling threading in MKL.
In this section, we cover some of the basic system requirements for parallel setups, and discuss some related technical details. Generally, these considerations are a task for a system administrator rather than the end-user.
In order to perform a parallel computation, more than one CPU is obviously required. There are several conceivable hardware configurations, ranging from shared memory/multi-CPU supercomputers, or dual core computers, to clusters and grid environments.
ATK is parallelized using MPI, and the code is, in principle, compatible with any MPI-2 compliant implementation of MPI. In the future, ATK will be linked dynamically to the parallelization libraries, thereby making it possible for the users to use different MPI implementations.
At present, the Linux version of ATK contains parallel support by default. In addition, it is also possible to obtain a parallel version for Windows on specific request. Please contact QuantumWise sales department for further details!
The presently shipped version of ATK (2008.10) is statically linked against MPICH2 version 1.0.5p4, and therefore requires the same MPI implementation to be installed on your network in order for ATK to function in parallel. MPICH2 1.0 can be downloaded from the MPICH2 home page. The library should be straightforward to install by following the accompanying installation instructions.
In addition, it is necessary to have NPTL (Native POSIX Thread Library) installed; since NPTL is a standard part of almost all modern Linux distributions, it should rarely be an issue.
Finally, it is of course also necessary to fulfill the standard system requirements for ATK (see Installing ATK). ATK, as well as the the input and output files, must be placed in a location on the network accessible to all nodes and must have the same absolute path.
QuantumWise uses the FLEXlm license manage system from Macrovision, and the licenses required to run ATK come in two different flavors: ATKmaster and ATKslave licenses. For each calculation, one ATKmaster license is required to start ATK. In addition, to run a calculation in parallel, it is necessary to supplement the master license by a number of slave licenses, viz. the same number as the number of additional parallel nodes that you wish to employ for the calculation.
Note that slave licenses are only available as floating licenses, which require a license server. For more details, in particular on how to install licenses and license servers, please see the guide The FLEXlm license system. If you have further questions about parallel licenses (such as pricing, etc), please contact the QuantumWise sales department.
Below we give a brief description of the standard procedure for launching parallel calculations in MPICH2 1.0. Note that some details may differ on your network, so check with your system administrator. In particular, it is not unlikely that you need to use a queuing system (such as PBS, Portable Batch System) to submit your jobs.
First of all, test that MPICH2 1.0 is available by giving the following command on the master node:
mpiexec -h
This should display the mpiexec help.
The command:
mpich2version
gives information on which version of MPICH2 is installed.
It is most likely that your MPICH2 configuration is set up to use SSH to communicate between the nodes. In this case it is necessary to provide your network password each time a node is added to the calculation. This will quickly become tiresome, at least if you use more than a handful of nodes. A convenient work-around is to use an ssh-agent with a password-less RSA/DSA public key. Contact your system administrator for more information.
Next, verify that ATK is properly installed. In this guide, we will symbolically
denote the directory in which the atk binary is located by
$ATK_BIN_DIR, but in fact it is a good idea to define the
corresponding environment variable.
Test the ATK installation by giving the command
$ATK_BIN_DIR/atk --version
This should display the version number of ATK.
Next, launch a small parallel NanoLanguage script to make sure that ATK is properly
configured to run in parallel on your system. Use the following script
(test_mpi.py)
as input file:
from ATK.MPI import * if processIsMaster(): print '# Master node' else: print '# Slave node'
The most important parameter to mpiexec is the option
“-n” which is used to specify the number of
processors to use for the calculation. In this test, we will use two processors
(feel free to use more, if your system supports it):
mpiexec -n 2 $ATK_BIN_DIR/atk /home/myusername/test_mpi.py
Please observe, that it is always necessary to give the full path to the ATK binary and the input file when using mpiexec; it is not enough to put the ATK binary directory in your PATH. This is where the usefulness of the additional environment variable mentioned above comes in!
If all works out, there will be two (or more, if you used more processors) lines
Master node
Slave node
written on the terminal indicating that the calculation was indeed run in parallel (the order of the lines may vary; this is not an error).
If any of these steps fails, please contact your system administrator, or consult the guide The FLEXlm license system if you suspect the problem is related to ATK installation. If all looks fine, you can start to launch your own parallel calculations!
Below we list some further useful MPICH2 options that may be of your interest. Please refer to the MPICH2 manual for a complete description of these and other options. The program mpiexec is actually a script, that supports a long list of arguments, and may also have been modified locally on your platform to fit particular network needs and requirements.
To run a single job locally (on the current node, which may be the master or a slave):
mpiexec -n 1 $ATK_BIN_DIR/atk [args...]
To run atk on 4 processors:
mpiexec -n 4 $ATK_BIN_DIR/atk [args...]
If you want to run on a specific set of machines you can construct a machine file. To run 2 jobs on the specified machines:
mpiexec -n 2 -machinefile mymachinefile $ATK_BIN_DIR/atk [args...]
In order to run on the two machines slave1 and
slave2, the mymachinefile would look like:
slave1 slave2
In the above, args... refers to all additional parameters you would
provide to your ATK run on a single processor.
In ATK 2008.02 the version of MPI was upgraded from MPICH version 1.2 to MPICH2 version 1.0. Even though the new MPICH2 introduces a number of new features it is not completely backwards compatible. This section describes some of the usual problems that may be encountered.
MPICH2 introduces a process management environment called mpd (consult the MPICH2 user's guide for more information). This environment must be running when executing ATK with mpiexe. It is encouraged that mpd is always started on booting the machine.
In MPICH 1.2 mpirun was used to execute an application in parallel. In MPICH2 it is encouraged to use mpiexec instead. However, note that MPICH2 still supports the use of mpirun.
The most commonly used command line arguments for mpirun in MPICH 1.2 have remained unchanged. Note, however, that some of the more advanced arguments are now a part of the configuration of the process management environment mpd.
If you wish to print or store data when running ATK in parallel, some special
precautions are needed, since I/O in this case should be handled by the master
process only. This, however, can easily be controlled from within ATK by using
the Boolean function processIsMaster(), which is included in
the ATK module ATK.MPI.
The function processIsMaster() returns True
if called by the master process (it also always returns True when
ATK is run in serial) and False otherwise. A very simple
in-line fix to handle VNLFile I/O could
therefore be
# Open a VNL file (MPI-safe) from ATK.MPI import * ... if ( processIsMaster() ): vnl_file = VNLFile('h2.vnl') vnl_file.addToSample(h2, 'h2')
A Python print statement should be encapsulated like this
# MPI-safe print statement from ATK.MPI import * ... if ( processIsMaster() ): print 'Total molecule energy is...'
whereas general file I/O is handled as
# Open a VNL file (MPI-safe) from ATK.MPI import * ... if ( processIsMaster() ): vnl_file = VNLFile('h2.vnl') vnl_file.addToSample(h2, 'h2')
Always keep these considerations in mind when you do I/O within a parallel execution of an ATK script. To see suggestions for some more general workarounds, you may also consult the section Generating VNL files from parallel runs in the guide Tips.
In the current version of ATK (2008.10), the Intel Math Kernel Library (MKL) library is used in many time-consuming parts of ATK. MKL supports multi-threading for multi-core CPUs
As described in this chapter, the primary parallelization in ATK is done using the MPI specification (Message Passing Interface). This implies that the most efficient parallelization will be to use all available CPUs for MPI parallelization. In an non-MPI environment, for example on a Windows platform, threading can, however, speed up calculations.
|
|
Warning |
|---|---|
|
Using multi-threading and MPI parallelization simultaneously, can have a negative impact on the parallel speedup. For a description of how to control multi-threading in the MKL library, see below. |
This section describes how to enable and disable multi-threading in the MKL library.
The WWW page Using
Intel® MKL Parallelism provides an overview of using multi-threading in MKL.
The MKL library uses OpenMP* threading software, which is controlled by the
environment variables OMP_NUM_THREADS and OMP_DYNAMIC
which determines the used number of threads. MKL defines its own set of environment
variables MKL_DYNAMIC and MKL_NUM_THREADS. The Intel
MKL variables are always inspected first. Below we give a short summary of the
these environment variables:
MKL_DYNAMIC
If this variable is set to false or
FALSE MKL will try to use threading with the specified
number of threads from MKL_NUM_THREADS. If the user does not
ask for a specific number of threads MKL will used the maximum number of
threads appropriate for the hardware. If this variable is unset or set to
anything else than the above values, MKL will dynamically determine the number
of threads used.
MKL_NUM_THREADS
If this variable is not set, MKL will use a value for its internal maximum
number of threads deduced from the hardware. If the variable is set to a
positive number, MKL will use the value as the maximum number of threads to use
according to the value of MKL_DYNAMIC.
An example of controlling the number of threads on Linux are shown below, using the bash shell
MKL_NUM_THREADS=2 MKL_DYNAMIC=FALSE $ATK_BIN_DIR/atk [args...]
This example will run ATK threaded using two CPUs
For Windows, the above examples becomes
set MKL_NUM_THREADS=2; set MKL_DYNAMIC=FALSE; $ATK_BIN_DIR\atk [args...]
The calculations that can be done with ATK are not all equally computationally demanding. Our efforts for improving the performance through parallelization have therefore naturally been focused on the parts that are most time-consuming. In particular, the following two main places in the formalism are parallelized: the calculation of matrix elements, and the evaluation of sums.
The parallelization of the matrix element calculation is simple and obvious; each matrix element in the Hamiltonian or overlap matrices is independent of the other elements, so with N processors we can evaluate N matrix elements in the same time as a single one on a serial computer. Thus one should see a linear speed-up on this part of the code, as each processor is given the task of evaluating a single matrix element. The accuracy of the matrix elements is mainly determined by the mesh cut-off, and this cannot really be influenced by parallelization.
The evaluation of sums may sound as a small task, but in fact it represents some of the most important bottle-necks in the code. The point is, that the terms in the sum are often very expensive to evaluate, and at the same time there are many of them, in particular if one want to obtain good accuracy in the results. Typically, each term involves an entire matrix diagonalization, which is a time-consuming task.
The following functions in ATK will benefit from being run in parallel:
ATK is specifically parallelized in terms of sums over
k-point sampling, both for the self-consistent calculation and the transmission spectrum, and
energy sampling in the transmission spectrum.
In all these cases there is no requirement for cross-communication between the processors. Also, the packages sent between the master and slave nodes are small, so there are no bandwidth bottle-necks or any complicated issues related to the way the parallel network is put together.
Figure 67 shows some benchmarks test, and demonstrate that for certain system we are in fact very close to ideal linear scaling with the number of CPUs. This occurs precisely in those system where we expect to see an improved performance due to the parallelization, namely two-probe and electrode calculations using a lot of k-points.
On the other hand, for molecules there is very little performance gain, again as expected, since no particular measures have been taken to improve the calculation speed through parallelization on such systems.
Although the electrode calculation is closest to linear scaling (which is to be expected, since the performance crucially depends on the k-point sampling), one should note that the two-probe calculation is the most time-consuming part of any study, and so this is where good parallel capabilities are most crucial. Fortunately, the scaling for the Fe-MgO two-probe systems (this is a very demanding calculation, with 40,000 k-points!) is also very close to linear.
Using the parallel version of ATK should display a significant performance improvement for bulk systems and two-probe systems, due to the parallelization of the calculation of matrix elements, k-point sampling, and evaluating energy sums.
Running ATK in parallel is not more complicated than running it in serial, provided the parallel infrastructure is set up properly, which includes having the proper licenses installed.
If you have any further questions, please do not hesitate to contact QuantumWise support.