Computing for Science

The Computing for Science (CS) group supports ILL scientists, students and visitors in a number of activities including data analysis, instrument simulation and sample simulation.

Back to ILL Homepage

Cluster

Computational cluster

A cluster here is a set of computers of a similar architecture which share common management and configuration and usually share the users' files transparently. The systems have good communications between themselves, and jobs can either be launched on one processor or multiple copies can be launched on several, allowing part of the calculation to be performed in parallel.

Computationally intensive tasks interfere with general purpose time-sharing activities and one function of offering processors on a cluster is to dedicate the processors fully to that task in hand.

To use the parallel computing possibilities it is necessary to invoke languages where this is a built-in feature, or to analyse the problem and use library functions to distribute data and program over selected processors. Typically groups who develop such programs optimise them for their own systems. These notes might help the installation of such programs for use at the ILL.


SI and CS operates together a computational cluster of 45 nodes with system SLES 10.1. This machine is available to everyone for computationally intensive work i.e. hours, days, months ...
If you would like to start using the cluster, please send Mark Johnson a mail or come and see him, as the CS would like to have an idea as to what the cluster is being used for. In this case you should also subscribe to the cluster_l mailing list.

An older cluster 'brick' (8 nodes) is used mainly for Material Studio.

SI staff involved in configuring the cluster: Remi Mudingay, Stephane Armanet, etc.

Good Practice Guidelines

Please remember the following when using the cluster:

  1. Check for free cpu's in the web display (Intranet access only):

    http://helpdesk.ill.fr/clusterscpu.html

  2. Check again that cpu's are free when you have chosen and logged-on to a machine using the "top" command.
  3. Do not run more than one process per core/processor.
  4. Leave the big memory nodes for people who need them if other nodes are available.
  5. If you have trouble finding enough cpu's please tell Mark Johnson.
  6. Otherwise questions or queries to CS and/or SI.

Main cluster 'NodeX'

Nodes

They have been renamed "nodeX" where X runs from 1-45 as follows:

1-10: IBM opterons (2 procs/node, formerly the AMD cluster)
11-19: IBM Pentiums (2 procs/node) --- NOT YET AVAILABLE (4/4/09)
20-39: Fujitsu-Siemens opterons (4 procs-cores/node)
40-45: Fujitsu-Siemens opteron BLADES (16 procs-cores/node)

Memory

  • 1Gb RAM per processor or core : on most machines
  • 2Gb/core : on 4 of the machines (nodes 20-39) i.e. 8Gb per machine.
  • 2Gb/core : Node 43-45 i.e. 32Gb per machine.

Software

On some nodes the CS will install the Materials Studio server software (information to follow).

Queuing

SI is currently looking into queuing systems that will make using the cluster easier since we now have a larg'ish number of nodes. A note concerning using the cluster will be made available on the CS web shortly.


Other cluster 'Brick'

CS and SI also run a second older cluster

Nodes

1-8: 2 procs/node Pentium Xeon 2 GHz 1-2 Gb memory

Software

Material Studio


Programming tips

High Performance Fortran, HPF

The aim is to produce a program with a structure of SPMD, single program, multiple data model. Each processor loads the same program but then operates on a distinct subset of the program's data, each using a local portion of distributed arrays. Special directives are added to this variant of the Fortran standard to control data and parallel program flow. The Portland Group HPF is available on brick.

Message Passing Interface, MPI

This comprises of utilities with a library of functions callable from C, Fortran etc., which allow tasks to be copied to other nodes, portions of data transferred, and then results reassembled after processing. This standard call interface allows the implementation of transfers and synchronisation to be left either to Open-source developers, or proprietary manufacturers (who often use concepts from the first group but optimise performance to their own hardware). Because the calls are standardised it is possible to build variants using any of the MPI implementations, and select the optimal performer for routine use. In each case there are a number of options on buffer sizes and transfer methods to explore.

MPI-CH

This is used where there are fast communications between cluster members, which must all share the same architecture.

LAM-MPI

This typically uses ethernet for communications, and even has the possibilities of reformatting data on transfer in heterogeneous clusters. Clearly the task has to be rebuilt for each type of machine.

OpenMPI

More recent implementation of MPI2 standard

Building Programs with MPI

In addition to a library to the organising routines there are several tools designed to simplify development.
mpirun      this launches an MPI task and distributes it amongst available nodes (DEC/Compaq dmpirun)
mpiclean    this resets the nodes should the mpirun fail and leave the nodes in an indeterminate state
mpif77       traditionally this expands to the commands necessary
mpicc        to build the task, identifying the libraries etc., yielding a standard method for building tasks, allowing      makefiles independent from the transport method.
Fuller description of these tools can be found in the man pages. Useful features include switches to launch dummy operations which show the nodes to be used and/or the files and libraries to be loaded without actually doing any processing. Variants of each of these utilities appropriate to the communication method are kept in e.g. :
    LAM-MPI  /usr/lib/lam-6.5.4/bin
    MPI-CH  /usr/lib/mpich-1.2.2/bin

Note
To show the path of the default program installed, use the command:

$ which mpif77

On brick this is /usr/lib/lam-6.5.4/bin/mpif77 showing that lam-mpi is the default MPI method. To change this to MPI-CH place /usr/lib/mpich-1.2.2/bin in your PATH environment variable ahead of the lam-mpi reference modifying your .cshrc file. The .cshrc file is re-used as each node starts your task anew.

In general MPI libraries and utilities are built using the GNU configure script. If the default installation does not match the imported source code it may be necessary to build a private library appropriate to that task, specifying installation in the builder's area using ./configure -prefix ... If the new mpif77 is used the correct libraries will be loaded. (It is probably simplest to modify the PATH for this.) Such programs should be linked as static entities (linker switch -Bstatic) to include copies of all libraries at link time, avoiding conflicts which might arise as default sharable libraries are sought on target nodes when run by mpirun or Alinka.

The Portland compiler package includes utilities for profiling programs to identify compute intensive code, and also debuggers to analyse problems in execution of parallel routines.