11.2. Introduction to High Performance Computing (HPC)

The advances in technology have resulted in fast, low-cost and highly efficient processors and networks, which have brought about a change in the cost/performance ratio in favour of using interconnected processing systems in a single high-speed processor. This type of architecture can be classified into two basic configurations:

The first example is known as a parallel processing system and the second as a distributed computing system. In the latter case, we can say that a distributed system is a set of processors interconnected on a network in which each processor has its own resources (memory and peripherals) and they communicate by exchanging messages on the network.

Computing systems are a relatively recent phenomenon (we could say that computing history started in the seventies). Initially, they consisted of large, heavy, expensive systems, which could only be used by a few experts and they were inaccessible and slow. In the seventies, advances in technology led to some substantial improvements carried out using interactive jobs, time sharing and terminals and the sizes of the computers were reduced considerably. The eighties were characterised by a significant improvement in the performance and efficiency (which has continued to today) and a dramatic reduction in the sizes, with the creation of microcomputers. Computing continued to develop through workstations and advances in networking (from 10 Mbits/s LANs and 56 Kbytes/s WANs in 1973 to today's 1Gbit/s LANs and WANs with asynchronous transfer mode (ATM) and 1.2 Gbits/s), which is a fundamental factor in current multimedia applications and those that will be developed in the near future. Distributed systems, for their part, originated in the seventies (systems with 4 or 8 computers), but really became widespread in the nineties.

Although administrating/installing/maintaining distributed systems is a complicated task, given that they continue to grow, the basic reasons for their popularity are the increase in performance and efficiency that they provide in inherently distributed applications (due to their nature), the information that can be shared by a group of users, the sharing of resources, the high fault tolerance and the possibility of ongoing expansion (the ability to add more nodes to gradually and continuously increase the performance and efficiency).

In the following sections we will look at some of the most common parallel/distributed processing systems, as well as the programming models used to generate code that can use these features.

11.2.1. Beowulf

Beowulf [Rad, Beo] is a multi-computer architecture that can be used for parallel/distributed applications (APD). The system basically consists of a server and one or more clients connected (generally) through Ethernet, without using any specific hardware. To explore this processing capacity, it is necessary for the programmers to have a distributed programming model that, whilst it is true that it is possible to do this through UNIX (socket, rpc), may require a very significant effort, given that the programming models are at the level of systems calls and C language, for example; but this working method can be considered as low-level.

The software layer provided by systems such as parallel virtual machine (PVM) and message passing interface (MPI) facilitates significantly the abstraction of the system and makes it possible to program parallel/distributed applications easily and simply. The basic working form is master-workers, in which there is a server that distributes the task that the workers perform. In large systems (systems with 1,024 nodes), there is more than one master and nodes dedicated to special tasks such as, for example, in/out or monitoring.

Example 11-2. Note

Various options:

  • Beowulf

  • OpenMosix

  • Grid (Globus)

One of the main differences between Beowulf and a cluster of workstations (COW) is that Beowulf is 'seen' as a single machine in which the nodes are accessed remotely, as they do not have a terminal (or a keyboard), whereas a COW is a group of computers that can be used by both the COW users and other users interactively through the screen and keyboard. We must remember that Beowulf is not software that transforms the user's code into distributed code or that affects the kernel of the operating system (like Mosix, for example). It is simply a way of creating a cluster of machines that execute GNU/Linux and act as a supercomputer. Obviously, there are many tools that make it possible to achieve an easier configuration, library or modification to the kernel for obtaining better performance levels, but it is also possible to build a Beowulf cluster from a GNU/Linux standard and conventional software. The construction of a Beowulf cluster with two nodes, for example, can be achieved simply with the two machines connected through Ethernet using a hub, a standard GNU/ Linux distribution (Debian) and the network file system (NFS) and after enabling the network services such as rsh or ssh. In such a situation, we might argue that we have a simple two node cluster. How do we configure the nodes?

First, we must modify (each node) /etc/hosts so that the localhost line only has and does not include any machine name, such as: localhost

And add the IPs of the nodes (and for all the nodes), for example: pirulo1

It is possible to create a user (nteum) in all the nodes, create a group and add this user to the group:

groupadd beowulf adduser nteum beowulf echo umask 007 >> /home/nteum/.bash_profile

In this way, any file created by the nteum user or any within the group can be modified by the Beowulf cluster.

We must create an NFS server (and the rest of the nodes will be clients of this NFS). On the server, we create a directory as follows:

mkdir /mnt/nteum chmod 770 /mnt/nteum chown -R nteum:beowulf /mnt/nteum

Now we can export this directory from the server.
cd /etc
cat >> exports
/mnt/wolf (rw) <control d>

We must remember that our network will be 192.168.0.xxx and it is a private network, in other words, the cluster will not be seen from the Internet and we must adjust the configurations so that all the nodes can see each other (from the firewalls).

We should verify that the services are working:

chkconfig -add sshd chkconfig -add nfs chkconfig -add rexec chkconfig -add rlogin chkconfig -level 3 rsh on chkconfig -level 3 nfs on chkconfig -level 3 rexec on chkconfig -level 3 rlogin on

To work securely, it is important to work with ssh instead of rsh, which means that we must generate the keys for interconnecting the machines-nteum user securely, without a password. To do this, we modify (we remove the comment #) the following lines in /etc/ssh/sshd_config:

RSAAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

We reboot the machine and we connect as the nteum user, given that this user will operate the cluster. To generate keys:

ssh-keygen -b 1024 -f ~/.ssh/id_rsa -t rsa -N ""

The id_rsa and id_rsa.pub files will have been created in the /home/nteum/.ssh library directory and we must copy id_rsa.pub in a file called authorized_keys in the same directory. And we modify the permissions with chmod 644 ~/.ssh/aut* and chmod 755 ~/.ssh.

Given that only the main node will be connected to the others (and not the other way round) we only need to copy the public key (d_rsa.pub) to each node in the directory/file /home/nteum/.ssh/authorized_keys of each node. In addition, on each node, we will have to mount the NFS adding /etc/fstab the line pirulo1:/mnt/nteum /mnt/nteum nfs rw,hard,intr 0 0.

As of this point, we already have a Beowulf cluster for executing applications that could be PVM or MPI (we will see this in the following sections). Over FC, there is an application (system-config-cluster) that makes it possible to configure a cluster based on a graphic tool. For more information, please see: http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Cluster_Administration/index.html. Benefits of distributed computing

What are the benefits of parallel computing? We will see these with an example [Rad]. We have a program for adding numbers (for example, 4 + 5 + 6...) called sumdis.c and written in C:

#include <stdio.h> 

		int main (int argc, char** argv){ 

float initial, final, result, tmp; 

if (argc < 2) {
		printf ("Use: %s N. initial N. final\n",argv[0]); 
else { 
		initial = atol (argv[1]); 
		final = atol (argv[2]); 
		result = 0.0; 
		for (tmp = inicial; tmp <= final; tmp++){ 
		result + = tmp; }
printf("%f\n", result) 
return 0; 

We compile it with gcc -o sumdis sumdis.c and if we look at the execution of this program, with, for example:

time ./sumdis 1 1000000 (from 1 to 106)

we can see that the time in a Debian 2.4.18 machine with AMD Athlon 1.400 MHz 256 Mb RAM is (approximately) real = 0,013 and user = 0,010 in other words, 13 ms in total and 10 ms inuser zone. If, however, we enter:

time ./sum 1 16000000 (from 1 to 16 * 106)

the time will be real = 182, in other words, 14 times more, which means, if we consider 160.000.000 (160*106), the time will be approximately dozens of minutes.

The idea of distributed computing is: if we have a cluster of 4 machines (node1-node4) with a server, where the file is shared by NFS, it would be interesting to divide the execution through rsh (not advisable, but it is acceptable for this example), so that the first adds from 1 to 40.000.000, the second from 40.000.001 to 80.000.000, the third from 80.000.001 to 120.000.000 and the fourth from 120.000.001 to 160.000.000. The following commands show one possibility. We consider that the system has the directory /home shared by NFS and that the user (nteum) who will execute the script has configured .rhosts adequately so that it is possible to access their account without the password. In addition, if tcpd has been activated in /etc/inetd.conf in the rsh line, there must be the corresponding file in /etc/hosts.allow, which would allow us to access the four machines in the cluster:

The shell script distr.sh can be something like:

rsh node1 /home/nteum/sumdis 1 40000000 > /home/nteum/out < /dev/null & rsh node2 /home/nteum/sumdis 40000001 80000000 > /home/nteum/out < /dev/null & rsh node3 /home/nteum/sumdis 80000001 120000000 > /home/nteum/out < /dev/null & rsh node4 /home/nteum/sumdis 120000001 160000000 > /home/nteum/out < /dev/null &

We can observe that the time is significantly reduced (by a factor of approximately 4) and not exactly lineally, but almost. Obviously, this example is very simple and is only used for demonstrative purposes. The programmers use libraries that allow them to set the execution time, the creation and communication of processes in a distributed system (such as PVM and MPI).

11.2.2. How should we program to take advantage of concurrent computing?

There are various ways of expressing the concurrency in a program. The most common two are:

1) Using threads (or processes).

2) Using processes in different processors that communicate through messages (MPS, message passing system).

Both methods can be implemented on different hardware configurations (share memory or messages) but MPS systems can involve latency and speed problems with the messages on the network, which can be a negative factor. However, with the advances in network technology, these systems have grown in popularity (and in number). A message is extremely simple:

send(destination, msg)
recv(origin, msg)

The most common APIs today are PVM and MPI and, in addition, they do not limit the possibility of using threads (even if it is at a local level) or of having concurrent processing and in/out. On the other hand, on a machine with shared memory (SHM) it is only possible to use threads and there is the severe problem of scalability, given that all the processors use the same memory and the number of processors in the system is limited by the memory's bandwidth.

To summarise, we can conclude that:

1) Proliferation of multitask (multi-user) machines connected through a network with distributed services (NFS and NIS YP).

2) They are heterogeneous systems with networked operating systems (NOS) that offer a series of distributed and remote services.

3) Distributed applications can be programmed at different levels:

a) Using a client-server model and programming at low-level (sockets).

b) The same model but with a "high-level" API (PVM, MPI).

c) Using other programming models such as programming oriented to distributed objects (RMI, CORBA, Agents...). Parallel virtual machine (PVM)

PVM [Proe] is an API that makes it possible to generate, from the perspective of the application, a dynamic cluster of computers, which constitutes a virtual machine (VM). The tasks can be created dynamically (spawned) and/or eliminated (killed) and any PVM task can send a message to another. There is no limit to the size or number of messages (according to the specifications, although there may be hardware/operating system combinations that result in limitations on message size) and the model supports fault tolerance, resource control, processes control, heterogeneity in the networks and in the hosts.

The system (VM) has tools for controlling the resources (adding or deleting hosts from the virtual machine), processes control (dynamic creation/elimination of processes), different communication models (blocking send, blocking/nonblocking receive, multicast), dynamic task groups (a task can be attached or removed from a group dynamically) and fault tolerance (the VM detects the fault and it can be reconfigured).

The PVM structure is based, on the one hand, on the daemon (pvm3d) that resides in each machine and is interconnected using UDP, and, on the other hand, the PVM library (libpvm3.a), which contains all the routines for sending/receiving messages, creating/eliminating processes, groups, synchronisation etc. which will use the distributed application.

PVM has a console (pvm) that makes it possible to start up the daemon, create the VM, execute applications etc. It is advisable to install the software from the distribution, given that the compilation requires a certain amount of 'dedication'. To install PVM on Debian, for example, we must include two packages (minimum): pvm and pvm-dev (the pvm console and utilities are in the first and the libraries, header and the rest of the compiling tools are in the second). If we only need the library because we already have the application, we can install only the libpvm3 package).

To create a parallel/distributed application in PVM, we can start with the standard version or look at the physical structure of the problem and determine which parts can be concurrent (independent). The concurrent parts will be candidates for being rewritten as parallel code. In addition, we must consider whether it is possible to replace the algebraic functions with their paralleled versions (for example, ScaLapack, Scalable Linear Algebra Package, available in Debian as scalapack-pvm | mpich-test | dev, scalapack1-pvm | mpich depending on whether it is PVM or MPI). It is also convenient to find out whether there is any similar parallel application (http://www.epm.ornl.gov/pvm) that might guide us as to the construction method of the parallel application.

Parallelising a program is not an easy task, as we have to take into account Amdahl's law.


Amdahl's law states that speedup is limited by the fraction of code (f) that can be paralleled: speedup = 1/(1-f).

This law implies that a sequential application f = 0 and the speedup = 1, with all the parallel code f = 1 and speedup= infinite (!), with possible values, 90% of the parallel code means a speedup = 10 but with f = 0.99, speedup = 100. This limitation can be avoided with scalable algorithms and different application models:

Example 11-3. Note

Amdahl's law

speedup = 1/(1-f)

f is the fraction of parallel code

1) Master-worker: the master starts up all the workers and coordinates the work and in/out.

2) Single process multiple data (SPMD): the same program that executes with different sets of data.

3) Functional: various programs that perform a different function in the application.

With the pvm console and with the add command we can configure the VM whilst adding all the nodes. In each of these nodes, there must be the directory ~/pvm3/bin/LINUX, with the binaries of the application. The variables PVM_ROOT = Directory must be declared, where the lib/LINUX/libpvm3.a is and PVM_ARCH=LINUX, which can be placed, for example, in file /.cshrc. The default shell of the user (generally a NIS user or, if not, the same user must be in each machine with the same password) should be csh (if we use rsh as a means of remote execution) and the file /.rhosts must be configured to provide access to each node without the password. The PVM package incorporates an rsh-pvm that can be found in /usr/lib/pvm3/bin as an rsh specifically made for PVM (see the documentation), as there are some distributions that do not include it, for security reasons. It is advisable to configure, as we have shown, the ssh with the public keys of the server in .ssh/authorized_keys of the directory of each user.

As an example of PVM programming, we show a program of the server-client type, where the server creates the child nodes, sends the data, these nodes circulate the data a determined number of times between the child nodes (the first node on the left receives a piece of data, processes it and sends it to the one on the right), whilst the parent nodes waits for each child node to finish.

Example 11-4. Example of PVM: master.c

To compile in Debian:

gcc -O -I/usr/share/pvm3/include/ -L/usr/share/pvm3/lib/LINUX -o master master.c -lpvm3

The directories in -I and in -L must be where the includes pvm3.h and libpvm* are located, respectively.


1) execute the pvmd daemon with pvm

2) execute add to add the nodes (this command can be skipped if we only have one node

3) execute quit (we leave pvm but it continues to execute)

4) we execute master

Example 11-5. Note

Compiling PVM:

gcc -O -I/usr/include/ -o

output output.c -lpvm3

#include <stdio.h> 
#include "pvm3.h"
#define SLAVENAME "/home/nteum/pvm3/client"
main() {
 int mytid, tids[20], n, nproc, numt, i, who, msgtype, loops; 
 float data[10]; int n_times; 
 if( pvm_parent() ==PvmNoParent ){
      /*Return if this is the parent or child process */ 
      loops = 1; 
      printf("\n How many children (120)? ");
      scanf("%d", &nproc); 
      printf("\n How many child-child communication loops (1 - 5000)? ");
      scanf("%d", &loops); }

 /*Redirects the in/out of the children to the parent */ 		 

 /*Creates the children */ 
 numt = pvm_spawn(SLAVENAME, (char**)0, 0, "", nproc, tids);
 /*Starts up a new process, 1st: executable child, 2nd: argv, 3rd :options, 
 4th :where, 5th :N. copies, 6th :matrix of id*/
 printf("Result of Spawn: %d \n", numt);

 /*Has it managed?*/ 
 if( numt &lt; nproc ){
  Printf("Error creating the children. Error code:\n"); 
  for( i = numt ; i<nproc ; i++ ) {
    printf("Tid %d %d\n",i,tids[i]); } 
  for( i = 0 ; i<numt ; i++ ){ 
  pvm_kill( tids[i] ); } 	/*Kill the processes with id in tids*/ 
  exit(); /*Finish*/

 /*Start up parent program, initialising the data */ 
 n = 10; 
 for( i = 0 ; i<n ; i++ ){
      data[i] = 2.0;}
 /*Broadcast with initial data to slaves*/
 /*Delete the buffer and specify message encoding*/ 
 pvm_pkint(&loops, 1, 1);
 /*Package data in the buffer, 2nd N., 3*:stride*/ 
 pvm_pkint(&nproc, 1, 1); 
 pvm_pkint(tids, nproc, 1); 
 pvm_pkint(&n, 1, 1); 
 pvm_pkfloat(data, n, 1); 
 pvm_mcast(tids, nproc, 0);
 /*Multicast in the buffer to the tids and wait for the result from the children*/
 msgtype = 5; 
 for( i = 0 ; i < nproc ; i++ ){
      pvm_recv( -1, msgtype ); 
      /*Receive a message, -1 :of any, 2nd:tag of msg*/ 
      pvm_upkint( &who, 1, 1 ); 
      printf("Finished %d\n",who); 

Example 11-6. Example of PVM: client.c

To compile in Debian:

gcc -O -I/usr/share/pvm3/include/ -L/usr/share/pvm3/lib/LINUX -or client client.c -lpvm3

The directories in -I and in -L must be where the included pvm3.h and libpvm* are located, respectively.


This is not necessary as the master will start them up, but the client must be in /home/nteum/pvm3

#include <stdio.h> 
#include "pvm3.h"

main() { 
	 int mytid;	/*Mi task id*/ 
	 int tids[20];	/*Task ids*/ 
	 int n, me, i, nproc, master, msgtype, loops; float data[10]; 
	 long result[4]; float work();
	 mytid = pvm_mytid(); msgtype = 0; 
	 pvm_recv( -1, msgtype ); 
	 pvm_upkint(&loops, 1, 1); 
	 pvm_upkint(&nproc, 1, 1); 
	 pvm_upkint(tids, nproc, 1); 
	 pvm_upkint(&n, 1, 1); 
	 pvm_upkfloat(data, n, 1);
	 /*Determines which child it is (0 -- nproc-1) */ 
	 for( i = 0; i < nproc ; i++ )
	 if( mytid == tids[i] ){ me = i; break; } 
	 /*Processes and passes the data between neighbours*/
	 work (me, data, tids, nproc, loops);
	 /*Send the data to the master */ 
	 pvm_initsend( PvmDataDefault ); 
	 pvm_pkint( &me, 1, 1 ); 
	 msgtype = 5; 
	 master = pvm_parent(); 	/*Find out who created it */ 
	 pvm_send( master, msgtype); 

float work(me, data, tids, nproc, loops) 
 int me, *tids, nproc; float *data; { 
		int i,j, dest; float psum = 0.0, sum = 0.1; 
		for (j = 1; j <= loops; j++){
			pvm_initsend( PvmDataDefault ); 
			pvm_pkfloat( &sum, 1, 1 ); 
			dest = me + 1; 
			if( dest == nproc ) dest = 0; 
			pvm_send( tids[dest], 22 ); 
			i = me - 1; 
			if (me == 0 ) i = nproc-1; 
			pvm_recv( tids[i], 22 ); 
			pvm_upkfloat( &psum, 1, 1 );

The programmer is assisted by a graphic interface that is of great help (see following figure), which acts as a PVM console and monitor, called xpvm (in Debian XPVM, install package xpvm), which makes it possible to configure the VM, execute processes, visualise the interaction between tasks (communications), statuses, information etc. Message passing interface (MPI)

The definition of the API of MPI [Prob, Proc] has been the work resulting from MPI Forum (MPIF), which is a consortium of more than 40 organisations. MPI has influences from different architectures, languages and works in the world of parallelism such as: WRC (Ibm), Intel NX/2, Express, nCUBE, Vertex, p4, Parmac and contributions from ZipCode, Chimp, PVM, Chamaleon, PICL. The main objective of MPIF was to design an API, without any particular relation with any compiler or library, so that efficient memory-to-memory copy communication, computing and concurrent communication and communication downloads would be possible, provided there is a communications coprocessor. In addition, it supports development in heterogeneous environments, with interface C and F77 (including C++, F90), where communication will be reliable and the faults resolved by the system. The API also needed an interface for different environments (PVM, NX, Express, p4...) and an implementation that was adaptable to different platforms with insignificant changes that did not interfere with the operating system (thread-safety). This API was designed especially for programmers that used message passing paradigm (MPP) in C and F77 to take advantage of the most important characteristic: portability. The MPP can be executed on multiprocessor machines, WS networks and even on machines with shared memory. The MPI1 version (the most widespread version) does not support the dynamic creation (spawn) of tasks, but MPI2 (which is developing at a growing rate) does.

Many aspects have been designed to take advantage of the benefits of communications hardware on scalable parallel computers (SPC) and the standard has been mostly accepted by parallel and distributed hardware manufacturers (SGI, SUN, Cray, HPConvex, IBM, Parsystec...). There are freeware versions (mpich, for example) (which are completely compatible with the commercial implementations from the hardware manufacturers) and they include point-to-point communications, collective operations and process groups, communications and topology contexts, support for F77 and C and a control, administration and profiling environment. But there are also some unresolved points, such as: SHM operations, remote execution, program construction tools, debugging, control of threads, administration of tasks, concurrent in/out functions (most of these problems arising from a lack of tools are resolved in version 2 of API MPI2). The function in MPI1, as there is no dynamic process creation, is very simple, given that of so many processes as tasks that exist, autonomous and executing their own multiple instruction multiple data (MIMD) style code and communicating via MPI calls. The code may be sequential or multithread (concurrent) and MPI works in threadsafe mode, in other words, it is possible to use calls to MPI in concurrent threads, as the calls re-enter.

To install MPI, it is recommended that you use the distribution, given that compiling it is extremely complex (due to the dependencies that it needs from other packages). Debian includes Mpich version 1.2.7-2 (Etch) in the mpich-bin package (the mpich one is obsolete) and also mpich-mpd-bin (version of a multipurpose daemon that includes support for scalable processes, management and control). The mpich-bin implements the MPI 1.2 standard and some parts of MPI 2 (such as, for example, parallel in/out). In addition, this same distribution includes another implementation of MPI called LAM (lam* packages and documentation in /usr/doc/lam-runtime/release.html). The two implementations are equivalent, from the perspective of MPI, but they are managed differently. All the information on Mpich can be found (after installing the mpich* packages) in /usr/share/doc/mpi (or in /usr/doc/mpi). Mpich needs rsh to execute in other machines, which means that we have to insert the user directory in a ~/.rhosts file with lines in the following format: host username to allow the username to enter the host without the password (the same as PVM). It should be remembered that we have to install the rshserver package on all the machines and if there is tcpd in /etc/inetd.conf on rsh.d, we must enable the hosts in /etc/hosts.allow. In addition, we must have mounted the directory of the user by NFS in all the machines and the /etc/mpich/machines.LINUX file must contain the hostname of all the machines that comprise the cluster (one machine per line, by default, appears as localhost). In addition, the user must have the Csh as the shell by default.

On Debian, we can install the update-cluster package to help with the administration. The installation of Mpich on Debian uses ssh instead of rsh for security reasons, although there is a link of rsh =>ssh for compatibility. The only difference is that we must use the ssh authentication mechanisms for the connection without password through the corresponding files. Otherwise, for each process that executes, we will have to enter the password before execution. To allow the connection between machines, with ssh, without the password, we must follow the procedure mentioned in the preceding section. To check it, we can run ssh localhost and then we should be able to log in without the password. Bear in mind that if we install Mpich and LAM-MPI, the mpirun of Mpich will be called mpirun.mpich and the mpirun will be that of LAM-MPI. It is important to remember that mpirun of LAM will use the lamboot daemon to form the distributed topology of the VM.

The lamboot daemon has been designed so that users can execute distributed programs without having root permissions (it also makes it possible to execute programs in a VM without calls to MPI). For this reason, to execute mpirun, we will have to do it as a user other than the root and execute lamboot beforehand. lamboot uses a configuration file in /etc/lam for the default definition of the nodes (see bhost*); please consult the documentation for more information (http://www.lam-mpi.org/). [Lam]

To compile MMPI programs, we can use the mpicc command (for example, mpicc -o test test.c), which accepts all the options of gcc although it is advisable to use (with modifications) some of the makefiles that are located in the /usr/doc/mpich/examples file. It is also possible to use mpireconfig Makefile, that uses the Makefile.in file as an entry to generate the makefile and is much easier to modify. After, we can run:

mpirun -np 8 programme


mpirun.mpich -np 8 programme

where np is the number of processes or processors in which the program will execute (8, in this case). We can put in the number we like, as Mpich will try to distribute the processes in a balanced manner better between all the machines of /etc/mpich/machines.LINUX. If there are more processes than processors, Mpich will use the swap characteristics of GNU/Linux to simulate parallel execution. In Debian and in the directory /usr/doc/mpich-doc (a link to /usr/share/doc/mpich-doc), we can find all the documentation in different formats (commands, API of MPI etc.).

To compile MPI: mpicc -O -o output output.c

Execute Mpich: mpirun.mpich -np N_processes output

We will now see two examples (which are included in the distribution of Mpich 1.2.x in directory /usr/doc/mpich/examples). Srtest is a simple program for establishing communications between point-to-point processes and cpi calculates the value of Pi in distributed form (through integration).

Example 11-7. Point-to-point communications: srtest.c

For compiling: mpicc -O -o srtest srtest.c

Execution of Mpich: mpirun.mpich -np N._processes srtest (will ask for the password [N. processes - 1] times if we do not have direct access through ssh).

Execution of LAM: mpirun -np N._processes srtest (must be a user other than the root)

#include "mpi.h"
#include <stdio.h> 
#define BUFLEN 512 
int main(int argc, char *argv[]) {
    int myid, numprocs, next, namelen; 
    char buffer[BUFLEN], processor_name[MPI_MAX_PROCESSOR_NAME];  MPI_Status status;
        /* Must be placed before other MPI calls, always */ 		
        /*Integrates the process in a communications group*/ 		
        /*Obtains the name of the processor*/ 
    fprintf(stderr,"Process %d on %s\n", myid, processor_name);     
    strcpy(buffer,"Hello People"); 
    if (myid ==numprocs1) next = 0; 
    else next = myid+1; 
    if (myid ==0) {   /*If it is the initial, send string of buffer*/. 
        printf("%d Send '%s' \n",myid,buffer); 
        MPI_Send(buffer, strlen(buffer)+1, MPI_CHAR, next, 99, 
        /*Blocking Send, 1 or :buffer, 2 or :size, 3 or :type, 4 or :destination, 5
        or :tag, 6 or :context*/
        /*MPI_Send(buffer, strlen(buffer)+1, MPI_CHAR, 
        MPI_PROC_NULL, 299,MPI_COMM_WORLD);*/ 
        printf("%d receiving \n",myid);
        /* Blocking Recv, 1 o :buffer, 2 or :size, 3 or :type, 4 or :source, 5 
        or :tag, 6 or :context, 7 or :status*/
        MPI_Recv(buffer, BUFLEN, MPI_CHAR, MPI_ANY_SOURCE, 99, MPI_COMM_WORLD,&status);
        printf("%d received '%s' \n",myid,buffer) }
    else { 
        printf("%d receiving \n",myid); 
        MPI_Recv(buffer, BUFLEN, MPI_CHAR, MPI_ANY_SOURCE, 99, MPI_COMM_WORLD,status);
        /*MPI_Recv(buffer, BUFLEN, MPI_CHAR, MPI_PROC_NULL, 299,MPI_COMM_WORLD,&status);*/
        printf("%d received '%s' \n",myid,buffer); 
        MPI_Send(buffer, strlen(buffer)+1, MPI_CHAR, next, 99, 
        printf("%d sent '%s' \n",myid,buffer);} 
    MPI_Barrier(MPI_COMM_WORLD); /*Synchronises all the processes*/   MPI_Finalize();
    /*Frees up the resources and ends*/ return (0);

Example 11-8. Calculation of distributed PI: cpi.c

For compiling: mpicc O or cpi cpi.c.

Execution of Mpich: mpirun.mpich -np N. processes cpi (will ask for the password (N. processes - 1) times if we do not have direct access through ssh).

Execution of LAM: mpirun -np N. processes cpi (must be a user other than root).

#include "mpi.h"
#include <stdio.h> 
#include <math.h>
double f( double ); 
double f( double a) { return (4.0 / (1.0 + a*a)); } 
int main( int argc, char *argv[] ) {
    int done = 0, n, myid, numprocs, i; 
    double PI25DT = 3.141592653589793238462643; 
    double mypi, pi, h, sum, x; 
    double startwtime = 0.0, endwtime; 
    int namelen; 
    char processor_name[MPI_MAX_PROCESSOR_NAME]; 
          /*Indicates the number of processes in the group*/
          /*Id of the process*/   MPI_Get_processor_name(processor_name,&namelen); 
          /*Name of the process*/ 
    fprintf(stderr,"Process %d on %s\n", myid, processor_name);  
    n = 0; 
    while (!done) {
        if (myid ==0) { /*If it is the first...*/ 
          if (n ==0) n = 100; else n = 0; 
          startwtime = MPI_Wtime();} /* Time Clock */
        MPI_Bcast(&amp;n, 1, MPI_INT, 0, MPI_COMM_WORLD);    /*Broadcast to the rest*/ 
        /*Send from 4th arg. to all 
        the processes of the group. All others that are not 0
        will copy the buffer from 4 or arg -process 0-*/ /*1.:buffer, 
        2nd :size, 3rd :type, 5th :group */
        if (n == 0) done = 1; else {
            h = 1.0 / (double) n; 
            sum = 0.0; 
            for (i = myid + 1; i &lt;= n; i + = numprocs) {
              x = h * ((double)i - 0.5); sum + = f(x); }
            mypi = h * sum; 
        MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, 
        /* Combines the elements of the Send Buffer of each process of the 
        group using the operation MPI_SUM and returns the result in 
        the Recv Buffer. It must be called by all the processes of the 
        group using the same arguments*/ /*1st :sendbuffer, 2nd 
        :recvbuffer, 3rd :size, 4th :typo, 5th :oper, 6th :root, 7th  
        if (myid == 0){ /*Only the P0 prints the result*/ 
        printf("Pi is approximately %.16f, the error is %.16f\n", pi, fabs(pi - PI25DT)); 
        endwtime = MPI_Wtime(); 
        printf("Execution time = %f\n", endwtime-startwtime); }
  MPI_Finalize(); /*Free up resources and finish*/ 
  return 0;

As XPVM exists in PVM, in MPI there is an analogous application (more sophisticated) called XMPI (xmpi in Debian). It is also possible to install a library, libxmpi3, which implements the XMPI protocol to graphically analyse MPI programs with more details than offered in xmpi. The following figure shows some of the possible graphics in xmpi.