Galileo Overview

An Overview of Galileo

Design Goals
     Even though Galileo is cheap as supercomputers go, it still represents a large monetary investment for our department. Because of this, we've designed Galileo with the intent that almost everone in the department will benefit from it in some way. Most supercomputing clusters are useful to only a few talented programmers, who know how to write parallel code that takes full advantage of the cluster. These users are only a small fraction of our user base. In designing Galileo, we've also kept in mind the average grad student or undergrad (or faculty member) who doesn't want or need to spend time parallelizing code, but needs more computing power than that provided by our previous "compute server", an IBM PowerServer 370 RS6000. Our intent is that everyone using the RS6000, in whatever capacity, will realize an immediate benefit by migrating to Galileo.

Fast Serial Performance
     To satisfy the needs of these users, we've built Galileo from fast nodes and implemented a number of load-balancing schemes. Each node of Galileo is PII-300 with 128 MB RAM. Various benchmarks show that a single node is from 1.3 to 2 times as fast as our RS6000. Thus, even users who use only a single node of the cluster will see improved performance.

Load Balancing
     Performance is further improved by spreading the user load around the cluster. Galileo's nodes communicate through an internal 100 Mbps ethernet network. One of the nodes has a second ethernet card, through which the cluster communicates with the outside world. This node acts as firewall, mediating traffic into and out of the cluster. Incoming connections to selected services (currently telnet, ftp, http, ssh, rlogin, rsh and xdm) are automatically forwarded to the currently least-loaded node. For example, with twelve nodes in the cluster, each of the first twelve users who telnet into Galileo might find that he has an entire node all to himself.
     Once a user has logged on to a cluster node, she is free to use other nodes as well. Security has been set up so that users can use other nodes transparently, without a password. For example, the user might start running the same application on two nodes by typing:
ssh node1 "myprogram 1 2 3 > outfile &" ssh node2 "myprogram 4 5 6 > outfile &"
To help with load-balancing, we've written an application called "run", which will execute a command on the currently least-loaded node. For example, intead of invoking her program by typing:
myprogram 1 2 3
a user could type:
run myprogram 1 2 3
"Run" preserves the current working directory (all user directories are available across the cluster) and the user's current environment variables.
     Thirdly, the Mosix system provides load-balancing for each process on each node, without user intervention. Mosix allows processes to move to other nodes of the cluster automatically. When the Mosix system determines that performance could be improved by migrating a process to another node, it does so. As far as the user is concerned, the process still looks like it's executing locally. The process may migrate around the cluster, running on several different nodes before it finishes. Mosix is installed on all of the Galileo nodes, and runs automatically, without requiring any special commands from the user. Users who want to manually control the action of Mosix should look at the man page for the mosrun command. Condor queue # system. Condor allows users to submit jobs to a pool of networked # computers. When a job is submitted, Condor locates a relatively # idle computer, and starts the job running there. If that computer # becomes more heavily loaded, Condor will look around for a less # loaded computer and automatically migrate the job to that machine. !>
Fast Parallel Performance
     The features described above satisfy the needs of many of our users, but some users really do have large problems which require the full power of the cluster. To make this possible, we've built Galileo with fast network connections between nodes, and taken care that each node is well-designed for fast communication over that network. The computers which compose Galileo are connected in a star topology, centered on a 16-port 100 megabit per second ethernet switch. Since networking speed can be limited by memory bandwidth, each computer is built with SDRAM memory instead of the slower FPM or EDO memory.
     We've also installed several software packages which make the task of writing parallel programs easier. These include PVM ("Parallel Virtual Machine") and MPI ("Message Passing Interface"), two programming environments for parallel computing. A "High Performance Fortran" compiler (pghpf) is also available. High Performance Fortran is a dialect of fortran with specialized features for use in parallel applications.

For More Information about Galileo, contact Bryan Wright.