 
  
	 
  
	 
  
	 
  
	 
  
	 
  
  | 
 
 | 
 
 | 
          
	An Overview of Galileo
	
	
	Design Goals
	     Even though Galileo is 
	cheap as supercomputers go, it still represents
	a large monetary investment for our department.  Because of this,
	we've designed Galileo with the intent that almost everone in the department 
	will benefit from it in some way.  Most supercomputing clusters
	are useful to only a few talented programmers, who know how to write
	parallel code that takes full advantage of the cluster.  These users 
	are only a small fraction of our user base.  In designing Galileo,
	we've also kept in mind the average grad student or undergrad 
	(or faculty member) who doesn't want or need to spend time parallelizing code, 
	but needs more computing power than that provided by our previous
	"compute server", an IBM PowerServer 370 RS6000.  Our intent is that
	everyone using the RS6000, in whatever capacity, will realize an
	immediate benefit by migrating to Galileo.
	
	Fast Serial Performance
	     To satisfy the 
	needs of these users, we've built Galileo from
	fast nodes and implemented a number of load-balancing schemes.
	Each node of Galileo is PII-300 with 128 MB RAM.  Various
	benchmarks show that a single node is from 1.3 to 2 times as fast
	as our RS6000.  Thus, even users who use only a single node of the
	cluster will see improved performance.
	 
	Load Balancing
	     Performance is further 
	improved by spreading the user load around 
	the cluster.  Galileo's nodes communicate through an internal 100 Mbps
	ethernet network.  One of the nodes has a second ethernet card, through
	which the cluster communicates with the outside world.  This node
	acts as firewall, mediating traffic into and out of the cluster.
	Incoming connections to selected services (currently telnet, ftp,
	http, ssh, rlogin, rsh and xdm) are automatically forwarded to
	the currently least-loaded node.  For example, with twelve nodes in the 
	cluster, each of the first twelve users who telnet into Galileo might
	find that he has an entire node all to himself. 
	 
	     Once a user has logged 
	on to a cluster node, she is free to
	use other nodes as well.  Security has been set up so that users can  
	use other nodes transparently, without a password.  For example,
	the user might start running the same application on two nodes by
	typing:
 
	ssh node1 "myprogram 1 2 3 > outfile &"
	ssh node2 "myprogram 4 5 6 > outfile &"
 
	To help with load-balancing, we've written an application called "run",
	which will execute a command on the currently least-loaded node.  For
	example, intead of invoking her program by typing:
	myprogram 1 2 3
 
	a user could type:
	run myprogram 1 2 3
 
	"Run" preserves the current working directory (all user directories are
	available across the cluster) and the user's current environment variables.
	
	     Thirdly, the Mosix
	system provides load-balancing for each process on each node, without
	user intervention.  Mosix allows processes to move to other nodes of the
	cluster automatically.  When the Mosix system determines that performance
	could be improved by migrating a process to another node, it does so.
	As far as the user is concerned, the process still looks like it's
	executing locally.  The process may migrate around the cluster, running
	on several different nodes before it finishes.  Mosix is installed on
	all of the Galileo nodes, and runs automatically, without requiring any
	special commands from the user.  Users who want to manually control the
	action of Mosix should look at the man page for the
	mosrun
	command.
#	     A third method of load balancing is
#	provided by the Condor queue
#	system.  Condor allows users to submit jobs to a pool of networked
#	computers.  When a job is submitted, Condor locates a relatively
#	idle computer, and starts the job running there.  If that computer
#	becomes more heavily loaded, Condor will look around for a less
#	loaded computer and automatically migrate the job to that machine.
!> 
	Fast Parallel Performance
	     The features described above 
	satisfy the needs of many of our users, but some users
	really do have large problems which require the full power of the 
	cluster.  To make this possible, we've built Galileo 
	with fast network connections between nodes, and taken
	care that each node is well-designed for fast communication
	over that network.  The computers which compose
	Galileo are connected in a star topology, centered on 
	a 16-port 100 megabit per second ethernet switch.  Since 
	networking speed can be limited by memory bandwidth, each
	computer is built with SDRAM memory instead of the slower 
	FPM or EDO memory.
	 
	     We've also installed several software
	packages which make the task of writing parallel programs
	easier.  These include 
	PVM
	 ("Parallel Virtual Machine") and
	MPI
	("Message Passing Interface"), two programming
	environments for parallel computing.  A "High Performance
	Fortran" compiler 
	(pghpf)
	 is also available.  High Performance
	Fortran is a dialect of fortran with specialized features for
	use in parallel applications.
	 
	  
	  
	  
	 
	
	For More Information about Galileo, contact Bryan Wright.
	
	
	  
	  
	  
	
 |