Simple.: XGrid agent for Unix architectures

April 19, 2005

Recent changes...

Some recent changes include the fixing of two major bugs:
Change in usage for libxml2, now uses xmlStrncatNew()
and the infamous "long message" problem when trying to send a long message in multiple frames.

Also we have imported code to enable Rendezvous(Zeroconf) with Howl. (which is optional). to enable rendezvous either the configure script will locate libhowl's includes and libraries or you may pass the following options to configure: --with-howl-includedir=/path/to --with-howl-libdir=/path/to

Please see our new project page: on Sourceforge

-Matthew W. Jones (matburt -at- oss-institute.org)

June 21, 2004

XGrid agent for Unix architectures

In January 2004, Apple released XGrid, a simple system for setting up and using a cluster of OS X machines. It is very simple to use compared to other grid or cluster systems and reduces the learning curve for performing cluster computation. What is missing to make it even more powerful is an agent for architectures other than Mac OS X (the agent in Xgrid terminology is the computer performing the computation). This is necessary since the computer infrastructure available to scientists is not always based on Mac OS X, and universities have a significant investment in various Unix platforms that cannot be neglected when running computations in clusters. This article introduces the first working Xgrid agent for Linux and other Unix systems. The agent will compile and work on Linux (at least Debian and RedHat) and Darwin (all tested). You still need an OS X machine for the controller and XGrid.app

This article is separated in various sections:

Getting the source code and compiling it
Usage and examples
Modifying the source code
How it all fits together: Xgrid and its various layers
Conclusion

For comments, here is my contact info at the Ontario Cancer Institute, (University of Toronto), Biophotonics group.

Bandwidth generously provided by egate and RegisterYour.CA

1) Getting the source code and compiling it

Necessary requirements to compile the agent:

libxml2
roadrunner (the BEEP library) and its required library glib-2.0 (and libxml2)
xgridagent.c xgridagent.h xgridagent-profile.c xgridagent-profile.h xgrid.config.xml (see below for download)

I will give instructions to compile everything in your home directory tree (that is: you don't need root privileges). I haven't encountered any problems myself but let me know if you do.

Getting and compiling libxml2

If libxml2 is not installed on your system (check with xml2-config --libs), then you can install and compile it the standard way:

curl -O http://xmlsoft.org/sources/libxml2-2.6.9.tar.gz tar xzvf libxml2-2.6.9.tar.gz cd libxml2-2.6.9 ./configure --prefix=$HOME make make install

or with DarwinPort sudo port install libxml2.

Getting and compiling glib-2.0

curl -O ftp://ftp.gtk.org/pub/gtk/v2.4/glib-2.4.1.tar.gz tar xzvf glib-2.4.1.tar.gz cd glib-2.4.1 ./configure --prefix=$HOME make make install

or with DarwinPort: sudo port install glib2

Getting and compiling Roadrunner

curl -O http://ftp.codefactory.se/pub/RoadRunner/source/roadrunner/roadrunner-0.9.1.tar.gz tar xzvf roadrunner-0.9.1.tar.gz cd roadrunner-0.9.1 ./configure --prefix=$HOME make make install

roadrunner is not available with darwinport.

Getting and compiling XgridAgent

curl -O http://www.novajo.ca/xgridagent-1.0.tar.gz tar xzvf xgridagent-1.0.tar.gz cd xgridagent-1.0 ./configure --prefix=$HOME --with-roadrunner-includedir=$HOME/include/roadrunner-1.0 --with-roadrunner-libdir=$HOME/lib make

You will get one warning: /home/dccote/xgridagent-rr/xgridagent.c:346: warning: the use of `tmpnam' is dangerous, better use `mkstemp'. Don't worry for now: that's the least of your problems. Don't run the agent as root: run as a regular user because there are a lot of vulnerabilities in the code. Try to run the agent with:

./xgridagent xgridControllerIP

to connect to a controller (you can start a controller by hand in the terminal of an OS X machine with /usr/libexec/xgrid/GridServer). You must not be using passwords. You can then connect to the controller using the XGrid.app application, and start testing your cluster with Linux agents (limitations, see below).

Several notes on compilation:

If you use this for anything other than testing, you are insane.
The configure script isn't great: it does not check for all compatibility issues and might even fail to run properly without telling you. If you type pkg-config --list-all and you don't see glib-2.0, gthreads, gobjects, libxml2 you have a problem with the installation of some of the packages and must fix that first. A few things I have noticed: you might need to define PKG_CONFIG_PATH to point to where the various configure.pc files are (in this example setenv PKG_CONFIG_PATH $HOME/lib/pkgconfig).
Libraries are linked dynamically. Make sure that LD_LIBRARY_PATH is defined to include at least $HOME/lib (where the libraries are installed if you followed the instructions above).

2) Overview of usage

The agent will load most of its configuration parameters from xgrid.config.xml. You may modify it at will. The program will write a file called cookie to reuse the same cookie between calls. The actual tasks run in "/tmp/filexxxx/"

When you open XGrid.app, you should obtain something along the lines of:

where the cluster on the picture is making use of three Linux machines, in addition to an OS X agent running as dccote. In your case, you will highly likely have only one Linux agent.

The Shell Xgrid plug-in will simply call a shell command (regardless of where it is in the execution path on the agent). For instance, on a cluster with a single Linux agent, one obtains the following result with uname -a:

You may try the XFeed plug-in to send a range of arguments to a command, but because of the way Xgrid.app is working, the command's path must be the same on the Linux agent and on the computer from which you run Xgrid.app (if they aren't it will tell you that the command is invalid). Note the following major restriction: large outputs/files will not get sent back properly and the agent will hang (see bugs below).

Finally, you can make a Custom plug-in: that's where it becomes interesting. If you want to execute a bourne shell script, then everything is fine (they are portable across Unix platforms):

with Test.sh being:

#!/bin/sh echo "Some text to stdout" ls -l echo "Some text into file1" > someFile1.txt exit 0

(make sure Test.sh is executable with chmod +x Test.sh). You will get someFile1.txt copied back in the destination directory, as well as "Some text to stdout" in the stdout file.

The custom plug-in can also send a binary executable to the agent and execute it, after which it sends the results back. Since you can't know ahead of time which node of your cluster will run what, then you must provide a binary for each type of agent you have (or you must compile it each time). Assuming you know that you have both Darwin on Power PC and Linux on i686 agents, then you can do the following:

where cal and ncal are the binaries for each platform and the shell script chooseAndRun.sh is:

#!/bin/sh # We simply obtain the operating system name and hardware type with # the uname command. We have to provide the various binaries ourselves. os=`uname` hw=`uname -m` echo "We are running on `uname -a`" > outfile.txt # We execute and pass the arguments directly. exec "./$os/$hw/cal" $*

The script will figure out what architecture it is running on and call the appropriate binary. This is the starting point for a multi-architecture calculation.

Notes on usage (also known as bugs):

Again, if you use this for anything other than testing, you are insane.
The multiple tasks don't quite work. I am not too sure why. The timing of the mutex/semaphore must be wrong. For now, 1 task per agent is recommended.
Very important bug: if the message sent by the agent is larger than 15k, it will hang. This is a problem due to my poor understanding of BEEP. See code xgridagent-profile.c for problem description in the function xgridagent_SengMSG(). This means that with the Custom plugin, if you generate data in the working directory and it is larger than 15k (tarred and zipped), the agent will hang. Most useful cases fall in that category. Fix the code if you know how, because I don't.
There are more buffer overflow vulnerabilities than you can count. They will get fixed. In the mean time, don't run this as root.
If you close the server, the agent will highly likely crash.

3) Modifying the source code

If you want to modify the code, then here are a few general warnings and comments:

The code is released under the Apache licence.
Please send me your modifications at dccote@novajo.ca so I can include them in the main distribution. If you PGP sign your email, it will get through my spam filter for sure (my pgp key is here). Don't encrypt, just sign.
The code is full of threads. If you don't know threads, then read a tutorial (like http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html)
The code makes heavy use of Xpath, which is a way to refer to any part of an XML document (it looks like a directory tree but it is more than that). You can simply modify the examples throughout the code, but you can also learn about it

Here is a graphical overview of the code:

Specific comments and pitfalls:

The thread function sem_init is not implemented on Darwin whereas the functions sem_open/sem_close is not implemented on Linux. That's why there is some juggling with the initialization of the semaphores in the code.
Take a look at the to do list below.
The only entry point from the BEEP library is xgridagent_ProcessBeepMSG() (called when a message is received from the controller).
The only call to a BEEP library function from the code is xgridagent_SengMSG() (called when a message is sent by xgridagent).
Adjust the xgrid.config.xml file to your liking for the debug level: 4 will spit out quite a bit of stuff, 0 is pretty quiet. (See comments in file and code).
If for some reason you would prefer to use beepcore-c instead of roadrunner, it is actually quite simple to change and I have another version of the code that uses beepcore-c (I started with beepcore-c and switched in the middle of the development because I had too many problems with it). I don't recommend it.

To do:

Fix the hang problem when messages are larger than advertised BEEP window. This should be simple, I just don't know enough about BEEP.
Improve the autoconf, automake scripts. BTW, autoconf 2.59 produces a bogus libtool. I use autoconf 2.57.
Much better error management needed: the use of a static buffer with LogMessage() is not even thread safe.
Stop use of unsafe buffers: there are tons and tons vulnerabilities (buffer overflows) in the code, because I use printf and scanf in finite buffers.
Add more flexibility in the way the tasks are started. For instance, allow the use of various site-specific commands (e.g. local job management systems) for starting tasks.
It is assumed everything is in UTF-8 characters. That could be wrong and could lead to incorrect replies, result, etc... I am extremely cavalier with my use of (the aptly named) BAD_CAST operator from Libxml2.
Check for the presence/absence/compatibility of the various Unix commands called (tar for instance does not always accept -z).
Better security: jailing process in /tmp/ and running as nobody could be useful
Clean up the code and use better terminology.
Use passwords and SASL profiles.
Adding support for idle mode
Adding Rendezvous support

4) How it all fits together: Xgrid and its various layers

There are various layers in the Xgrid agent:

4.1) The Xgrid layer

The XGrid protocol is actually quite simple to understand, since there are only three types of messages that can be passed: a request (to which one replies) or a notification (to which there is no need to reply). Each message is identified with a CorrelationID, a name, a type (request/reply/notification) and a payload (which contains something specific to current message (identified by name)). The XGrid protocol is also the application protocol (that's what the application understands) and has nothing to do with the actual communication protocol (tcp/ip, beep, etc...). Here is a graphical overview of the cient registration process as well as the task submission process: View Registration image, View Task Submission image

4.2) The BEEP layer

Each XGrid message is sent as a BEEP MSG, and must be acknowledged when received completely by an empty RPY. MSG's can be sent in smaller chunks (frames). The implementation of BEEP that is used in this xgridagent is Roadrunner, but there is also beepcore-c (which is not as flexible).

4.3) XML

It is convenient but not necessary that both XGrid and BEEP rely on XML. Some BEEP information (in the initiation of the connection for instance) is encoded in XML. XGrid uses XML extensively, which makes it trivial to analyze.

4.4) Lightweight threads

Because two computers are talking to each other over the network, it is convenient to use threads for the BEEP library. This means that there is no "single point" in the code where one can follow the execution: it looks like several parts are running in parallel. To make sure that the various threads can talk to each other, one uses a simple locking mechanism (mutex) or a signalling system (sempahores).

Conclusion

There remain a few important bugs in the agent code, but they should be worked out quickly if others look at the code. It can be used for trivial examples for now, involving agent of different architectures on the same cluster. Since the Xgrid application protocol is platform agnostic, this agent can be used to bring any Unix machine into an XGrid cluster. The official reference for this site is: "XGrid agent for Unix architectures" available at http://www.novajo.ca/xgridagent/. Any question or comment can be sent to Daniel Côté (OCI, U of T) or to dccote@novajo.ca.

This work was done with help and encouragements from the XGrid team and Ernest Prabhakar at Apple.

Posted by dccote at June 21, 2004 11:51 PM | TrackBack