Batch Cluster Information

About Batch Cluster Use

A cluster system controls a group of compute nodes which are assigned tasks by a batch scheduler. The scheduler selects work to be done by the compute nodes from a series of batch queues. These queues manage job requests (shell scripts generally referred to as jobs) submitted by users. In other words, to get your computations done by the cluster, you must submit a job request to a specific batch queue. The scheduler will assign your job to a compute node in the order determined by the policy on that queue and the availability of an idle compute node.

Why batch queues?

Batch processing from job queues may seem an unusual approach in an age of personal computers. The assumption is that the work to be done is computationally significant (ie will require hours, days or even several weeks of execution time). In that event, it may well be more efficient for you to be able to "farm out" the work instead of running it in the background on your personal computer. Almost inevitably, this implies sharing a resource with other users, and the immediate question is how.

The idea behind batch queues is that each queue is available only to a subset of users. The group of users assigned to a specific queue is usually determined by department affiliation, or perhaps application type. This allows different groups to decide what constraints are reasonable for jobs submitted through the group's queue. Those constraints set the policy for each queue, and are arrived at through discussion with the various members of that group. Once policy is set, all jobs submitted to that queue are automatically subject to the agreed upon constraints.

Job Queues

In order to submit a job to a queue for processing, you must first login to the batch processor system. The only form of remote access supported by cluster controllers is secure shell (ssh, slogin etc). Consult our FAQs page for more specific information regarding use of job queues.

Submitting work

You should now have a pretty good idea which queues are intended for your work (if not, please ask!). In order to submit work to a queue, it must be encapsulated in a shell script. If you are familiar with UNIX, this should be pretty simple, but even if you are not, it is not a major obstacle. A shell script is a plain text file that contains commands you would normally enter from the keyboard in order to do your work. As a simple example, consider a file that contains the following commands:

date
hostname

Each of these is a UNIX command that will output the time and the name of the host respectively. If these two lines are edited into a file named qtest, you can execute the commands by setting the permissions on the qtest file to indicate it is an executable file (chmod 755 qtest). Now you can test the script by entering its name from the keyboard (./qtest). The output will print a time stamp and the name of the host. The qtest script executed in this fashion did not run on the cluster (it ran on the local host).

Consult our FAQs page for more specific information regarding submitting work..

Performance

The execution environment for a batch job will have your login directory available for reading input data and writing output data. This is made possible by allowing the compute node to mount your controller login directory over the network. You should be aware that network writes are much slower than local writes to disk.

Clusters and Controllers

A cluster facility is intended to make available generic compute nodes for your use. The design is scalable (meaning it is relatively simple to add more compute nodes to the system) and inexpensive (meaning the per-node cost is fairly low). The batch server is intended to serve as a staging area and job scheduler. Please do not run compute intensive jobs on the controller, rather submit them to a batch queue. Please do not use the controller as your personal workstation or development platform. Any jobs run directly on controller instead of through the batch system are subject to immediate termination without prior notification.

Constraints on services

The disk space on a cluster controller is intended as a staging area for work to be done, and a collection area for results. It is not intended to provide archive services. In other words, you will need to provide your own long-term storage facilities and move results off of the controller to that storage on a regular basis. Please read the following carefully: There is absolutely NO backup of your disk space on the controller or of the cluster nodes. In case of a system error or, heaven forbid, a user error, there is no possibility of recovering lost data. Please do not leave your only copy of valuable data on the controller!

The reason for this constraint is two-fold. First, this is a compute facility and would be organized differently for archive purposes. Second, it will be very expensive to backup the amount of data expected to pass through a controller. You must develop your own plans for storage of the data generated on the cluster or accept the low but nonetheless real risk of loosing data. You have been forewarned.

We would prefer to operate the controller without using disk quotas. The most efficient usage of available space can be had if users on a particular disk (and therefore in the same group) negotiate usage with each other. This might, for instance, allow a large amount of storage to be available to one user for a limited period when others don't need the space in exchange for the same privilege for other users at a later time. If the members of a particular group cannot or will not negotiate in good faith with others in the group, then a quota will be set for the entire group. Using quotas, there is no provision for extenuating circumstances, so it is to everyone's advantage to have user-managed disk allocation.

Questions and comments

It is our goal to run a useful computing facility for researchers at UNT. Contemporary research activity spans a spectrum of computer applications far in excess of the services supplied on our clusters. If you have a compute intensive job (as opposed to say I/O intensive like a database or network server), we would be happy to discuss with you the potential of running it on our clusters. If you're not sure whether some particular application would be a good fit for cluster computing, perhaps the following guidelines will help.

The classic application for cluster computing will be jobs that do a lot of computation on a little data with typical runtimes in the hours to days range. If you have source code that will compile on Linux, that is best. If you have binaries built for Linux/Intel, that may well work. While we are happy to provide information to help you port an application to this environment, we do not have staff to provide you with much programming assistance.

Tasks which require user interaction at runtime will not work on a cluster. Tasks which provide services to other computers (such as web servers or file system sharing) will not be supported on our clusters. Activities which require creation/deletion of temporary accounts (i.e. class accounts) will not be managed on our clusters. Tasks which can easily be performed on individual workstations (such as email, web browsing, etc.) should not require a cluster.

The use of commercially licensed software on the clusters (the batch system and operating system software are free) may be viable in some cases, not really in others. If the software has a site license and does not have constraints on the number of simultaneous users, then it might be possible to run on the clusters. In any case, the purchase of the software and management of the licenses is not a service provided by UIT. Please discuss your project with us before purchasing software intended for use on this cluster to avoid misunderstandings, wasted time and money.

Please send questions or comments to hpc-admin@unt.edu. We will be happy to meet with you to discuss your application, and how it might be run effectively on our facilities. Even if your application is not, for some reason, a good fit for our particular cluster implementations, it will be useful to know your requirements when planning for future projects.