.----------------------------------------------------------------------------.
|                      UTS - Unbalanced Tree Search                          |
|  ______________________________________________________________________    |
|                                                                            |
|     C.-W. Tseng                       - University of Maryland             |
|     J. Prins, S. Olivier, J. Liu      - UNC Chapel Hill                    |
|     J. Dinan, G. Sabin, P. Sadayappan - The Ohio State University          |
|     D. Pryor                          - Supercomputing Research Center     |
'----------------------------------------------------------------------------'

INTRODUCTION
===============

The UTS benchmark performs an exhaustive search on an unbalanced tree.  The
tree is generated on the fly using a splittable random number generator (RNG)
that allows the random stream to be split and processed in parallel while still
producing a deterministic tree.  The splittable RNG has been constructed using
the SHA1 secure hash algorithm.  Thus, generating a node's children requires
multiple applications of the SHA1 hash algorithm to generate splittable hashes
for each child.

For more information on UTS, see [REFERENCES] below.


CONFIGURING UTS
===============

Select a configuration file for your system from the 'config/' directory using
the 'configure.sh' script.  After running the configure script, edit
'config.in' to review the build parameters.

BUILDING UTS
==============

The Makefiles provided with UTS require GNU Make.  To build UTS:

  $ gmake [targets]

The following targets are available:
   uts-seq     - Sequential implementation
   uts-stat    - Sequential implementation for gathering tree statistics
		 (statistics output in file stat.txt)
   uts-upc     - UPC with workstealing distributed load balancing
   uts-upc-enhanced - UPC with improved performance on distributed memory systems
   uts-upc-dcq - UPC shared global work queue load balancer
   uts-omp     - OpenMP workstealing
   uts-shmem   - Shmem workstealing
   uts-pthread - Pthread workstealing 
   uts-mta     - MTA-2 futures-parallel recursive implementation
   uts-dfs     - Sequential recursive implementation (built from from uts-mta)
   uts-mpi-wm  - MPI with work manager centralized load balancing
   uts-mpi-ws  - MPI with workstealing distributed load balancing
   time_rng    - Microbenchmark to measure peak RNG spawn time


RUNNING UTS
==============

Sample workloads are provided in the sample_trees files.  These files are
scriptable and are meant to be used as in:

  $ source sample_trees.sh
  $ ./uts-seq $T1

Or if using csh:

  $ source sample_trees.csh
  $ ./uts-seq $T1

Depending on the number and speed of processors used, we suggest
measuring performance for the following trees:

  T3L   (~100 million nodes) for 1-8 processors 
  T3XXL (~3 billion nodes)   for 8-128 processors
  T2WL  (~300 billion nodes) for 128-1024+ processors

More and less verbose output can be gathered using the '-v #' flag.  '-v 0'
provides concise data suitable for graphing, '-v 1' is the normal output, level
2 provides additional load balancing statistics, and higher levels provide
greater detail.

In addition to this, extra parameters are available for different
implementations that can be used to fine-tune load balancing or get more
verbose output.  These parameters can be viewed by running one of the uts
targets with the '-h' flag.

The chunk size parameter '-c' is the main knob used to tune load balancing.  It
is available on most parallel builds of UTS and it controls the granularity
with which load balancing is performed.  Specifically, it controls how many
nodes are transferred from one process to another as a result of a load
balancing operation.  Smaller chunk sizes allow for more fine-grained load
balancing but cost more in overhead.  Larger chunk sizes have lower overhead
but can result in load imbalance.  In practice, the best chunk size will vary
for a given system and parallel implementation and will also vary slightly over
different workloads.  Finding a single chunk size that performs well on all
workloads can be achieved through experimentation.  By default, UTS uses a
chunk size that gives reasonable performance across a range of systems.

For example, to perform a parallel search on T3 using 4 Pthreads with a chunk
size of 20:

 $ ./uts-pthread -T 4 -c 20 $T3


THE OUTPUT
==============

The benchmark's result is to print out the total number of nodes explored 
in the tree.  An execution run is valid if the number of nodes explored
matches the number of nodes given in the sample_trees script.  The benchmark 
also outputs the performance that was achieved in nodes explored per second.  
The peak performance for a system can be calculated using the time_rng 
microbenchmark.  In practice, actual performance will be less than the 
value given by the microbenchmark because of additional search operations, 
load imbalance, and load balancing overhead.


UTS FILES
===============

Random Number Generators (RNG):
   rng/               - Location of random number generators
   alfg.c             - Additive lagged fibonacci generator
   alfg.h
   brg_sha1.c         - Dr. Brian R. Gladman's SHA-1 implementation
   brg_...

Configuration Files:
   configure.sh       - Configure UTS (generates config.in)
   config/            - Location where configuration files are stored
   config.in          - Symlink to your configuration file in config/
   Makefile           - GNU Makefile, reads config.in to build UTS

Core Files:
   dequeue.c          - Double-ended queue data structure
   dequeue.h
   dlist.c            - Doubly linked list data structure
   dlist.h
   shared_dequeue.c   - UPC shared double-ended queue
   shared_dequeue.h
   shared_dlist.c     - UPC shared doubly linked list
   shared_dlist.h

   uts.c,h            - UTS tree generation library 
   uts_dm.c,h         - Base for distributed memory/PGAS implementations 
			(MPI, UPC-DCQ)
   uts_shm.c          - Base for shared memory implementations 
			(UPC, OpenMP, Shmem, Serial, and Stats)
   uts_upc_enhanced.c - UPC implementation with improved performance on 
			distributed memory systems
   uts_dfs.c          - Base for MTA-2 and serial uts-dfs.  This code 
			performs a recursive search.
   mpi_worksharing.c  - Work-Manager implementation using MPI
   mpi_workstealing.c - WorkStealing implementation using MPI
   upc_worksharing.c  - Distributed global queue using UPC
   
   time_rng.c         - RNG timing program

Debugging Files:
   stats.c            - Tracing and additional statistics generation code
   check_ctrk.pl      - Chunk tracker debugging tool
   ctrk.h

Documentation:
   LICENSE            - The license that UTS is distributed under
   AUTHORS            - Contributors to UTS
   README             - This file
   sample_trees.sh    - Sample inputs to UTS (sh shell script)
   sample_trees.csh   - Sample inputs to UTS (csh shell script)


REFERENCES
===============

For more information, please refer to the following publications:

S. Olivier, J. Prins, "Scalable Dynamic Load Balancing Using UPC", Proc. of 
37th International Conference on Parallel Processing (ICPP-08), Portland, OR, 
September 2008.

J. Dinan, S. Olivier, J. Prins, G. Sabin, P. Sadayappan, C.-W. Tseng, "Dynamic
Load Balancing of Unbalanced Computations Using Message Passing", Proc. of 6th
Intl. Workshop on Performance Modeling, Evaluation, and Optimization of
Parallel and Distributed Systems (PMEO-PDS 2007), Long Beach, CA, 2007. 

S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, C.-W. Tseng,
"UTS: An Unbalanced Tree Search Benchmark", 19th International Workshop on
Languages and Compilers for Parallel Computing (LCPC 2006), 2006. 

J. Prins, J. Huan, W. Pugh, C.-W. Tseng, P. Sadayappan, "UPC Implementation of
an Unbalanced Tree Search Benchmark", UNC Dept. of Computer Science Technical
Report 03-034, Oct 2003