                    ARMCI on MPI-RMA Implementation Notes
                       James Dinan <dinan@mcs.anl.gov>

===============================================================================
Introduction
===============================================================================

This project provides a full, high performance, portable implementation of the
ARMCI runtime system using MPI's remote memory access (RMA) functionality.

===============================================================================
Installing ARMCI-MPI
===============================================================================

ARMCI-MPI uses autoconf and must be configured before compiling:

 $ ./configure

Many configure options are provided, run "configure --help" for details.  After
configuring the source tree, the code can be built and installed by running:

 $ make && make install

ARMCI-MPI can be used with GA 5.0 and later by substituting this directory for
the "armci" directory in the GA distribution.  The quality of MPI-RMA
implementations varies.  As of May, 2011 the following MPI implementations are
known to work correctly with ARMCI-MPI:

 + MVAPICH2 1.6
 + MPICH2
 + Cray MPI on Cray XE6
 + IBM MPI on BG/P (set ARMCI_IOV_METHOD=BATCHED)

The following MPI implementations are known to fail with ARMCI-MPI:

 - MVAPICH2 prior to 1.6
 - OpenMPI

===============================================================================
ARMCI-MPI Errata
===============================================================================

Direct access to local buffers:

 * Because of MPI's semantics, you are not allowed to access shared memory
   directly, it must be through put/get.  Alternatively you can use the 
   new ARMCI_Access_begin/end() functions.
   
Progress semantics:

 * On some MPI implementations and networks you may need to enable implicit
   progress.  In many cases this is done through an environment variable.  For
   MPICH2: set MPICH_ASYNC_PROGRESS; for MVAPICH2 recompile with
   --enable-async-progress and set MPICH_ASYNC_PROGRESS; set DCMF_INTERRUPTS=1
   for MPICH2-BG; etc.

===============================================================================
Environment Variables:
===============================================================================

 -------------------
: Debugging Options :
 -------------------

ARMCI_VERBOSE

  Enable extra status output from ARMCI-MPI.

ARMCI_DEBUG_ALLOC

  Turn on extra shared allocation debugging.

ARMCI_NO_FLUSH_BARRIERS

  Don't do extra communication flushing in ARMCI_Barrier.  Extra flushes are
  present to help make unsafe DLA safer.

 --------------------------
: Shared Buffer Protection :
 --------------------------

ARMCI_SHR_BUF_METHOD = { COPY (default), NOGUARD }

  ARMCI policy for managing shared origin buffers in communication operations:
  lock the buffer (unsafe, but fast), copy the buffer (safe), or don't guard
  the buffer - assume that the system is cache coherent and MPI supports
  unlocked load/store.

 --------------------
: I/O Vector Options :
 --------------------

ARMCI_IOV_METHOD = { AUTO (default), CONSRV, BATCHED, DIRECT }

  Select the IO vector communication strategy: automatic; a "conservative"
  implementation that does lock/unlock around each operation; an implementation
  that issues batches of operations within a single lock/unlock epoch; and a
  direct implementation that generates datatypes for the origin and target and
  issues a single operation using them.

ARMCI_IOV_DISABLE_CHECKS

  Disable expensive IOV safety checks (recommended for performance).

ARMCI_IOV_BATCHED_LIMIT = { 0 (default), 1, ... }

  Set the maximum number of one-sided operations per epoch for the BATCHED IOV
  method.  Zero (default) is unlimited.
  
ARMCI_IOV_NO_MPI_BOTTOM

  Disable the use of MPI_BOTTOM in datatypes.  This is a workaround for MPI
  implementations that don't correctly handle MPI datatypes that are relative
  to BOTTOM.

 -----------------
: Strided Options :
 -----------------

ARMCI_STRIDED_METHOD = { DIRECT, IOV (default) }

  Select the method for processing strided operations.
