#+TITLE: Frequently asked questions (FAQ) about Tpetra
#+AUTHOR: Mark Hoemmen
#+DATE: Time-stamp: "2019-04-16 09:58:15 mhoemme"

* If this is just a text file, why does it have strange mark-up?

This file /is/ indeed a plain text file.  It uses Emacs' org-mode for
mark-up.  Do a web search for "Emacs org-mode", or look at the
following website: http://orgmode.org

This file is best viewed and edited using Emacs (version >= 23), but
you may view and edit it using any text editor.  Editors, please defer
to the mark-up conventions already existing in this document.  For
readers who are not familiar with org-mode, please note that text
between tildes, like ~this~, denotes literal (verbatim) text.  We use
this with source code, error messages, and other such things that call
for literalness.

* Building and running Tpetra
** How do I build and run Tpetra to use OpenMP for threads?
*** Introduction

OpenMP is a portable standard for thread parallelism.  Tpetra uses
Kokkos (see github.com/kokkos) to run in a thread-parallel way using
any of various back-ends, including OpenMP.  You do not need to know
OpenMP in order to build or use Tpetra on GPUs.  However, you should
have some familiarity with thread parallelism and Kokkos ideas.

*** Assumptions

Tpetra relies on Kokkos for OpenMP support.  Thus, Tpetra inherits
Kokkos' assumptions about OpenMP.  Kokkos's OpenMP back-end assumes
only OpenMP 3.x; it does not currently require OpenMP 4.x.  (Kokkos
has a separate OpenMP 4.x back-end in development.)

In order for Tpetra to use OpenMP, you must build Trilinos with a
compiler that supports OpenMP 3.x.

*** How do I build Tpetra for OpenMP?

Set the following CMake variable:

#+BEGIN_EXAMPLE
-D Trilinos_ENABLE_OpenMP:BOOL=ON
#+END_EXAMPLE

Setting this option tells Trilinos' build system to deduce any
OpenMP-related compiler and linker flags.  Trilinos will then add
those flags automatically to the compiler's or linker's command line.
You generally do not need to add such flags to e.g.,
~CMAKE_CXX_FLAGS~.  We have success with this approach with the GCC
and Intel compilers; other compilers are supported too.  If Trilinos
cannot figure out the OpenMP compiler or linker flags, the
configuration process will fail with an informative error message.  In
that case, you must add the required flags yourself.

If you enable OpenMP using the above instructions, Kokkos will enable
its OpenMP support by default.  If Kokkos enables its OpenMP support,
Tpetra will enable its OpenMP support by default.  (See "How does
Tpetra pick its default Kokkos execution space?" below.)  Avoid
setting CMake options explicitly; this hinders portability and makes
your build configuration unnecessarily complicated.

*** How do I set run-time options for Intel KNL?

"KNL" refers to "Knights Landing," a code name for Intel's
second-generation Xeon Phi processor, with product numbers 72x0 or
72x0F (replace "x" with a single-digit number).  For best practice
when running on KNL, please refer to the following Trilinos GitHub
issue:

https://github.com/trilinos/Trilinos/issues/1727

** How do I build and run Tpetra for NVIDIA GPUs (CUDA)?
*** Introduction

Tpetra uses Kokkos (see github.com/kokkos) to run in a thread-parallel
way on NVIDIA Graphics Processing Units (GPUs).  Tpetra does so using
Kokkos' CUDA back-end.  CUDA is a programming language that extends a
subset of the C++ programming language.  You do not need to know CUDA
in order to build or use Tpetra on GPUs.  However, you should have
some familiarity with Kokkos ideas of multiple memory spaces, multiple
execution spaces, and asynchronous kernel launch.

*** Assumptions

Tpetra requires at least CUDA 7.5, and is regularly tested with CUDA
8.0.  Tpetra itself requires C++11 and depends on the optional "device
lambdas" feature in order to build and pass tests.  Trilinos currently
tests CUDA builds only with MPI enabled, and exercises the OpenMPI
implementation of MPI most heavily.  We recommend OpenMPI 1.8.x at a
minimum, preferably newer.  Other MPI implementations should work.

Tpetra currently assumes Unified Virtual Memory (UVM) with CUDA.  It
currently requires kernel launches to block.  In particular, it
requires the following environment variables to be set:

#+BEGIN_EXAMPLE
  export CUDA_LAUNCH_BLOCKING=1
  export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
#+END_EXAMPLE

We are currently working on relaxing these restrictions.  This is not
just a Tpetra issue, however; downstream solver libraries in Trilinos
currently depend on these settings.  Those libraries must be changed
in order to relax these restrictions for all Trilinos users.

*** How do I build Tpetra for GPUs?

First, set the following environment variables, in which
~TRILINOS_PATH~ points to your Trilinos source directory:

#+BEGIN_EXAMPLE
  export OMPI_CXX=${TRILINOS_PATH}/packages/kokkos/bin/nvcc_wrapper
  export CUDA_LAUNCH_BLOCKING=1
  export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
#+END_EXAMPLE

Then set the following CMake variables, in addition to the CMake
variables that you would normally set:

#+BEGIN_EXAMPLE
-D TPL_ENABLE_MPI:BOOL=ON
-D TPL_ENABLE_CUDA:BOOL=ON
-D Kokkos_ENABLE_Cuda_UVM:BOOL=ON
-D Kokkos_ENABLE_Cuda_Lambda:BOOL=ON
#+END_EXAMPLE

If you already define ~CMAKE_CXX_FLAGS~ yourself, just make sure that
you include the above options.  If the versions of ~nvcc~ and ~mpicxx~
that live in your path are correct, then this should work.

If you enable CUDA using the above instructions, Kokkos will enable
its CUDA support by default.  If Kokkos enables its CUDA support,
Tpetra will enable its CUDA support by default.  (See "How does Tpetra
pick its default Kokkos execution space?" below.)  Avoid setting CMake
options explicitly; this hinders portability and makes your build
configuration unnecessarily complicated.  However, if you wish to
enable Tpetra's CUDA support explicitly, you may add the following
CMake option:

#+BEGIN_EXAMPLE
-D Tpetra_INST_CUDA:BOOL=ON
#+END_EXAMPLE

*** Do I need a special MPI implementation to use CUDA?

Some MPI implementations are "CUDA aware."  This means that they know
how to send and receive from CUDA device buffers.  Tpetra attempts to
detect whether the MPI implementation is CUDA aware at configure time.
However, it currently (as of Aug / Sep 2017) only knows how to do so
for OpenMPI.  Furthermore, automatic detection only works if not cross
compiling, since it requires running an OpenMPI executable.  If
automatic detection does not work, you may set the CMake variable
~TPETRA_ASSUME_CUDA_AWARE_MPI~ to control the default assumption at
configure time.  For example:

#+BEGIN_EXAMPLE
-D TPETRA_ASSUME_CUDA_AWARE_MPI:BOOL=ON
#+END_EXAMPLE

You may also tell Tpetra, via the environment variable
~TPETRA_ASSUME_CUDA_AWARE_MPI~, whether to assume that the MPI
implementation that Trilinos uses is CUDA aware.  To make Tpetra
assume that MPI is CUDA aware, set this environment variable to some
true value, like "1" or "TRUE".  To make Tpetra assume that MPI is
/not/ CUDA aware, by setting this environment variable to some false
value, like "0" or "FALSE".  (Tpetra interprets an empty string "" as
true.)

** What about Pthreads?  Does Tpetra support that?

Tpetra does have an option to use Pthreads (POSIX Threads) for thread
parallelism.  It does so through Kokkos, using the ~Kokkos::Threads~
execution space.  However, Trilinos disables this option by default.
This is for the following reasons:

  1. Trilinos automatically enables Pthreads if it finds the library
     on your system.  This surprises users who did not intend to run
     with threads enabled.
  2. Kokkos' Pthreads back-end is slower than its OpenMP back-end.
  3. Use of OpenMP and Kokkos' Pthreads back-end in the same
     application can hurt performance of both.  This is because
     compilers' OpenMP implementations often set up their own threads
     that keep running continuously to reserve hardware resources.
     These threads fight the threads that Kokkos' Pthreads back-end
     launches.
  4. It is possible for multiple OpenMP libraries to work correctly
     and performantly, as long as they share a common OpenMP run-time
     environment.  It is /not/ possible for multiple libraries that
     each launch their own Pthreads to work correctly and
     performantly, for reasons analogous to those in (3) above.

If you really /really/ want Tpetra to use Pthreads, you will need to
do the following:

  1. Make sure that Trilinos enables the Pthreads TPL, by checking
     that the CMake variable ~TPL_ENABLE_Pthread~ is ~ON~.  If not,
     you may need to tell Trilinos where to find the Pthreads library
     and header file.
  2. Enable Kokkos' Pthreads support, by setting
     ~Kokkos_ENABLE_Pthread:BOOL=OFF~.
  3. Enable Tpetra's Pthreads support, by setting 
     ~Tpetra_INST_PTHREAD:BOOL=OFF~.
  4. Make Tpetra use Pthreads by default, by setting
     ~Tpetra_INST_SERIAL:BOOL=OFF~.

If you choose to do this, we /highly/ recommend turning off OpenMP.
Step 4 is optional, but if you do not do Step 3, Pthreads will not be
Tpetra's default execution space.  In that case, if you want to use
Pthreads in Tpetra, you must set the ~Node~ template parameter
explicitly to
~Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Threads>~.

** Weird C++ errors with Intel compiler

The Intel compiler uses GCC header files for the C++ Standard Library.
Versions of GCC < 4.7.x have poor or no support for C++11.  If the
Intel compiler uses C++11, but the version of the GCC headers that it
accesses is < 4.7.x, this will cause build errors.  The resulting
errors might look like this:

#+BEGIN_EXAMPLE
[  0%] Building CXX object
commonTools/gtest/CMakeFiles/gtest.dir/gtest/gtest-all.cc.o
cd /tmp/trilinos-build/commonTools/gtest &&
/opt/apps/intel16/cray_mpich/7.2.4/bin/mpicxx   -Dgtest_EXPORTS -mkl
-DMPICH_SKIP_MPICXX -std=c++11 -O3 -DNDEBUG -fPIC
-I/tmp/trilinos-build -I/admin/build/rpms/BUILD/trilinos-12.2.2/co\
mmonTools/gtest    -o CMakeFiles/gtest.dir/gtest/gtest-all.cc.o -c
/admin/build/rpms/BUILD/trilinos-12.2.2/commonTools/gtest/gtest/gtest-all.cc
/usr/include/c++/4.3/ext/new_allocator.h(114): error: a value of type
"long" cannot be used to initialize an entity of type "char *"
        { ::new((void *)__p) _Tp(std::forward<_Args>(__args)...); }
                                 ^
          detected during:
            instantiation of "void
            __gnu_cxx::new_allocator<_Tp>::construct(__gnu_cxx::new_allocator<_Tp>::pointer,
            _Args &&...) [with _Tp=char *, _Args=<long>]" at line 704
            of "/usr/include/c++/4.3/bits/stl_vector.h"
            instantiation of "void std::vector<_Tp,
            _Alloc>::push_back(_Args &&...) [with _Tp=char *,
            _Alloc=std::allocator<char *>, _Args=<long>]" at line
            7384 of
            "/admin/build/rpms/BUILD/trilinos-12.2.2/commonTools/gtest/gtest/\
gtest-all.cc"

compilation aborted for
/admin/build/rpms/BUILD/trilinos-12.2.2/commonTools/gtest/gtest/gtest-all.cc
(code 2)
#+END_EXAMPLE

You can tell the Intel compiler to use a specific set of GCC headers
by setting its "-gxx-name" flag.  Please refer to the Intel
compiler's documentation for details.

** NVCC link errors in debug builds

NVCC's linker may report errors of the following form:
#+BEGIN_EXAMPLE
nvcc error   : 'nvlink' died due to signal 11 (Invalid memory reference)
nvcc error   : 'nvlink' core dumped
#+END_EXAMPLE
See [[https://github.com/trilinos/Trilinos/issues/655][#655]] for an example.  This is not a Tpetra issue; it usually
manifests downstream.  The cause appears to be excessively large
executables.  It manifests even with dynamic shared libraries enabled.

One thing that can help is to set the CMake option
~CMAKE_CXX_FLAGS_DEBUG:STRING="-g -Os"~.  The -Os flag tells the
compiler to minimize code size.  You may do this in your debug build
by adding the following to your CMake invocation line:
#+BEGIN_EXAMPLE
-D CMAKE_CXX_FLAGS_DEBUG:STRING="-g -Os"
#+END_EXAMPLE

** "Error 127" link error in CUDA builds

In a CUDA build of kokkos-kernels (upstream of Tpetra) with complex
arithmetic and explicit template instantiation (ETI) enabled, one
might see a link error with the mysterious "Error 127" message.

Short answer: Try setting the following CMake option:
#+BEGIN_EXAMPLE
-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL=ON
#+END_EXAMPLE

For details, see the following GitHub issue:

https://github.com/trilinos/Trilinos/issues/2141

and the following instructions:

http://trilinos.org/docs/files/TrilinosBuildReference.html#enabling-the-usage-of-resource-files-to-reduce-length-of-build-lines

If you have trouble seeing the above trilinos.org link, due to web
page certificate issues, please refer to the following tribits.org
link instead:

https://tribits.org/doc/TribitsBuildReference.html#enabling-the-usage-of-resource-files-to-reduce-length-of-build-lines

* Thread safety
** Weak and strong thread safety

Tpetra uses the term /thread safe/ in two different ways, which I will
call "weak" and "strong."  A function or method has /weak thread
safety/ when calling it concurrently by different threads does not
corrupt state, as long as the concurrent updates do not write to the
same data.  /Strong thread safety/ takes away the latter restriction,
by using Kokkos' atomic updates to ensure safe concurrent writes.

* What is the "Node" template parameter?  How does it relate to Kokkos?

Most Tpetra objects take a "Node" template parameter.  This
corresponds exactly to a ~Kokkos::Device~ specialization.  It governs
the Kokkos execution space that Tpetra objects use for thread-parallel
operations, and the Kokkos memory space that Tpetra objects use for
memory allocations.

The only valid Node types as of Trilinos 12.x are specializations of
~Kokkos::Compat::KokkosDeviceWrapperNode~, which lives in
~teuchos/kokkoscompat/src/KokkosCompat_ClassicNodeAPI_Wrapper.hpp~.
You do /not/ need to include this header file or worry about the
details of Node.  Node is a remnant of the 2008-9 version of Kokkos by
Chris Baker.  It will go away at some point, to be replaced by direct
use of ~Kokkos::Device~.

* Tpetra's choices of default enabled template parameters

** How does Tpetra pick its default Kokkos execution space?

Tpetra enables exactly one Kokkos execution space by default.  This
keeps down build times and library sizes.  Tpetra uses the following
rules to decide which execution space to use by default:

  - If CUDA is enabled, use ~Kokkos::Cuda~.
  - Else, if OpenMP is enabled, use ~Kokkos::OpenMP~.
  - Else, if ~Kokkos::Serial~ is enabled, use ~Kokkos::Serial~.
  - Else, if ~Kokkos::Threads~ is enabled, use ~Kokkos::Threads~.
  - Otherwise, report an error (no Kokkos execution spaces are
    enabled).

** How do I enable other Kokkos execution spaces in Tpetra?

  - Set ~Tpetra_INST_CUDA=ON~ to enable ~Kokkos::Cuda~
  - Set ~Tpetra_INST_OPENMP=ON~ to enable ~Kokkos::OpenMP~
  - Set ~Tpetra_INST_PTHREAD=ON~ to enable ~Kokkos::Threads~
  - Set ~Tpetra_INST_SERIAL=ON~ to enable ~Kokkos::Serial~

** Why does Kokkos forbid enabling both OpenMP and Pthreads?

As of Dec 2017, Kokkos forbids, at configure time, enabling both
~Kokkos::OpenMP~ (OpenMP back-end) and ~Kokkos::Threads~ (Pthreads
back-end).  This would be a bad idea anyway, because using both
execution spaces at the same time would result in very poor
performance.  This is because ~Kokkos::Threads~ uses Pthreads; OpenMP
uses its own threads.  The two sets of worker threads fight over
hardware resources.  As soon as you start up either of these execution
spaces in the same executable, it dominates the hardware and makes
life hard for the other one.  Pick one of these two (we prefer OpenMP)
and stick to it.

** What GlobalOrdinal (GO) types does Tpetra allow?

Tpetra has support for the following GlobalOrdinal (GO) types:

  - ~long long~ (preferred)
  - ~long~
  - ~int~
  - ~unsigned long~ (NOT preferred)
  - ~unsigned~ (REALLY NOT preferred)

However, the type(s) you use /must/ have been enabled.  You may enable
GO types by setting the following CMake options at configure time:

  - Set ~Tpetra_INST_INT_LONG_LONG=ON~ to enable ~long long~
  - Set ~Tpetra_INST_INT_LONG=ON~ to enable ~long~
  - Set ~Tpetra_INST_INT_INT=ON~ to enable ~int~
  - Set ~Tpetra_INST_INT_UNSIGNED_LONG=ON~ to enable ~unsigned long~
  - Set ~Tpetra_INST_INT_UNSIGNED=ON~ to enable ~unsigned~

See the next frequently asked question to learn how Tpetra picks what
GO types it enables by default.

** How does Tpetra pick which GO types to enable by default?

By default, Tpetra currently enables ~GO = int~ and ~GO = long long~.
If users enable ~GO = long~ (by setting ~Tpetra_INST_INT_LONG:BOOL=ON~
explicitly), then Tpetra does not enable ~GO = long long~ by default.
Tpetra prefers ~GO = long long~, because the C++11 standard requires
that ~long long~ have at least 64 bits.  Depending on the platform,
~long~ could be either 32 or 64 bits.  Tpetra does not enable more
types by default, because that would increase build times and library
sizes.  Bug 6358 (in the Bugzilla bug tracking system) prevents Tpetra
from disabling ~GO = int~ by default.

** What Scalar types does Tpetra allow?

Tpetra has support for the following Scalar types:

  - ~double~ (preferred)
  - ~float~
  - ~std::complex<double>~
  - ~std::complex<float>~
  - ~__float128~ (GCC extension; NOT allowed with CUDA)

However, the type(s) you use /must/ have been enabled.  You may enable
GO types by setting the following CMake options at configure time:

  - Set ~Tpetra_INST_INT_DOUBLE=ON~ to enable ~double~
  - Set ~Tpetra_INST_INT_FLOAT=ON~ to enable ~float~
  - Set ~Tpetra_INST_INT_COMPLEX_DOUBLE=ON~ to enable
    ~std::complex<double>~
  - Set ~Tpetra_INST_INT_COMPLEX_FLOAT=ON~ to enable
    ~std::complex<float>~
  - Set ~Tpetra_INST_INT_FLOAT128=ON~ and enable the "libquadmath" TPL
    to enable ~__float128~

Some of these are enabled by default.  See the next frequently asked
question to learn how Tpetra picks what Scalar types it enables by
default.

** How does Tpetra pick which Scalar types to enable by default?

In a release build, Tpetra only enables ~Scalar = double~ by default.
In a debug build, Tpetra enables both ~Scalar = double~ and ~Scalar =
std::complex<double>~ by default.

* Explicit template instantiation (ETI)

** What is explicit template instantiation?

ETI stands for "explicit template instantiation."  The compiler
instantiates a templated class or function when it fills in all the
template parameters with actual types, and compiles the code.  The
compiler does parse templated code when encountered, but it cannot
actually compile templated code until all the template parameters are
filled in.  /Implicit/ instantiation is what happens normally in C++,
when you declare a templated class or function, and then use it,
filling in the template parameters with actual types.  For example:

#+BEGIN_SRC C++
// Foo.hpp:

template <class T>
class Foo {
public:
  T bar (const T& x) {
    return x * x; // must have generic implementation
  }
};
#+END_SRC

#+BEGIN_SRC C++
// main.cpp:

#include "Foo.hpp"
int main () {
  int x = 42;
  Foo<int> f;
  int y = f.bar (x);
  return y;
}
#+END_SRC

Line 2 of ~main()~ implicitly instantiates the ~Foo~ class for ~T =
int~.  ~Foo~ has no code until the compiler sees ~Foo<int>~; at that
point, it compiles the methods of ~Foo~ that actually get used.

/Explicit/ instantiation means to force the compiler to compile code,
by filling in template parameters explicitly.  For example:

#+BEGIN_SRC C++
// main.cpp:

#include "Foo.hpp"
template class Foo<double>; // explicitly instantiate for T = double
int main () {
  int x = 42;
  Foo<int> f; // implicitly instantiate for T = int
  int y = f.bar (x);
  return y;
}
#+END_SRC

Even though the code doesn't actually use ~Foo<double>~, the explicit
instantiation means that it compiles ~Foo<double>~.  The example also
shows that implicit and explicit instantiation can coexist for
different template parameter combinations.  Here, main.cpp implicitly
instantiates (and uses) ~Foo~ for ~T = int~, but explicitly
instantiates (and does /not/ use) ~Foo~ for ~T = double~.

The above example does ad hoc explicit instantiation.  Trilinos' ETI
system does explicit instantiation systematically.  The next section
will describe how Trilinos does this.

** How does Trilinos do ETI?

Trilinos has the option to use ETI to speed up builds.  In order to
enable ETI in Trilinos, set the CMake option
~Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON~.

Trilinos' ETI system works like this:

  1. A package may opt into ETI.  Even if a package opts in, templated
     classes in that package may still choose whether or not to
     participate in ETI.  Not all templated classes must participate.
     A class that participates, does two things:

     - Divides its header file into two header files, one with
       declarations (~$NAME_decl.hpp~) and one with definitions
       (~$NAME_def.hpp~)
     - Explicitly instantiates itself, for a fixed set of template
       parameter combinations, in one or more .cpp files
       (~$NAME*.cpp~)

  2. Trilinos automatically generates ~$NAME.hpp~ header files from
     ~$NAME_decl.hpp~ and ~$NAME_def.hpp~ header files.

     - If ETI is OFF, ~$NAME.hpp~ includes both ~$NAME_decl.hpp~ and
       ~$NAME_def.hpp~.
     - If ETI is ON, ~$NAME.hpp~ includes ONLY ~$NAME_decl.hpp~, the
       declaration of the templated class (that participates in ETI).

  3. If ETI is ON, Trilinos builds the .cpp files that do explicit
     instantiation.  This "pre-builds" classes for a finite enumerated
     set of template parameter combinations.
  4. Code that uses a templated class that participates in Trilinos'
     ETI must use one of the enabled template parameter combinations.
     If it uses some other combination, the code will fail to link.

Trilinos' ETI system does /not/ depend on compiler-specific constructs
(like compiler flags), nor does it depend on a compiler-specific model
of template instantiation (see
https://gcc.gnu.org/onlinedocs/gcc/Template-Instantiation.html).  It
requires only standard C++.  The system uses CMake build logic that
generates preprocessor macros and some C++ code.

Trilinos uses CMake to do the following:

  - Define a set of template parameter combinations over which to do
    instantiations and tests (the latter is defined even if ETI is
    OFF).  For example, Tpetra does instantiations and tests over
    4-tuples of template parameters (Scalar, LocalOrdinal,
    GlobalOrdinal, Node).  Each of these template parameters has a set
    of values enabled by default, and users may set CMake variables to
    change or add to this set.
  - Generate macros that "iterate" over all template parameter
    combinations, in order to instantiate classes, functions, or
    tests.  Also, generate typedefs for template parameter values,
    that avoid issues with macro arguments not being allowed to have
    spaces.  The CMake generation code for (sub)package $PACKAGE lives
    in $PACKAGE/cmake/ExplicitInstantiationSupport.cmake.  It
    generates a header file with the "iteration" macros, typically
    called ~$PACKAGE_ETIHelperMacros.h~, that it writes to the
    package's build directory.
  - Automatically generate .cpp files that split up the explicit
    instantiations into a few or just one template parameter
    combination per file

** What are the advantages and disadvantages of ETI?

*** Much faster builds of tests, examples, and applications

When ETI is OFF, Trilinos must rebuild every templated class used in
every compilation unit (.cpp file that is compiled).  This makes
builds very slow, especially for deeply nested hierarchies of
templated classes (MueLu is a good example).  ETI lets Trilinos
"pre-build" those classes, so the compiler doesn't have to build them
again.  This makes building Trilinos' tests and examples, and
application code, a lot faster.  For example, a finite element code
that fills into a Tpetra sparse matrix, and creates and uses a MueLu
preconditioner with a Tpetra solver, may take 30 minutes to build with
ETI OFF.  Many developers find ~make -jN~ with ~N > 1~ unusable when
ETI is OFF.

*** Must build code, whether or not it gets used

The main disadvantage of ETI is that it actually requires building all
the classes that participate in the ETI system, for all enabled
template parameter combinations.  Those classes take up space in
Trilinos' libraries.  Furthermore, when using static libraries,
linking one method in a .o file (archived in the library) into an
executable pulls the whole .o file's contents into the executable.
Applications that use Trilinos only sparsely thus have to pull in more
code than they might need.  However, in practice, applications that
use MueLu use all of the Tpetra stack, because MueLu's factories can
create just about any Tpetra stack solver.  ShyLU behaves similarly.
The very "solver factories" demanded for the sake of usability require
including and building all solvers in a package, whether or not ETI is
ON.  Thus, sparse use of Trilinos' packages with Tpetra stack solvers
is rare in practice.

Trilinos packages like Tpetra and Ifpack2 make an effort to split up
many of the .cpp files that do explicit instantiations, in order to
minimize the amount of code to build per .cpp file.  (It generally
helps compilers optimize if .cpp files are shorter, so this is a good
idea anyway.)

*** Must know set of enabled types at Trilinos' configure time

One characteristic of ETI is that it forces Trilinos to enumerate the
set of enabled template parameter combinations at configure time,
before building Trilinos.  The advantage of this is that it makes
run-time registration of solvers (Stratimikos, LinearSolverFactory)
both correct and efficient.  (Run-time registration / dependency
inversion and injection requires filling in all the template
parameters; you can't register code that hasn't been built.)  The
disadvantage is that it forbids use of type combinations not in the
original set.

It's important to clarify the latter point.  If someone installs
Trilinos with ETI enabled, this means that the installer has defined
the set of type combinations that users may use with that
installation.  Users may not use any other type combinations.  This
may frustrate users who are not able or willing to build Trilinos
themselves, because they are stuck with the set of types that got
installed.  Most users are satisfied with the default enabled set,
though.  More importantly, the set of enabled types is exactly the set
of tested types, whether or not ETI is enabled.  Trilinos does not
promise correct behavior when using other types.

Tpetra is not like ~std::vector<T>~; you can't shove just any type
into it.  Adding new types requires implementing certain traits
classes for those types, and Kokkos imposes its own restrictions and
requirements.  The types might work with Tpetra, but not necessarily
with downstream packages; for example, arbitrary Scalar types in
Anasazi would require a fully templated LAPACK replacement, which we
do not have.  (~Teuchos::LAPACK~ only works with the four Scalar types
that the LAPACK library implements: S (float), D (double), C
(~std::complex<float>~), and Z (~std::complex<double>~).  Have fun
implementing a dense eigensolver for modular arithmetic or an
arbitrary-precision floating-point type!)

We have not yet documented a required interface or implemented tests
to verify the interface required by Tpetra's template parameters.
More importantly, funding for this work is very limited, for a use
case that most Tpetra and downstream package users never want to
exercise (besides perhaps the special cases of Sacado and Stokhos
Scalar types, which Trilinos handles for them).

** Is there an alternative to ETI that has similar benefits?

Yes.  We call it /full specializations/.  The full specializations
option is a compromise between the ETI OFF and ETI ON cases.  Code
still includes both declarations and definitions of templated classes.
However, the declaration header files also declare full
specializations of the classes.  These have all the template
parameters filled in.  Implementations of the full specializations
live in separate .cpp files, and are "pre-built," just as in the ETI
case.  As with ETI, the set of type combinations for which Trilinos
does full specializations must be defined at configure time.

The main advantage of this approach is that it does not constrain
users to the set of types enabled at configure time.  The disadvantage
is that declaring the full specializations adds code to the class
declarations.  When applications include the declaration, the compiler
must read all the declarations of full specializations.  We have not
yet explored the cost of this approach in comparison with Trilinos'
current ETI system.

Here is an example of the full specializations approach, for the above
~Foo~ class.

#+BEGIN_SRC C++
// Foo.hpp:

template <class T>
class Foo {
public:
  T bar (const T& x) {
    return x * x; // must have generic implementation
  }
};

// Declaration of full specialization for T = int.
template <>
class Foo<int> {
public:
  int bar (const int& x);
};

// Declaration of full specialization for T = double.
template <>
class Foo<double> {
public:
  double bar (const double& x);
};
#+END_SRC

#+BEGIN_SRC C++
// Foo.cpp:

#include "Foo.hpp"

// Definition of full specialization for T = int.
int Foo<int>::bar (const int& x) {
  // This happens to be the same implementation as the generic
  // version of bar(), but this doesn't have to be the case.
  return x * x;
}

// Definition of full specialization for T = double.
double Foo<double>::bar (const int& x) {
  // This happens to be the same implementation as the generic
  // version of bar(), but this doesn't have to be the case.
  return x * x;
}
#+END_SRC

#+BEGIN_SRC C++
// main.cpp:

#include "Foo.hpp"
#include <iostream>
int main () {
  int x_i = 3;
  Foo<int> foo_t; // use pre-built full specialization for T = int
  std::cout << foo_i.bar (x_i) << std::endl;

  float x_f = 3.14;
  Foo<float> foo_f; // implicitly instantiate Foo for T = float
  std::cout << foo_f.bar (x_f) << std::endl;

  return 0;
}
#+END_SRC

** Why does Tpetra restrict the set of enabled template parameter combinations?

Whether or not ETI is ON, restricting the set of template parameter
combinations that packages use has value in reducing library and
executable sizes.  With ETI ON, Trilinos pre-builds them; with ETI
OFF, applications still have to build them.

** Can I use non-enabled types in Tpetra when ETI is OFF?

Q: If ETI is OFF and Tpetra_INST_FLOAT = OFF, can someone still build
and run a test with ~Scalar = float~?  I thought that, if we built
with ETI=OFF, we could use whatever data types we wanted (as long as
we had time to wait for the compilation).

A. If ETI is OFF, Tpetra will work with whatever data types Tpetra
supports.  For example, you can use ~Scalar = float~.

Nevertheless, Tpetra still has a notion of the set of enabled type
combinations, whether or not ETI is enabled.  The enabled type
combinations go into those ~TPETRA_INSTANTIATE_*~ macros (that live in
the generated header file ~TpetraCore_ETIHelperMacros.h~).  Those
macros exist whether or not ETI is enabled.  Many Tpetra (and
downstream) tests use those macros to instantiate templated tests.
Thus, the set of /enabled/ types is the set of /tested/ types.

Note that Stratimikos, LinearSolverFactory, and anything else that
does automatic run-time registration, must have a finite enumerated
set of "enabled" template parameter combinations over which to do
registration.  This is independent of whether ETI is enabled.  For
example, if ~Scalar = float~ is disabled, LinearSolverFactory won't be
able to create solvers for ~Scalar = float~, unless users register
packages' factors with that type manually.  However, Tpetra will work
fine regardless.
* What is TSQR?

TSQR stands for "Tall Skinny QR" factorization.  Mark Hoemmen wrote an
implementation of TSQR in 2010.  It now lives in the TSQR subpackage
of Tpetra.  Both Anasazi and Belos have an option to use TSQR as a
block orthogonalization method.  Mark came up with TSQR in 2005
specifically for use as a block orthogonalization in Krylov subspace
methods.

** Build errors when I enable TSQR; why?

Packages that specialize ~Anasazi::MultiVecTraits~ or
~Belos::MultiVecTraits~ for MultiVector types, but omit a TSQR adapter
from their MultiVecTraits class, may cause build errors when TSQR is
enabled.  Since TSQR is disabled by default at the moment,
implementers tend to skip this crucial step.

Both Anasazi and ROL have examples of this error.  The work-around for
the former is to disable TSQR for Anasazi ONLY, via the CMake option
~Anasazi_ENABLE_TSQR:BOOL=OFF~, and leave it enabled for Belos.  ROL
specializes the Belos adapter, so the work-around for the latter is
either to disable TSQR for Belos (~Belos_ENABLE_TSQR:BOOL=OFF~) or to
disable TSQR altogether (the default behavior).

** Adding ETI for a new Tpetra class or function broke Stokhos' build; why?

Stokhos needs to instantiate Tpetra classes and functions for its own
Scalar types.  If Tpetra does ETI for a class or function that takes a
Scalar (or Packet; see Tpetra::DistObject) template parameter, Stokhos
must also instantiate that Tpetra class or function for its own Scalar
(or Packet) types.  Otherwise, if explicit template instantiation is
enabled in Trilinos, you may see build or link errors.

You may fix this by telling Stokhos about your new class or function.
You only need to do this if doing explicit instantiation for your new
class or function.

Stokhos uses a CMake-based system for generating explicit
instantiations automatically.  Its system is pickier than Tpetra's.
For example, if the class is called ~Tpetra::TheClass~, the
corresponding Tpetra header files must be ~Tpetra_TheClass_decl.hpp~
and ~Tpetra_TheClass_def.hpp~, and the corresponding explicit
instantiation macro must be called ~TPETRA_THECLASS_INSTANT~.  If the
function is ~Tpetra::Details::theFunction~, the corresponding Tpetra
header files must be ~Tpetra_Details_theFunction_decl.hpp~ and
~Tpetra_Details_theFunction_def.hpp~, and the corresponding explicit
instantiation macro must be called
~TPETRA_DETAILS_THEFUNCTION_INSTANT~.  (Tpetra has a more general
system that lets developers specify class, header, and macro names
separately.)

You must first follow these requirements for naming classes,
functions, header files, and explicit instantiation macros.  You must
then add your class or function to Stokhos' list of "Tpetra ETI
classes."  Do this by editing the CMake file
~Trilinos/packages/stokhos/src/CMakeLists.txt~, and adding the new
Tpetra class or function to the list of ~TPETRA_ETI_CLASSES~.  This
requires rerunning CMake before rebuilding (typing ~make~ will usually
do that for you automaticaly).
