Sophie: xen-doc-4.12.1-1.mga7 noarch

xen-doc-4.12.1-1.mga7.noarch.rpm

NAME
    xl-numa-placement - Guest Automatic NUMA Placement in libxl and xl

DESCRIPTION
  Rationale
    NUMA (which stands for Non-Uniform Memory Access) means that the memory
    accessing times of a program running on a CPU depends on the relative
    distance between that CPU and that memory. In fact, most of the NUMA
    systems are built in such a way that each processor has its local
    memory, on which it can operate very fast. On the other hand, getting
    and storing data from and on remote memory (that is, memory local to
    some other processor) is quite more complex and slow. On these machines,
    a NUMA node is usually defined as a set of processor cores (typically a
    physical CPU package) and the memory directly attached to the set of
    cores.

    NUMA awareness becomes very important as soon as many domains start
    running memory-intensive workloads on a shared host. In fact, the cost
    of accessing non node-local memory locations is very high, and the
    performance degradation is likely to be noticeable.

    For more information, have a look at the Xen NUMA Introduction
    <http://wiki.xen.org/wiki/Xen_NUMA_Introduction> page on the Wiki.

  Xen and NUMA machines: the concept of *node-affinity*
    The Xen hypervisor deals with NUMA machines throughout the concept of
    *node-affinity*. The node-affinity of a domain is the set of NUMA nodes
    of the host where the memory for the domain is being allocated (mostly,
    at domain creation time). This is, at least in principle, different and
    unrelated with the vCPU (hard and soft, see below) scheduling affinity,
    which instead is the set of pCPUs where the vCPU is allowed (or prefers)
    to run.

    Of course, despite the fact that they belong to and affect different
    subsystems, the domain node-affinity and the vCPUs affinity are not
    completely independent. In fact, if the domain node-affinity is not
    explicitly specified by the user, via the proper libxl calls or xl
    config item, it will be computed basing on the vCPUs' scheduling
    affinity.

    Notice that, even if the node affinity of a domain may change on-line,
    it is very important to "place" the domain correctly when it is fist
    created, as the most of its memory is allocated at that time and can not
    (for now) be moved easily.

  Placing via pinning and cpupools
    The simplest way of placing a domain on a NUMA node is setting the hard
    scheduling affinity of the domain's vCPUs to the pCPUs of the node. This
    also goes under the name of vCPU pinning, and can be done through the
    "cpus=" option in the config file (more about this below). Another
    option is to pool together the pCPUs spanning the node and put the
    domain in such a *cpupool* with the "pool=" config option (as documented
    in our Wiki <http://wiki.xen.org/wiki/Cpupools_Howto>).

    In both the above cases, the domain will not be able to execute outside
    the specified set of pCPUs for any reasons, even if all those pCPUs are
    busy doing something else while there are others, idle, pCPUs.

    So, when doing this, local memory accesses are 100% guaranteed, but that
    may come at he cost of some load imbalances.

  NUMA aware scheduling
    If using the credit1 scheduler, and starting from Xen 4.3, the scheduler
    itself always tries to run the domain's vCPUs on one of the nodes in its
    node-affinity. Only if that turns out to be impossible, it will just
    pick any free pCPU. Locality of access is less guaranteed than in the
    pinning case, but that comes along with better chances to exploit all
    the host resources (e.g., the pCPUs).

    Starting from Xen 4.5, credit1 supports two forms of affinity: hard and
    soft, both on a per-vCPU basis. This means each vCPU can have its own
    soft affinity, stating where such vCPU prefers to execute on. This is
    less strict than what it (also starting from 4.5) is called hard
    affinity, as the vCPU can potentially run everywhere, it just prefers
    some pCPUs rather than others. In Xen 4.5, therefore, NUMA-aware
    scheduling is achieved by matching the soft affinity of the vCPUs of a
    domain with its node-affinity.

    In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity
    are busy, it is possible for the domain to run outside from there. The
    idea is that slower execution (due to remote memory accesses) is still
    better than no execution at all (as it would happen with pinning). For
    this reason, NUMA aware scheduling has the potential of bringing
    substantial performances benefits, although this will depend on the
    workload.

    Notice that, for each vCPU, the following three scenarios are possbile:

    *   a vCPU *is pinned* to some pCPUs and *does not have* any soft
        affinity In this case, the vCPU is always scheduled on one of the
        pCPUs to which it is pinned, without any specific peference among
        them.

    *   a vCPU *has* its own soft affinity and *is not* pinned to any
        particular pCPU. In this case, the vCPU can run on every pCPU.
        Nevertheless, the scheduler will try to have it running on one of
        the pCPUs in its soft affinity;

    *   a vCPU *has* its own vCPU soft affinity and *is also* pinned to some
        pCPUs. In this case, the vCPU is always scheduled on one of the
        pCPUs onto which it is pinned, with, among them, a preference for
        the ones that also forms its soft affinity. In case pinning and soft
        affinity form two disjoint sets of pCPUs, pinning "wins", and the
        soft affinity is just ignored.

  Guest placement in xl
    If using xl for creating and managing guests, it is very easy to ask for
    both manual or automatic placement of them across the host's NUMA nodes.

    Note that xm/xend does a very similar thing, the only differences being
    the details of the heuristics adopted for automatic placement (see
    below), and the lack of support (in both xm/xend and the Xen versions
    where that was the default toolstack) for NUMA aware scheduling.

  Placing the guest manually
    Thanks to the "cpus=" option, it is possible to specify where a domain
    should be created and scheduled on, directly in its config file. This
    affects NUMA placement and memory accesses as, in this case, the
    hypervisor constructs the node-affinity of a VM basing right on its vCPU
    pinning when it is created.

    This is very simple and effective, but requires the user/system
    administrator to explicitly specify the pinning for each and every
    domain, or Xen won't be able to guarantee the locality for their memory
    accesses.

    That, of course, also mean the vCPUs of the domain will only be able to
    execute on those same pCPUs.

    It is is also possible to have a "cpus_soft=" option in the xl config
    file, to specify the soft affinity for all the vCPUs of the domain. This
    affects the NUMA placement in the following way:

    *   if only "cpus_soft=" is present, the VM's node-affinity will be
        equal to the nodes to which the pCPUs in the soft affinity mask
        belong;

    *   if both "cpus_soft=" and "cpus=" are present, the VM's node-affinity
        will be equal to the nodes to which the pCPUs present both in hard
        and soft affinity belong.

  Placing the guest automatically
    If neither "cpus=" nor "cpus_soft=" are present in the config file,
    libxl tries to figure out on its own on which node(s) the domain could
    fit best. If it finds one (some), the domain's node affinity get set to
    there, and both memory allocations and NUMA aware scheduling (for the
    credit scheduler and starting from Xen 4.3) will comply with it.
    Starting from Xen 4.5, this also means that the mask resulting from this
    "fitting" procedure will become the soft affinity of all the vCPUs of
    the domain.

    It is worthwhile noting that optimally fitting a set of VMs on the NUMA
    nodes of an host is an incarnation of the Bin Packing Problem. In fact,
    the various VMs with different memory sizes are the items to be packed,
    and the host nodes are the bins. As such problem is known to be NP-hard,
    we will be using some heuristics.

    The first thing to do is find the nodes or the sets of nodes (from now
    on referred to as 'candidates') that have enough free memory and enough
    physical CPUs for accommodating the new domain. The idea is to find a
    spot for the domain with at least as much free memory as it has
    configured to have, and as much pCPUs as it has vCPUs. After that, the
    actual decision on which candidate to pick happens accordingly to the
    following heuristics:

    *   candidates involving fewer nodes are considered better. In case two
        (or more) candidates span the same number of nodes,

    *   candidates with a smaller number of vCPUs runnable on them (due to
        previous placement and/or plain vCPU pinning) are considered better.
        In case the same number of vCPUs can run on two (or more)
        candidates,

    *   the candidate with with the greatest amount of free memory is
        considered to be the best one.

    Giving preference to candidates with fewer nodes ensures better
    performance for the guest, as it avoid spreading its memory among
    different nodes. Favoring candidates with fewer vCPUs already runnable
    there ensures a good balance of the overall host load. Finally, if more
    candidates fulfil these criteria, prioritizing the nodes that have the
    largest amounts of free memory helps keeping the memory fragmentation
    small, and maximizes the probability of being able to put more domains
    there.

  Guest placement in libxl
    xl achieves automatic NUMA placement because that is what libxl does by
    default. No API is provided (yet) for modifying the behaviour of the
    placement algorithm. However, if your program is calling libxl, it is
    possible to set the "numa_placement" build info key to "false" (it is
    "true" by default) with something like the below, to prevent any
    placement from happening:

        libxl_defbool_set(&domain_build_info->numa_placement, false);

    Also, if "numa_placement" is set to "true", the domain's vCPUs must not
    be pinned (i.e., "domain_build_info->cpumap" must have all its bits set,
    as it is by default), or domain creation will fail with "ERROR_INVAL".

    Starting from Xen 4.3, in case automatic placement happens (and is
    successful), it will affect the domain's node-affinity and *not* its
    vCPU pinning. Namely, the domain's vCPUs will not be pinned to any pCPU
    on the host, but the memory from the domain will come from the selected
    node(s) and the NUMA aware scheduling (if the credit scheduler is in
    use) will try to keep the domain's vCPUs there as much as possible.

    Besides than that, looking and/or tweaking the placement algorithm
    search "Automatic NUMA placement" in libxl_internal.h.

    Note this may change in future versions of Xen/libxl.

  Xen < 4.5
    The concept of vCPU soft affinity has been introduced for the first time
    in Xen 4.5. In 4.3, it is the domain's node-affinity that drives the
    NUMA-aware scheduler. The main difference is soft affinity is per-vCPU,
    and so each vCPU can have its own mask of pCPUs, while node-affinity is
    per-domain, that is the equivalent of having all the vCPUs with the same
    soft affinity.

  Xen < 4.3
    As NUMA aware scheduling is a new feature of Xen 4.3, things are a
    little bit different for earlier version of Xen. If no "cpus=" option is
    specified and Xen 4.2 is in use, the automatic placement algorithm still
    runs, but the results is used to *pin* the vCPUs of the domain to the
    output node(s). This is consistent with what was happening with xm/xend.

    On a version of Xen earlier than 4.2, there is not automatic placement
    at all in xl or libxl, and hence no node-affinity, vCPU affinity or
    pinning being introduced/modified.

  Limitations
    Analyzing various possible placement solutions is what makes the
    algorithm flexible and quite effective. However, that also means it
    won't scale well to systems with arbitrary number of nodes. For this
    reason, automatic placement is disabled (with a warning) if it is
    requested on a host with more than 16 NUMA nodes.