<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" xml:lang="en-us"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta> <meta http-equiv="X-UA-Compatible" content="IE=edge"></meta> <meta name="copyright" content="(C) Copyright 2005"></meta> <meta name="DC.rights.owner" content="(C) Copyright 2005"></meta> <meta name="DC.Type" content="concept"></meta> <meta name="DC.Title" content="Floating Point and IEEE 754 Compliance for NVIDIA GPUs"></meta> <meta name="abstract" content="White paper covering the most common issues related to NVIDIA GPUs."></meta> <meta name="description" content="White paper covering the most common issues related to NVIDIA GPUs."></meta> <meta name="DC.Coverage" content="White Papers"></meta> <meta name="DC.subject" content="CUDA Floating Point, CUDA Floating Point formats, CUDA Floating Point FMA, CUDA Floating Point accuracy, CUDA Floating Point rounding mode, CUDA Floating Point x86 differences, CUDA Floating Point compiler flags, CUDA Floating Point core counts, CUDA Floating Point x87, CUDA Floating Point recommendations"></meta> <meta name="keywords" content="CUDA Floating Point, CUDA Floating Point formats, CUDA Floating Point FMA, CUDA Floating Point accuracy, CUDA Floating Point rounding mode, CUDA Floating Point x86 differences, CUDA Floating Point compiler flags, CUDA Floating Point core counts, CUDA Floating Point x87, CUDA Floating Point recommendations"></meta> <meta name="DC.Format" content="XHTML"></meta> <meta name="DC.Identifier" content="abstract"></meta> <link rel="stylesheet" type="text/css" href="../common/formatting/commonltr.css"></link> <link rel="stylesheet" type="text/css" href="../common/formatting/site.css"></link> <title>Floating Point and IEEE 754 :: CUDA Toolkit Documentation</title> <!--[if lt IE 9]> <script src="../common/formatting/html5shiv-printshiv.min.js"></script> <![endif]--> <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.min.js"></script> <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.ba-hashchange.min.js"></script> <link rel="canonical" href="http://docs.nvidia.com/cuda/floating-point/index.html"></link> <link rel="stylesheet" type="text/css" href="../common/formatting/qwcode.highlight.css"></link> </head> <body> <article id="contents"> <div id="eqn-warning">This document includes math equations (highlighted in red) which are best viewed with <a target="_blank" href="https://www.mozilla.org/firefox">Firefox</a> version 4.0 or higher, or another <a target="_blank" href="http://www.w3.org/Math/Software/mathml_software_cat_browsers.html">MathML-aware browser</a>. There is also a <a href="../../pdf/Floating_Point_on_NVIDIA_GPU.pdf">PDF version of this document</a>. </div> <div id="eqn-warning-buf"></div> <div id="release-info">Floating Point and IEEE 754 (<a href="../../pdf/Floating_Point_on_NVIDIA_GPU.pdf">PDF</a>) - CUDA Toolkit v5.5 (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>) - Last updated July 19, 2013 - <a href="mailto:cudatools@nvidia.com?subject=CUDA Tools Documentation Feedback: floating-point">Send Feedback</a></div> <div class="topic nested0" id="abstract"><a name="abstract" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#abstract" name="abstract" shape="rect"><span class="ph">Floating Point and IEEE 754 Compliance for NVIDIA GPUs</span></a></h2> <div class="body conbody"> <p class="p">A number of issues related to floating point accuracy and compliance are a frequent source of confusion on both CPUs and GPUs. The purpose of this white paper is to discuss the most common issues related to NVIDIA GPUs and to supplement the documentation in the CUDA C Programming Guide. </p> </div> </div> <div class="topic concept nested0" id="introduction"><a name="introduction" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#introduction" name="introduction" shape="rect">1. Introduction</a></h2> <div class="body conbody"> <p class="p">Since the widespread adoption in 1985 of the IEEE Standard for <dfn class="term">Binary Floating-Point Arithmetic</dfn> (IEEE 754-1985 <a class="xref" href="index.html#references__1" shape="rect">[1]</a>) virtually all mainstream computing systems have implemented the standard, including NVIDIA with the CUDA architecture. IEEE 754 standardizes how arithmetic results should be <em class="ph i">approximated</em> in floating point. Whenever working with inexact results, programming decisions can affect accuracy. It is important to consider many aspects of floating point behavior in order to achieve the highest performance with the precision required for any specific application. This is especially true in a heterogeneous computing environment where operations will be performed on different types of hardware. </p> <p class="p">Understanding some of the intricacies of floating point and the specifics of how NVIDIA hardware handles floating point is obviously important to CUDA programmers striving to implement correct numerical algorithms. In addition, users of libraries such as <dfn class="term">CUBLAS</dfn> and <dfn class="term">CUFFT</dfn> will also find it informative to learn how NVIDIA handles floating point under the hood. </p> <p class="p">We review some of the basic properties of floating point calculations in <a class="xref" href="index.html#floating-point" shape="rect">Chapter 2</a>. We also discuss the fused multiply-add operator, which was added to the IEEE 754 standard in 2008 <a class="xref" href="index.html#references__2" shape="rect">[2]</a> and is built into the hardware of NVIDIA GPUs. In <a class="xref" href="index.html#dot-product-accuracy-example" shape="rect">Chapter 3</a> we work through an example of computing the dot product of two short vectors to illustrate how different choices of implementation affect the accuracy of the final result. In <a class="xref" href="index.html#cuda-and-floating-point" shape="rect">Chapter 4</a> we describe NVIDIA hardware versions and NVCC compiler options that affect floating point calculations. In <a class="xref" href="index.html#considerations-for-heterogeneous-world" shape="rect">Chapter 5</a> we consider some issues regarding the comparison of CPU and GPU results. Finally, in <a class="xref" href="index.html#concrete-recommendations" shape="rect">Chapter 6</a> we conclude with concrete recommendations to programmers that deal with numeric issues relating to floating point on the GPU. </p> </div> </div> <div class="topic concept nested0" id="floating-point"><a name="floating-point" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#floating-point" name="floating-point" shape="rect">2. Floating Point</a></h2> <div class="topic concept nested1" id="formats"><a name="formats" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#formats" name="formats" shape="rect">2.1. Formats</a></h3> <div class="body conbody"> <div class="section"> <p class="p">Floating point encodings and functionality are defined in the IEEE 754 Standard <a class="xref" href="index.html#references__2" shape="rect">[2]</a> last revised in 2008. Goldberg <a class="xref" href="index.html#references__5" shape="rect">[5]</a> gives a good introduction to floating point and many of the issues that arise. </p> <p class="p">The standard mandates binary floating point data be encoded on three fields: a one bit sign field, followed by exponent bits encoding the exponent offset by a numeric bias specific to each format, and bits encoding the significand (or fraction). </p><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/sign-exponent-fraction.png"></img></div><br clear="none"></br><p class="p">In order to ensure consistent computations across platforms and to exchange floating point data, IEEE 754 defines basic and interchange formats. The 32 and 64 bit basic binary floating point formats correspond to the C data types <samp class="ph codeph">float</samp> and <samp class="ph codeph">double</samp>. Their corresponding representations have the following bit lengths: </p><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/float-double.png"></img></div><br clear="none"></br><p class="p">For numerical data representing finite values, the sign is either negative or positive, the exponent field encodes the exponent in base 2, and the fraction field encodes the significand without the most significant non-zero bit. For example, the value -192 equals (-1)<sup class="ph sup">1</sup> x 2<sup class="ph sup">7</sup> x 1.5, and can be represented as having a negative sign, an exponent of 7, and a fractional part .5. The exponents are biased by 127 and 1023, respectively, to allow exponents to extend from negative to positive. Hence the exponent 7 is represented by bit strings with values 134 for float and 1030 for double. The integral part of 1. is implicit in the fraction. </p><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/float-1-double-1.png"></img></div><br clear="none"></br><p class="p">Also, encodings to represent infinity and not-a-number (NaN) data are reserved. The IEEE 754 Standard <a class="xref" href="index.html#references__2" shape="rect">[2]</a> describes floating point encodings in full. </p> <p class="p">Given that the fraction field uses a limited number of bits, not all real numbers can be represented exactly. For example the mathematical value of the fraction 2/3 represented in binary is 0.10101010... which has an infinite number of bits after the binary point. The value 2/3 must be rounded first in order to be represented as a floating point number with limited precision. The rules for rounding and the rounding modes are specified in IEEE 754. The most frequently used is the round-to-nearest-or-even mode (abbreviated as round-to-nearest). The value 2/3 rounded in this mode is represented in binary as: </p><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/float-0-double-0.png"></img></div><br clear="none"></br><p class="p">The sign is positive and the stored exponent value represents an exponent of -1. </p> </div> </div> </div> <div class="topic concept nested1" id="operations-and-accuracy"><a name="operations-and-accuracy" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#operations-and-accuracy" name="operations-and-accuracy" shape="rect">2.2. Operations and Accuracy</a></h3> <div class="body conbody"> <div class="section"> <p class="p">The IEEE 754 standard requires support for a handful of operations. These include the arithmetic operations add, subtract, multiply, divide, square root, fused-multiply-add, remainder, conversion operations, scaling, sign operations, and comparisons. The results of these operations are guaranteed to be the same for all implementations of the standard, for a given format and rounding mode. </p> <p class="p">The rules and properties of mathematical arithmetic do not hold directly for floating point arithmetic because of floating point's limited precision. For example, the table below shows single precision values <em class="ph i">A</em>, <em class="ph i">B</em>, and <em class="ph i">C</em>, and the mathematical exact value of their sum computed using different associativity. </p> <p class="p d4p_eqn_block"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mtable columnalign="right left left" columnspacing="0.2em"> <mtr> <mtd> <mi>A</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>1</mn> </msup> <mo>×</mo> <mn>1.00000000000000000000001</mn> </mtd> </mtr> <mtr> <mtd> <mi>B</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>0</mn> </msup> <mo>×</mo> <mn>1.00000000000000000000001</mn> </mtd> </mtr> <mtr> <mtd> <mi>C</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.00000000000000000000001</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mi>A</mi> <mo>+</mo> <mi>B</mi> <mo>)</mo> <mo>+</mo> <mi>C</mi> </mrow> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.01100000000000000000001011</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>A</mi> <mo>+</mo> <mo>(</mo> <mi>B</mi> <mo>+</mo> <mi>C</mi> <mo>)</mo> </mrow> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.01100000000000000000001011</mn> </mtd> </mtr> </mtable> </math> </p> <p class="p">Mathematically, (<em class="ph i">A</em> + <em class="ph i">B</em>) + <em class="ph i">C</em> does equal <em class="ph i">A</em> + (<em class="ph i">B</em> + <em class="ph i">C</em>). </p> <p class="p">Let rn(<em class="ph i">x</em>) denote one rounding step on <em class="ph i">x</em>. Performing these same computations in single precision floating point arithmetic in round-to-nearest mode according to IEEE 754, we obtain: </p> <p class="p d4p_eqn_block"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mtable columnalign="right left left" columnspacing="0.2em"> <mtr> <mtd> <mi>A</mi> <mo>+</mo> <mi>B</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>1</mn> </msup> <mo>×</mo> <mn>1.1000000000000000000000110000...</mn> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>+</mo> <mi>B</mi> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>1</mn> </msup> <mo>×</mo> <mn>1.10000000000000000000010</mn> </mtd> </mtr> <mtr> <mtd> <mi>B</mi> <mo>+</mo> <mi>C</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.0010000000000000000000100100...</mn> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mi>B</mi> <mo>+</mo> <mi>C</mi> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.00100000000000000000001</mn> </mtd> </mtr> <mtr> <mtd> <mi>A</mi> <mo>+</mo> <mi>B</mi> <mo>+</mo> <mi>C</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.0110000000000000000000101100...</mn> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>+</mo> <mi>B</mi> <mo>)</mo> <mo>+</mo> <mi>C</mi> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.01100000000000000000010</mn> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>+</mo> <mtext>rn</mtext> <mo>(</mo> <mi>B</mi> <mo>+</mo> <mi>C</mi> <mo>)</mo> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>3</mn> </msup> <mo>×</mo> <mn>1.01100000000000000000001</mn> </mtd> </mtr> </mtable> </math> </p> <p class="p">For reference, the exact, mathematical results are computed as well in the table above. Not only are the results computed according to IEEE 754 different from the exact mathematical results, but also the results corresponding to the sum rn(rn(A + B) + C) and the sum rn(A + rn(B + C)) are different from each other. In this case, rn(A + rn(B + C)) is closer to the correct mathematical result than rn(rn(A + B) + C). </p> <p class="p">This example highlights that seemingly identical computations can produce different results even if all basic operations are computed in compliance with IEEE 754. </p> <p class="p">Here, the order in which operations are executed affects the accuracy of the result. The results are independent of the host system. These same results would be obtained using any microprocessor, CPU or GPU, which supports single precision floating point. </p> </div> </div> </div> <div class="topic concept nested1" id="fused-multiply-add-fma"><a name="fused-multiply-add-fma" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#fused-multiply-add-fma" name="fused-multiply-add-fma" shape="rect">2.3. The Fused Multiply-Add (FMA)</a></h3> <div class="body conbody"> <p class="p">In 2008 the IEEE 754 standard was revised to include the fused multiply-add operation (<dfn class="term">FMA</dfn>). The FMA operation computes <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <mi>X</mi> <mo>×</mo> <mi>Y</mi> <mo>+</mo> <mi>Z</mi> <mo>)</mo> </mrow> </math> with only one rounding step. Without the FMA operation the result would have to be computed as <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <mtext>rn</mtext> <mo>(</mo> <mi>X</mi> <mo>×</mo> <mi>Y</mi> <mo>)</mo> <mo>+</mo> <mi>Z</mi> <mo>)</mo> </mrow> </math> with two rounding steps, one for multiply and one for add. Because the FMA uses only a single rounding step the result is computed more accurately. </p> <p class="p">Let's consider an example to illustrate how the FMA operation works using decimal arithmetic first for clarity. Let's compute <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>−</mo> <mn>1</mn> </mrow> </math> with four digits of precision after the decimal point, or a total of five digits of precision including the leading digit before the decimal point. </p> <p class="p">For <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mi>x</mi> <mo>=</mo> <mn>1.0008</mn> </mrow> </math> , the correct mathematical result is <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>−</mo> <mn>1</mn> <mo>=</mo> <mn>1.60064</mn> <mo>×</mo> <msup> <mrow> <mn>10</mn> </mrow> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mrow> </math>. The closest number using only four digits after the decimal point is <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mn>1.6006</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mrow> </math>. In this case <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>−</mo> <mn>1</mn> <mo>)</mo> <mo>=</mo> <mn>1.6006</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mrow> </math> which corresponds to the fused multiply-add operation <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <mi>x</mi> <mo>×</mo> <mi>x</mi> <mo>+</mo> <mo>(</mo> <mo>−</mo> <mn>1</mn> <mo>)</mo> <mo>)</mo> </mrow> </math>. The alternative is to compute separate multiply and add steps. For the multiply, <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>1.00160064</mn> </mrow> </math>, so <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> <mo>=</mo> <mn>1.0016</mn> </mrow> </math>. The final result is <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <mtext>rn</mtext> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> <mo>−</mo> <mn>1</mn> <mo>)</mo> <mo>=</mo> <mn>1.6000</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mrow> </math>. </p> <p class="p">Rounding the multiply and add separately yields a result that is off by 0.00064. The corresponding FMA computation is wrong by only 0.00004, and its result is closest to the correct mathematical answer. The results are summarized below: </p> <p class="p d4p_eqn_block"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mtable columnalign="right left left left" columnspacing="0.2em"> <mtr> <mtd> <mi>x</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mn>1.0008</mn> </mtd> </mtr> <mtr> <mtd> <msup> <mi>x</mi> <mn>2</mn> </msup> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mn>1.00160064</mn> </mtd> </mtr> <mtr> <mtd> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>−</mo> <mn>1</mn> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mn>1.60064</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> <mtext> </mtext> </mtd> <mtd> <mtext>true value</mtext> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>−</mo> <mn>1</mn> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mn>1.6006</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mtd> <mtd> <mtext>fused multiply-add</mtext> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mn>1.0016</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mtext>rn</mtext> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> <mo>−</mo> <mn>1</mn> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mn>1.6000</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mtd> <mtd> <mtext>multiply, then add</mtext> </mtd> </mtr> </mtable> </math> </p> <p class="p">Below is another example, using binary single precision values:</p> <p class="p d4p_eqn_block"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mtable columnalign="right left left left left" columnspacing="0.2em"> <mtr> <mtd> <mi>A</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd></mtd> <mtd> <msup> <mn>2</mn> <mn>0</mn> </msup> </mtd> <mtd> <mo>×</mo> <mn>1.00000000000000000000001</mn> </mtd> </mtr> <mtr> <mtd> <mi>B</mi> </mtd> <mtd> <mo>=</mo> </mtd> <mtd> <mo>−</mo> </mtd> <mtd> <msup> <mn>2</mn> <mn>0</mn> </msup> </mtd> <mtd> <mo>×</mo> <mn>1.00000000000000000000010</mn> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>×</mo> <mi>A</mi> <mo>+</mo> <mi>B</mi> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd></mtd> <mtd> <msup> <mn>2</mn> <mrow> <mo>−</mo> <mn>46</mn> </mrow> </msup> </mtd> <mtd> <mo>×</mo> <mn>1.00000000000000000000000</mn> </mtd> </mtr> <mtr> <mtd> <mtext>rn</mtext> <mo>(</mo> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>×</mo> <mi>A</mi> <mo>)</mo> <mo>+</mo> <mi>B</mi> <mo>)</mo> </mtd> <mtd> <mo>=</mo> </mtd> <mtd></mtd> <mtd> <mn>0</mn> </mtd> </mtr> </mtable> </math> </p> <p class="p">In this particular case, computing <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>×</mo> <mi>A</mi> <mo>)</mo> <mo>+</mo> <mi>B</mi> <mo>)</mo> </mrow> </math> as an IEEE 754 multiply followed by an IEEE 754 add loses all bits of precision, and the computed result is 0. The alternative of computing the FMA <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mtext>rn</mtext> <mo>(</mo> <mi>A</mi> <mo>×</mo> <mi>A</mi> <mo>+</mo> <mi>B</mi> <mo>)</mo> </mrow> </math> provides a result equal to the mathematical value. In general, the fused-multiply-add operation generates more accurate results than computing one multiply followed by one add. The choice of whether or not to use the fused operation depends on whether the platform provides the operation and also on how the code is compiled. </p> <p class="p"><a class="xref" href="index.html#fused-multiply-add-fma__multiply-and-add-code-fragment-and-output-for-x86-and-nvidia-fermi-gpu" shape="rect">Figure 1</a> shows CUDA C code and output corresponding to inputs <em class="ph i">A</em> and <em class="ph i">B</em> and operations from the example above. The code is executed on two different hardware platforms: an x86-class CPU using <dfn class="term">SSE</dfn> in single precision, and an NVIDIA GPU with compute capability 2.0. At the time this paper is written (Spring 2011) there are no commercially available x86 CPUs which offer hardware FMA. Because of this, the computed result in single precision in SSE would be 0. NVIDIA GPUs with compute capability 2.0 do offer hardware FMAs, so the result of executing this code will be the more accurate one by default. However, both results are correct according to the IEEE 754 standard. The code fragment was compiled without any special intrinsics or compiler options for either platform. </p> <p class="p">The fused multiply-add helps avoid loss of precision during subtractive cancellation. Subtractive cancellation occurs during the addition of quantities of similar magnitude with opposite signs. In this case many of the leading bits cancel, leaving fewer meaningful bits of precision in the result. The fused multiply-add computes a double-width product during the multiplication. Thus even if subtractive cancellation occurs during the addition there are still enough valid bits remaining in the product to get a precise result with no loss of precision. </p> <div class="fig fignone" id="fused-multiply-add-fma__multiply-and-add-code-fragment-and-output-for-x86-and-nvidia-fermi-gpu"><a name="fused-multiply-add-fma__multiply-and-add-code-fragment-and-output-for-x86-and-nvidia-fermi-gpu" shape="rect"> <!-- --></a><span class="figcap">Figure 1. Multiply and Add Code Fragment and Output for x86 and NVIDIA Fermi GPU</span><pre xml:space="preserve"><span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-keyword">union</span> { <span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-keyword">float</span> f; <span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-keyword">unsigned</span> <span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-keyword">int</span> i } a, b; <span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-keyword">float</span> r; a.i = 0x3F800001; b.i = 0xBF800002; r = a.f * a.f + b.f; printf(<span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-string">"a %.8g\n"</span>, a.f); printf(<span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-string">"b %.8g\n"</span>, b.f); printf(<span xmlns:xslthl="http://xslthl.sf.net" class="xslthl-string">"r %.8g\n"</span>, r);</pre><p class="p">x86-64 output:</p><pre class="pre screen" xml:space="preserve">a: 1.0000001 b: -1.0000002 <strong class="ph b">r: 0</strong></pre><p class="p">NVIDIA Fermi output:</p><pre class="pre screen" xml:space="preserve">a: 1.0000001 b: -1.0000002 <strong class="ph b">r: 1.4210855e-14</strong></pre></div> </div> </div> </div> <div class="topic concept nested0" id="dot-product-accuracy-example"><a name="dot-product-accuracy-example" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#dot-product-accuracy-example" name="dot-product-accuracy-example" shape="rect">3. Dot Product: An Accuracy Example</a></h2> <div class="body conbody"> <p class="p">Consider the problem of finding the dot product of two short vectors <math xmlns="http://www.w3.org/1998/Math/MathML"> <semantics definitionURL="" encoding=""> <mover accent="true"> <mi>a</mi> <mo>→</mo> </mover> </semantics> </math> and <math xmlns="http://www.w3.org/1998/Math/MathML"> <semantics definitionURL="" encoding=""> <mover accent="true"> <mi>b</mi> <mo>→</mo> </mover> </semantics> </math>, both with four elements. </p> <ul class="sl simple"> <li class="sli"> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> <mover accent="true"> <mi>a</mi> <mo>⇀</mo> </mover> <mo>=</mo> <mrow> <mo>[</mo> <mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>a</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>a</mi> <mn>2</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>a</mi> <mn>3</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>a</mi> <mn>4</mn> </msub> </mrow> </mtd> </mtr> </mtable> </mrow> <mo>]</mo> </mrow> <mtext> </mtext> <mtext> </mtext> <mover accent="true"> <mi>b</mi> <mo>⇀</mo> </mover> <mo>=</mo> <mrow> <mo>[</mo> <mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>b</mi> <mn>1</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>b</mi> <mn>2</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>b</mi> <mn>3</mn> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>b</mi> <mn>4</mn> </msub> </mrow> </mtd> </mtr> </mtable> </mrow> <mo>]</mo> </mrow> <mtext> </mtext> <mover accent="true"> <mi>a</mi> <mo>⇀</mo> </mover> <mo>⋅</mo> <mover accent="true"> <mi>b</mi> <mo>⇀</mo> </mover> <mo>=</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>2</mn> </msub> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>3</mn> </msub> <msub> <mi>b</mi> <mn>3</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>4</mn> </msub> <msub> <mi>b</mi> <mn>4</mn> </msub> </mrow> </math> </li> </ul> <p class="p">This operation is easy to write mathematically, but its implementation in software involves several choices. All of the strategies we will discuss use purely IEEE 754 compliant operations. </p> </div> <div class="topic concept nested1" id="example-algorithms"><a name="example-algorithms" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#example-algorithms" name="example-algorithms" shape="rect">3.1. Example Algorithms</a></h3> <div class="body conbody"> <p class="p">We present three algorithms which differ in how the multiplications, additions, and possibly fused multiply-adds are organized. These algorithms are presented in <a class="xref" href="index.html#example-algorithms__serial-method-to-compute-vectors-dot-product" title="The serial method uses a simple loop with separate multiplies and adds to compute the do t product of the vectors. The final result can be represented as ((((a1 x b1) + (a2 x b2)) + (a3 x b3)) + (a4 x b4))." shape="rect">Figure 2</a>, <a class="xref" href="index.html#example-algorithms__fma-method-to-compute-vectors-dot-product" title="The FMA method uses a simple loop with fused multiply-adds to compute the dot product of the vectors. The final result can be represented as a4 x b4 = (a3 x b3 + (a2 x b2 + (a1 x b1 + 0)))." shape="rect">Figure 3</a>, and <a class="xref" href="index.html#comparison__parallel-method-to-reduce-individual-elements-products-into-final-sum" title="The parallel method uses a tree to reduce all the products of individual elements into a final sum. The final result can be represented as ((a1 x b1) + (a2 x b2)) + ((a3 x b3) + (a4 x b4))." shape="rect">Figure 4</a>. Each of the three algorithms is represented graphically. Individual operation are shown as a circle with arrows pointing from arguments to operations. </p> <p class="p">The simplest way to compute the dot product is using a short loop as shown in <a class="xref" href="index.html#example-algorithms__serial-method-to-compute-vectors-dot-product" title="The serial method uses a simple loop with separate multiplies and adds to compute the do t product of the vectors. The final result can be represented as ((((a1 x b1) + (a2 x b2)) + (a3 x b3)) + (a4 x b4))." shape="rect">Figure 2</a>. The multiplications and additions are done separately. </p> <div class="fig fignone" id="example-algorithms__serial-method-to-compute-vectors-dot-product"><a name="example-algorithms__serial-method-to-compute-vectors-dot-product" shape="rect"> <!-- --></a><span class="figcap">Figure 2. Serial Method to Compute Vectors Dot Product</span>. <span class="desc figdesc">The serial method uses a simple loop with separate multiplies and adds to compute the do t product of the vectors. The final result can be represented as ((((a<sub class="ph sub">1</sub> x b<sub class="ph sub">1</sub>) + (a<sub class="ph sub">2</sub> x b<sub class="ph sub">2</sub>)) + (a<sub class="ph sub">3</sub> x b<sub class="ph sub">3</sub>)) + (a<sub class="ph sub">4</sub> x b<sub class="ph sub">4</sub>)).</span><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/serial-method.png" alt="A figure of serial method to compute the vector dot product using a simple loop with separate multiplies and adds."></img></div><br clear="none"></br></div> <div class="fig fignone" id="example-algorithms__fma-method-to-compute-vectors-dot-product"><a name="example-algorithms__fma-method-to-compute-vectors-dot-product" shape="rect"> <!-- --></a><span class="figcap">Figure 3. FMA Method to Compute Vector Dot Product</span>. <span class="desc figdesc">The FMA method uses a simple loop with fused multiply-adds to compute the dot product of the vectors. The final result can be represented as a<sub class="ph sub">4</sub> x b<sub class="ph sub">4</sub> = (a<sub class="ph sub">3</sub> x b<sub class="ph sub">3</sub> + (a<sub class="ph sub">2</sub> x b<sub class="ph sub">2</sub> + (a<sub class="ph sub">1</sub> x b<sub class="ph sub">1</sub> + 0))).</span><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/fma-method.png" alt="A figure of the FMA method to compute the vector dot product using a simple loop with fused multiply-adds."></img></div><br clear="none"></br></div> <p class="p">A simple improvement to the algorithm is to use the fused multiply-add to do the multiply and addition in one step to improve accuracy. <a class="xref" href="index.html#example-algorithms__fma-method-to-compute-vectors-dot-product" title="The FMA method uses a simple loop with fused multiply-adds to compute the dot product of the vectors. The final result can be represented as a4 x b4 = (a3 x b3 + (a2 x b2 + (a1 x b1 + 0)))." shape="rect">Figure 3</a> shows this version. </p> <p class="p">Yet another way to compute the dot product is to use a divide-and-conquer strategy in which we first find the dot products of the first half and the second half of the vectors, then combine these results using addition. This is a recursive strategy; the base case is the dot product of vectors of length 1 which is a single multiply. <a class="xref" href="index.html#comparison__parallel-method-to-reduce-individual-elements-products-into-final-sum" title="The parallel method uses a tree to reduce all the products of individual elements into a final sum. The final result can be represented as ((a1 x b1) + (a2 x b2)) + ((a3 x b3) + (a4 x b4))." shape="rect">Figure 4</a> graphically illustrates this approach. We call this algorithm the parallel algorithm because the two sub-problems can be computed in parallel as they have no dependencies. The algorithm does not require a parallel implementation, however; it can still be implemented with a single thread. </p> </div> </div> <div class="topic concept nested1" id="comparison"><a name="comparison" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#comparison" name="comparison" shape="rect">3.2. Comparison</a></h3> <div class="body conbody"> <p class="p">All three algorithms for computing a dot product use IEEE 754 arithmetic and can be implemented on any system that supports the IEEE standard. In fact, an implementation of the serial algorithm on multiple systems will give exactly the same result. So will implementations of the FMA or parallel algorithms. However, results computed by an implementation of the serial algorithm may differ from those computed by an implementation of the other two algorithms. </p> <div class="fig fignone" id="comparison__parallel-method-to-reduce-individual-elements-products-into-final-sum"><a name="comparison__parallel-method-to-reduce-individual-elements-products-into-final-sum" shape="rect"> <!-- --></a><span class="figcap">Figure 4. The Parallel Method to Reduce Individual Elements Products into a Final Sum</span>. <span class="desc figdesc">The parallel method uses a tree to reduce all the products of individual elements into a final sum. The final result can be represented as ((a<sub class="ph sub">1</sub> x b<sub class="ph sub">1</sub>) + (a<sub class="ph sub">2</sub> x b<sub class="ph sub">2</sub>)) + ((a<sub class="ph sub">3</sub> x b<sub class="ph sub">3</sub>) + (a<sub class="ph sub">4</sub> x b<sub class="ph sub">4</sub>)).</span><br clear="none"></br><div class="imagecenter"><img class="image imagecenter" src="graphics/parallel-method.png" alt="A figure of the Parallel Method using a tree to reduce the products of individual elements into a final sum"></img></div><br clear="none"></br></div> <div class="fig fignone" id="comparison__algorithms-results-vs-correct-mathematical-dot-product"><a name="comparison__algorithms-results-vs-correct-mathematical-dot-product" shape="rect"> <!-- --></a><span class="figcap">Figure 5. Algorithms Results vs. the Correct Mathematical Dot Product</span>. <span class="desc figdesc">The three algorithms yield results slightly different from the correct mathematical dot product.</span><table cellpadding="4" cellspacing="0" summary="" border="1" class="simpletable"> <tr class="sthead"> <th valign="bottom" align="left" id="d54e2678" class="stentry" rowspan="1" colspan="1">method</th> <th valign="bottom" align="left" id="d54e2681" class="stentry" rowspan="1" colspan="1">result</th> <th valign="bottom" align="left" id="d54e2684" class="stentry" rowspan="1" colspan="1">float value</th> </tr> <tr class="strow"> <td valign="top" headers="d54e2678" class="stentry" rowspan="1" colspan="1">exact</td> <td valign="top" headers="d54e2681" class="stentry" rowspan="1" colspan="1">.0559587528435...</td> <td valign="top" headers="d54e2684" class="stentry" rowspan="1" colspan="1">0x3D65350158...</td> </tr> <tr class="strow"> <td valign="top" headers="d54e2678" class="stentry" rowspan="1" colspan="1">serial</td> <td valign="top" headers="d54e2681" class="stentry" rowspan="1" colspan="1">.0559588074</td> <td valign="top" headers="d54e2684" class="stentry" rowspan="1" colspan="1">0x3D653510</td> </tr> <tr class="strow"> <td valign="top" headers="d54e2678" class="stentry" rowspan="1" colspan="1">FMA</td> <td valign="top" headers="d54e2681" class="stentry" rowspan="1" colspan="1">.0559587515</td> <td valign="top" headers="d54e2684" class="stentry" rowspan="1" colspan="1">0x3D653501</td> </tr> <tr class="strow"> <td valign="top" headers="d54e2678" class="stentry" rowspan="1" colspan="1">parallel</td> <td valign="top" headers="d54e2681" class="stentry" rowspan="1" colspan="1">.0559587478</td> <td valign="top" headers="d54e2684" class="stentry" rowspan="1" colspan="1">0x3D653500</td> </tr> </table> </div> <p class="p">For example, consider the vectors:</p> <ul class="sl simple"> <li class="sli"> a = [1.907607, -.7862027, 1.148311, .9604002] </li> <li class="sli"> b = [-.9355000, -.6915108, 1.724470, -.7097529] </li> </ul> <p class="p">whose elements are randomly chosen values between -1 and 2. The accuracy of each algorithm corresponding to these inputs is shown in <a class="xref" href="index.html#comparison__algorithms-results-vs-correct-mathematical-dot-product" title="The three algorithms yield results slightly different from the correct mathematical dot product." shape="rect">Figure 5</a>. </p> <p class="p">The main points to notice from the table are that each algorithm yields a different result, and they are all slightly different from the correct mathematical dot product. In this example the FMA version is the most accurate, and the parallel algorithm is more accurate than the serial algorithm. In our experience these results are typical; fused multiply-add significantly increases the accuracy of results, and parallel tree reductions for summation are usually much more accurate than serial summation. </p> </div> </div> </div> <div class="topic concept nested0" id="cuda-and-floating-point"><a name="cuda-and-floating-point" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#cuda-and-floating-point" name="cuda-and-floating-point" shape="rect">4. CUDA and Floating Point</a></h2> <div class="body conbody"> <p class="p">NVIDIA has extended the capabilities of GPUs with each successive hardware generation. Current generations of the NVIDIA architecture such as <dfn class="term">Tesla C2xxx</dfn>, <dfn class="term">GTX 4xx</dfn>, and <dfn class="term">GTX 5xx</dfn>, support both single and double precision with <dfn class="term">IEEE 754</dfn> precision and include hardware support for fused multiply-add in both single and double precision. Older NVIDIA architectures support some of these features but not others. In CUDA, the features supported by the GPU are encoded in the <dfn class="term">compute capability</dfn> number. The runtime library supports a function call to determine the compute capability of a GPU at runtime; the <cite class="cite">CUDA C Programming Guide</cite> also includes a table of compute capabilities for many different devices <a class="xref" href="index.html#references__7" shape="rect">[7]</a>. </p> </div> <div class="topic concept nested1" id="compute-capability-1-2-and-below"><a name="compute-capability-1-2-and-below" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#compute-capability-1-2-and-below" name="compute-capability-1-2-and-below" shape="rect">4.1. Compute Capability 1.2 and Below</a></h3> <div class="body conbody"> <p class="p">Devices with compute capability <em class="ph i">1.2 and below</em> support single precision only. In addition, not all operations in single precision on these GPUs are <dfn class="term">IEEE 754</dfn> accurate. Denormal numbers (small numbers close to zero) are flushed to zero. Operations such as square root and division may not always result in the floating point value closest to the correct mathematical value. </p> </div> </div> <div class="topic concept nested1" id="compute-capability-1-3"><a name="compute-capability-1-3" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#compute-capability-1-3" name="compute-capability-1-3" shape="rect">4.2. Compute Capability 1.3</a></h3> <div class="body conbody"> <p class="p">Devices with compute capability <em class="ph i">1.3</em> support both single and double precision floating point computation. Double precision operations are always <dfn class="term">IEEE 754</dfn> accurate. Single precision in devices of compute capability 1.3 is unchanged from previous compute capabilities. </p> <p class="p">In addition, the double precision hardware offers fused multiply-add. As described in <a class="xref" href="index.html#fused-multiply-add-fma" shape="rect">Section 2.3</a>, the fused multiply-add operation is faster and more accurate than separate multiplies and additions. There is no single precision fused multiply-add operation in compute capability 1.3. </p> </div> </div> <div class="topic concept nested1" id="compute-capability-2-0-and-above"><a name="compute-capability-2-0-and-above" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#compute-capability-2-0-and-above" name="compute-capability-2-0-and-above" shape="rect">4.3. Compute Capability 2.0 and Above</a></h3> <div class="body conbody"> <p class="p">Devices with compute capability <em class="ph i">2.0 and above</em> support both single and double precision <dfn class="term">IEEE 754</dfn> including fused multiply-add in both single and double precision. Operations such as square root and division will result in the floating point value closest to the correct mathematical result in both single and double precision, by default. </p> </div> </div> <div class="topic concept nested1" id="rounding-modes"><a name="rounding-modes" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#rounding-modes" name="rounding-modes" shape="rect">4.4. Rounding Modes</a></h3> <div class="body conbody"> <p class="p">The <dfn class="term">IEEE 754</dfn> standard defines four rounding modes: round-to-nearest, round towards positive, round towards negative, and round towards zero. CUDA supports all four modes. By default, operations use round-to-nearest. Compiler intrinsics like the ones listed in the tables below can be used to select other rounding modes for individual operations. </p> <p class="p"></p> <p class="p"></p> <p class="p"></p> <p class="p"></p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" align="right" valign="top" width="16.666666666666664%" id="d54e2894" rowspan="1" colspan="1">mode</th> <th class="entry" valign="top" width="83.33333333333334%" id="d54e2897" rowspan="1" colspan="1">interpretation</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" align="right" valign="top" width="16.666666666666664%" headers="d54e2894" rowspan="1" colspan="1">rn</td> <td class="entry" valign="top" width="83.33333333333334%" headers="d54e2897" rowspan="1" colspan="1">round to nearest, ties to even</td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="16.666666666666664%" headers="d54e2894" rowspan="1" colspan="1">rz</td> <td class="entry" valign="top" width="83.33333333333334%" headers="d54e2897" rowspan="1" colspan="1">round towards zero</td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="16.666666666666664%" headers="d54e2894" rowspan="1" colspan="1">ru</td> <td class="entry" valign="top" width="83.33333333333334%" headers="d54e2897" rowspan="1" colspan="1">round towards <math xmlns="http://www.w3.org/1998/Math/MathML"> <mo>+</mo> <mtext mathvariant="normal" mathsize="big">∞</mtext> </math> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="16.666666666666664%" headers="d54e2894" rowspan="1" colspan="1">rd</td> <td class="entry" valign="top" width="83.33333333333334%" headers="d54e2897" rowspan="1" colspan="1">round towards <math xmlns="http://www.w3.org/1998/Math/MathML"> <mo>−</mo> <mtext mathvariant="normal" mathsize="big">∞</mtext> </math> </td> </tr> </tbody> </table> </div> <p class="p"></p> <p class="p"></p> <p class="p"></p> <p class="p"></p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <tbody class="tbody"> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">x + y</samp></p> <p class="p"><samp class="ph codeph">__fadd_[rn | rz | ru | rd] (x, y)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">addition</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">x * y</samp></p> <p class="p"><samp class="ph codeph">__fmul_[rn | rz | ru | rd] (x, y)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">multiplication</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">fmaf (x, y, z)</samp></p> <p class="p"><samp class="ph codeph">__fmaf_[rn | rz | ru | rd] (x, y, z)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">FMA</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">1.0f / x</samp></p> <p class="p"><samp class="ph codeph">__frcp_[rn | rz | ru | rd] (x)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">reciprocal</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">x / y</samp></p> <p class="p"><samp class="ph codeph">__fdiv_[rn | rz | ru | rd] (x, y)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">division</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">sqrtf(x)</samp></p> <p class="p"><samp class="ph codeph">__fsqrt_[rn | rz | ru | rd] (x)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">square root</p> </td> </tr> </tbody> </table> </div> <p class="p"></p> <p class="p"></p> <p class="p"></p> <p class="p"></p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <tbody class="tbody"> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">x + y</samp></p> <p class="p"><samp class="ph codeph">__dadd_[rn | rz | ru | rd] (x, y)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">addition</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">x * y</samp></p> <p class="p"><samp class="ph codeph">__dmul_[rn | rz | ru | rd] (x, y)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">multiplication</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">fma (x, y, z)</samp></p> <p class="p"><samp class="ph codeph">__fma_[rn | rz | ru | rd] (x, y, z)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">FMA</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">1.0 / x</samp></p> <p class="p"><samp class="ph codeph">__drcp_[rn | rz | ru | rd] (x)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">reciprocal</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">x / y</samp></p> <p class="p"><samp class="ph codeph">__ddiv_[rn | rz | ru | rd] (x, y)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">division</p> </td> </tr> <tr class="row"> <td class="entry" align="right" valign="top" width="83.33333333333334%" rowspan="1" colspan="1"> <p class="p"><samp class="ph codeph">sqrtf(x)</samp></p> <p class="p"><samp class="ph codeph">__dsqrt_[rn | rz | ru | rd] (x)</samp></p> </td> <td class="entry" valign="top" width="16.666666666666664%" rowspan="1" colspan="1"> <p class="p">square root</p> </td> </tr> </tbody> </table> </div> </div> </div> <div class="topic concept nested1" id="controlling-fused-multiply-add"><a name="controlling-fused-multiply-add" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#controlling-fused-multiply-add" name="controlling-fused-multiply-add" shape="rect">4.5. Controlling Fused Multiply-add</a></h3> <div class="body conbody"> <p class="p">In general, the fused multiply-add operation is faster and more accurate than performing separate multiply and add operations. However, on occasion you may wish to <em class="ph i">disable</em> the merging of multiplies and adds into fused multiply-add instructions. To inhibit this optimization one can write the multiplies and additions using intrinsics with explicit rounding mode as shown in the previous tables. Operations written directly as intrinsics are guaranteed to remain independent and will not be merged into fused multiply-add instructions. With CUDA Fortran it is possible to disable FMA merging via a compiler flag. </p> </div> </div> <div class="topic concept nested1" id="compiler-flags"><a name="compiler-flags" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#compiler-flags" name="compiler-flags" shape="rect">4.6. Compiler Flags</a></h3> <div class="body conbody"> <p class="p">Compiler flags relevant to <dfn class="term">IEEE 754</dfn> operations are <samp class="ph codeph">-ftz={true|false}</samp>, <samp class="ph codeph">-prec-div={true|false}</samp>, and <samp class="ph codeph">-prec-sqrt={true|false}</samp>. These flags control single precision operations on devices of compute capability of 2.0 or later. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table" frame="border" border="1" rules="all"> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="50%" id="d54e3275" rowspan="1" colspan="1">mode</th> <th class="entry" valign="top" width="50%" id="d54e3278" rowspan="1" colspan="1">flags</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e3275" rowspan="1" colspan="1"> <p class="p">IEEE 754 mode (default)</p> </td> <td class="entry" valign="top" width="50%" headers="d54e3278" rowspan="1" colspan="1"> <p class="p">-ftz=false</p> <p class="p">-prec-div=true</p> <p class="p">-prec-sqrt=true</p> </td> </tr> <tr class="row"> <td class="entry" valign="top" width="50%" headers="d54e3275" rowspan="1" colspan="1"> <p class="p">fast mode</p> </td> <td class="entry" valign="top" width="50%" headers="d54e3278" rowspan="1" colspan="1"> <p class="p">-ftz=true</p> <p class="p">-prec-div=false</p> <p class="p">-prec-sqrt=false</p> </td> </tr> </tbody> </table> </div> <p class="p">The default <dfn class="term">IEEE 754 mode</dfn> means that single precision operations are correctly rounded and support denormals, as per the IEEE 754 standard. In the <dfn class="term">fast mode</dfn> denormal numbers are flushed to zero, and the operations division and square root are not computed to the nearest floating point value. The flags have no effect on double precision or on devices of compute capability below 2.0. </p> </div> </div> <div class="topic concept nested1" id="differences-from-x86"><a name="differences-from-x86" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#differences-from-x86" name="differences-from-x86" shape="rect">4.7. Differences from x86</a></h3> <div class="body conbody"> <p class="p">NVIDIA GPUs differ from the x86 architecture in that rounding modes are encoded within each floating point instruction instead of dynamically using a floating point control word. Trap handlers for floating point exceptions are not supported. On the GPU there is no status flag to indicate when calculations have overflowed, underflowed, or have involved inexact arithmetic. Like <dfn class="term">SSE</dfn>, the precision of each GPU operation is encoded in the instruction (for x87 the precision is controlled dynamically by the floating point control word). </p> </div> </div> </div> <div class="topic concept nested0" id="considerations-for-heterogeneous-world"><a name="considerations-for-heterogeneous-world" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#considerations-for-heterogeneous-world" name="considerations-for-heterogeneous-world" shape="rect">5. Considerations for a Heterogeneous World</a></h2> <div class="topic concept nested1" id="mathematical-function-accuracy"><a name="mathematical-function-accuracy" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#mathematical-function-accuracy" name="mathematical-function-accuracy" shape="rect">5.1. Mathematical Function Accuracy</a></h3> <div class="body conbody"> <p class="p">So far we have only considered simple math operations such as addition, multiplication, division, and square root. These operations are simple enough that computing the best floating point result (e.g., the closest in round-to-nearest) is reasonable. For other mathematical operations computing the best floating point result is harder. </p> <p class="p">The problem is called the <dfn class="term">table maker's dilemma</dfn>. To guarantee the correctly rounded result, it is not generally enough to compute the function to a fixed high accuracy. There might still be rare cases where the error in the high accuracy result affects the rounding step at the lower accuracy. </p> <p class="p">It is possible to solve the dilemma for particular functions by doing mathematical analysis and formal proofs <a class="xref" href="index.html#references__4" shape="rect">[4]</a>, but most math libraries choose instead to give up the guarantee of correct rounding. Instead they provide implementations of math functions and document bounds on the relative error of the functions over the input range. For example, the double precision <samp class="ph codeph">sin</samp> function in CUDA is guaranteed to be accurate to within 2 units in the last place (ulp) of the correctly rounded result. In other words, the difference between the computed result and the mathematical result is at most ±2 with respect to the least significant bit position of the fraction part of the floating point result. </p> <p class="p">For most inputs the <samp class="ph codeph">sin</samp> function produces the correctly rounded result. precisions, libraries and hardware. Take for example the C code sequence shown in <a class="xref" href="index.html#mathematical-function-accuracy__cosine-computation-using-glibc-math-library-when-compiled-with-m32-and-m64" title="The computation of cosine using the glibc Math Library yields different results when compiled with -m32 and -m64." shape="rect">Figure 6</a>. We compiled the code sequence on a 64-bit x86 platform using gcc version 4.4.3 (Ubuntu 4.3.3-4ubuntu5). </p> <p class="p">This shows that the result of computing cos(5992555.0) using a common library differs depending on whether the code is compiled in 32-bit mode or 64-bit mode. </p> <p class="p">The consequence is that different math libraries cannot be expected to compute exactly the same result for a given input. This applies to GPU programming as well. Functions compiled for the GPU will use the NVIDIA CUDA math library implementation while functions compiled for the CPU will use the host compiler math library implementation (e.g., <dfn class="term">glibc</dfn> on Linux). Because these implementations are independent and neither is guaranteed to be correctly rounded, the results will often differ slightly. </p> <div class="fig fignone" id="mathematical-function-accuracy__cosine-computation-using-glibc-math-library-when-compiled-with-m32-and-m64"><a name="mathematical-function-accuracy__cosine-computation-using-glibc-math-library-when-compiled-with-m32-and-m64" shape="rect"> <!-- --></a><span class="figcap">Figure 6. Cosine Computations using the <samp class="ph codeph">glibc</samp> Math Library</span>. <span class="desc figdesc">The computation of cosine using the <samp class="ph codeph">glibc</samp> Math Library yields different results when compiled with <samp class="ph codeph">-m32</samp> and <samp class="ph codeph">-m64</samp>.</span><pre class="pre screen" xml:space="preserve">volatile float x = 5992555.0; printf("cos(%f): %.10g\n", x, cos(x)); gcc test.c -lm -m64 <strong class="ph b">cos(5992555.000000): 3.320904615e-07</strong> gcc test.c -lm -m32 <strong class="ph b">cos(5992555.000000): 3.320904692e-07</strong></pre></div> </div> </div> <div class="topic concept nested1" id="x87-sse"><a name="x87-sse" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#x87-sse" name="x87-sse" shape="rect">5.2. x87 and SSE</a></h3> <div class="body conbody"> <p class="p">One of the unfortunate realities of C compilers is that they are often poor at preserving IEEE 754 semantics of floating point operations <a class="xref" href="index.html#references__6" shape="rect">[6]</a>. This can be particularly confusing on platforms that support x87 and SSE operations. Just like CUDA operations, SSE operations are performed on single or double precision values, while x87 operations often use an additional internal 80-bit precision format. Sometimes the results of a computation using x87 can depend on whether an intermediate result was allocated to a register or stored to memory. Values stored to memory are rounded to the declared precision (e.g., single precision for <samp class="ph codeph">float</samp> and double precision for <samp class="ph codeph">double</samp>). Values kept in registers can remain in extended precision. Also, x87 instructions will often be used by default for 32-bit compiles but SSE instructions will be used by default for 64-bit compiles. </p> <p class="p">Because of these issues, guaranteeing a specific precision level on the CPU can sometimes be tricky. When comparing CPU results to results computed on the GPU, it is generally best to compare using SSE instructions. SSE instructions follow IEEE 754 for single and doubleprecision. </p> <p class="p">On 32-bit x86 targets without SSE it can be helpful to declare variables using <samp class="ph codeph">volatile</samp> and force floating point values to be stored to memory (<samp class="ph codeph">/Op</samp> in Visual Studio and <samp class="ph codeph">-ffloat-store</samp> in <samp class="ph codeph">gcc</samp>). This moves results from extended precision registers into memory, where the precision is precisely single or double precision. Alternately, the x87 control word can be updated to set the precision to 24 or 53 bits using the assembly instruction <samp class="ph codeph">fldcw</samp> or a compiler option such as <samp class="ph codeph">-mpc32</samp> or<samp class="ph codeph">-mpc64</samp> in <samp class="ph codeph">gcc</samp>. </p> </div> </div> <div class="topic concept nested1" id="core-counts"><a name="core-counts" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#core-counts" name="core-counts" shape="rect">5.3. Core Counts</a></h3> <div class="body conbody"> <p class="p">As we have shown in <a class="xref" href="index.html#dot-product-accuracy-example" shape="rect">Chapter 3</a>, the final values computed using <dfn class="term">IEEE 754</dfn> arithmetic can depend on implementation choices such as whether to use fused multiply-add or whether additions are organized in series or parallel. These differences affect computation on the CPU and on the GPU. </p> <p class="p">One way such differences can arise is from differences between the number of concurrent threads involved in a computation. On the GPU, a common design pattern is to have all threads in a block coordinate to do a parallel reduction on data within the block, followed by a serial reduction of the results from each block. Changing the number of threads per block reorganizes the reduction; if the reduction is addition, then the change rearranges parentheses in the long string of additions. </p> <p class="p">Even if the same general strategy such as parallel reduction is used on the CPU and GPU, it is common to have widely different numbers of threads on the GPU compared to the CPU. For example, the GPU implementation might launch blocks with 128 threads per block, while the CPU implementation might use 4 threads in total. </p> </div> </div> <div class="topic concept nested1" id="verifying-gpu-results"><a name="verifying-gpu-results" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#verifying-gpu-results" name="verifying-gpu-results" shape="rect">5.4. Verifying GPU Results</a></h3> <div class="body conbody"> <p class="p">The same inputs will give the same results for individual <dfn class="term">IEEE 754</dfn> operations to a given precision on the CPU and GPU. As we have explained, there are many reasons why the same sequence of operations may not be performed on the CPU and GPU. The GPU has fused multiply-add while the CPU does not. Parallelizing algorithms may rearrange operations, yielding different numeric results. The CPU may be computing results in a precision higher than expected. Finally, many common mathematical functions are not required by the IEEE 754 standard to be correctly rounded so should not be expected to yield identical results between implementations. </p> <p class="p">When porting numeric code from the CPU to the GPU of course it makes sense to use the x86 CPU results as a reference. But differences between the CPU result and GPU result must be interpreted carefully. Differences are not automatically evidence that the result computed by the GPU is wrong or that there is a problem on the GPU. </p> <p class="p">Computing results in a high precision and then comparing to results computed in a lower precision can be helpful to see if the lower precision is adequate for a particular application. However, rounding high precision results to a lower precision is not equivalent to performing the entire computation in lower precision. This can sometimes be a problem when using x87 and comparing results against the GPU. The results of the CPU may be computed to an unexpectedly high extended precision for some or all of the operations. The GPU result will be computed using single or double precision only. </p> </div> </div> </div> <div class="topic concept nested0" id="concrete-recommendations"><a name="concrete-recommendations" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#concrete-recommendations" name="concrete-recommendations" shape="rect">6. Concrete Recommendations</a></h2> <div class="body conbody"> <p class="p">The key points we have covered are the following:</p> <dl class="dl"> <dt class="dt dlterm">Use the fused multiply-add operator.</dt> <dd class="dd">The fused multiply-add operator on the GPU has high performance and increases the accuracy of computations. No special flags or function calls are needed to gain this benefit in CUDA programs. Understand that a hardware fused multiply-add operation is not yet available on the CPU, which can cause differences in numerical results. </dd> <dt class="dt dlterm">Compare results carefully.</dt> <dd class="dd">Even in the strict world of <dfn class="term">IEEE 754</dfn> operations, minor details such as organization of parentheses or thread counts can affect the final result. Take this into account when doing comparisons between implementations. </dd> <dt class="dt dlterm">Know the capabilities of your GPU.</dt> <dd class="dd">The numerical capabilities are encoded in the compute capability number of your GPU. Devices of compute capability 2.0 and later are capable of single and double precision arithmetic following the IEEE 754 standard, and have hardware units for performing fused multiply-add in both single and double precision. </dd> <dt class="dt dlterm">Take advantage of the CUDA math library functions.</dt> <dd class="dd">These functions are documented in Appendix C of the <cite class="cite">CUDA C Programming Guide </cite><a class="xref" href="index.html#references__7" shape="rect">[7]</a>. The math library includes all the math functions listed in the C99 standard <a class="xref" href="index.html#references__3" shape="rect">[3]</a> plus some additional useful functions. These functions have been tuned for a reasonable compromise between performance and accuracy. </dd> <dd class="dd">We constantly strive to improve the quality of our math library functionality. Please let us know about any functions that you require that we do not provide, or if the accuracy or performance of any of our functions does not meet your needs. Leave comments in the <cite class="cite">NVIDIA CUDA forum</cite><a name="fnsrc_1" href="#fntarg_1" shape="rect"><sup>1</sup></a> or join the <cite class="cite">Registered Developer Program</cite><a name="fnsrc_2" href="#fntarg_2" shape="rect"><sup>2</sup></a> and file a bug with your feedback. </dd> </dl> </div> </div> <div class="topic reference nested0" id="acknowledgements"><a name="acknowledgements" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#acknowledgements" name="acknowledgements" shape="rect">A. Acknowledgements</a></h2> <div class="body refbody"> <div class="section"> <p class="p">This paper was authored by Nathan Whitehead and Alex Fit-Florea for NVIDIA Corporation. </p> <p class="p">Thanks to Ujval Kapasi, Kurt Wall, Paul Sidenblad, Massimiliano Fatica, Everett Phillips, Norbert Juffa, and Will Ramey for their helpful comments and suggestions. </p> <p class="p">Permission to make digital or hard copies of all or part of this work for any use is granted without fee provided that copies bear this notice and the full citation on the first page. </p> </div> </div> </div> <div class="topic reference nested0" id="references"><a name="references" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#references" name="references" shape="rect">B. References</a></h2> <div class="body refbody"> <div class="section" id="references__1"><a name="references__1" shape="rect"> <!-- --></a><p class="p">[1] <cite class="cite">ANSI/IEEE 754-1985. American National Standard - IEEE Standard for Binary Floating-Point Arithmetic. American National Standards Institute, Inc., New York, 1985.</cite></p> </div> <div class="section" id="references__2"><a name="references__2" shape="rect"> <!-- --></a><p class="p">[2] <cite class="cite">IEEE 754-2008. IEEE 754–2008 Standard for Floating-Point Arithmetic. August 2008.</cite></p> </div> <div class="section" id="references__3"><a name="references__3" shape="rect"> <!-- --></a><p class="p">[3] <cite class="cite">ISO/IEC 9899:1999(E). Programming languages - C. American National Standards Institute, Inc., New York, 1999.</cite></p> </div> <div class="section" id="references__4"><a name="references__4" shape="rect"> <!-- --></a><p class="p">[4] <cite class="cite">Catherine Daramy-Loirat, David Defour, Florent de Dinechin, Matthieu Gallet, Nicolas Gast, and Jean-Michel Muller. CR-LIBM: A library of correctly rounded elementary functions in double-precision, February 2005.</cite></p> </div> <div class="section" id="references__5"><a name="references__5" shape="rect"> <!-- --></a><p class="p">[5] <cite class="cite">David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, March 1991.</cite> Edited reprint available at: <a class="xref" href="http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html" target="_blank" shape="rect">http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html</a>. </p> </div> <div class="section" id="references__6"><a name="references__6" shape="rect"> <!-- --></a><p class="p">[6] <cite class="cite">David Monniaux. The pitfalls of verifying floating-point computations. ACM Transactions on Programming Languages and Systems, May 2008.</cite></p> </div> <div class="section" id="references__7"><a name="references__7" shape="rect"> <!-- --></a><p class="p">[7] <cite class="cite">NVIDIA. CUDA C Programming Guide Version 4.0, 2011.</cite></p> </div> </div> </div> <div class="topic concept nested0" id="notices-header"><a name="notices-header" shape="rect"> <!-- --></a><h2 class="title topictitle1"><a href="#notices-header" name="notices-header" shape="rect">Notices</a></h2> <div class="topic reference nested1" id="notice"><a name="notice" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#notice" name="notice" shape="rect"></a></h3> <div class="body refbody"> <div class="section"> <h3 class="title sectiontitle">Notice</h3> <p class="p">ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. </p> <p class="p">Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. </p> </div> </div> </div> <div class="topic reference nested1" id="trademarks"><a name="trademarks" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#trademarks" name="trademarks" shape="rect"></a></h3> <div class="body refbody"> <div class="section"> <h3 class="title sectiontitle">Trademarks</h3> <p class="p">NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. </p> </div> </div> </div> <div class="topic reference nested1" id="copyright-past-to-present"><a name="copyright-past-to-present" shape="rect"> <!-- --></a><h3 class="title topictitle2"><a href="#copyright-past-to-present" name="copyright-past-to-present" shape="rect"></a></h3> <div class="body refbody"> <div class="section"> <h3 class="title sectiontitle">Copyright</h3> <p class="p">© <span class="ph">2011</span>-<span class="ph">2013</span> NVIDIA Corporation. All rights reserved. </p> </div> </div> </div> </div> <div class="fn"><a name="fntarg_1" href="#fnsrc_1" shape="rect"><sup>1</sup></a><a class="xref" href="http://forums.nvidia.com/index.php?showforum=62" target="_blank" shape="rect">http://forums.nvidia.com/index.php?showforum=62</a></div> <div class="fn"><a name="fntarg_2" href="#fnsrc_2" shape="rect"><sup>2</sup></a><a class="xref" href="http://developer.nvidia.com/join-nvidia-registered-developer-program" target="_blank" shape="rect">http://developer.nvidia.com/</a><a class="xref" href="http://developer.nvidia.com/join-nvidia-registered-developer-program" target="_blank" shape="rect">join-nvidia-registered-developer-program</a></div> <hr id="contents-end"></hr> <div id="release-info">Floating Point and IEEE 754 (<a href="../../pdf/Floating_Point_on_NVIDIA_GPU.pdf">PDF</a>) - CUDA Toolkit v5.5 (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>) - Last updated July 19, 2013 - <a href="mailto:cudatools@nvidia.com?subject=CUDA Tools Documentation Feedback: floating-point">Send Feedback</a></div> </article> <header id="header"><span id="company">NVIDIA</span><span id="site-title">CUDA Toolkit Documentation</span><form id="search" method="get" action="search"> <input type="text" name="search-text"></input><fieldset id="search-location"> <legend>Search In:</legend> <label><input type="radio" name="search-type" value="site"></input>Entire Site</label> <label><input type="radio" name="search-type" value="document"></input>Just This Document</label></fieldset> <button type="reset">clear search</button> <button id="submit" type="submit">search</button></form> </header> <nav id="site-nav"> <div class="category closed"><span class="twiddle">▷</span><a href="../index.html" title="The root of the site.">CUDA Toolkit</a></div> <ul class="closed"> <li><a href="../cuda-toolkit-release-notes/index.html" title="The Release Notes for the CUDA Toolkit from v4.0 to today.">Release Notes</a></li> <li><a href="../eula/index.html" title="The End User License Agreements for the NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA Display Driver, and NVIDIA NSight (Visual Studio Edition).">EULA</a></li> <li><a href="../cuda-getting-started-guide-for-linux/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on GNU/Linux systems.">Getting Started Linux</a></li> <li><a href="../cuda-getting-started-guide-for-mac-os-x/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on Mac OS X systems.">Getting Started Mac OS X</a></li> <li><a href="../cuda-getting-started-guide-for-microsoft-windows/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on Microsoft Windows systems.">Getting Started Windows</a></li> <li><a href="../cuda-c-programming-guide/index.html" title="This guide provides a detailed discussion of the CUDA programming model and programming interface. It then describes the hardware implementation, and provides guidance on how to achieve maximum performance. The Appendixes include a list of all CUDA-enabled devices, detailed description of all extensions to the C language, listings of supported mathematical functions, C++ features supported in host and device code, details on texture fetching, technical specifications of various devices, and concludes by introducing the low-level driver API.">Programming Guide</a></li> <li><a href="../cuda-c-best-practices-guide/index.html" title="This guide presents established parallelization and optimization techniques and explains coding metaphors and idioms that can greatly simplify programming for CUDA-capable GPU architectures. The intent is to provide guidelines for obtaining the best performance from NVIDIA GPUs using the CUDA Toolkit.">Best Practices Guide</a></li> <li><a href="../kepler-compatibility-guide/index.html" title="This application note is intended to help developers ensure that their NVIDIA CUDA applications will run effectively on GPUs based on the NVIDIA Kepler Architecture. This document provides guidance to ensure that your software applications are compatible with Kepler.">Kepler Compatibility Guide</a></li> <li><a href="../kepler-tuning-guide/index.html" title="Kepler is NVIDIA's next-generation architecture for CUDA compute applications. Applications that follow the best practices for the Fermi architecture should typically see speedups on the Kepler architecture without any code changes. This guide summarizes the ways that an application can be fine-tuned to gain additional speedups by leveraging Kepler architectural features.">Kepler Tuning Guide</a></li> <li><a href="../parallel-thread-execution/index.html" title="This guide provides detailed instructions on the use of PTX, a low-level parallel thread execution virtual machine and instruction set architecture (ISA). PTX exposes the GPU as a data-parallel computing device.">PTX ISA</a></li> <li><a href="../optimus-developer-guide/index.html" title="This document explains how CUDA APIs can be used to query for GPU capabilities in NVIDIA Optimus systems.">Developer Guide for Optimus</a></li> <li><a href="../video-decoder/index.html" title="This document provides the video decoder API specification and the format conversion and display using DirectX or OpenGL following decode.">Video Decoder</a></li> <li><a href="../video-encoder/index.html" title="This document provides the CUDA video encoder specifications, including the C-library API functions and encoder query parameters.">Video Encoder</a></li> <li><a href="../inline-ptx-assembly/index.html" title="This document shows how to inline PTX (parallel thread execution) assembly language statements into CUDA code. It describes available assembler statement parameters and constraints, and the document also provides a list of some pitfalls that you may encounter.">Inline PTX Assembly</a></li> <li><a href="../cuda-runtime-api/index.html" title="The CUDA runtime API.">CUDA Runtime API</a></li> <li><a href="../cuda-driver-api/index.html" title="The CUDA driver API.">CUDA Driver API</a></li> <li><a href="../cuda-math-api/index.html" title="The CUDA math API.">CUDA Math API</a></li> <li><a href="../cublas/index.html" title="The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across multiple GPUs.">CUBLAS</a></li> <li><a href="../cufft/index.html" title="The CUFFT library user guide.">CUFFT</a></li> <li><a href="../curand/index.html" title="The CURAND library user guide.">CURAND</a></li> <li><a href="../cusparse/index.html" title="The CUSPARSE library user guide.">CUSPARSE</a></li> <li><a href="../npp/index.html" title="NVIDIA NPP is a library of functions for performing CUDA accelerated processing. The initial set of functionality in the library focuses on imaging and video processing and is widely applicable for developers in these areas. NPP will evolve over time to encompass more of the compute heavy tasks in a variety of problem domains. The NPP library is written to maximize flexibility, while maintaining high performance.">NPP</a></li> <li><a href="../thrust/index.html" title="The Thrust getting started guide.">Thrust</a></li> <li><a href="../cuda-samples/index.html" title="This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.">CUDA Samples</a></li> <li><a href="../cuda-compiler-driver-nvcc/index.html" title="This document is a reference guide on the use of the CUDA compiler driver nvcc. Instead of being a specific CUDA compilation driver, nvcc mimics the behavior of the GNU compiler gcc, accepting a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.">NVCC</a></li> <li><a href="../cuda-gdb/index.html" title="The NVIDIA tool for debugging CUDA applications running on Linux and Mac, providing developers with a mechanism for debugging CUDA applications running on actual hardware. CUDA-GDB is an extension to the x86-64 port of GDB, the GNU Project debugger.">CUDA-GDB</a></li> <li><a href="../cuda-memcheck/index.html" title="CUDA-MEMCHECK is a suite of run time tools capable of precisely detecting out of bounds and misaligned memory access errors, checking device allocation leaks, reporting hardware errors and identifying shared memory data access hazards.">CUDA-MEMCHECK</a></li> <li><a href="../nsight-eclipse-edition-getting-started-guide/index.html" title="Nsight Eclipse Edition getting started guide">Nsight Eclipse Edition</a></li> <li><a href="../profiler-users-guide/index.html" title="This is the guide to the Profiler.">Profiler</a></li> <li><a href="../cuda-binary-utilities/index.html" title="The application notes for cuobjdump and nvdisasm.">CUDA Binary Utilities</a></li> <li><a href="../floating-point/index.html" title="A number of issues related to floating point accuracy and compliance are a frequent source of confusion on both CPUs and GPUs. The purpose of this white paper is to discuss the most common issues related to NVIDIA GPUs and to supplement the documentation in the CUDA C Programming Guide.">Floating Point and IEEE 754</a></li> <li><a href="../incomplete-lu-cholesky/index.html" title="In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2x speedup over CPU in the incomplete-LU and Cholesky preconditioned iterative methods. We focus on the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sparse nonsymmetric and symmetric positive definite linear systems, respectively. Also, we comment on the parallel sparse triangular solve, which is an essential building block in these algorithms.">Incomplete-LU and Cholesky Preconditioned Iterative Methods</a></li> <li><a href="../libnvvm-api/index.html" title="The libNVVM API.">libNVVM API</a></li> <li><a href="../libdevice-users-guide/index.html" title="The libdevice library is an LLVM bitcode library that implements common functions for GPU kernels.">libdevice User's Guide</a></li> <li><a href="../nvvm-ir-spec/index.html" title="NVVM IR is a compiler IR (internal representation) based on the LLVM IR. The NVVM IR is designed to represent GPU compute kernels (for example, CUDA kernels). High-level language front-ends, like the CUDA C compiler front-end, can generate NVVM IR.">NVVM IR</a></li> <li><a href="../cupti/index.html" title="The CUPTI API.">CUPTI</a></li> <li><a href="../debugger-api/index.html" title="The CUDA debugger API.">Debugger API</a></li> <li><a href="../gpudirect-rdma/index.html" title="A tool for Kepler-class GPUs and CUDA 5.0 enabling a direct path for communication between the GPU and a peer device on the PCI Express bus when the devices share the same upstream root complex using standard features of PCI Express. This document introduces the technology and describes the steps necessary to enable a RDMA for GPUDirect connection to NVIDIA GPUs within the Linux device driver model.">RDMA for GPUDirect</a></li> </ul> <div class="category"><span class="twiddle">▼</span><a href="index.html" title="Floating Point and IEEE 754">Floating Point and IEEE 754</a></div> <ul> <li><a href="#introduction">1. Introduction</a></li> <li><a href="#floating-point">2. Floating Point</a><ul> <li><a href="#formats">2.1. Formats</a></li> <li><a href="#operations-and-accuracy">2.2. Operations and Accuracy</a></li> <li><a href="#fused-multiply-add-fma">2.3. The Fused Multiply-Add (FMA)</a></li> </ul> </li> <li><a href="#dot-product-accuracy-example">3. Dot Product: An Accuracy Example</a><ul> <li><a href="#example-algorithms">3.1. Example Algorithms</a></li> <li><a href="#comparison">3.2. Comparison</a></li> </ul> </li> <li><a href="#cuda-and-floating-point">4. CUDA and Floating Point</a><ul> <li><a href="#compute-capability-1-2-and-below">4.1. Compute Capability 1.2 and Below</a></li> <li><a href="#compute-capability-1-3">4.2. Compute Capability 1.3</a></li> <li><a href="#compute-capability-2-0-and-above">4.3. Compute Capability 2.0 and Above</a></li> <li><a href="#rounding-modes">4.4. Rounding Modes</a></li> <li><a href="#controlling-fused-multiply-add">4.5. Controlling Fused Multiply-add</a></li> <li><a href="#compiler-flags">4.6. Compiler Flags</a></li> <li><a href="#differences-from-x86">4.7. Differences from x86</a></li> </ul> </li> <li><a href="#considerations-for-heterogeneous-world">5. Considerations for a Heterogeneous World</a><ul> <li><a href="#mathematical-function-accuracy">5.1. Mathematical Function Accuracy</a></li> <li><a href="#x87-sse">5.2. x87 and SSE</a></li> <li><a href="#core-counts">5.3. Core Counts</a></li> <li><a href="#verifying-gpu-results">5.4. Verifying GPU Results</a></li> </ul> </li> <li><a href="#concrete-recommendations">6. Concrete Recommendations</a></li> <li><a href="#acknowledgements">A. Acknowledgements</a></li> <li><a href="#references">B. References</a></li> </ul> </nav> <nav id="search-results"> <h2>Search Results</h2> <ol></ol> </nav> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/formatting/common.min.js"></script> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/omniture/s_code_us_dev_aut1-nolinktrackin.js"></script> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/omniture/omniture.js"></script> <noscript><a href="http://www.omniture.com" title="Web Analytics"><img src="http://omniture.nvidia.com/b/ss/nvidiacudadocs/1/H.17--NS/0" height="1" width="1" border="0" alt=""></img></a></noscript> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-write.js"></script> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-tracker.js"></script> </body> </html>