Numerical Precision in ONNX and AI Inference
Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models and neural network computations across different frameworks and hardware. As models are exported and deployed via ONNX, the numerical precision of computations becomes critical. Deep learning inference involves a variety of floating-point operations, and small numerical discrepancies can accumulate and affect model accuracy or reproducibility. This article provides an in-depth overview of floating-point precision issues in numerical computing, with a focus on how they manifest in AI inference and the ONNX ecosystem. We cover floating-point formats and rounding behavior, accumulation strategies for summations, the use of low-precision formats (FP16, BFloat16, FP8) in modern AI, and how ONNX defines (or leaves unspecified) the precision of accumulators in operations. Strategies for maintaining numerical stability in deep neural networks are discussed, along with the trade-offs between computational performance and arithmetic accuracy. The goal is to give engineers and researchers a comprehensive, encyclopedic reference on numerical precision in ONNX-based AI inference systems. [1]
Floating-Point Precision in Numerical Computing
2.1. IEEE 754 Floating-Point Formats
Modern computers represent real numbers using the IEEE 754 floating-point standard, which defines formats like 32-bit single precision (FP32) and 64-bit double precision (FP64). In IEEE 754 binary formats, a floating-point number is composed of three fields: 1 sign bit, several exponent bits (with a bias), and several fraction (mantissa) bits. For example, FP32 consists of 1 sign bit, 8 exponent bits, and 23 fraction bits (24 bits of significand precision including the implicit leading 1). This provides roughly 7 decimal digits of precision and an exponent range allowing values on the order of 10±38. FP64 allocates 1 sign bit, 11 exponent bits, and 52 fraction bits (53-bit precision), yielding about 15–16 decimal digits of precision and a vastly larger range (~10±308). These larger formats (especially FP64) are the "gold standard" for numerical accuracy in scientific computing, as they reduce round-off error in complex calculations. However, the increased precision comes at the cost of doubled storage and typically reduced computational throughput (e.g. many GPUs execute FP64 instructions at a much lower rate than FP32). [2]
In addition to single and double precision, the IEEE 754-2008 standard introduced a 16-bit half precision format (FP16, or binary16). FP16 has 1 sign bit, 5 exponent bits, and 10 fraction bits (11-bit significand including the implicit 1). This format covers a much smaller dynamic range (approximately 6.1 × 10-5 to 6.5 × 104 for normalized values) and offers only around 3 decimal digits of precision. Half precision was originally used in graphics, but it has gained popularity in machine learning for its speed and memory advantages (discussed later in Section 4). Another format relevant to AI is BFloat16 (Brain Float 16), a 16-bit float with 8 exponent bits and 7 fraction bits. BFloat16 sacrifices precision (only ~7 fraction bits ≈ 2–3 decimal digits) in order to have the same exponent range as FP32. This means BFloat16 can represent very large or very small numbers (∼10±38, similar range as FP32) but with much less precision between representable values. BFloat16 was introduced for deep learning by Google (for TPUs) to allow reduced precision training without frequent overflow/underflow, relying on the wide range and accepting the coarse precision. [3]
Each floating-point format has a unit roundoff or machine precision, denoted ε. This is effectively the spacing between 1.0 and the next representable number. For FP32, ε ≈ 2-23 ≈ 1.19 × 10-7, corresponding to about 7–8 decimal digits of precision. In FP64, ε = 2-52 ≈ 2.22 × 10-16 (15–16 decimal digits). Half precision is much less precise: FP16 has ε = 2-10 ≈ 9.76 × 10-4, and BFloat16 has ε = 2-7 = 7.8125 × 10-3. These values quantify how finely the continuum of real numbers is quantized in each format. [3]
2.2. Machine Epsilon and Rounding Error
Machine epsilon (ε) is formally defined as the smallest positive number such that 1.0 + ε > 1.0 in the floating-point format. It effectively measures the precision of the representation. In FP32, ε = 2-23 ≈ 1.19 × 10-7, meaning any result is rounded to about 7 decimal digits. FP16’s ε = 2-10 ≈ 9.77 × 10-4, over 1000× larger (so only ~3 decimal digits of precision), and BFloat16’s ε = 2-7 ≈ 7.8 × 10-3, reflecting its mere ~8-bit precision. Rounding in IEEE 754 by default uses round-to-nearest (ties to even), which ensures the rounding error for any single operation is at most half a unit in the last place (0.5 ULP). We can model a floating-point addition as: [3]
where |δ| ≤ ε for that format. Here ε/2 is the maximum relative error in one rounding (also called unit roundoff). For example, when adding two FP32 numbers, the result is accurate to within about 10-7 of the true sum (relative), whereas in BF16, the result can deviate by up to ∼7.8 × 10-3 (nearly 0.78% relative error per operation). Thus, each arithmetic operation incurs a small rounding error (on the order of machine epsilon) that can propagate or accumulate through a sequence of calculations. [3]
Because floating-point representations have finite exponent range, we must also consider overflow and underflow. If a result’s magnitude exceeds the maximum representable value, it overflows to infinity; if it falls below the minimum normal (or minimum subnormal) value, it underflows to zero (or a denormalized tiny value). FP32 and BF16 share an exponent size (8 bits), so their range for normal numbers is roughly [10-38, 1038] (approx max). FP16’s 5-bit exponent allows max ∼ 6.55 × 104 and min normal ∼ 6.10 × 10-5. In practice, underflow/overflow can occur in low-precision arithmetic if values are not appropriately scaled. For instance, summing a large number of moderately sized terms in FP16 could overflow where FP32 would not. As an example, summing on the order of 104 values of magnitude ~101 (say, adding 10,000 terms around 10 each) would produce a total ∼105, which exceeds FP16’s max finite value ~6.5 × 104 and would overflow to +∞ in half precision. Underflow is usually a concern when subtracting nearly equal numbers (catastrophic cancellation yielding a tiny difference that might flush to zero in low precision). [3]
2.3. Accumulation Error in Reductions
Floating-point addition is not associative: (a+b)+c can differ from a+(b+c) due to rounding, and thus the order in which we sum a list of numbers can change the result slightly. When summing a large sequence of N numbers x1, x2,..., xN, the accumulated rounding error can be significant. In the worst case (e.g., adding many positive and negative terms that cause maximal cancellation or always rounding one way), the error can grow on the order of O(N ε) relative to the exact sum. Simply summing N numbers in sequence (the naive linear summation) has a worst-case error bound proportional to . This worst-case occurs for adversarial arrangements of values; in more typical scenarios with rounding errors of random signs, one can model the error as a random walk, yielding an expected root-mean-square (RMS) error growth on the order of . Still, with very large N, even times ε can become large enough to matter. [4][5]
Cancellation is a particular problem: if we add two nearly equal numbers of opposite sign, most leading digits cancel and the result loses significant bits of precision. For example, in single precision, subtracting two numbers that agree in the first 6 decimal digits will yield a result with only ~1 digit of accuracy left. Catastrophic cancellation can dramatically increase relative error. Summation is especially sensitive to cancellation if large positive and negative terms are summed naively. The condition number of a summation (the ratio of the sum of absolute values to the absolute value of the true sum ) indicates how sensitive the result is to perturbations. If the numbers have mixed signs and nearly cancel out (making the condition number large), even an optimal summation order will have a large relative error. For instance, summing [1.0, 10100, 1.0, -10100] in double precision has an exact sum of 2.0, but naive summation may yield 0.0 due to cancellation; an improved summation algorithm yields the correct 2.0. [4][5]
In practical terms, when summing a long list of floating-point numbers (such as accumulating a dot product or reducing a tensor), the accumulation error can become noticeable if no precautions are taken. Techniques to mitigate this error include carefully choosing the summation order or using higher precision for intermediate sums. We will explore these strategies in the next section.
Accumulators and Reduction Strategies
When performing reductions (sums, dot products, etc.), the method of accumulation can greatly affect the accuracy of the result. Below, we discuss various strategies:
3.1. Naive Linear Accumulation
The simplest accumulation strategy is to use a single accumulator and add each element sequentially. In C-like pseudocode:
float sum = 0.0f;
for (int i = 0; i < N; ++i) {
sum += array[i];
}
This naive linear summation adds values one by one to the running total. While straightforward and fast, it is prone to the rounding error accumulation described earlier. The partial sum sum continually grows (in absolute value) as terms are added, so new small terms may fall outside its precision. For example, if sum is large and positive, adding a comparatively tiny positive number may produce no change because the small term's significant digits vanish in the presence of the large sum (the increment is below the rounding threshold of sum). Conversely, adding a small number of opposite sign can suffer from cancellation. The order of addition matters: summing from smallest magnitude to largest (ascending order) tends to yield a more accurate result than summing in descending order, because adding small terms first preserves their contributions before the sum becomes very large. The naive loop above implicitly sums in the given index order; if the data is unsorted, this order is essentially arbitrary.
The performance of linear accumulation is excellent (minimal overhead), and for many applications the resulting error is within acceptable limits. However, for very large N or ill-conditioned sums, the inaccuracy can become a problem. In a reduction involving millions of elements, a single-precision sum could lose several digits of accuracy. In deep learning, this could translate to small discrepancies in layer outputs or aggregated gradients if done naively in low precision.
3.2. Pairwise and Tree-Based Reductions
A better approach to summation is pairwise summation, which sums numbers in a binary tree structure rather than linearly. The idea is to recursively split the array into halves, sum each half, then add the two partial sums. This can be implemented as:
function pairwise_sum(x[1..n]):
if n == 0:
return 0
if n == 1:
return x[1]
m = floor(n/2)
left_sum = pairwise_sum(x[1..m])
right_sum = pairwise_sum(x[m+1..n])
return left_sum + right_sum
By always adding numbers of similar magnitudes first (small subsets are reduced to subtotals, then those are added, etc.), pairwise summation avoids the situation of a very large running total absorbing small additions until the final steps. The error of pairwise summation grows logarithmically with N in the worst case (roughly O(ε log2 N)), instead of linearly. Intuitively, at each level of the tree, rounding error is constrained, and there are about log2 N levels. In fact, the worst-case relative error bound for pairwise summation (base case of 1 element) is: [5]
which for practical N simplifies to on the order of ε log2 N times the condition number of the sum. In typical random-error scenarios, the error behaves even better (on the order of ). The key is that pairwise (tree) reduction balances the accumulations, preventing one sum from overwhelming the others too early. [4][5]
From a performance standpoint, pairwise summation has the same O(N) total additions as linear summation, just a different order. It is also highly amenable to parallelization: independent partial sums can be computed in parallel at each level of the tree. Many numerical libraries and BLAS implementations use a form of blocked or pairwise summation in dot products for improved accuracy at negligible performance cost. For example, NumPy and Julia perform partial pairwise summation by default for better numerical behavior. [4]
3.3. Compensated Summation (Kahan, Neumaier)
Compensated summation is an alternative technique to improve summation accuracy by tracking the small “lost” quantities. The most famous algorithm is the Kahan summation algorithm (also called Kahan–Babuška). Kahan summation uses an extra variable to accumulate the round-off error. In pseudocode: [4]
double sum = 0.0;
double c = 0.0; // compensation for lost low-order bits
for (i = 1; i <= N; ++i) {
double y = x[i] - c; // recover low-order bits by subtracting compensation
double t = sum + y; // perform the addition
c = (t - sum) - y; // compute new compensation (the error in t)
sum = t;
}
return sum;
Here c accumulates the tiny leftovers that would otherwise be dropped. Each iteration essentially subtracts the last error from the new addend (y = x[i] - c) so that t = sum + y includes the small bits from the previous operation. The new error is then stored in c for the next iteration. The result is that the final sum is as if computed with higher precision than the machine’s: Kahan’s algorithm can greatly reduce the error in summing a sequence of floats. In fact, with exact arithmetic for the compensation, Kahan summation yields a result accurate to within 1 or 2 ULP of the true sum, regardless of N (assuming the sequence of partial sums doesn’t overflow). More precisely, Kahan summation achieves a worst-case error independent of N (error bound depends only on machine precision, not the number of terms). In practice, using a double accumulator for summing single-precision numbers via Kahan can give near double-precision accuracy. [4]
An important subtlety is that the compensation variable c should be kept in full precision. In the implementation above, we used double for both sum and c. If summing FP32 values, one might use float sum and float c (both 32-bit), which still improves accuracy, but using a higher precision for the accumulator and compensation (i.e. double) yields even better results. Many implementations of Kahan summation use the same type for sum and c, but it's assumed to be higher precision than the inputs when possible.
Kahan’s algorithm carries a performance cost: each iteration does five arithmetic operations (two adds, one subtract, etc.) and has a data dependency, which can hinder parallelism or pipelining. Thus, it runs slower than naive summation (roughly 2–3× slower or more, depending on CPU pipeline). In loops where summation is a bottleneck, this may be significant. Nonetheless, for modest N or where maximum accuracy is required, compensated summation is extremely useful. In fact, Python’s math.fsum function uses a variant of Kahan’s algorithm (actually a more advanced multi-part compensation) to sum iterables with high precision. [4]
There are variations and improvements on basic Kahan summation. Neumaier’s algorithm (sometimes called Kahan-Babuška-Neumaier) improves Kahan’s method by also handling cases where the next input is larger in magnitude than the running sum. In such cases, the original Kahan formula could suffer error; Neumaier’s version effectively swaps roles when a huge value comes in. Its pseudocode is: [4]
function NeumaierSum(array):
sum = 0.0
compensation = 0.0
for each x in array:
t = sum + x
if (fabs(sum) >= fabs(x))
compensation += (sum - t) + x;
else
compensation += (x - t) + sum;
sum = t;
return sum + compensation;
This algorithm ensures the compensation covers both scenarios: when the sum is the bigger addend or when the new term is bigger. In the end, it adds the final compensation once to the result. Neumaier’s method can produce the correct result in cases where basic Kahan fails. For instance, summing [1.0, +10100, 1.0, -10100] in double precision: regular Kahan summation might give 0.0 (losing the small terms), whereas Neumaier’s algorithm yields 2.0 exactly. There are also higher-order compensation methods (carrying multiple compensation terms, e.g., the Klein summation variant or using arbitrary precision for intermediate sums) that further reduce error at increased computational cost. [4]
In summary, compensated summation algorithms trade extra computation for greatly improved numerical accuracy. They are especially valuable when summing large arrays of numbers with varying magnitudes, where naive summation could lose most of the low-order bits. Many scientific computing environments provide compensated sum routines for this reason. In deep learning, however, a different approach to improved precision is often taken: using a wider accumulator type.
3.4. Mixed-Precision Accumulation
Mixed-precision accumulation refers to performing arithmetic in a higher precision than the input data. A common scenario is accumulating sums in 32-bit floats when the operands are 16-bit floats. Rather than implementing a compensation algorithm, one can simply use a wider accumulator so that intermediate rounding error is smaller. This is effectively what modern GPUs do for tensor operations: for example, NVIDIA Tensor Cores operate on FP16 input data with an FP32 accumulator (the 16-bit products are accumulated in 32-bit). In this way, summing 2048 products in a dot product has far less precision loss than it would with a 16-bit accumulator. Hardware that supports mixed precision can give the best of both worlds: fast low-precision multiplies and high-precision accumulation. [6]
Software can also leverage mixed precision. For instance, one can sum an array of float values into a double variable, accumulating the result in double precision and then converting back to float at the end:
double sum = 0.0;
for (int i=0; i<N; ++i) {
sum += (double) arr_float[i];
}
float result = (float) sum;
This approach often yields a more accurate result than summing in single precision directly, because the intermediate sum carries 52-bit precision. The final rounding to float happens only once at the end (minimizing overall rounding error). In fact, the above is one way to implement math.fsum in Python (it internally does something similar with multiple partial sums in double precision).
In machine learning training and inference, mixed precision is widely used. When using FP16 for speed, it's common to accumulate with FP32. For example, Google's TPUs use BFloat16 for multiplication but accumulate in FP32 for matrix ops. AWS’s Inferentia (Neuron) hardware similarly supports FP16/BF16 matrix multiplication with FP32 accumulation. The motivation is that summing hundreds or thousands of half-precision values would otherwise introduce intolerable error or risk overflow. With an FP32 accumulator, the reduction error is drastically reduced (by about 213 times, since FP32 has 13 more precision bits than FP16). Mixed-precision accumulation thus provides a simple and effective form of compensation: use a larger bucket to collect many small contributions. [6][7]
There is a spectrum of options: summing int8 or int16 integers might use 32-bit integers as accumulators for the same reason (to avoid overflow). Some BLAS libraries even offer extended precision accumulators for single precision dot products (accumulating in double). The downside of mixed precision is the extra memory and maybe throughput cost of the wider type. But on many platforms, the add latency is dominated by hardware design rather than bit-width, so using a 32-bit adder vs 16-bit has negligible speed impact if supported natively. The key is whether data movement or register pressure increases.
In summary, using a higher precision accumulator is often the easiest way to improve numerical accuracy for reductions. It is a primary strategy in deep learning hardware to maintain stability while exploiting low precision for storage and initial computations.
Precision Formats in AI and Machine Learning
Modern AI systems use a variety of numeric formats to balance speed, memory, and accuracy. We discuss common precisions and their roles:
4.1. FP32 and FP64 in Training and Inference
FP32 (single precision) has been the workhorse of deep learning for years. Most training scripts in frameworks like TensorFlow and PyTorch used FP32 by default, as it offers sufficient precision to propagate gradients and accumulate weight updates over millions of iterations without excessive error. With ~7 decimal digits of precision and a wide exponent range, FP32 can handle the dynamic range of activations and gradients in typical neural networks. In inference, FP32 is often more precision than strictly necessary for final predictions, but it has remained common because of its ease and the fact that many GPU operations were optimized for FP32.
FP64 (double precision) is rarely used in deep learning model training or inference, except in specific scenarios requiring extreme precision. FP64 is common in scientific computing and certain types of numerical simulations. In neural networks, however, the extra precision is usually not needed for model accuracy—using doubles tends to yield negligible improvements in model metrics but at a large performance cost (since on GPUs FP64 throughput can be 1/2, 1/8, or even 1/32 of FP32 throughput depending on hardware). One exception might be when integrating neural nets into larger physics simulations or when solving ill-conditioned problems as part of the model. Some researchers use FP64 to diagnose numerical issues or ensure reproducibility. Generally though, deep learning has favored lower precision for efficiency, and FP64 is considered overkill for most networks. [2]
That said, certain classical machine learning algorithms benefit from double precision. For example, computing the inverse of a matrix (as in a Gaussian Process regression or Kalman filter embedded in a model) can incur large rounding errors in single precision, so using FP64 can significantly improve the result. The ONNX standard supports a float64 tensor type for such needs. In ONNX Runtime, however, emphasis has been on FP32 performance, and some pipelines default to float even if the original model was double. If a model’s prediction computation involves operations that are numerically sensitive (like matrix inversion, Cholesky decomposition, or summing extremely large counts), then FP64 may be required to avoid noticeable discrepancies. A good practice is to test a model in both float and double if you suspect precision issues; significant differences indicate an ill-conditioned calculation that might need higher precision. [1]
In summary, FP32 remains the default format for most training and inference due to its balance of accuracy and efficiency, while FP64 is reserved for niche cases or precision-critical subroutines. FP32 typically achieves the same accuracy as FP64 for end-to-end deep learning tasks, because network training can often compensate for small noise, and other sources of error (like generalization error) dominate. Therefore, the community trend has been moving to lower precisions, not higher, to gain speed — provided the models still converge and perform well. [2]
4.2. FP16 and BFloat16 in Deep Learning
To further speed up neural network training and inference, half-precision (FP16) was adopted, especially with the advent of GPUs supporting FP16 math at twice the rate of FP32. Using FP16 reduces memory storage and bandwidth by 50% compared to FP32, which is significant for large models. However, FP16’s small exponent range and precision posed challenges: training a large network purely in FP16 can fail due to gradient underflow/overflow or weight updates not registering (because changes are below the 0.001 quantization step). To address this, frameworks introduced mixed-precision training, where compute-intensive parts (matrix multiplies, convolutions) run in FP16, but certain accumulations and model parameters remain in FP32. For instance, NVIDIA’s Tensor Core units require FP16 (or BF16) inputs but accumulate results in FP32, effectively ensuring that reductions are done in higher precision. Research showed that with proper loss scaling and mixed precision, models can train to FP32-level accuracy using mostly FP16 arithmetic, achieving significant speedups. [6]
BFloat16 (BF16) emerged as an alternative 16-bit format primarily from Google’s TPU training systems. BF16 has a larger exponent (8 bits, same as FP32) but only 7 fraction bits. This means BF16 can handle the same dynamic range as FP32 (no overflows where FP32 wouldn’t) but at the cost of even lower precision than FP16 (BF16 has about 2–3 decimal digits of precision vs FP16’s ~3–4). Despite its coarse precision, BF16 turns out to be effective for training deep networks because the algorithms (stochastic gradient descent, etc.) are somewhat robust to that noise, and having the wide range prevents catastrophic loss of signal (no gradient will overflow to Inf unless it was going to in FP32 anyway). Many training frameworks (PyTorch, TensorFlow) support BF16 training on hardware like TPU, NVIDIA A100+, Intel CPUs with AVX-512 BF16, etc. Typically, when using BF16, accumulations are still done in FP32 (for example, a TPU v2/v3 multiplies BF16 matrices and accumulates into an FP32 result for the dot product). [3][6]
In inference, both FP16 and BF16 are used to deploy models for faster speed or lower memory. FP16 is popular on NVIDIA GPUs (which have specialized FP16 units). BF16 is advantageous on TPUs and some CPUs that support it, and increasingly on GPUs (Ampere and newer GPUs also support BF16 arithmetic). The choice between FP16 and BF16 can depend on hardware: BF16’s advantage is easier software conversion (you can often take an FP32-trained model and cast weights to BF16 with minimal loss, since the dynamic range is intact), whereas FP16 might require adjusting the model or ensuring values fit in range (sometimes needing techniques like maintaining a master copy of weights in FP32 during training, or handling out-of-range values).
For ONNX, both Float16 and BFloat16 types are part of the standard. ONNX models can be converted to FP16 (truncating weights and inserting cast nodes for computations) using tools like float16_converter in ONNX Runtime. ONNX Runtime on GPUs will typically execute those FP16 ops with mixed precision on tensor cores (FP16 input, FP32 accumulate), whereas on CPU it may emulate FP16 or upcast to FP32 (as many CPU implementations do not natively support half-precision math). In fact, as an example, running an ONNX model in FP16 on a CPU might insert hidden Cast operations to FP32 behind the scenes, because the CPU kernel for that op only exists in FP32. This ensures correctness (the computation is done in higher precision) but can reduce performance. We will revisit such backend-dependent behavior in Section 5.3. [8][9]
In summary, FP16 and BF16 are now established as important low-precision formats in deep learning. FP16 offers more precision bits, which can matter for e.g. small differences, but has a limited range (care needed to avoid overflow). BF16 offers safety in range but less precision (which can slightly affect convergence or final accuracy in some cases). Both typically rely on FP32 for accumulation to achieve acceptable accuracy. When deploying models, one should choose a format supported efficiently by the target hardware and verify that the accuracy is not significantly degraded. Often, CNNs and transformers can run in FP16/BF16 with virtually no drop in score, but certain operations (like softmax, see Section 6.3) might need special handling.
4.3. FP8 Formats (E4M3, E5M2) and Scaling
The push for ever lower precision has led to experimental 8-bit floating point (FP8) formats for deep learning. In 2022, researchers from NVIDIA, Intel, and Arm proposed an FP8 standard with two variants: E4M3 (4 exponent bits, 3 fraction bits) and E5M2 (5 exponent bits, 2 fraction bits). Both have 1 sign bit, for a total of 8 bits. E5M2 has a wider exponent range (5 bits gives bias 15) but one fewer mantissa bit than E4M3. The idea is to use E4M3 for numbers that need more precision (like weights or activations around 1.0) and E5M2 for numbers that need more dynamic range (like gradient updates which can be very small or large). In fact, the study found that using E4M3 for forward activations/weights and E5M2 for backward gradients can allow training of large networks with FP8 to accuracy on par with FP16 training. [10][11]
FP8 formats are not IEEE 754 standard (though IEEE is considering lower precision in its 2019 revision). Notably, the E4M3 format as used by NVIDIA does not allocate any bit-pattern for +∞ or -∞ – it has only NaN for invalid results, using that freed bit pattern to extend the representable range by one exponent value. Essentially, E4M3 can represent numbers up to ~215 (around 3e4) instead of capping at infinity at 216, by foregoing infinities. Variations also exist regarding negative zero and how NaNs are handled. For instance, GraphCore’s IPU uses FP8 formats that disallow negative zero and infinities, to maximize useable range. [10]
Using 8-bit floats for neural networks requires additional techniques, chiefly scaling. Because 8-bit has very limited precision and range, one typically applies a power-of-two scale factor to groups of values to normalize them into an FP8-friendly range. This is akin to block floating-point or “microscaling” per layer or per tensor. For example, one might keep a FP32 scale for each channel or tensor, and represent values as (FP8_value) × 2scale. The ONNX standard has introduced support for FP8 in version 1.15, including types for E4M3 and E5M2 (with variations FNUZ: finite-only and no negative zero), and even a type called E8M0 which is an 8-bit exponent-only number used purely as a scale factor. The E8M0 type has 8 exponent bits, no mantissa, effectively representing powers of two from 2-127 to 2128 for scaling purposes. This indicates an approach where an FP8 tensor would be accompanied by an E8M0 scale (or scales) to interpret its value correctly. [10]
Rounding becomes critical when casting values to FP8. The quantization error is much larger relative to FP16 or FP32. Studies have shown that choosing the right rounding mode (e.g., round-to-nearest, stochastic rounding, etc.) and clamping behavior (saturating to max instead of inf) can affect model accuracy. ONNX added a round_mode attribute to its Cast operator to allow controlling this for FP8. In particular, one paper found that rounding up with saturation (round-to-nearest ties away from 0, and clamp overflows to max finite rather than inf) gave better accuracy in large language model training than other rounding modes. [10]
At present, FP8 is on the cutting edge. NVIDIA’s H100 GPU supports FP8 tensor cores (for both E4M3 and E5M2), and early results show minimal accuracy loss on some vision and NLP tasks when properly tuned. However, using FP8 requires fine-tuning hyperparameters like scaling factors for each layer and perhaps tweaking training procedures. ONNX’s inclusion of FP8 types means you can represent and transport models using FP8 weights or activations, but actual hardware support is limited and typically paired with calibration or dynamic scaling logic outside of ONNX graph (or using custom ONNX functions for scaling). It remains an open research area how to fully automate FP8 quantization for arbitrary models without accuracy loss.
In summary, FP8 formats promise further compression of neural network computation and data. They are an order of magnitude more error-prone than FP16, but when combined with careful scaling, mixed usage of E4M3/E5M2, and maybe modified training techniques, they have shown surprising success. ONNX is actively evolving to support these ultra-low precision types, signaling their potential importance in future AI inference deployments.
Accumulation Semantics in ONNX
One of the roles of the ONNX specification is to define the mathematical behavior of operators (Conv, MatMul, Sum, etc.) in a framework-agnostic way. However, ONNX mostly defines what an operator computes in exact arithmetic, not how the computation is carried out in finite precision. This leaves certain details – like accumulator precision or summation order – up to the implementation (runtime or hardware). We'll examine how ONNX treats accumulation and what guarantees (or lack thereof) are provided.
5.1. Operator Definitions and Numerical Guarantees
Each ONNX operator is described as performing an ideal mathematical function on its inputs. For example, MatMul is defined as the standard matrix product of two tensors, and ReduceSum computes a summation along given axes. The ONNX spec, as of current versions, does not mandate the precision of intermediate calculations beyond the types of the inputs and outputs. If an operator’s inputs and outputs are float16, the spec implies the computation is done in float16 in a sense, but it does not forbid an implementation from using higher precision internally – it simply doesn't mention it. There is no explicit guarantee of identical bit-for-bit results across different ONNX backends for floating-point ops. Instead, results are expected to be numerically close, within tolerances one would normally expect from floating-point differences (due to different hardware or algorithms). The ONNX Backend Test suite typically uses a relative or absolute error tolerance when validating outputs for this reason.
Because of this, important details like the use of extended precision or accumulation in higher precision are left ambiguous. This has led to discussions in the ONNX community about making such behavior more explicit. For example, a proposal was raised to allow specification of accumulation precision for certain ops like MatMul. The reasoning is that as lower-precision types (float16, bfloat16, float8) become more common, knowing or controlling whether an op internally accumulates in a higher precision is important for both accuracy and reproducibility. In June 2025, an issue on the ONNX GitHub suggested adding an attribute to MatMul (and similarly Gemm, Softmax, etc.) to specify the accumulator type. By default it would use input precision (status quo), but an attribute could request (for instance) 32-bit accumulation even if inputs are 16-bit. As of this writing, however, no such attribute exists in the official spec – it's an open issue. Thus, the numerical semantics are effectively: “compute the result as if in the same type as inputs, but actual implementations may use more precision internally as an optimization.” [12]
From a guarantee standpoint, ONNX operators promise the mathematically correct result assuming real-number arithmetic, but they do not guarantee bit-exact results due to floating-point effects. There is an understanding that differences can arise from different summation orders, fused operations (e.g., an implementation might use an FMA (fused multiply-add) which has one rounding instead of two), or use of extended precision registers. ONNX does not formally define rounding modes either – it relies on IEEE 754 defaults, but if a backend used something like deterministic summation or alternative rounding, ONNX has no mechanism to express that.
In summary, ONNX's current stance is largely implementation-dependent for numerical precision. It provides minimal explicit numerical guarantees beyond type constraints. This is why an ONNX model run on two different hardware or runtime libraries can have slightly different results, especially in reduced precision. The burden is on backends to produce results "close enough" to each other that they don't affect the model's decisions in any significant way.
5.2. MatMul, Conv, and Reduction Operators
Consider ONNX MatMul: it takes two tensors (often two matrices) and produces their matrix product. In pure math, . If A and B are float16, should that sum be computed in float16 or float32? As discussed, ONNX doesn’t specify – it simply says the output is a float16 matrix containing the product. In practice, many backends will promote the computation: for example, NVIDIA GPUs will do the multiplication and addition in a mixed FP16/FP32 mode by default (because of tensor cores), then convert the final result back to float16. ONNX Runtime on CPU might not support float16 multiplication natively, so it will insert cast ops to float32, perform the MatMul in float32, then cast back to float16. From ONNX’s perspective, as long as the final result is within acceptable error of the “ideal” float16 computation, it’s fine. But note, an “ideal” float16 computation (where every multiply and add is rounded to FP16) might actually be less accurate* than what the backend did with FP32. This means that a backend using higher precision can produce a result that is not bit-equal to a pure FP16 implementation – it's actually closer to the true real number result. This is usually considered a good thing (improved accuracy), but it does highlight a lack of consistency: another backend might do everything in pure FP16 and get a slightly different answer. [9]
The same situation occurs for convolution (Conv op) which is essentially a batched sum of products (like MatMul in sliding window), and for explicit reduction ops like ReduceSum, ReduceMean, etc. If you have an ONNX ReduceSum that sums 1000 float16 values, the spec doesn’t say how to sum them. A GPU kernel might sum in a tree using FP32 accumulators; a naive implementation might sum in order with FP16 and possibly overflow or underflow intermediate steps. Both are allowed as long as they output a float16 result that is “reasonable.” In practice, ONNX Runtime and others take care to avoid obviously bad behavior (like overflow) by using higher precision or by splitting the sum. But again, it’s not mandated by the ONNX standard explicitly.
typically applied along a specified axis. Directly computing this in float16 can be problematic for large vectors, as discussed in Section 6.3. Indeed, the issue of accumulation precision arises: should the sum be done in FP16 or can it be in FP32? Many implementations will do it in FP32 even if x is FP16, because the dynamic range of is large. The ONNX Softmax spec doesn’t detail this, but it expects the result to be a proper probability distribution (summing to 1.0 within normal floating error). A pure FP16 softmax on, say, 1000 elements could sum to something like 0.996 or 1.004 due to rounding, whereas FP32 accumulation would be much closer to 1.000. ONNX likely allows both; the small difference is usually inconsequential, but if it were large (say the FP16 sum overflowed to inf), that would be a correctness issue. A robust backend will prevent that (by using the common max-subtraction trick and higher precision accumulation). The ONNX issue 7072 indeed calls out Softmax as another op where explicit control of accumulation precision would be useful. [12]
In summary, MatMul, Conv, and reduction ops in ONNX are defined mathematically but with implicit numerical behavior. Implementations commonly use higher precision for intermediate sums especially when input is low precision, but this is not standardized. Consequently, results can differ slightly between implementations. For critical applications, one must be aware of these differences. A model exported to ONNX with float16 weights might perform slightly differently on one runtime vs another if one does all float16 math and the other uses float32 accumulations. Both conform to ONNX. As a user, if you require a certain precision for accumulation, currently you have to ensure it by the way you export or run the model (for example, insert manual cast nodes in the ONNX graph to force certain ops to FP32). We may see future ONNX versions address this with attributes or metadata as mentioned, but for now these ops are effectively “compute with best effort accuracy”.
5.3. Backend and Hardware-Dependent Behavior
Because ONNX is implemented by many backends (ONNX Runtime, TensorRT, CoreML, PyTorch JIT, etc.), and these in turn run on varied hardware (CPU, GPU, accelerators), the numeric behavior can be backend-dependent. Some examples:
- CPU backends: Many CPUs do not have native float16 vector arithmetic (with a few exceptions in recent CPUs). ONNX Runtime’s default CPU execution provider will often upcast float16 to float32, compute, then downcast. This ensures accuracy but means that if you thought you were getting a full float16 pipeline, you’re not – the CPU is essentially ignoring the low precision in compute. On the other hand, x86 CPUs have extended 80-bit precision in x87 registers (though less used now) and 64-bit SIMD registers for doubles; some compiler optimizations might carry intermediate results in higher precision. This could make CPU results slightly different from a GPU’s if, say, a sum is done in 80-bit on CPU versus 32-bit on GPU. Generally, modern compilers avoid such extended precision unless explicitly asked, to improve reproducibility. [9]
- GPU backends: GPUs often perform summations in parallel using tree reductions. The order of operations can change depending on how threads are scheduled. This means even on the same GPU, summing an array in one launch versus another might not give bit-identical results if the internal parallel reduction algorithm differs (though it should be deterministically the same given the same library version and input, but if you change GPU model or BLAS library, it might differ). Also, GPUs have fused multiply-add (FMA) instructions that do (a*b + c) in one step with one rounding, which can produce slightly different results than doing the multiply and add separately (which would round twice). If an ONNX backend uses a fused kernel (like a GEMM library) it might yield different low-order bits than a simple loop.
- Vendor libraries: ONNX Runtime can use vendor BLAS or DNN libraries (MKL, cuDNN, TensorRT, etc.). These libraries often implement their own strategies for numerical stability. For example, NVIDIA’s cuDNN might use FP32 accumulation for FP16 inputs by default. TensorRT (NVIDIA’s inference engine) will sometimes automatically promote precisions or insert scaling to maintain stability. On the other hand, if you explicitly request FP16 in TensorRT with no FP32 fallback, it might do the entire op in FP16 and you’d see more error. Some libraries offer knobs: TensorRT allows choosing FP32, FP16, or INT8 precision for each layer, and will use FP32 internally for safety in some cases unless forced otherwise. [13]
- Hardware-specific quirks: As another example, NVIDIA’s TensorFloat-32 (TF32) on Ampere GPUs is a precision mode where matrix multiplications use 10-bit mantissa (so about FP16 precision for the multiply) but keep 8-bit exponent (range like FP32). By default, CUDA libraries use TF32 for FP32 matrix multiplies on Ampere to speed them up, unless you disable it. This means if you run an ONNX model with FP32 on an A100 GPU, your matmuls might actually be lower precision than FP32 (though higher than FP16). It’s an example of hardware choosing an intermediate precision for performance. From ONNX’s perspective, the output is still FP32 and (hopefully) within error tolerances, but a user might be surprised that it’s not a “true” FP32 computation internally. If exactness is needed, one has to turn off TF32. [6]
- Non-determinism: Some backends (especially on GPUs) allow non-deterministic execution for performance. For instance, atomic adds in parallel reductions can produce results that vary run-to-run because the order of accumulation is race-condition-dependent. ONNX’s spec doesn’t cover this explicitly, but it’s something to consider. In ONNX Runtime, there are settings for deterministic computing, but by default it may use all available performance optimizations, even if results vary at the 1e-7 level.
In short, the numerical behavior under ONNX can depend on the combination of backend software and hardware capabilities. The lack of a strict spec for accumulation means implementations have freedom to use clever strategies (higher precision, reordering) to improve accuracy or speed. Usually this is positive (e.g. getting better accuracy than a naive approach). But it could also lead to slight inconsistencies. For example, you might find that exporting a PyTorch model to ONNX and running it with one engine yields outputs slightly different from PyTorch’s – not because ONNX is wrong, but because the summation or convolution was done differently (maybe PyTorch summed in FP32 as well, but maybe not in exactly the same way).
The key takeaway is that ONNX by design abstracts away these details, which is good for portability but means you need to trust the backend or verify its precision behavior for sensitive models. If necessary, you can enforce certain behavior by inserting ops (for example, if you want to ensure FP32 accumulation, you could insert a cast to FP32 for inputs of a MatMul and cast the result back to FP16 – this way the ONNX graph itself mandates FP32 compute in between). This of course sacrifices some performance or device-specific magic.
Looking forward, as mentioned, there are proposals to enrich ONNX’s specification regarding precision. This might include explicit flags for “exact summation” or “accumulate in int32” etc., which would make models more self-descriptive but also potentially less portable (if a backend doesn’t support that, what then?). For now, users should be aware of these backend differences. Most of the time, well-engineered ONNX runtimes yield results close enough that they don't affect the application (e.g., classification top-1 remains the same even if low-level sums differ by 1e-5). But for high-stakes numerical computations, testing and maybe adjusting the model or backend settings is prudent.
Numerical Stability in Deep Learning Models
Deep learning computations involve many operations that, while theoretically well-defined, can be sensitive to numerical issues when implemented in finite precision. We'll discuss a few common scenarios in neural networks where numerical stability is a concern, and how precision and accumulation play a role.
6.1. Dot Products and Large-Scale Reductions
Neural networks are full of dot products: every fully-connected layer computes , which is a series of inner products between weight vectors and the input. Convolutional layers compute weighted sums of local regions. Attention mechanisms compute large matrix products of query/key/value matrices. All these boil down to summing a lot of terms: . When K is large (it could be in the thousands or more for a single neuron in a fully connected layer of a large model), the potential for floating-point error in that sum increases.
If using FP32, the error may still be negligible in many cases, but in lower precision like FP16, it can become significant. For example, consider a dot product of length 2048 in FP16. Each product is in FP16 (with ~3 decimal digits precision) and if accumulated in FP16, the worst-case error could be on the order of 2048 × ε_FP16 ≈ 2048 × 0.0009766 ≈ 2 (i.e., you could lose a couple units in the third decimal place relative accuracy in worst case). Even in random error accumulation, you might see error on the order of or 4.4% (though this is a rough estimate). That’s quite significant. Fortunately, as discussed, most hardware will accumulate such sums in a higher precision like FP32, dramatically reducing that error (~7 orders of magnitude smaller ε). But if it didn’t, the network’s outputs could get noisy. A fully-connected layer with many inputs might output slightly different results depending on summation order if done in low precision.
Large-scale reductions also occur in operations like global average pooling (summing all elements of a feature map) or computing the total loss by summing loss over batch samples. If one sums 100,000 values in FP32, the worst-case error could be on the order of 100,000 1e-7 ≈ 1e-2 (1% uncertainty), though typical error would be much less. In FP16, summing 100k values is likely impossible without overflow: as noted, adding ~10^4 moderate numbers can overflow FP16. So large reductions must* be handled carefully (either in chunks or in higher precision). Frameworks generally do: for example, when performing a sum reduction on GPU, libraries break it into parallel blocks that accumulate in registers (usually 32-bit registers, even if inputs are 16-bit). [3]
Another aspect is that many dot products in neural nets involve values of varying magnitude. If the distribution of terms is such that some are much larger than others, the smaller terms might not contribute much if summed last. Ideally, one might sort by magnitude or use pairwise summation to improve that. Most BLAS libraries effectively do a pairwise strategy due to their parallel structure. So, while worst-case analyses are grim, practical implementations mitigate a lot of it.
However, certain layers can suffer if not handled properly. One famous example is in RNNs or LSTMs: if one uses FP16 naive summation for the gating mechanism, errors could accumulate over many time steps and cause drift or instability. In modern practice, recurrent layers are often run in FP16 with FP32 accumulators to avoid this.
In summary, dot products and large reductions need sufficient precision or robust algorithms. If you are implementing a custom kernel (say a custom ONNX op) and you have to sum a huge array, you should consider using a technique from Section 3 (pairwise summation, Kahan, or at least double accumulator if possible). Many deep learning computations, by virtue of hardware and library design, already do something like this under the hood. But it's always good to be aware, especially if pushing into lower precision territories.
6.2. Cancellation, Dynamic Range, and Conditioning
We introduced cancellation earlier in the context of summation. In deep learning, cancellation can appear in a few places. One is in the computation of certain norms or variances. For instance, Batch Normalization computes mean and variance of activations. The variance formula is subject to catastrophic cancellation if E[x] and are close in magnitude (which can happen if the variance is small compared to the mean). If implemented naively in single precision, you might get a negative variance due to numerical error (which would then be clamped to zero or sqrt would yield NaN). Robust implementations compute variance in a single pass or increase precision. Indeed, in many frameworks, batchnorm accumulation is done in FP32 even if the activations are FP16, precisely to avoid numerical issues in computing mean and variance. ONNX itself doesn’t specify this, but an ONNX backend is likely to do batchnorm stats in higher precision for safety. The conditioning of the variance calculation is poor when variance is tiny relative to mean, so it's a spot where higher precision is needed.
Another example is in certain loss functions or regularizations. If you subtract two large numbers to get a small difference, you lose precision. Say you have two nearly identical predictions and you subtract them as part of some operation – if done in float16, you might get zero whereas float32 would have a slight difference. This could affect, say, a gradient if that difference was supposed to propagate information. Fortunately, neural nets often avoid directly subtracting large nearly equal numbers; but it can happen. In attention mechanisms, the score normalization (softmax) subtracts a max to avoid huge exponentials, which is good for range but if multiple scores are close to the max, you subtract numbers that are close. However, those subtractions are not catastrophic in the same way; they're intentional to reduce range and are typically fine in FP32 or even FP16.
Dynamic range issues are everywhere in deep learning. Activation functions like ReLU and exponential can produce very large or very small values. If subsequent computations don’t account for that, you get infinities or zeros. For example, consider a network output that produces a logit of 100 in FP16. in FP16 overflows to infinity (since FP16 max ~ 6e4 and e100 ∼ 3.7e43). If you then do softmax, that one infinity will dominate the sum and you might get a probability of 1 for that class (or NaNs if not handled). The standard softmax implementation subtracts the max logit, so in this case it would subtract 100 from all logits, making the largest exponent and others tiny – this avoids overflow. But what about underflow? If you had a very negative logit, in FP16 would underflow to 0. In softmax, that just contributes ~0 probability, which is fine (it effectively is zero anyway). Underflow tends to be less dangerous than overflow in inference — underflow usually just means “this probability/neuron is effectively 0,” whereas overflow can poison the results with NaNs or Infs.
Conditioning refers to how sensitive a function is to input perturbations. Neural networks themselves as functions can be poorly conditioned (e.g., tiny changes in weights cause big changes in output for some architectures), but typically they are designed to be reasonably well-behaved. Certain internal computations like matrix inversion (if one does a pseudoinverse or something for a layer) can be extremely ill-conditioned if the matrix is near singular. That again would need double precision perhaps. In ML, one common ill-conditioned operation is the softmax denominator: can be very large and also very sensitive to changes in the largest xj. However, the softmax is more stable if computed carefully (max trick). Another is normalization: computing for layer norm or batch norm – if is very small, then the division blows up noise. In those cases, a small ε is usually added to to prevent instability (this is a model-level fix, not a numeric one per se, but it acknowledges numeric limits).
In training, optimizers can suffer from precision issues. For example, the Adam optimizer keeps a moving average of squared gradients; in half precision, these small updates might flush to zero or not change at some point. Some research found that certain adaptive optimizers break in FP16 because the small differences are lost ("Why FP16 training breaks NAdam" etc.). The solution is often to keep the optimizer states in FP32 while doing the bulk of compute in FP16. This again is a form of mixed precision.
From an inference perspective, once the model is trained, the main concerns are avoiding overflow in activations and ensuring summations (like in softmax or reductions) are done with enough precision. The model’s weights are fixed, so we don't have to worry about cumulative error over iterations as in training, just the forward pass.
6.3. Normalization Layers and Softmax
Normalization layers (BatchNorm, LayerNorm, etc.) and Softmax are specifically worth examining for numerical stability:
- Batch Normalization: During inference, batchnorm applies an affine transform using precomputed mean and variance (from training). It’s basically: . This is a stable operation as long as ε is provided (to avoid division by zero). Since are constants in inference, it reduces to linear ops which are fine. The main numerical work (computing mean/variance) happened during training or model calibration, and frameworks usually did that in FP32. So batchnorm at inference time is not a source of instability (it’s just a scale and shift of the data).
- Layer Normalization / Group Normalization: These compute mean/var on the fly for each sample (and possibly channel group). In inference, they still need to compute these statistics for each forward pass. If implemented in low precision, the variance calculation can suffer cancellation as mentioned. However, it's likely that inference engines compute the mean and variance in FP32 even if inputs are FP16 (especially since these are per-sample operations, not huge batch reductions – it's feasible to do in higher precision). If someone implemented layer norm purely in FP16, they might see issues when the variance is very small. Usually there’s an ε added which might dominate a tiny variance and thus mitigate catastrophic cancellation (the result might just output something mostly from ε). In summary, normalization layers are usually fine as long as the implementation uses enough precision for the summation of squares.
- Softmax: Softmax is notorious if done improperly. The safe algorithm: subtract the maximum input from all inputs (to bring the largest exponent to 1.0), exponentiate, sum them up, divide each exponent by the sum. The max subtraction handles overflow in . But what about the sum? As discussed, if there are N elements in the softmax, the sum is at most N (when all inputs equal the max). In classification, N might be the number of classes. For ImageNet, N=1000, so worst sum ~1000 (which is fine in FP16, as 1000 << 65504). For an extreme case like a language model, N could be 50,000 (vocab size). Sum could be 50,000 in worst case (everyone equal probability). 50,000 is still below 65504, so actually an FP16 sum would not overflow in that scenario, and the rounding error on a sum of 50k identical terms would be maybe a few ULPs (which at 50k magnitude might be ~0.5 difference worst-case). So maybe softmax summation in FP16 is not catastrophic for output probabilities – you’d get essentially 1.0 after normalization instead of 0.99999 in some case. Usually fine.
- The bigger issue is if one or a few terms dominate the softmax. If one term is much larger, the others underflow to 0 after the max subtraction. That’s fine: you get a one-hot distribution essentially. If the model expected a softer distribution, that could be a problem. But if the logits were really that far apart, even FP32 would give nearly one-hot. So not really a precision issue, more of a model property.
- However, the stability during training is a big issue. The GitHub thread we saw (ColossalAI) mentioned training a 130B model where FP16 softmax in attention caused training to occasionally blow up[14]. They observed attention scores in some heads ranging up to 1e4 or as low as 1e-3[14]. 1e4 exponent in FP16 is definitely Inf (because is astronomically huge, but even exp(100) is inf in half). The method CogView introduced (Precision Bottleneck Relaxation) involved subtracting the max per head to reduce range[14]—which is basically the softmax max trick at a finer granularity or something similar. In the end, they found the simplest fix was to compute softmax in FP32 for those layers[14]. They reported that it had negligible speed impact but significantly improved stability[14]. This highlights a general principle: certain operations like softmax (and sometimes matrix multiplications in attention) benefit from higher precision, and using FP32 in a few critical spots can save a mixed-precision training from crashing. In inference, using FP32 for softmax is also a safeguard if one is uncertain. It costs little because softmax is not a heavy compute compared to matrix multiplies.
- ONNX does have a Softmax operator; if the input is float16, a backend might do exactly what they described: internally cast to float32 for the softmax computation, then cast output back to float16. Some inference engines (like DeepSpeed’s inference or others) keep attention softmax in FP32 even if rest is FP16, for robustness[14].
Apart from softmax, other places where exponentials occur are in loss functions like cross-entropy (which uses log-softmax internally), or in activation functions like Sigmoid and Tanh (which are basically scaled logistic functions). Sigmoid(x)=1/(1+e-x). If x is moderately large positive, e-x underflows to 0 in FP16 if . That gives a sigmoid output of 1.0 exactly. In FP32, e-x for x=11 is about 5e-5, not underflow, so sigmoid ~0.99995. So FP16 will saturate some outputs to exactly 0 or 1 sooner than FP32 would. This can slightly change behaviors (e.g., a neuron might go to full saturation a bit early). Usually, this isn’t a huge deal, but it could make gradients zero earlier. In inference, we don’t care about gradients, just the output value. Getting exactly 1.0 vs 0.99995 probably has no practical difference for a classification output. But one should be aware that activation functions can saturate faster in lower precision due to limited range or precision. ReLU has no issue (it’s linear or 0), but exponential-based activations do. In practice, again, frameworks often simply compute those in the given precision and accept the slight difference. If it were a problem, one could do those in FP32 as well.
Summary: Numerical stability in inference is maintained by using robust computation methods: subtracting maxima in softmax/LogSumExp, adding epsilon in denominators for normalization, and using higher precision for summations or critical steps. Most of these are built into the algorithms. ONNX models assume these standard practices (they don't need to explicitly do them in the graph because the backend inherently will for ops like Softmax). It's wise when building or exporting models to ONNX to test them in the target precision and ensure there are no surprises (like an output becomes NaN or Inf due to precision). Usually if training was done in mixed precision, the model will have been vetted in that environment already.
Performance vs. Accuracy Trade-offs
There is an inherent trade-off between numerical precision and computational performance (speed, memory usage) in AI systems. Choosing lower precision can dramatically increase throughput and reduce memory, but it may introduce arithmetic errors. Engineers must balance these considerations:
7.1. Throughput, Memory Bandwidth, and Precision
Throughput (speed) is often higher for lower-precision operations. Modern hardware can execute more low-precision ops in parallel than high-precision ops. For example, an NVIDIA V100 GPU can perform 2× more half-precision FLOPs than single-precision per clock, and newer architectures with Tensor Cores can do 8× more (since a Tensor Core warp operation might perform 64 FP16 FMAs per cycle vs 16 FP32). Similarly, int8 operations can be even faster and more energy efficient. Lower precision numbers also use less memory bandwidth – fetching 16-bit data is twice as fast as 32-bit data from memory (assuming the same bus width), which can relieve memory bottlenecks. Many deep learning workloads are memory-bound, so halving the data size can nearly double throughput if compute is not the limiting factor. This is a huge incentive to use lower precision.
Memory and storage: Using FP16 or int8 weights reduces model size, which is important for deploying large models to memory-limited environments (like edge devices or GPUs with limited VRAM). It also means you can fit larger batch sizes in memory, or more models on a chip, etc. There is also a cascade effect: smaller data means better cache utilization, which further improves performance.
However, accuracy suffers if precision is too low. We have discussed how rounding error and limited range can cause deviations in outputs. The question is: do those deviations matter for the application? Often for neural networks, a small amount of noise or error in intermediate calculations does not change the final prediction. Neural nets are somewhat error-tolerant due to their redundant distributed representations. This is why quantization and FP16 techniques can work at all. The final model accuracy (e.g. top-1 accuracy on ImageNet) might drop only 0.5% or less when using FP16 instead of FP32, or int8 instead of FP16, if done properly. For many use-cases, this loss is acceptable given the speed gains.
If one goes too low (like naive FP8 or int4 without special care), accuracy might drop severely – perhaps making the model unusable. Thus the trade-off: as precision decreases, performance increases but at some point accuracy degrades beyond an acceptable threshold. Engineers aim to find the lowest precision that still preserves accuracy.
There is also the consideration of accumulation strategy affecting performance: for instance, Kahan summation (compensated) is more accurate but ~4× slower than naive summation. If you need that accuracy, you pay a performance cost. In many AI inference scenarios, the slight extra accuracy from something like Kahan isn't needed, so it’s not used. But in a scenario where you sum millions of elements for a final result that is thresholded, maybe you do need it.
Another dimension is determinism vs speed. Enforcing a deterministic reduction order (for reproducibility) might require disabling some parallelism, which could slow things down. If exact repeatability is critical (like certain scientific or financial applications of neural nets), one might sacrifice some performance to ensure the summation happens the same way every time. ONNX allows both modes depending on the runtime.
From a memory perspective, if precision is higher than needed, you waste memory. For example, storing intermediate activations in FP32 when they could be FP16 uses 2× memory, which might mean lower batch size or more memory traffic. So there's strong motivation to keep things as low-precision as possible except where accuracy truly demands.
In summary, the trade-off often manifests as diminishing returns: FP64 vs FP32 – huge slowdown for usually negligible accuracy gain in ML; FP32 vs FP16 – moderate speedup for usually minor accuracy hit; FP16 vs int8 – another big speedup but maybe more accuracy loss unless calibration is done; int8 vs int4 – speed gain but often big accuracy drop that might not be worth it for current models.
Engineers use techniques like mixed precision to target high precision only where needed. For example, run most of the model in FP16, but keep a few sensitive layers in FP32. This yields most of the performance benefits while ensuring critical parts are accurate. We already saw softmax as one example. Another might be final output layer: sometimes using FP32 for the logits and softmax can be safer.
Hardware is also trending to support mixed modes – like FP8 Tensor Cores accumulate in FP32, etc. So hardware tries to give accuracy cheaply.
7.2. Hardware Accelerators and Tensor Cores
Hardware accelerators (GPUs, TPUs, FPGAs, ASICs like GraphCore IPU, etc.) have specific support for certain precisions. Leveraging these can provide massive speedups:
- NVIDIA Tensor Cores: These started in Volta (V100) with support for 4x4 matrix multiply-adds on FP16 inputs with FP32 accumulation. Using Tensor Cores can give up to 8x throughput compared to using regular FP32 cores. But to use them, the data needs to be FP16 (or now BF16 or FP8 on newer GPUs) and aligned to certain dimensions (multiples of 8 etc. in matrix sizes, which frameworks handle by padding if needed). If an ONNX model is all FP32, by default it would not use Tensor Cores unless the runtime automatically converts some ops to mixed precision (which some do via a flag or environment variable enabling TF32 or FP16). The introduction of TensorFloat-32 (TF32) in Ampere GPUs is essentially a compromise: treat FP32 inputs as two 16-bit significand halves (10-bit mantissa actually) so that it can run on Tensor Cores, giving ~8x speed, while still outputting FP32. TF32 has about the precision of FP16 but range of FP32, and it was chosen because it typically doesn’t hurt convnet accuracy. ONNX Runtime or others on Ampere will use TF32 by default for conv and matmul unless disabled, giving a free speed boost at minimal accuracy cost. If exact FP32 is needed, users have to turn it off (sacrificing speed). [6]
- TPU (Google): TPUs have native support for BF16 matrix ops with FP32 accumulators. They were designed specifically for the mixed precision approach in training. For inference, TPUs also support int8 and other quantized ops (e.g., via Edge TPU for lower-power). If exporting ONNX to run on a TPU (via some bridge), one might get automatic BF16 usage. BF16 has the advantage of no loss in range, so TPUs can more straightforwardly run models in BF16 without modifications.
- FPGAs and ASICs: Some custom accelerators might use fixed-point or block floating-point arithmetic for efficiency. They often treat the network in quantized terms. The ONNX quantization specification (QuantizeLinear/DequantizeLinear ops) exists for representing quantized models (INT8 etc.), which is another domain beyond floating-point, but it overlaps as a precision issue. In those systems, accumulators for INT8 × INT8 multiplies are usually 32-bit (to avoid overflow for dot products up to certain lengths). The trade-off there is between integer vs float representation, which can give big speedups on CPUs/DSAs via SIMD instructions (many CPUs have vector int8 multiply-add instructions now for inference).
- Memory bandwidth on hardware: Some accelerators include special high-bandwidth memory or cache specifically for lower precision. For example, an accelerator might fetch two FP16 values in one 32-bit memory transaction – effectively doubling memory bandwidth for half precision. On some hardware, using FP16 can even reduce memory latency if multiple values can be packed.
- Parallelism and Vectorization: Many hardware units process data in fixed vector widths (e.g., 256-bit registers). That could be 8 floats at once or 16 half-floats at once in the same 256 bits. If your code is in FP16, you might use the full vector width (16 values) whereas FP32 might only use half (8 values). This means FP16 can make better use of vector units. Similarly, matrix units often have fixed tile sizes, and smaller data types allow larger tiles or more tiles concurrently.
- Energy efficiency: Lower precision often means less energy per operation, which on large deployments (data centers or battery-powered devices) is a significant advantage. This isn't directly performance, but it's an important trade: you might choose a slightly lower precision to save power even if accuracy drops a tiny bit.
- Software overhead: Using mixed precision on hardware involves some overhead of casting and moving data between precisions. For instance, if you have to frequently cast FP16 to FP32 and back, that adds instructions and memory copies. A well-designed accelerator (like GPUs with unified memory for FP16/32 or special hardware casting) minimize this. Still, the runtime might insert extra operations (as we saw, ONNX Runtime CPU inserted many Cast nodes converting FP16 to FP32 and back, which actually slowed down that FP16 model by 4x vs FP32 on a CPU, because the CPU had no native FP16). So ironically, on some hardware, using lower precision in ONNX might degrade performance if not supported (the CPU case). Always ensure the target hardware actually supports the low precision you plan to use; otherwise, you'll pay overhead for emulation. [9]
To illustrate trade-off: Suppose you have a model that achieves 99% accuracy in FP32 and drops to 98.5% in FP16. If FP16 runs 2x faster, many would accept the 0.5% accuracy drop for double throughput. If int8 runs 4x faster but accuracy drops to 95%, that might be too much drop for some applications, but acceptable for others that value speed. It's application-dependent.
In critical applications, one might maintain higher precision accumulators or run certain validation steps in high precision to ensure nothing drifted. In research, sometimes a model is trained in mixed precision but validated in double to ensure any small numeric differences don’t influence the reported result (rare, but for scientific computing integration maybe).
In the ONNX world, performance-vs-accuracy is often managed via Execution Providers (EPs) that target specific hardware. For example, the TensorRT EP in ONNX Runtime can take an ONNX model and run it using NVIDIA's high-performance inference library, which will use FP16 or INT8 if allowed, to maximize speed. The user can configure the EP with precision flags. If the accuracy drop is too high, one might restrict it to FP32. If it's fine, allow FP16. Tools exist to measure the accuracy difference (like calibrating int8 quantization).
To conclude, the trade-offs revolve around picking the lowest precision that yields acceptable accuracy, and utilizing hardware features fully. ONNX being a high-level spec abstracts this, but it's ultimately up to the deployment to decide and sometimes up to the exporter to incorporate quantization. ONNX’s role is to be able to represent these choices (they have added quantized types, FP8 types, etc., to allow the graph to include those when needed).
Best Practices and Design Guidelines
Given the above insights, here are some best practices and guidelines for dealing with numerical precision and accumulators in AI inference (and training) design:
8.1. Choosing an Accumulator Precision
- Match precision to data length and range: If an operation accumulates a large number of terms, consider using a higher precision accumulator than the input data. For example, for summing >1000 values of type FP16, an FP32 accumulator is recommended to avoid large rounding error or overflow. If summing > million values of FP32, consider double (FP64) accumulation if high accuracy is needed (though this is rare in DL). Rule of thumb: if is not tiny (say >1e-3), you may benefit from higher precision. For FP16 (ε ≈ 1e-3), that threshold N is a few hundred. For FP32 (ε ≈ 1e-7), N would be in the millions before 1e-3 error.
- Use higher precision in critical layers: Identify if certain layers of the model are more sensitive to precision. For instance, the output layer or a layer that feeds into a very sharp decision boundary might require more precision. If a small change there could flip the prediction, you don't want excessive numerical error. Using FP32 for logits and softmax is a common practice to maintain classification confidence stability.
- Follow hardware defaults, but override if needed: Many hardware accelerators have sensible defaults (like accumulate in FP32 for tensor ops). It's usually best to use those defaults as they were chosen to balance accuracy and speed. Only override if you have evidence of a problem. For example, if using ONNX Runtime on CPU, you discovered it upcasts to float for you – great. But if using some custom accelerator that does pure FP16, you might manually insert casts in the ONNX graph around certain ops to force higher precision on those ops.
- Maintain precision for gradient accumulation (training): This is more training-oriented, but worth noting: when summing gradients over batches or updating weights, use higher precision (FP32 or even 64) for those accumulations. This ensures training doesn't diverge due to numeric error. Many frameworks keep a "master copy" of weights in FP32 when training in FP16. For inference, this translates to: any running statistics or counters in the model (like if you had a running average inside the model) should probably be in higher precision to avoid drift. [2]
- Be mindful of intermediate cast costs: If you do decide to cast to higher precision for an operation in ONNX, remember it incurs memory copies and compute overhead. Make sure it's worth it. For example, if you cast a huge tensor to double, that could slow things drastically. Usually FP16->FP32 is fine (hardware handles it well), but FP32->FP64 might be expensive on GPUs.
8.2. Safe Reduction Lengths and Heuristics
- Chunk long reductions: If you must do a very long sum in low precision (due to memory constraints or hardware), chunk the sum into parts, sum each to a higher precision local accumulator, then sum the partial sums. This is essentially pairwise summation. Many libraries use a fixed chunk size heuristic (e.g., sum in blocks of 256 in registers, then add those up). This bounds the error growth. If implementing manually, you can choose a chunk size such that chunk_length ε is small (like 256 1e-3 = 0.256 for FP16, hmm maybe choose 100).
- Sort by magnitude if possible: In some cases (not typical in DL, but possible in post-processing), if you have numbers varying by orders of magnitude, summing from smallest to largest improves accuracy. It prevents small ones from being swallowed by big ones. This is more relevant if you output a sum of mixed-scale values (like some algorithm mixing big and small contributions). It’s not always feasible in real-time inference to sort, but keep in mind.
- Use compensation for critical sums: If you find a particular sum or difference in your inference computation is crucial (like computing a small residual from two close numbers), consider using Kahan summation or another compensated scheme just for that part. For example, if you had an ONNX custom function that needs to sum a lot of almost cancelling terms, you might implement it with compensation internally.
- Heuristics for FP8/low-bit: When using extremely low precision like FP8, adopt the recommended practices: use scaling (don't feed raw data into FP8, normalize it first), and perhaps only use FP8 on portions of the network that are empirically robust. The industry is still establishing heuristics, but one might be: use FP8 for matrix multiplications, but accumulate in FP16 or FP32; keep normalization in FP16; use FP16 for first and last layer maybe. This is evolving, so keep an eye on literature.
- Keep an eye on sum length in softmax and normalization: If you have extremely large class counts or group sizes (e.g., a softmax over 1e6 items, or layer norm over tens of thousands of features), realize this is a giant reduction. You may want to ensure that reduction is done in higher precision. ONNX Softmax by default will likely not handle a million-class case in FP16 well (the sum would overflow since 1e6 > 65504). So either do that on CPU in FP32 or break it down (could do hierarchical softmax, etc.). This is a corner case, but relevant for huge vocab language models. In practice, vocab ~50k is fine as we said.
- Test numeric differences: A practical tip: after quantizing or reducing precision, run some test inputs through both the high-precision model and low-precision model. Compare outputs (and maybe some intermediate signals if possible). If differences are consistently small relative to the application tolerance, you're good. If you see occasional large differences or instabilities (like outputs become NaN or Inf in low precision), that's a red flag. Use that testing to guide where you might need to increase precision or change algorithms.
8.3. When Higher Precision Is Required
- Ill-conditioned calculations: If your model or pipeline performs any operation known to be ill-conditioned (matrix inversion, subtraction of nearly equal quantities, computing a tiny difference of large numbers), you likely need higher precision for that part. For example, if your ONNX model includes a custom op to solve a linear system, use double precision in that op if possible (or at least float with iterative refinement). In ML, this might occur in something like a physics-informed neural network or a differentiable simulation within the model.
- Strict accuracy requirements: Some applications (medical, aerospace, finance) might have strict tolerances. If an inference model’s output needs to be correct to 5 decimal places, you probably cannot use FP16, which has only 3-4 decimal digits precision. FP32 might be barely enough (~7 digits). FP64 might be warranted to guarantee that level of accuracy in worst case. Most deep learning applications (like image classification) don't demand that level of numeric precision in outputs – as long as the top class is correct, it doesn't matter if its probability was 0.901 vs 0.899. But if you're doing something like summing probabilities to make a risk estimate or doing a cumulative sum over many predictions, be careful to accumulate precisely.
- Large sums or products in post-processing: Sometimes the neural net is fine, but the post-processing of results might need precision. E.g., summing up probabilities of many events, or computing confidence intervals, etc. Do those calculations in double if needed outside the model.
- Preventing drift in long sequences: For sequential models (RNNs processing thousands of steps), rounding error can accumulate over steps. If each step has a small error, after 10000 steps it could add up. In such cases, using a higher precision for the state might reduce drift. Some practitioners use FP64 for the hidden state accumulation in very long sequence processing to maintain numerical stability (though this is not common due to speed loss). Alternatively, periodically renormalize or reset small errors.
- Simulations or differential equations: If your ONNX model involves simulating something (like a differential equation solver inside a model, or an ODE/RNN hybrid), those often need double precision to be stable for long durations. Recognize those situations and allocate precision accordingly.
- Verification and debugging: Use higher precision as a tool to verify correctness. For example, run your model in FP64 (if possible) and compare to FP32 to see if there are differences. If not, FP32 is fine. If yes, you might find a particular calculation losing too much info in FP32, implying maybe you should keep it in FP64 in production as well. This is rare in inference but can be part of rigorous validation.
In ONNX, using higher precision may involve casting to tensor(double) for certain ops. Not all runtimes support double on GPU though (some do via emulation or just slower path). ONNX CPU does support double for general ops. So be mindful of the target.
Finally, document your precision choices. If you deliver an ONNX model that uses mixed types (float16 inputs, float32 accumulations, int8 weights, etc.), ensure that is clear to whoever maintains or uses it. Precision issues can be sneaky, so clarity helps.
Limitations and Open Problems
Despite advances in handling numerical precision, there remain limitations and open challenges in this domain:
- Lack of Precision Specification in ONNX: As noted, ONNX currently has no way to formally specify that an op should use a higher precision internally. This can lead to nondeterministic differences across platforms. While proposals exist to add attributes for this, it is not yet standard. Thus an ONNX model might behave slightly differently on two compliant backends, and there's no built-in mechanism to ensure identical results. This is a limitation for applications requiring cross-platform consistency. Solving it may involve extending ONNX (e.g., an attribute like
accumulatePrecision="FLOAT32"for MatMul) or tightening the spec, but those come with backward compatibility and implementation burden concerns. [12]
- Reproducibility and Determinism: Even with the same precision, parallel computation can introduce nondeterminism (due to different summation orders). Ensuring bit-identical results across runs (important for some sensitive domains) is non-trivial. Frameworks often provide "deterministic mode" which might use fixed ordering at cost of speed. In ONNX, there's no global flag for this; it's up to the runtime. More research could be done on algorithms that produce bit-wise identical results independent of parallel execution order (perhaps using techniques like pairwise summation or consistent partitioning). Stochastic rounding is another area: sometimes using probabilistic rounding can reduce bias in accumulated error, but it introduces randomness – reconciling that with reproducibility is tricky.
- Ultra-low Precision (FP8 and below): While FP8 is being explored, going to 4-bit floats or mixed schemes is an open challenge. The OpenAI community is investigating 4-bit (quantization-wise) for LLMs with some success, but a true 4-bit float (like E3M1) would be extremely limited. Perhaps hybrid schemes (like block floating point, log scales, etc.) will come. ONNX already has some support (they mention 4-bit in documentation placeholders), but it's nascent. Research into training techniques, calibration, and new number formats (like posit numbers or tapering precision floats) is ongoing. If a new format proves useful, ONNX will need to adapt to represent it. [10]
- Quantization Noise and Model Robustness: Reducing precision can sometimes expose models to issues like adversarial vulnerability or fragility – e.g., a slight perturbation might flip a decision if the margin was small and precision loss nudged it. Ensuring model robustness under quantization is an open problem. It might require re-training the model with quantization in the loop (quantization-aware training) or designing models that have large "margin" between classes to tolerate quantization noise.
- Dynamic range during training vs inference: In training, activations can have wider distributions (especially early in training). Models often learn to constrain their ranges (via activation functions saturating or normalization), which makes inference more stable in lower precision. But if one tries to train in FP8 from scratch, it's very hard. This gap between what's possible in a post-trained inference model vs training is an open area – how to train directly in low precision without significant accuracy loss. Solving that could simplify workflows (no need for high-precision training then quantize; just train low-prec end to end). Techniques like loss scaling, gradient clipping, etc., are partial solutions. The development of optimizers that are resilient to low precision is ongoing.
- Hardware variability: As new hardware emerges (with new numeric capabilities or quirks), ensuring that ONNX models run consistently is an ongoing task. For instance, one device might implement a fused op that reduces rounding error, another does it separately. Even something like convolution might use a different algorithm (FFT-based vs direct) which has different numeric properties. Open problem: could ONNX allow specifying numeric tolerances per op or preferred algorithms? That may be overkill for most, but for some HPC ML use cases, maybe desired.
- Testing and validation of numerics: There isn't a standardized set of "numeric compliance" tests for ONNX runtimes beyond simple correctness. Perhaps in the future, test suites might include challenging numeric scenarios (large sums, nearly singular matrices) to ensure a runtime handles them gracefully (maybe by switching precision). Right now, it's mostly on the user to test their model under expected conditions.
- Combining quantization with floating point: Many production models use a mix (e.g., int8 for some layers, FP16 for others). ONNX can represent this (with QuantizeLinear ops or MixedPrecision through casting), but tooling to automatically find the optimal mix is still developing. It's somewhat trial-and-error or using heuristics. An open question is how to automate precision assignment: given a model and a target hardware, how to automatically choose which layers can be int8, which should be FP16, etc., to meet an accuracy target. This crosses into NAS (neural architecture search) territory or automated quantization.
- Numerical debugging tools: When a model goes wrong due to precision, it's often hard to pinpoint where. Better tooling to analyze numerical error layer by layer could be useful. For instance, one could run a model in double and float and diff the intermediate results to see where the largest divergence happens. Doing this systematically for big nets is an open tooling challenge. Some frameworks have hooks for this, but not standardized.
In conclusion, while we've come far with using lower precision in AI without sacrificing much accuracy, the landscape is always evolving. The key open problem is how to push precision lower (for efficiency gains) without crossing the threshold where accuracy falls off a cliff. Each new generation of hardware and techniques (like FP8, novel quantization schemes) extends that boundary, but also brings new questions on how to manage the numerical errors. ONNX as an interchange will need to keep pace, possibly by adding more expressive power for these numeric considerations.
References
- ↑ 1.0 1.1 Accelerate and simplify Scikit-learn model inference with ONNX Runtime - Microsoft Open Source Blog — opensource.microsoft.com — https://opensource.microsoft.com/blog/2020/12/17/accelerate-simplify-scikit-learn-model-inference-onnx-runtime
- ↑ 2.0 2.1 2.2 2.3 Floating Point Precision: Understanding FP64, FP32, and FP16 in Large Language Models - DEV Community — dev.to — https://dev.to/lukehinds/floating-point-precision-understanding-fp64-fp32-and-fp16-in-large-language-models-3gk6
- ↑ 3.0 3.1 3.2 3.3 3.4 3.5 3.6 Floating Point Precision and Its Limitations | by Umair Akbar | Medium — akbu.medium.com — https://akbu.medium.com/floating-point-precision-and-its-limitations-cfb7247d7789
- ↑ 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 Kahan summation algorithm - Wikipedia — en.wikipedia.org — https://en.wikipedia.org/wiki/Kahan_summation_algorithm
- ↑ 5.0 5.1 5.2 5.3 Pairwise summation - Wikipedia — en.wikipedia.org — https://en.wikipedia.org/wiki/Pairwise_summation
- ↑ 6.0 6.1 6.2 6.3 6.4 6.5 Numerical behavior of NVIDIA tensor cores - PMC — pmc.ncbi.nlm.nih.gov — https://pmc.ncbi.nlm.nih.gov/articles/PMC7959640/
- ↑ Mixed precision and performance-accuracy tuning (neuron-cc) — AWS Neuron Documentation — awsdocs-neuron.readthedocs-hosted.com — https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/appnotes/neuron-cc/mixed-precision.html
- ↑ Create Float16 and Mixed Precision Models - ONNX Runtime — onnxruntime.ai — https://onnxruntime.ai/docs/performance/model-optimizations/float16.html
- ↑ 9.0 9.1 9.2 9.3 [Performance] ONNX FP16 model is having performance bottle neck when compared to FP32 variant · Issue #25824 · microsoft/onnxruntime · GitHub — github.com — https://github.com/microsoft/onnxruntime/issues/25824
- ↑ 10.0 10.1 10.2 10.3 10.4 Float stored in 8 bits - ONNX 1.21.0 documentation — onnx.ai — https://onnx.ai/onnx/technical/float8.html
- ↑ [2209.05433] FP8 Formats for Deep Learning — arxiv.org — https://arxiv.org/abs/2209.05433
- ↑ 12.0 12.1 12.2 Should MatMul op allow specification of accumulation precision? · Issue #7072 · onnx/onnx · GitHub — github.com — https://github.com/onnx/onnx/issues/7072
- ↑ Quantize ONNX models | onnxruntime — onnxruntime.ai — https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html
- ↑ 14.0 14.1 14.2 14.3 14.4 14.5 Keep Attention Softmax FP32 during FP16/ZeRO Training · Issue #1485 · hpcaitech/ColossalAI · GitHub — github.com — https://github.com/hpcaitech/ColossalAI/issues/1485