[While this is a self-answered question, I will happily upvote and accept any alternative answer that either provides superior accuracy for the same amount of computational work or reduces the amount of computational work while maintaining the same level of accuracy.]
I have previously demonstrated how to compute the complementary error function erfcf() with a maximum error of less than three ulps. This can serve as a building block for additional functions, such as the CDF of the standard normal distribution Φ(x) = ½ erfc (-√½ x) or the Gaussian Q-function, Q(x) = 1-Φ(x) = ½ erfc (√½ x). However, for some use cases computation fully accurate  to single precision is not needed while the contribution of erfc() evaluations to total run-time is not negligible.
The literature provides various low-accuracy approximations to the complementary error function, but they are either limited to a subset of the full input domain, or optimized for absolute error, or computationally too complex, for example by requiring multiple invocations of transcendental functions. How could one go about implementing erfcf() with high performance and with a relative accuracy of about 5 decimal digits across the entire input domain?
How could one go about implementing
erfcf()with high performance and with a relative accuracy of about 5 decimal digits across the entire input domain?
This answer, like OP's answer, does achieve about 5 decimal digits relative accuracy*1. To achieve at least 5 likely obliges another term in the rational polynomial - which I doubt will slow the performance much.
Special values
In addition to performance and accuracy, the result of a my_erfc(x) for select values should be considered.
For x near 0.0, the result should be 1.0 exactly.  This follows principle of least surprise.
Example: A user may readily rely on my_cos(0.0) == 1.0, or my_exp(0.0) == 1.0, so should my_erfc(0.0) == 1.0
Fortunately this is not hard to accomplish when a rational polynomial is used with coefficients of 1.0. See below.
Take advantage of the linear region
About 45% of all float are such that the erfc() is a simple linear equation with high accuracy.
#define TWO_OVER_ROOT_PI 1.1283791670955125738961589031215f
#define ERFC_SMALL 0.0053854f
float fast_erfcf_alt(float x) {
  if (fabsf(x) < ERFC_SMALL) {
    return 1.0f - TWO_OVER_ROOT_PI * x;
  }
  ...
}
Define performance rating
One could rate code in a worst case, what is the slowest for all float x?
I suspect OP is more interested in an assessment from x= 0.0 up to when erfc(x) is 0.0 (~x == 10) in a linear progression.
Sample performance results and test code follows. Of course, hard to beat the C library's crafted code.
0 time: 0.594  erfcf (Library function)
1 time: 0.703  fast_erfcf (OP's answer)
2 time: 0.688  fast_erfcf_alt (This answer)
int main(void) {
  float (*f[3])(float) = {erfcf, fast_erfcf, fast_erfcf_alt };
  for (int i = 0; i < 3; i++) {
    clock_t c0, c1;
    c0 = clock();
    float dx = 10.0f - nextafterf(10, 0);
    for (float x = 10.0f; x >= 0.0f; x -= dx) {
      f[i](x);
    }
    c1 = clock();
    printf("%d time: %g\n", i, (double) (c1 - c0) / CLOCKS_PER_SEC);
  }
}
Following code is modeled after OP self answer and incorporates ideas mentioned above.
#define TWO_OVER_ROOT_PI 1.1283791670955125738961589031215f
#define ERFC_SMALL 0.0053854f
float fast_erfcf_alt(float x) {
  if (fabsf(x) < ERFC_SMALL) {
    return 1.0f - TWO_OVER_ROOT_PI * x;
  }
  float a, c, e, p, q, r, s;
  a = fabsf(x);
  c = fminf(a, 10.5f);
  s = -c * c;
#if USE_BUILTIN_EXP
  e = expf(s);
#else // USE_BUILTIN_EXP
    e = my_expf (s);
#endif // USE_BUILTIN_EXP
  q = 0.374177223624056f;
  p = -5.00032254520701E-05f;
  q = q * c + 1.29051354328887f;
  p = p * c + 0.212358010453875f;
  q = q * c + 1.84437448399707f;
  p = p * c + 0.715675302663111f;
  q = q * c + 1.0f;
  p = p * c + 1.0f;
  r = e / q;
  r = r * p;
  if (x < 0.0f)
    r = 2.0f - r;
  if (isnan(x))
    r = x + x;
  return r;
}
Note: I'd consider eliminating c = fminf(a, 10.5f);.
This code's error results using OP's answer test harness.
(#define USE_FMA (0) and #define USE_BUILTIN_EXP (1)).
It's relative error (1.20e-05) vs. OP's error (1.06e-05) is primarily due to using 1.0 in both polynomials p and q.
Other code could use OP's answer's p and q and this answer's if (fabs(x) < ERFC_SMALL) for a best of both.  Additional work needed to see how well the linear and rational polynomial sections meet.
 * This answer
 * max rel err = 1.203212e-05 @  9.19353199e+00
 * max abs err = 9.690295e-06 @ -7.19872937e-02
 * max ulp err = 1.877379e+02 @  8.03564072e+00
 *
 * OP's answer
 * max rel err = 1.061202e-05 @  4.35268253e-01
 * max abs err = 9.398669e-06 @ -9.19871107e-02
 * max ulp err = 1.755724e+02 @  2.46583080e+00
As an extra, below is a graphical view of OP's error over the [0... ~10] range. I found it illuminating to show where errors develop.

*1 Low relative accuracy is not possible in the sub-normal range of answers.
The following assumes a platform that complies with the IEEE-754 (2008) floating-point standard, on which float is mapped to the IEEE-754 binary32 format, and that uses the same endianness between 32-bit integers and float. It is further assumed that a C toolchain is available which (by setting appropriate command-line switches, if need be) preserves IEEE-754 semantics. I used the Intel C/C++ compiler with the switches -march=skylake-avx152 -O3 -fp-model=precise.
As the complementary error function is symmetric about (0,1), one can focus on function inputs in the positive half-plane. Here the function decays roughly like exp(-x2), and with float computation it underflows to zero for arguments x > 10.5. If one plots erfc(x) / exp(-x2) on [0, 10.5] the shape suggests that it is somewhat difficult to approximate with a polynomial, but should be easily approximated by a rational function, that is, the ratio of two polynomials. Some initial experiments showed that two polynomials each of degree 3 should be sufficient to achieve five-digit accuracy.
While there are many tools available that can generate polynomial approximations, this is unfortunately not the case with rational approximations. I used a modification of the Remez algorithm to generate an initial minimax approximation R(x) = P(x)/Q(x) to erfc(x) / exp(-x2), but had to follow-up with a fairly extensive heuristic search to arrive at an approximation that provides close to equi-oscillation of the error peaks and almost achieves a relative error of 10-5, with remaining differences being negligible for my needs.
By computing erfc(x) = exp(-x2) R(x) the accuracy achieved is obviously dependent on the accuracy of a platform's expf() implementation. In my experience, faithfully-rounded implementations of this function (with maximum error <= 1 ulp) are common. While the Intel compiler comes with a highly accurate math library that provides a nearly correctly-rounded implementation (maximum error very close to 0.5 ulps), I also tried my own faithfully-rounded alternative my_expf() with larger error and observed only a very minor impact on accuracy of fast_erfcf().
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#define USE_FMA          (1)
#define USE_BUILTIN_EXP  (0)
#if !USE_BUILTIN_EXP
float my_expf (float a);
#endif // USE_BUILTIN_EXP
/* Fast computation of the complementary error function. For argument x > 0
   erfc(x) = exp(-x*x) * P(x) / Q(x), where P(x) and Q(x) are polynomials. 
   If expf() is faithfully rounded, the following error bounds should hold:
   Maximum relative error < 1.065e-5, maximum absolute error < 9.50e-6, and 
   maximum ulp error < 176.5
*/
float fast_erfcf (float x)
{
    float a, c, e, p, q, r, s;
    a = fabsf (x);
    c = fminf (a, 10.5f);
    s = -c * c;
#if USE_BUILTIN_EXP
    e = expf (s);
#else // USE_BUILTIN_EXP
    e = my_expf (s);
#endif // USE_BUILTIN_EXP
#if USE_FMA
    q =             3.82346243e-1f;  //  0x1.8785c6p-2
    p =            -4.38094139e-5f;  // -0x1.6f8000p-15
    q = fmaf (q, c, 1.30382288e+0f); //  0x1.4dc756p+0
    p = fmaf (p, c, 2.16852024e-1f); //  0x1.bc1ceap-3
    q = fmaf (q, c, 1.85278833e+0f); //  0x1.da5056p+0
    p = fmaf (p, c, 7.23953605e-1f); //  0x1.72aa0cp-1
    q = fmaf (q, c, 9.99991655e-1f); //  0x1.fffee8p-1
    p = fmaf (p, c, 1.00000000e+0f); //  0x1.000000p+0
#else // USE_FMA
    q =         3.82346272e-1f; //  0x1.8785c8p-2f
    p =        -4.38243151e-5f; // -0x1.6fa000p-15
    q = q * c + 1.30382371e+0f; //  0x1.4dc764p+0
    p = p * c + 2.16852218e-1f; //  0x1.bc1d04p-3
    q = q * c + 1.85278797e+0f; //  0x1.da5050p+0
    p = p * c + 7.23953605e-1f; //  0x1.72aa0cp-1
    q = q * c + 9.99991596e-1f; //  0x1.fffee6p-1
    p = p * c + 1.00000000e+0f; //  0x1.000000p+0
#endif // USE_FMA
    r = e / q;
    r = r * p;
    if (x < 0.0f) r = 2.0f - r;
    if (isnan(x)) r = x + x;
    return r;
}
float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}
/* Exponential function. Maximum error 0.86565 ulps */
float my_expf (float a)
{
    float f, r, j, s, t;
    int i;
    unsigned int ia;
    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, 12582912.f); // 0x1.715476p0 // log2(e)
    j = j - 12582912.f; // 0x1.8p23 // 2**23+2**22
    f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo 
    i = (int)j;
    // approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
    r =             1.37805939e-3f;  // 0x1.694000p-10
    r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
    r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
    r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
    r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    // exp(a) = 2**i * r
    ia = (i > 0) ? 0 : 0x83000000;
    s = uint32_as_float (0x7f000000 + ia);
    t = uint32_as_float ((i << 23) - ia);
    r = r * s;
    r = r * t;
    // handle special cases: severe overflow / underflow
    if (fabsf (a) >= 104.0f) r = (a < 0) ? 0.0f : INFINITY;
    return r;
}
uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}
uint64_t double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}
double floatUlpErr (float res, double ref)
{
    uint64_t refi, i, j, err;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN, infinity, zero */
    if (isnan (res) || isnan (ref) || isinf(res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0)) {
        return 0.0;
    }
    i = ((int64_t)float_as_uint32 (res)) << 32;
    expoRef = (int)(((double_as_uint64 (ref) >> 52) & 0x7ff) - 1023);
    refi = double_as_uint64 (ref);
    if (expoRef >= 129) {
        j = (refi & 0x8000000000000000ULL) | 0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
        j = j | (refi & 0x8000000000000000ULL);
    } else {
        j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
        j = j | (refi & 0x8000000000000000ULL);
    }
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}
int main (void)
{
    uint32_t argi, resi, refi, diff;
    float arg, res, reff, abserrloc = NAN, relerrloc = NAN, ulperrloc = NAN;
    double ref, relerr, abserr, ulperr;
    double maxabserr = 0, maxrelerr = 0, maxulperr = 0;
    argi = 0;
    do {
        arg = uint32_as_float (argi);
        ref = erfc ((double)arg);
        res = fast_erfcf (arg);
        reff = (float)ref;
        resi = float_as_uint32 (res);
        refi = float_as_uint32 (reff);
        ulperr = floatUlpErr (res, ref);
        if (ulperr > maxulperr) {
            ulperrloc = arg;
            maxulperr = ulperr;
        }
        abserr = fabs (res - ref);
        if (abserr > maxabserr) {
            abserrloc = arg;
            maxabserr = abserr;
        }
        if (fabs (ref) >= 0x1.0p-126) {
            relerr = fabs ((res - ref) / ref);
            if (relerr > maxrelerr) {
                relerrloc = arg;
                maxrelerr = relerr;
            }
        }
        diff = (resi > refi) ? (resi - refi) : (refi - resi);
        if (diff > 200) {
            printf ("diff=%u @ %15.8e : res=% 15.8e  ref=% 15.8e\n", 
                    diff, arg, res, ref);
            return EXIT_FAILURE;
        }
        argi++;
    } while (argi);
    printf ("max rel err = %.6e @ % 15.8e\n"
            "max abs err = %.6e @ % 15.8e\n"
            "max ulp err = %.6e @ % 15.8e\n",
            maxrelerr, relerrloc, 
            maxabserr, abserrloc,
            maxulperr, ulperrloc);
    return EXIT_SUCCESS;
}
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With