Use 4x FMA chains to sum on SIMD 128 FP64 targets. On x86 this showed about 1.4x improvement. For PPC, do a full multiply (32x32->64b), convert to DP then accumulate. This may be slightly less precise for some inputs. But is 1.5x faster than the above which is about 1.5x than the FMA above for ~2.5x speedup. |
||
|---|---|---|
| .. | ||
| 3rdparty/SoftFloat | ||
| doc | ||
| include/opencv2 | ||
| misc | ||
| perf | ||
| src | ||
| test | ||
| CMakeLists.txt | ||