
The Intel Xeon Phi Knights Landing supports the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.
Intel KNL Supported AVX-512 Extensions
Extension | Description |
---|---|
AVX-512F | Foundation |
AVX-512CD | Conflict Detection |
AVX-512PF | Prefetch |
AVX-512ER | Exponents and Reciprocals |
A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.
Instruction Latency Tables for Vector Vs. Scalar Instructions
Vector Instructions
Instruction | Latency | Bandwidth |
---|---|---|
Simple Int | 2 | 2 |
FMA Vectorizations | 6 | 2 |
Mask Ops | 2 | 2 |
X87 / MMX | 6 | 1 |
EMU (AVX-512ER) | 7 | 0.5 |
Shuffle / Permutes (1 src) | 2 | 1 |
Shuffle / Permutes (2 src) | 3 | 0.5 |
Convert - Same Width | 2 | 1 |
Convert - Different Width | 6 | 0.2 |
Vector Loads | 5 | 2 |
Store and load forwarding | 2 | 2 |
Gather (8 elems) | 15 | 0.2 |
Gather (16 elems) | 19 | 0.1 |
Float to Int move | 2 | 1 |
Int to Float Move | 4 | 1 |
DIVSS or SQRTSS | 25 | 0.05 |
DIVSD or SQRTSD | 40 | 0.03 |
Packed DIV or SQRT | 38 | 0.1 |
Scalar Instructions
Instruction | Latency | Bandwidth |
---|---|---|
Math | 1 | 2 |
Int Multiply | 3 or 5 | 1 |
Store to load forwarding | 2 | 1 |
Integer Loads | 4 | 1 |
Integer Division | Varies | 0.05 |
Scalar Versus Vector Code Performance - Kernel Sizes
Note that the Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.
Example Operations - Operation Costs and Comparisons
for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; } |
Instructions Present: Gather, Horizontal Reduction Operations -- 2x Load, Gather/Load, Horizontal Reduction & Sum
Vector Cost: 5*N
Scalar Cost: 19*ceiling(N/8)+30
Analysis: Scalar is better for N < 13
for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; } |
Instructions Present: Scatter and Division Operations -- 3x Load, Scatter/Store, Division
Vector Cost: 38*ceiling(N/8)
Scalar Cost: 19*N
Analysis: Scalar is better if N < 1
for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; } |
Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load
Vector Cost: 36*ceiling(N/8)
Scalar Cost: 3*N
Analysis: Scalar code is always optimal, no matter the iteration (N Count)

Vectorization Latency & Bandwidth for AVX-512 for Intel Knights Landing
The Intel Xeon Phi Knights Landing supports the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.
Intel KNL Supported AVX-512 Extensions
Extension | Description |
---|---|
AVX-512F | Foundation |
AVX-512CD | Conflict Detection |
AVX-512PF | Prefetch |
AVX-512ER | Exponents and Reciprocals |
A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.
Instruction Latency Tables for Vector Vs. Scalar Instructions
Vector Instructions
Instruction | Latency | Bandwidth |
---|---|---|
Simple Int | 2 | 2 |
FMA Vectorizations | 6 | 2 |
Mask Ops | 2 | 2 |
X87 / MMX | 6 | 1 |
EMU (AVX-512ER) | 7 | 0.5 |
Shuffle / Permutes (1 src) | 2 | 1 |
Shuffle / Permutes (2 src) | 3 | 0.5 |
Convert - Same Width | 2 | 1 |
Convert - Different Width | 6 | 0.2 |
Vector Loads | 5 | 2 |
Store and load forwarding | 2 | 2 |
Gather (8 elems) | 15 | 0.2 |
Gather (16 elems) | 19 | 0.1 |
Float to Int move | 2 | 1 |
Int to Float Move | 4 | 1 |
DIVSS or SQRTSS | 25 | 0.05 |
DIVSD or SQRTSD | 40 | 0.03 |
Packed DIV or SQRT | 38 | 0.1 |
Scalar Instructions
Instruction | Latency | Bandwidth |
---|---|---|
Math | 1 | 2 |
Int Multiply | 3 or 5 | 1 |
Store to load forwarding | 2 | 1 |
Integer Loads | 4 | 1 |
Integer Division | Varies | 0.05 |
Scalar Versus Vector Code Performance - Kernel Sizes
Note that the Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.
Example Operations - Operation Costs and Comparisons
for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; } |
Instructions Present: Gather, Horizontal Reduction Operations -- 2x Load, Gather/Load, Horizontal Reduction & Sum
Vector Cost: 5*N
Scalar Cost: 19*ceiling(N/8)+30
Analysis: Scalar is better for N < 13
for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; } |
Instructions Present: Scatter and Division Operations -- 3x Load, Scatter/Store, Division
Vector Cost: 38*ceiling(N/8)
Scalar Cost: 19*N
Analysis: Scalar is better if N < 1
for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; } |
Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load
Vector Cost: 36*ceiling(N/8)
Scalar Cost: 3*N
Analysis: Scalar code is always optimal, no matter the iteration (N Count)