Troubleshooting

Vectorization Latency & Bandwidth for AVX-512 for Intel Knights Landing

April 3, 2017
7 min read
Featured-Image-3.jpg

The Intel Xeon Phi Knights Landing supports the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.

Intel KNL Supported AVX-512 Extensions

ExtensionDescription
AVX-512FFoundation
AVX-512CD
Conflict Detection
AVX-512PF
Prefetch
AVX-512ER
Exponents and Reciprocals

A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.

Instruction Latency Tables for Vector Vs. Scalar Instructions

Vector Instructions

Instruction
Latency
Bandwidth
Simple Int
2
2
FMA Vectorizations
6
2
Mask Ops
2
2
X87 / MMX
6
1
EMU (AVX-512ER)
7
0.5
Shuffle / Permutes (1 src)
2
1
Shuffle / Permutes (2 src)
3
0.5
Convert - Same Width
2
1
Convert - Different Width
6
0.2
Vector Loads
5
2
Store and load forwarding
2
2
Gather (8 elems)
15
0.2
Gather (16 elems)
19
0.1
Float to Int move
2
1
Int to Float Move
4
1
DIVSS or SQRTSS
25
0.05
DIVSD or SQRTSD
40
0.03
Packed DIV or SQRT
38
0.1

Scalar Instructions

Instruction
Latency
Bandwidth
Math
1
2
Int Multiply
3 or 5
1
Store to load forwarding
2
1
Integer Loads
4
1
Integer Division
Varies
0.05

Scalar Versus Vector Code Performance - Kernel Sizes

Note that the Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.

Example Operations - Operation Costs and Comparisons

for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; }

Instructions Present: Gather, Horizontal Reduction Operations -- 2x Load, Gather/Load, Horizontal Reduction & Sum

Vector Cost: 5*N

Scalar Cost: 19*ceiling(N/8)+30

Analysis: Scalar is better for N < 13

for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; }

Instructions Present: Scatter and Division Operations -- 3x Load, Scatter/Store, Division

Vector Cost: 38*ceiling(N/8)

Scalar Cost: 19*N

Analysis: Scalar is better if N < 1

for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; }

Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load

Vector Cost: 36*ceiling(N/8)

Scalar Cost: 3*N

Analysis: Scalar code is always optimal, no matter the iteration (N Count)

Topics

Featured-Image-3.jpg
Troubleshooting

Vectorization Latency & Bandwidth for AVX-512 for Intel Knights Landing

April 3, 20177 min read

The Intel Xeon Phi Knights Landing supports the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.

Intel KNL Supported AVX-512 Extensions

ExtensionDescription
AVX-512FFoundation
AVX-512CD
Conflict Detection
AVX-512PF
Prefetch
AVX-512ER
Exponents and Reciprocals

A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.

Instruction Latency Tables for Vector Vs. Scalar Instructions

Vector Instructions

Instruction
Latency
Bandwidth
Simple Int
2
2
FMA Vectorizations
6
2
Mask Ops
2
2
X87 / MMX
6
1
EMU (AVX-512ER)
7
0.5
Shuffle / Permutes (1 src)
2
1
Shuffle / Permutes (2 src)
3
0.5
Convert - Same Width
2
1
Convert - Different Width
6
0.2
Vector Loads
5
2
Store and load forwarding
2
2
Gather (8 elems)
15
0.2
Gather (16 elems)
19
0.1
Float to Int move
2
1
Int to Float Move
4
1
DIVSS or SQRTSS
25
0.05
DIVSD or SQRTSD
40
0.03
Packed DIV or SQRT
38
0.1

Scalar Instructions

Instruction
Latency
Bandwidth
Math
1
2
Int Multiply
3 or 5
1
Store to load forwarding
2
1
Integer Loads
4
1
Integer Division
Varies
0.05

Scalar Versus Vector Code Performance - Kernel Sizes

Note that the Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.

Example Operations - Operation Costs and Comparisons

for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; }

Instructions Present: Gather, Horizontal Reduction Operations -- 2x Load, Gather/Load, Horizontal Reduction & Sum

Vector Cost: 5*N

Scalar Cost: 19*ceiling(N/8)+30

Analysis: Scalar is better for N < 13

for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; }

Instructions Present: Scatter and Division Operations -- 3x Load, Scatter/Store, Division

Vector Cost: 38*ceiling(N/8)

Scalar Cost: 19*N

Analysis: Scalar is better if N < 1

for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; }

Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load

Vector Cost: 36*ceiling(N/8)

Scalar Cost: 3*N

Analysis: Scalar code is always optimal, no matter the iteration (N Count)

Topics