Troubleshooting

Vectorization Latency & Bandwidth for AVX-512 for Intel Knights Landing

April 3, 2017

7 min read

The Intel Xeon Phi Knights Landing supports the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.

Intel KNL Supported AVX-512 Extensions

Extension	Description
AVX-512F	Foundation
AVX-512CD	Conflict Detection
AVX-512PF	Prefetch
AVX-512ER	Exponents and Reciprocals

A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.

Instruction Latency Tables for Vector Vs. Scalar Instructions

Vector Instructions

Instruction	Latency	Bandwidth
Simple Int	2	2
FMA Vectorizations	6	2
Mask Ops	2	2
X87 / MMX	6	1
EMU (AVX-512ER)	7	0.5
Shuffle / Permutes (1 src)	2	1
Shuffle / Permutes (2 src)	3	0.5
Convert - Same Width	2	1
Convert - Different Width	6	0.2
Vector Loads	5	2
Store and load forwarding	2	2
Gather (8 elems)	15	0.2
Gather (16 elems)	19	0.1
Float to Int move	2	1
Int to Float Move	4	1
DIVSS or SQRTSS	25	0.05
DIVSD or SQRTSD	40	0.03
Packed DIV or SQRT	38	0.1

Scalar Instructions

Instruction	Latency	Bandwidth
Math	1	2
Int Multiply	3 or 5	1
Store to load forwarding	2	1
Integer Loads	4	1
Integer Division	Varies	0.05

Scalar Versus Vector Code Performance - Kernel Sizes

Note that the Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.

Example Operations - Operation Costs and Comparisons

for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; }

Instructions Present: Gather, Horizontal Reduction Operations -- 2x Load, Gather/Load, Horizontal Reduction & Sum

Vector Cost: 5*N

Scalar Cost: 19*ceiling(N/8)+30

Analysis: Scalar is better for N < 13

for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; }

Instructions Present: Scatter and Division Operations -- 3x Load, Scatter/Store, Division

Vector Cost: 38*ceiling(N/8)

Scalar Cost: 19*N

Analysis: Scalar is better if N < 1

for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; }

Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load

Vector Cost: 36*ceiling(N/8)

Scalar Cost: 3*N

Analysis: Scalar code is always optimal, no matter the iteration (N Count)

Topics

Have any questions?

Troubleshooting