Two weeks ago I wrote about Parallel Similarity Transformation, which is useful iterative method for finding maximum eigen value of positive square matrices, along with respective eigen vector such that following equality is satisfied. I used SYCL DPC++ for implementing kernels, but in last weeks I discovered ways to improve upon my previous work. So today I decided to write about how I optimize similarity transformation kernels and present you with updated benchmark results.

Av = λv
A = positive square matrix
λ = (max) eigen value
v = (respective) eigen vector

I'll just pick each kernel required for computing maximum eigen value of positive square matrix and figure out ways how it can be improved in step-by-step manner. Following five kernels are required for performing parallel similarity transformation.

- kernelSumAcrossRows
- kernelMaxInVector
- kernelComputeEigenVector
- kernelSimilarityTransform
- kernelStopCriteria

I'll start with first one. All this kernel should do is simply compute sum of values stored in cells along each row of matrix. A_{N x N}
being a square matrix, it needs to output to a N-element vector where i^{th} cell should hold sum of i^{th} row of matrix. I want to
perform N-many parallel summations --- as simple as that. With this requirement in mind, I'll write following kernel.

// @note this is a pseudocode !
// kernelSumAcrossRows_v0
//
// check https://github.com/itzmeanjan/eigen_value/blob/e584beda3b05759673bce5af7ebed95fd262dff2/benchmarks/benchmark_similarity_transform.cpp#L39-L53
float *mat = ...; // matrix I'm operating on of dimension N x N
float *vec = ...; // row sum result vector of length N
N = 1 << 10;
B = 1 << 5;
q.parallel_for(nd_range<2>{{N, N}, {1, B}}, [=] (nd_item<2> it) {
const size_t r = it.get_global_id(0);
const size_t c = it.get_global_id(1);
float cell = *(mat + r * N + c); // value stored here
atomic_add(*(vec + r), cell);
});

A high-level look at this kernel reveals it's not good idea to atomically compute sum of all elements along each row of A_{N x N}.
N-many memory locations of result vector of length N, are being operated on atomically to avoid any chance of data race. Each cell of result
vector of length N, will be accessed N-many times by N-many designated work-items. This results into N^{2} global memory access, which is expensive.
I'd like to reduce global memory access as much as possible. One simple way to do this in current context is to make use of reduction function
of SYCL DPC++, which will help me in computing sum of all elements held by all work-items of subgroup. Using this subgroup method maps
well to underlying SIMD instructions, which reduces communication cost, but does compute sum of all values held by subgroup participants.
Finally I'll use atomic write to update sum in designated row of result vector, living in global memory, helping me to reduce global memory access
to N^{2} / B, given that B = subgroup width. I'll express this with following kernel.

// @note this is a pseudocode
// kernelSumAcrossRows_v1
//
// check https://github.com/itzmeanjan/eigen_value/blob/e584beda3b05759673bce5af7ebed95fd262dff2/benchmarks/benchmark_similarity_transform.cpp#L86-L106
q.parallel_for(nd_range<2>{{N, N}, {1, B}}, [=] (nd_item<2> it) {
const size_t r = it.get_global_id(0);
const size_t c = it.get_global_id(1);
float cell = *(mat + r * N + c);
// reduction method called over subgroup work-items
//
// should result into better SIMD usage
float subgroup_sum = reduce_sum(cell);
// only sub-group leader accesses global memory one time !
if(leader()) {
atomic_add(*(vec + r), subgroup_sum);
}
});

This kernel is indeed better than previous matrix row sum kernel, where N^{2} global memory accesses were performed. But I see
one more way to attempt further optimization. I want to make use of faster local memory so that subgroup leader can atomically
update respective work-group's sum, stored in designated local memory.
It's possible to have multiple subgroups in a work-group, that way I'm able to reduce global memory access by another factor. Now only subgroup leaders
are accessing local memory to update respective work-group's row sum, with their own subgroup level row sum. Finally after all work-items of
this work-group are done, only work-group leader will access designated cell of global memory to atomically update respective row sum.
With this update kernel should only access global memory N^{2} / B times, when B = work-group size, which should preferrably be greater than subgroup size.
My final update to matrix row sum kernel looks like below.

// @note this is a pseudocode
// kernelSumAcrossRows_v2
//
// check https://github.com/itzmeanjan/eigen_value/blob/d27f64c27a7a38c7b1038a49563ec4ce47e9ac19/similarity_transform.cpp#L90-L135
float *local_memory = ...; // local-data-share of size 1
q.parallel_for(nd_range<2>{{N, N}, {1, B}}, [=] (nd_item<2> it) {
if(leader()) {
// initialised by work-group leader
//
// if not done, it might prove to be costly
*local_memory = 0.f;
}
// only after all work-items of work-group
// reaches this point, move forward !
barrier();
const size_t r = it.get_global_id(0);
const size_t c = it.get_global_id(1);
float cell = *(mat + r * N + c);
// reduction method called over subgroup work-items
//
// should result into better SIMD usage
float subgroup_sum = reduce_sum(cell);
// subgroup leader should only update
// subgroup level row sum in work-group
// local memory
if (leader()) {
atomic_add(*local_memory, subgroup_sum);
}
// only after all work-items of work-group
// reaches this point, move forward !
barrier();
// only work-group leader accesses global memory one time !
if(leader()) {
atomic_add(*(vec + r), *local_memory);
}
});

Now it's good time to run these three kernels on both CPU and GPU, with various matrix dimension, to see how they perform. Following section shows result of running three variants of row sum kernel on CPU.

running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
[kernel] Sum Across Rows of Matrix (v0)
128 x 128 2.688 ms
256 x 256 0.602 ms
512 x 512 1.294 ms
1024 x 1024 5.474 ms
2048 x 2048 20.816 ms
4096 x 4096 79.719 ms
8192 x 8192 315.3 ms
[kernel] Sum Across Rows of Matrix (v1)
128 x 128 1.235 ms
256 x 256 0.219 ms
512 x 512 0.363 ms
1024 x 1024 1.04 ms
2048 x 2048 3.746 ms
4096 x 4096 14.896 ms
8192 x 8192 59.13 ms
[kernel] Sum Across Rows of Matrix (v2)
128 x 128 0.23 ms
256 x 256 0.431 ms
512 x 512 0.683 ms
1024 x 1024 2.12 ms
2048 x 2048 8.167 ms
4096 x 4096 32.721 ms
8192 x 8192 126.747 ms

I see for random square matrix of dimension N x N, second variant of row summation kernel performs best. I expected third variant to perform at least as same as second one, if not better, though it seems second one is ~2x faster compared to last kernel. I compiled these kernels Ahead of Time, so that runtime kernel compilation time can be saved and launch on accelerator will be faster. As I've access to GPU, I've also ran it there and following are benchmark results I obtained.

running on Intel(R) Iris(R) Xe MAX Graphics [0x4905]
[kernel] Sum Across Rows of Matrix (v0)
128 x 128 7.844 ms
256 x 256 47.783 ms
512 x 512 298.784 ms
1024 x 1024 1652.19 ms
2048 x 2048 7182.5 ms
4096 x 4096 35201.8 ms
8192 x 8192 191747 ms
[kernel] Sum Across Rows of Matrix (v1)
128 x 128 0.536 ms
256 x 256 0.531 ms
512 x 512 1.395 ms
1024 x 1024 5.726 ms
2048 x 2048 22.359 ms
4096 x 4096 101.141 ms
8192 x 8192 502.316 ms
[kernel] Sum Across Rows of Matrix (v2)
128 x 128 0.617 ms
256 x 256 0.624 ms
512 x 512 1.378 ms
1024 x 1024 3.934 ms
2048 x 2048 14.186 ms
4096 x 4096 54.654 ms
8192 x 8192 220.202 ms

Naive implementation i.e. first variant is performing much poor on GPU, but that's what I expected as all work-items are attempting to update global memory atomically, congestion should be high. I see this time third variant shines ! As I understand, in GPU it has dedicated local memory, which provides faster access compared to global memory. That's why third variant, where I make use of local memory to reduce global memory access and only work-group leaders access global memory one time to atomically update total row sum, performs better than others --- almost 2x faster than second variant, its closest competitor.

Now I'd like to spend some time in explaining how I'd attempt to write second kernel i.e. kernelMaxInVector. Business logic of this kernel should just find out maximum value from a given vector of size N. A naive way that comes to my mind first, is using atomic maximum value finder, where each work-item ( respresenting each cell of vector ) attempts to atomically compare its holding with some known global memory location, where finally maximum value is supposed to be stored. As my interest is finding out maximum value of vector, I'll have to initialize result memory location with 0. I'll write following kernel, depicting my naive attempt.

// kernelMaxInVector_v0
//
// check https://github.com/itzmeanjan/eigen_value/blob/cbbe178cc36bfe3b3e4a7e9d38d311145a9ff742/benchmarks/benchmark_similarity_transform.cpp#L164-L175
float *vec = ...; // source vector of length N
float *max = ...; // result memory location
*max = 0.f; // zero initialise
N = 1 << 10;
B = 1 << 5;
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) {
const size_t r = it.get_global_id(0);
float cell = *(vec + r);
// each work-item attempts to atomically compare
// its holding with designated result memory location
//
// only if this work-item's holding has value larger than
// what's currently stored in result, it'll be updated with
// this work-item's held value
atomic_max(*max, cell);
});

A next step that comes to my mind, for improving max value in vector finder kernel, is using SYCL DPC++ reduction kernel
for finding maximum value among all work-items of certain subgroup. This results in better mapping to underlying SIMD instructions.
Finally only subgroup leader will attempt to compare subgroup's maximum value with result in global memory location, which naturally
results into reduced access of global memory. In previous kernel, all work-items would have to compare their holding
with target in global memory, which results into N-many global memory accesses. But with this improvement, global memory access should
be reduced to N / B, when N = vector size, B = subgroup size.

Following is my attempt in writing this improved variant of *kernelMaxInVector*.

// kernelMaxInVector_v1
//
// check https://github.com/itzmeanjan/eigen_value/blob/cbbe178cc36bfe3b3e4a7e9d38d311145a9ff742/benchmarks/benchmark_similarity_transform.cpp#L208-L226
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) {
const size_t r = it.get_global_id(0);
float cell = *(vec + r);
// subgroup level maximum value
// computed using SYCL DPC++ reduction function
float sub_max = reduce_max(cell);
// only subgroup leader attempts to
// compare subgroup's maximum value with global memory
//
// updates same only if atomic comparison says target
// global memory has smaller value compared to
// subgroup's maximum value
if(leader()) {
atomic_max(*max, sub_max);
}
});

As I understand, second variant of max in vector kernel should perform better than first variant, but I've another idea
which I'd like to implement and benchmark. In quest of finding a way in reducing global memory access as much as possible,
I bring in idea of using local memory, where all subgroups of certain work-group help in computing maximum value of work-group
itself. And finally when all work-items of that work-group are done ( read reaches a synchronization point ), only work-group leader will attempt to access
global memory and perform atomic comparison to decide whether updation to result can be performed or not. This kernel
should reduce global memory access to N / B, when N = vector size, B = work-group size. Note, this technique will only
prove fruitful if work-group size is larger than subgroup size, specifally I'll set work-group such that it's evenly
divisible by subgroup size, which I'll specify using attribute understandable by dpcpp compiler.

Following is the attribute required for setting subgroup size for some kernel.

// just an example of hardcoding subgroup size in kernel definition
//
// check https://github.com/itzmeanjan/eigen_value/blob/cbbe178cc36bfe3b3e4a7e9d38d311145a9ff742/similarity_transform.cpp#L161
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) [[intel::reqd_sub_group_size(32)]] {
// kernel body
});

My final attempt in improving vector max value finder kernel is as below.

// kernelMaxInVector_v2
//
// check https://github.com/itzmeanjan/eigen_value/blob/cbbe178cc36bfe3b3e4a7e9d38d311145a9ff742/similarity_transform.cpp#L159-L203
float *local_memory = ...; // local-data-share of size 1
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) {
// only work-group leader executes this block
if(leader()) {
// explicit initialisation of local memory
//
// otherwise result might be wrong
// if we rely on default initialised value
//
// notice I can safely initialise with 0.f
// because I'm working with positive matrix
*local_memory = 0.f;
}
// only after all work-items of work-group
// reaches this point, move forward !
barrier();
const size_t r = it.get_global_id(0);
float cell = *(vec + r);
float sub_max = reduce_max(cell);
// subgroup leader should only update
// subgroup level row sum in work-group
// local memory
if (leader()) {
atomic_max(*local_memory, sub_max);
}
// only after all work-items of work-group
// reaches this point, move forward !
barrier();
// only work-group leader accesses global memory one time !
if(leader()) {
atomic_add(*max, *local_memory);
}
});

This is a time to benchmark three implementations for finding maximum value in vector. AOT compiled kernels on CPU perform as below.

running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
[kernel] Max Value in Vector (v0)
65536 0.153 ms
131072 0.207 ms
262144 0.319 ms
524288 0.54 ms
1048576 0.981 ms
2097152 1.853 ms
4194304 3.622 ms
8388608 7.379 ms
16777216 14.867 ms
33554432 28.477 ms
[kernel] Max Value in Vector (v1)
65536 0.149 ms
131072 0.197 ms
262144 0.303 ms
524288 0.509 ms
1048576 0.928 ms
2097152 1.756 ms
4194304 3.421 ms
8388608 7.159 ms
16777216 14.027 ms
33554432 27.881 ms
[kernel] Max Value in Vector (v2)
65536 0.217 ms
131072 0.328 ms
262144 0.506 ms
524288 0.873 ms
1048576 1.439 ms
2097152 2.78 ms
4194304 5.298 ms
8388608 11.136 ms
16777216 22.701 ms
33554432 44.979 ms

I notice first variant, where all work-items are atomically comparing self value with global memory's
target value, doesn't perform that bad on CPU. It may be due to that CPU has better atomic operation support.
The final variant, which I expected to do better, is probably sufferring from the fact that on CPU, there's no
explicit local memory. Making subgroup leaders to write to local memory first, for finding work-group local maximum value
and then only work-group leader contributing in finding maximum value of vector by accessing/ comparing global memory ---
is incurring more cost than benefit, on CPU.

In below table, I present you with benchmark results I obtain after running same kernels on GPU. As GPU is less capable
in executing atomic operations, compared to CPU, I see first variant performing much worse on GPU. In second variant, subgroup's
work-items find maximum value using SYCL DPC++ reduction function, which is performing lot better as I expected. In this case,
contention should also be lower, as only subgroup leaders are supposed to be atomically accessing global memory. The final
variant, where I introduced local memory, so that I can reduce scope of global memory access by another factor, letting only
workgroup leaders atomically access global memory, performs almost as same as second variant. May be third technique doesn't bring
much on table, but it's another sound & efficient way to write this kernel.

running on Intel(R) Iris(R) Xe MAX Graphics [0x4905]
[kernel] Max Value in Vector (v0)
65536 0.588 ms
131072 0.758 ms
262144 1.096 ms
524288 1.832 ms
1048576 3.315 ms
2097152 6.373 ms
4194304 12.108 ms
8388608 23.492 ms
16777216 46.511 ms
33554432 91.242 ms
[kernel] Max Value in Vector (v1)
65536 0.459 ms
131072 0.609 ms
262144 0.815 ms
524288 1.2 ms
1048576 2.101 ms
2097152 3.853 ms
4194304 7.278 ms
8388608 13.92 ms
16777216 26.941 ms
33554432 53.695 ms
[kernel] Max Value in Vector (v2)
65536 0.475 ms
131072 0.613 ms
262144 0.834 ms
524288 1.222 ms
1048576 2.155 ms
2097152 3.893 ms
4194304 7.303 ms
8388608 13.961 ms
16777216 27.154 ms
33554432 53.736 ms

Next kernel in my list, kernelComputeEigenVector, is very simple to implement, where each element of result vector is computed using following formula.

// sequential way of computing eigen vector
max_val = max(src_vec);
for i in 0..n {
res_vec[i] *= src_vec[i] / max_val;
}

This problem itself is very parallel, no data dependency, so I'll write following kernel for parallelly computing result eigen vector from row sum vector and max value in row sum vector.

// kernelComputeEigenVector_v0
//
// check https://github.com/itzmeanjan/eigen_value/blob/cbbe178cc36bfe3b3e4a7e9d38d311145a9ff742/benchmarks/benchmark_similarity_transform.cpp#L289-L294
float *row_sum_vec = ...; // row summations of matrix of dimension N x N, so length is N
float *max_val = ...; // max value in above row sum vector
float *res_vec = ...; // result vector of size N
N = 1 << 10;
B = 1 << 5;
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) {
const size_t r = it.get_global_id(0);
*(res_vec + r) *= *(row_sum_vec + r) / *max_val;
});

Notice, value stored in *max_val* memory location is in global memory, which is accessed N-many
times during execution of this kernel. But this value is somewhat constant in this context, so I'd like
to use some technique which will help me in reading this value from global memory only once. After that
read value will be distributed among a set of work-items, not requiring each of them to read from global memory.
This will result into reduced global memory access, which contributes in making whole compute pipeline faster.
The set of work-items which will read from global memory only once, is called subgroup. As subgroups map well
to underlying SIMD intrinsics, resulting code running on accelerator should perform better. My take at modifying
this kernel is as simple as adding *group_broadcast* method when reading from memory location held behind *max_val*.

// kernelComputeEigenVector_v1
//
// check https://github.com/itzmeanjan/eigen_value/blob/1d189ac2973d69ac330785bc74335c8101ff0bee/similarity_transform.cpp#L222-L236
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) {
const size_t r = it.get_global_id(0);
// this will result into only one work-item in subgroup
// reading from global memory location, behind `max_val`
*(res_vec + r) *= *(row_sum_vec + r) / group_broadcast(*max_val);
});

Benchmarking last two kernels, used for computing eigen vector, on CPU gives me following result. It looks both kernel variants perform almost same on CPU and my speculation of resulting into better SIMD usage in case of second variant of kernel, doesn't seem to be true.

running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
[kernel] Eigen Vector Computation (v0)
65536 1.057 ms
131072 0.122 ms
262144 0.169 ms
524288 0.236 ms
1048576 0.397 ms
2097152 0.71 ms
4194304 1.682 ms
8388608 4.186 ms
16777216 8.177 ms
33554432 16.149 ms
[kernel] Eigen Vector Computation (v1)
65536 0.124 ms
131072 0.134 ms
262144 0.167 ms
524288 0.262 ms
1048576 0.382 ms
2097152 0.828 ms
4194304 1.878 ms
8388608 4.164 ms
16777216 8.156 ms
33554432 16.293 ms

Benchmarking on GPU, I see similar trend that both variants are performing almost same. I believe this is due to either
I'm using *group_broadcast* improperly or it's not working as expected.

running on Intel(R) Iris(R) Xe MAX Graphics [0x4905]
[kernel] Eigen Vector Computation (v0)
65536 0.096 ms
131072 0.107 ms
262144 0.109 ms
524288 0.153 ms
1048576 0.295 ms
2097152 0.437 ms
4194304 0.902 ms
8388608 1.754 ms
16777216 3.447 ms
33554432 6.827 ms
[kernel] Eigen Vector Computation (v1)
65536 0.1 ms
131072 0.11 ms
262144 0.112 ms
524288 0.18 ms
1048576 0.301 ms
2097152 0.496 ms
4194304 0.885 ms
8388608 1.733 ms
16777216 3.497 ms
33554432 6.904 ms

Next kernel I'll write is kernelSimilarityTransform, which is an important part in compute pipeline of whole Similarity Transformation method. This kernel itself helps in computing next iteration's matrix using following technique. I've already found row sum vector of a matrix, which I'll use in this kernel. Assume I've one vector of length N, forming one diagonal matrix and its respective inverse matrix is trivial.

// here N = 4
vec = [a, b, c, d]
d = diag([a, b, c, d]) // diagonal matrix
d^-1 = diag([1/ a, 1/ b, 1/ c, 1/ d]) // inverse of diagonal matrix

Next iteration's matrix is computed as below.

// M = current matrix of size N x N
// d = diagonal matrix formed with row-sum vector
// d^-1 = inverse of above diagonal matrix
M = d^-1 x M x d // = matrix to be used in next iteration

This is a matrix multiplication problem, but two of three operands being diagonal matrices and two matrices being inverse of one another, I can reduce matrix multiplication problem to a reduced variant, as I derive it in above diagram. Writing a parallel implementation of this is pretty easy.

// kernelSimilarityTransform or call it kernelComputeNextMatrix
//
// check https://github.com/itzmeanjan/eigen_value/blob/1d189ac2973d69ac330785bc74335c8101ff0bee/similarity_transform.cpp#L271-L293
float *mat = ...; // matrix of dimension N x N
float *vec = ...; // row sum vector of size N
float *loc_row = ...; // vector of length 1, in local memory
float *loc_col = ...; // vector of length B, in local memory
N = 1 << 10;
B = 1 << 5; // = work-group size
q.parallel_for(nd_range<2>{{N, N}, {1, B}}, [=] (nd_item<2> it) {
const size_t r = it.get_global_id(0);
const size_t c = it.get_global_id(1);
// 1D unique local identifier of work-item in work-group
const size_t ll_id = it.get_local_linear_id(); // ∈ [0, B)
// 1D unique global identifier of work-item in execution space
const size_t gl_id = it.get_global_linear_id(); // ∈ [0, N*N)
// --- caching row-sum vector into local memory ---
// only work-group leader does this
if(leader()) {
loc_row[0] = *(vec + r);
}
loc_col[ll_id] = *(vec + gl_id % B);
// wait until all work-items of work-group reach here
barrier();
// --- caching ends ---
*(mat + r * N + c) *= (1.f / loc_row[0]) * loc_col[ll_id];
});

This kernel makes use of local memory to cache diagonal matrix's ( read row-sum vector ) cells, reducing global memory access. Benchmarking this kernel on CPU, GPU compares as below. As GPUs possess explicit local memory, benefits of using local memory is clearly visible on second column in following diagram.

[kernel] Next Matrix Computation
dim GPU CPU
128 x 128 0.059 ms 0.152 ms
256 x 256 0.064 ms 0.23 ms
512 x 512 0.092 ms 0.504 ms
1024 x 1024 0.192 ms 1.699 ms
2048 x 2048 0.594 ms 6.497 ms
4096 x 4096 2.412 ms 25.806 ms
8192 x 8192 9.962 ms 103.048 ms

Final kernel in pipeline i.e. kernelStopCriteria, examines row-sum vector and decides whether we should go to next iteration or not. When iterative method converges
to result, all values in row-sum vector should be very close and absolute difference between a pair of consequtive values should be lesser than some predefined
*epsilon* value --- error in floating point computation that I'm okay to live with. Let us take an example vector to demonstrate how I can write one parallel kernel for same.

EPS = 1e-3
vec = [a, b, c, d]
// stop if following evaluates to True
abs(vec[0] - vec[1]) < EPS && abs(vec[1] - vec[2]) < EPS && abs(vec[2] - vec[3]) < EPS && abs(vec[3] - vec[0]) < EPS
// --- 👇 is a generic rewrite of above condition check ---
N = len(vec);
stop = True;
for i in 0..N {
stop &= abs(vec[i] - vec[(i + 1) % N]) < EPS;
}
if(stop) // stop iteration
else // compute next iteration's matrix

For a vector of length N, I'll kick off N work-items where i^{th} work-item accesses row-sum vector's i^{th} and (i+1)^{th} indices,
for finding absolute difference between a pair of consequtive cells. Notice for work-item with index *N-1*, attempting to access global memory with
index *(i+1) = N* should result into segmentation fault. I should be accessing index *i* and *(i+1) % N* of row-sum vector from each work-item.
Implementing such a kernel is as easy as below.

// kernelStopCriteria_v0
float EPS = 1e-3;
float *row_sum = ...; // row-sum vector of length N, in global memory
uint *stop = ...; // status in global memory, whether converged or not
*stop = 1U; // initialization is required
q.parallel_for(nd_range<1>{N, B}, [=] (nd_item<1> it) {
const size_t cur = it.get_global_id(0);
const size_t nxt = (cur + 1) % N;
float cur_v = *(row_sum + cur);
float nxt_v = *(row_sum + nxt);
float diff = abs(cur_v - nxt_v);
bool flag = all_of(diff < EPS);
// only work group leader
if(leader()) {
atomic_min(*stop, flag ? 1 : 0);
}
});

This kernel is pretty simple, except *all_of* function, which is a SYCL group algorithm function, helps in evaluation of a boolean
predicate across all work-items of a work(/ sub)-group. The main problem I see with above kernel is duplicate access of global memory ( i.e. row-sum vector ).
Each cell of row-sum vector is accessed twice. For reducing global memory access, I've to resort to subgroup shuffling algorithm, so that within a subgroup
all work-items just read once from global memory, but then communicate among themselves, sharing their register cached values with other work-items in same
subgroup. I use one specific subgroup function --- *shuffle_down*, which helps me in obtaining *((i + 1) % B)*^{th} work-item's value
inside *i*^{th} work-item, without explicitly accessing global memory. For sake of better demonstration purpose, let me take an example.

Say I've following vector, which I want to shuffle down, then assertion in following pseudocode block must hold.

vec = [0, 1, 2, 3]
// this is executed in data parallel context
float received = shuffle_down(vec[i], 1)
// then following must hold !
[[parallel]] for i in 0..4 {
assert(received == vec[(i+1) % 4])
}

Pictorially it looks like below, when subgroup size is 4. If each of 4 work-items start with their initial value in
first row, after shuffle_down by 1 place, their local register content should be updated to last row. This is somewhat similar to what I want to
attain. I say somewhat because, this model doesn't fully fit my requirement when vector size *N* itself is greater than subgroup
size, say *B*.

Check following scenario, where subgroup size is 4, but vector which is being shuffled down is of length 8.
As *shuffle_down* method itself wraps around at subgroup boundary, produced result will look like below.
But my expected result is something where index wrapping should happen at boundary of whole vector under operation.

As subgroup shuffling itself is not enough to satisfy requirement for checking stopping criteria, in parallel, I use conditional check based reading from global memory, though this results into duplicate reads from global memory at subgroup boundary. Then rest of the flow is same.

// check https://github.com/itzmeanjan/eigen_value/blob/1d189ac2973d69ac330785bc74335c8101ff0bee/similarity_transform.cpp#L324-L412
// for full implementation of stopping criteria checker kernel

I can introduce local memory here, following same rationale as I did in some of previous kernels. Instead of letting all subgroup leaders atomically accessing global memory for updating status of whether convergence criteria has been satisfied in respective subgroup, I can just let work-group leaders do that part. In this updated setup, subgroup leaders update flag in local memory, where it's decided whether convergence criteria has been reached in work-group. And when all work-items of work-group reach one barrier ( synchronization point ), work-group leader will update work-group's result in global memory. Comparing benchmark of final variant of convergence criteria checker kernel shows much better performance on GPU.

[kernel] Stop ( Convergence ) Criteria Checker
dim CPU GPU
65536 0.707 ms 0.676 ms
131072 0.85 ms 0.634 ms
262144 1.435 ms 0.84 ms
524288 2.615 ms 1.248 ms
1048576 5.072 ms 2.246 ms
2097152 9.812 ms 3.964 ms
4194304 19.469 ms 7.339 ms
8388608 38.686 ms 13.655 ms
16777216 77.394 ms 26.739 ms
33554432 154.021 ms 54.011 ms

Empowered with all these (optimized) kernels, it's time to present whole compute pipeline with inter-kernel dependency. I use SYCL buffers for representing/ passing around data, which helps SYCL runtime to figure out computational dependency graph in terms of DAG.

Now I'd like to present final benchmark of whole compute pipeline of Parallel Similarity Transformation, depicting performance on both CPU, GPU. I run benchmark suite with N x N hilbert matrix, finding maximum eigen value and respective eigen vector. Last column in following table shows how many iterations were required before converging to maximum eigen value. For matrices of large dimension performance on GPU is better, because all component kernels perform better on GPU, making whole compute pipeline faster.

running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Parallel Similarity Transform for finding max eigen value (with vector)
128 x 128 11 ms 9 round(s)
256 x 256 6 ms 10 round(s)
512 x 512 17 ms 12 round(s)
1024 x 1024 54 ms 13 round(s)
2048 x 2048 213 ms 14 round(s)
4096 x 4096 911 ms 15 round(s)
8192 x 8192 4052 ms 17 round(s)
// --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
running on Intel(R) Iris(R) Xe MAX Graphics [0x4905]
Parallel Similarity Transform for finding max eigen value (with vector)
128 x 128 13 ms 9 round(s)
256 x 256 9 ms 10 round(s)
512 x 512 14 ms 12 round(s)
1024 x 1024 35 ms 13 round(s)
2048 x 2048 130 ms 14 round(s)
4096 x 4096 535 ms 15 round(s)
8192 x 8192 2373 ms 17 round(s)

I keep parallel similarity transform method's full implementation here for future reference. In this post, I attempted to show how I'd go about writing parallel kernels and eventually optimize them. Major points I try to focus during optimization phase.

- Can global memory access be reduced ?
- Is it possible to use subgroup functions ?
- Is it possible to write local memory cached/ blocked variant ?

If you find out some mistake/ possible (yet undiscovered) way to improve some kernel, consider dropping me a line with that.

Have a great time !