Last week I started working on implementing Similarity Transform Method, which is an iterative method for quickly finding maximum eigen value along with respective eigen vector for a positive square matrix. Today I plan to go through this method step-by-step, while first developing its sequential implementation and finally moving to parallel implementation, written targeting heterogeneous platforms i.e. CPU, GPU, FPGA etc. using SYCL DPC++. Finally I'd like to do some performance comparison between sequential and parallel implementations, both written by me.

As Similarity Transform works with positive square matrices, I'd like to fix one matrix, which I'll work on. Let me call it A_{N x N}, where N = 3.

I start with finding sum of elements across each row of A_{N x N}. I can write 馃憞 code snippet for computing that.

def sum_across_rows(mat):
n = mat.shape[1]
v = np.array([np.sum(mat[i]) for i in range(n)])
return v

Notice, I mark maximum row sum as we've to use it for computing eigen vector ( iteratively ) in next step. I initialise iteratively computed eigen vector with [1; N] and during each iteration following function is used for updating eigen vector, while reaching closer to desired values.

eigen_vec = np.ones(3) // == [1] * 3
sum_across_rows = [6, 9, 12]
def compute_eigen_vector(eigen_vec, sum_across_rows):
max_row = np.max(sum_across_rows) // == 12
return np.array([j * (sum_across_rows[i]/max_row) for i, j in enumerate(eigen_vec)])

Now I'd like to present next step where I check whether we've converged to maximum eigen value or not. It's as simple
as checking whether all consecutive absolute differences between cells of *sum_across_rows* vector are below of preset
epsilon value, which I consider to be error induced during floating point arithmetics. If all consecutive differences are below
say, **EPS = 1e-5**, it denotes it's good time to stop this iterative method, as we've converged to maximum eigen value
of given matrix. I can take very first element of *sum_across_rows* vector ( in final round ) and consider it to be
maximum eigen value, we're looking for.

I write convergence criteria checker function as below. In very first iteration, we're far from convergence, as our row sum vector looks like
[6, 9, 12] --- so answer is negative and we'll continue to next iteration.

EPS = 1e-5
def converged(sum_vec) -> bool:
return all(map(lambda e: e < EPS, [abs(sum_vec[i] - sum_vec[i-1]) for i in range(1, len(sum_vec))]))

Before I prepare A_{N x N} for next iteration, this is how current eigen vector looks like.

In final step, I take *sum_across_rows* vector and prepare diagonal matrix D_{N x N} and D_{N x N}^{-1},
using those I compute A^{'}_{N x N} for next iteration. I programmatically express that logic 馃憞.

def compute_next(mat, sum_vec):
sigma = np.diag(sum_vec)
sigma_inv = np.diag(1/ sum_vec)
return np.matmul(np.matmul(sigma_inv, mat), sigma)

With completion of this step, I've following matrix, which is to be worked on during next iteration.

In second iteration, I already find *sum_across_rows* vector having lesser consecutive absolute differences --- tending to converge.

After going through each of these steps, during 5^{th} iteration, convergence criteria is fully satisfied, resulting into following eigen value and corresponding eigen vector.

Just to check that these are correct, I assert following, while considering some floating point arithmetics error will probably be present.

Av = 位v // must satisfy !
A = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
v = np.array([0.49835378, 0.72407576, 0.94979774])
位 = 9.623376623376622
// set `atol` to EPS
// during demonstration EPS = 1e-3
>>> assert np.allclose(np.dot(a, vec), np.dot(v, vec), atol=1e-3)
>>> print(f'Av = {np.dot(A, v)}\n位v = {np.dot(位, v)}')
Av = [4.79589852 6.9681258 9.14035308]
位v = [4.79584612 6.96805374 9.14026137]

I keep full sequential implementation of Similarity Transform Method here for future reference.
Now it's good time to talk about parallel implementation of this method. First step i.e. summing N-many elements across N-many rows
is easily parallelizable. I'm currently making use of atomic instructions for spawning N x N
work-items, each reading value from cell A_{i, i} and atomically summing up values along each row. In second step, I find maximum
value from just computed N-element vector i.e. *sum_across_rows*. Here also I make use of atomic instructions for parallelly
computing maximum element of vector. Third step consists of updating eigen vector with *sum_across_rows* of this iteration, which is
also a good candidate of parallel computing. Each of N-many work-items updating a single cell of eigen vector. Finally, it requires me to
perform two matrix multiplications in last step, but given that two of these three operands are diagonal matrices and one is inverse of another, I can compute matrix for next round using 馃憞 trick.
This trick lends itself to heavy parallelism and each cell of A_{ N x N} matrix can be computed independently using
N x N work-items.

I've also parallel implementation for checking whether current round has reached convergence criteria or not, using atomic
instructions. Empowered with these, I'm ready to run both sequential and data parallel implementation for performance comparison.
Starting with sequential implementation I wrote in Python. This is a single threaded implementation using *numpy* for doing
matrix algebra. Note performance when N = 1024 ( read last row ).

// check https://github.com/itzmeanjan/eigen_value/tree/964695e#sequential-reference-implementation
$ python3 main.py
Sequential Similarity Transform, for finding maximum eigen value ( with vector )
32 x 32 1.27 ms 5 round(s)
64 x 64 2.44 ms 5 round(s)
128 x 128 9.67 ms 5 round(s)
256 x 256 20.58 ms 4 round(s)
512 x 512 74.24 ms 4 round(s)
1024 x 1024 335.77 ms 4 round(s)

Data parallel implementation powered by **U**nified **S**hared **M**emory and atomics doesn't perform very well, compared
to previous one. Note, for both data parallel and sequential implementations, I'm using randomly generated positive square matrices
with each cell taking value from [0, 1).

// check https://github.com/itzmeanjan/eigen_value/tree/964695e#benchmark
$ make && ./run
running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Parallel Algorithm using Similarity Transform for finding max eigen value (with vector)
32 x 32 4 ms
64 x 64 200 ms
128 x 128 203 ms
256 x 256 442 ms
512 x 512 755 ms
1024 x 1024 1860 ms
--- --- --- --- --- --- --- --- --- --- --- ---
$ make && ./run
running on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
Parallel Algorithm using Similarity Transform for finding max eigen value (with vector)
32 x 32 150 ms
64 x 64 117 ms
128 x 128 159 ms
256 x 256 182 ms
512 x 512 270 ms
1024 x 1024 440 ms

I'd like to note few points, regarding why data parallel implementation doesn't do better than sequential one.

- I'm making use of atomic instructions a lot, which limits scalability by underlying device's atomic instruction execution ability
- I will consider using blocked reading from global memory using vectors of width 2/ 4/ 8, which will promote usage of coalesced memory access
- While computing matrix for next iteration, I'm accessing same cells of
*sum_across_rows*vector multiple times, which results into duplicate, improperly strided global memory reads - Speaking of previous note, I'll consider using cheaper & faster local scratch pad memory in better way, so that I don't end up generating many duplicate global memory accesses

Keeping aforementioned points in mind, I end today's discussion. I keep full implementation of both sequential and parallel similarity transform methods
in this repository. In coming days, I'll try to
improve data parallel implementation, while also sharing my findings.

Have a great time !