Last week I worked on implementing LU decomposition on GPGPU using SYCL DPC++. Being a pretty useful method for
solving system of linear equations, finding determinant & inverse of matrix, I thought it is worth
taking a look at this problem, where I attempt to implement its SIMD form, capable of running in high throughput machine --- GPGPUs.
My objective will be to split a given matrix A_{N x N} into lower & upper triangular matrices L_{N x N}, U_{N x N} respectively such that

A_(N x N) = L_(N x N) x U_(N x N)

I'll start working with an example, where I go through steps of algorithm while talking about its respective SIMD implementation. I choose to work with following matrix.

As I know I've to split A into L & U, I start by taking two square matrices, initially setting L = I_{N x N} & U = A.

As LU decomposition is just forward elimination step of Gauss-Jordan method, used for solving system of linear equations, while
recording elementary row operation's in L_{N x N} --- I start working on column_{0}, hoping to zero out
all cells below pivot U_{0, 0}. First I record these elementary row operations in L_{N x N}, which has all pivots set to 1.

// record elementary row operations in L
col = 0
N = 4
for row in range(col+1, N):
L[row][col] = A[row][col] / A[col][col]

Resulting L_{N x N} looks like below

Now I zero out all elements below U_{0, 0} while applying row operation across whole rows of submatrix of U_{N x N}.
I can express it in following snippet of pseudocode.

// perform actual row operations & update U
col = 0
N = 4
for row in range(col+1, N):
factor = A[row][col] / A[col][col]
for c in range(col, N):
U[row][c] = A[row][c] - (A[col][c] * factor)

Resulting U_{N x N} must look like below, where marked submatrix was operated on.

From first iteration of elimination step, where I worked to zero out elements below U_{0, 0} , it can clearly
understood updation of L & U can go side by side. After each round of elimination where U_{i, i} is chosen to be
pivot of interest, original matrix A can be updated by content of updated U as my interest during next iteration ( when U_{i+1, i+1} becomes pivot )
will be to clean up cells below chosen pivot U_{i+1, i+1}, while recording elementary row operations in L_{N x N}.

Current pivot is U_{1, 1} & I record elimination steps in L_{N x N}, resulting into following L.

In parallel I also update marked submatrix of U_{N x N}, resulting into following.

After copying U_{N x N} into A_{N x N}, I select U_{2, 2} as pivot in final elimination step.
In parallel updation of both L & U results into following matrices.

There's one more pivot in U i.e. U_{3, 3}, which doesn't need to be processed anyhow as there's no forward
elimination step which can be performed with this pivot. Also both L & U are in desired triangular form. Just to assert that
LU decomposition worked, I do L x U, resulting into A.

>>> import numpy as np
>>> l = np.array([[ 1, 0, 0, 0],
[-2, 1, 0, 0],
[ 3, -4, 1, 0],
[ 2, 1, 3, 1]])
>>> u = np.array([[ 2, 4, 3, 5],
[ 0, 1, 1, 18],
[ 0, 0, -3, 66],
[ 0, 0, 0, -212]])
>>> np.matmul(l, u)
array([[ 2, 4, 3, 5],
[-4, -7, -5, 8],
[ 6, 8, 2, 9],
[ 4, 9, -2, 14]]) # asserted !

As LU decomposition has data dependency on previous iterations, I have to offload computation
into GPGPU in multiple iterations. Selecting each pivot is done on host machine, while processing
of selected pivot i.e. forward elimination in U & recording of respective elementary row operations in L
are performed on GPGPU. Another thing to notice, in each iteration during elimination of cells below selected pivot,
dimension of submatrix under operation keeps decreasing.

I can express scope of parallelism in LU decomposition using following diagram.

SIMD LU decomposition's performance turns out to be satisfactory. I was expecting better performance with a 1024 x 1024 random dense matrix.
I note, one definite reason behind not-so-good performance is after each iteration I'm performing large copy from U_{N x N} to A_{N x N}.
Also as dimension of operable sub-matrix keeps decreasing after each iteration, lots of work items are just doing nothing useful --- **huge loss** !

$ dpcpp -fsycl lu_decomposition.cpp && ./a.out
running on Intel Xeon Processor (Skylake, IBRS)
LU decomposition: 3840 ms ðŸ˜•
max deviation: 0.0765064

My interest in coming days will be to figure out a way to reduce lost computation cycles, where lots of work items are doing nothing as they're pointing to cell
which doesn't belong to current operable sub-matrix.
I'll also like to take a look at how I can reduce amount of back and forth **memcpy**(s), which are expensive. I keep current SIMD implementation of LU factorization
here for future reference.

Have a great time !