Last week I was exploring Vulkan API for harnessing power of my laptop's integrated graphics card, though mainly interested in
Vulkan Compute API so that I can put parallelizable computations on GPU. For easily parallelizable tasks such as matrix multiplication,
matrix transposition, matrix-vector multiplication, matrix LU factorization, fast fourier transform, random linear network coding etc. where matrix algebra is heavily involved,
GPGPU shines. This is due to the fact how GPU computation is divided & conquered. For understanding it better, I planned to deep dive by implementing matrix
transposition algorithm in both CPU & GPU executable forms. I chose matrix transposition because it's easy to understand & can easily be parallelized,
due to absence of dependencies among smaller computation steps. Also computing on GPU requires copying data between CPU & GPU accessible memories, same applies
for matrix transposition where I copy large 2-dimensional arrays between device local *( read GPU memory )* & host *( read CPU memory )* memories.

First I'll take a look at how matrix transposition is computed. Also for sake of simplicity, I use one N x N matrix during practical demonstration. I don't choose M x N matrix where M != N, because it won't let me perform transposition in-place, calling for further allocating one N x M matrix, which stores transposed form. Notice, elements positioned on matrix diagonal are always unchanged after transposition. Only lower & upper triangular elements i.e. below & above of matrix diagonal respectively switch their places.

begin:
if i == j:
return
tmp := A[i][j]
A[i][j] = A[j][i]
A[j][i] = tmp
end

If I expand aforementioned algorithm & step through its execution flow for a 2 x 3 matrix, I see diagonal elements are always untouched --- they can be simply copied from original to transposed matrix, only remainings are switching their places.

// Generic M x N Matrix Transposition Algorithm
A = [[a, b, c], [d, e, f]] // 2 x 3 matrix original matrix
row := len(A) // read 2
col := len(A[0]) // read 3
B = [[0; row]; col] // 3 x 2 transposed matrix, now only holding zeros
// non-diagonal elements switching places
for i in 0..max(row, col) {
for j in 0..i {
if i < row && j < col:
B[j][i] = A[i][j]
if i < col && j < row:
B[i][j] = A[j][i]
}
}
// diagonal elements simply being copied
for i in 0..col {
for j in 0..row {
if i == j:
B[i][j] = A[i][j]
break
}
}

In a square matrix, it becomes easier because transposition doesn't change matrix dimension, so it's in-place. I can demonstrate that, only switching lower diagonal places of each row with respective upper diagonal portions of column, transposes a N x N matrix.

I can easily parallelize square matrix transposition on CPU, where each worker thread independently performs row/ column switching at cost of employing N-worker threads ( in total ). In that case transposition runs with execution cost of O(N), because each thread needs to switch ~N many elements. Notice, originally matrix transposition is O(N**2) algorithm. It's possible for large matrices that N-many threads can't be employed at same time, so work may be equally distributed among available P-many worker threads in round-robin fashion. In such case execution cost is ~ O(c*N).

My interest is in performing similar parallel transposition on GPGPU using Vulkan API, which requires me to write compute shader ( read code to be run on GPU ). This compute shader is parallelly invoked by Vulkan for each row & within each invocation elements across a row-column pair are switched. If GPU is able to invoke & execute N-many shaders at same time, transposition costs ~O(N). As GPUs are able to do so for quite large N, compared to CPUs, the speedup obtained is noticeably large.

I notice for each parallel invocation of shader code, if row under operation has index **I**, work group needs to
perform **I**-many switching. Those **I**-many switching are performed element-wise between first **I**-many
elements of **row A[I]** *( left to right )* & **column A[I]** *( top to bottom )*.
I can express this logic in compute shader code.

const uint idx = row_index; // index of row being operated on
// perform I-many switching between row A[I] & column A[I]
for(uint j = 0; j < idx; j++) {
const int tmp = matrix[idx][j];
matrix[idx][j] = matrix[j][idx];
matrix[j][idx] = tmp;
}

All I need to figure out now is, how to obtain index of row being operated on, which can be easily fetched by reading
gl_GlobalInvocationID.x in OpenGL Shading Language ( read GLSL ), which I make use of
for writing compute shader. I keep full matrix transposition compute shader
here.

With this compute shader, I can ask Vulkan API to dispatch N-many work groups, each of size 32. It denotes graphics card will invoke compute shader 32
times, in parallel, for 32 row indices of matrix under operation. After one work group completes, another starts, again spawning 32 execution threads.
This way N-many work groups are dispatched in total, resulting into invocation of compute shader N x 32 times. I solve for N & dispatch 32 work groups
when matrix has dimension of 1024 x 1024. Notice, none of these invocations are repeating any row of matrix. Also matrix transposition's smaller
computation steps ( read row/ column switching ) are independent of each other, resulting into no need for use of synchronization techniques to prevent data race.

Vulkan API allows me to dispatch shader execution not only in single dimension, but also along 2 or 3 dimensions. 2 dimensional dispatching can be used
for image processing, where some filter ( read kernel of size K x K ) is applied on each pixel of original image. In that case current pixel
under operation can be figured out inside compute shader by checking gl_GlobalInvocationID.xy, which is a vector with 2 elements,
storing row & column indices of image respectively. May be some other day, I'll talk more about usecases of multi-dimensional shader dispatching in compute pipeline.

There may be situations when matrix dimension is not properly divisible by 32, in that case some invocation of shader by Vulkan API
may have gl_GlobalInvocationID.x value >= matrix dimension, which are to be ignored, because no row with certain
index exists in matrix under question & accessing that memory from shader should be illegal.

N x 32 = number-of-rows // N = #-of work groups dispatched

Remaining part is making matrix data available to GPU accessible memory, so that it can be accessed by compute shader, which requires
me to bind GPU allocated buffers with descriptors defined in shader. And after computation is done, transposed matrix needs to be read
back by CPU executable code, which requires mapping GPU accessible storage buffer to CPU accessible memory space.
I think setting up Vulkan before computation can be enqueued, is quite verbose, requiring
me to specify every single detail of memory allocation, association of descriptor set & pipeline layout with compiled shader module etc..
Though for sake of simplicity I used vulkano, which is higher level abstraction
over Vulkan API, hiding some complexities, letting me write portions of GPGPU code in Rust. I keep whole implementation
here for future reference.

Finally it's time to transpose randomly generated 1024 x 1024 matrix, where I compute transposition on both GPU, CPU ( sequential ) & assert
result for correctness.

$ cargo build --release
$ ./target/release/matrix_transpose
Device: Intel(R) HD Graphics 5500 (BDW GT2)
Vulkan API: 1.2.0
Queue Count: 1 Compute: true Graphics: true
GPU matrix transpose: 3.994984ms
CPU matrix transpose: 22.005047ms
Transpose asserted: 1.413437ms

In coming days, I plan to explore Vulkan more deeply, while sharing my understanding of how GPGPU can be used for parallelizing computations.

Have a great time !