Comparing Basic Linear Algebra Subprograms
- The time to run a program can be modeled by: Time to run code = clock cycles running code + clock cycles waiting for memory.
- The improvement rate in processor speed (Moore’s Law, ~60%/yr) exceeds that of DRAM memory (~7%/yr). There is a growing latency between the state of the art proessor speed and memory access speed (~53%/yr). Therefore, the bottleneck for computing is memory access and not the number of arithmetic operations.
- Computing the matrix product of two matrices has a computational complexity of $O(n^3)$ for $n\times n$ matrices and is slow. Different algorithms have been devised for computing the matrix product for large matrices.
Introduction
Matrix Multiplication or Matrix Product is an operation that produces a matrix from two matrices.
If $A\in \mathbb{R}^{m\times p}$ represented by
\[\begin{gather} A = \begin{bmatrix} a_{11} & a_{12} & a_{13} & \dots & a_{1n} \\\\ a_{21} & a_{22} & a_{23} & \dots & a_{2n} \\\\ \vdots & \vdots & \vdots & \ddots & \vdots \\\\ a_{m1} & a_{m2} & a_{m3} & \dots & a_{mn} \end{bmatrix} \end{gather}\]and $B\in \mathbb{R}^{p\times n}$ represented by
\[\begin{gather} B = \begin{bmatrix} b_{11} & b_{12} & b_{13} & \dots & b_{1n} \\ b_{21} & b_{22} & b_{23} & \dots & b_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ b_{p1} & b_{p2} & b_{p3} & \dots & b_{pn} \end{bmatrix} \end{gather}\]then the Matrix Product can be defined as $C = AB$ where \(c_{ij} = \sum^m_{k=1}a_{ik}b_{kj} \quad 1\leq i \leq n\ , 1\leq j \leq p\)
\[\begin{gather} C = AB = \begin{bmatrix} c_{11} & c_{12} & c_{13} & \dots & c_{1n} \\ c_{21} & c_{22} & c_{23} & \dots & c_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ c_{m1} & c_{m2} & c_{m3} & \dots & c_{mn} \end{bmatrix} \end{gather}\]Computing the matrix product of two matrices has a computational complexity of $O(n^3)$ for $n\times n$ matrices and hence slow. Different algorithms have been devised for computing the matrix product for large matrices. In this project, a survey of methods that take computer architecture into account will be explored.
Computing Bottleneck
The time to run a program can be modeled by:
\[Time\;to\;run\;code\;=\;clock\;cycles\;running\;code\;+\;clock\;cycles\;waiting\;for\;memory\]The improvement rate in processor speed (Moore’s Law, ~60%/yr) exceeds that of DRAM memory (~7%/yr). There is a growing latency between the state of the art proessor speed and memory access speed (~53%/yr). Therefore, the bottleneck for computing is memory access and not the number of arithmetic operations.
To model this, we can define a simple memory model with two levels - fast and slow memories. All the data is initially in slow memory. Data is processed faster if in the fast memory. Therefore, Data needs to be transferred from slow to fast memory before each subroutine. The performance metric is the number of floating point operations per slow element access, called FLOPS.
Basic Linear Algebra Subprograms (BLAS)
BLAS are low-level kernels that provide routines for performing Linear Algebra operations such as matrix addition, multiplication and dot products. These routines tend to be optimized for particular machines. These are the standard building blocks in modern linear algebra software.
BLAS separates its functionality into three levels: BLAS-1, BLAS-2 and BLAS-3. BLAS-1 performs scalar, vector and vector-vector operations. BLAS-2 performs matrix-vector operations. BLAS-3 performs matrix-matrix operations. A summary table of performance metrics is shown below:
\[\begin{array}{rr} \hline BLAS &Memory References &FLOPS & FLOPS/Memory References \\ \hline 1 & 3n & 2n & 2/3 \\ \hline 2 & n^2 & 2n^2 & 2 \\ \hline 3 & 4n^2 & 2n^3 & n/2 \end{array}\]Experiment Methods
Five different algorithms for matrix multiplication will be tested and compared with FLOP rate for square matrices. The size of the matrix (n) will be varied and plotted versus MFLOPS/s. In this case, 10 values from 10 to 500 were selected using the numpy linspace function.
The MFLOPS/s will be computed using the equation: $ MFLOPS/s = \frac{2n^3}{10^6 \times runtime}$
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
1. Triple Loop
The triple loop is the element wise (naïve) multiplication, which is mentioned in the introduction as \(c_{ij} = \sum^m_{k=1}a_{ik}b_{kj} \quad 1\leq i \leq n\ , 1\leq j \leq p\)
def matmul_triple(A, B):
[m, p, n] = mat2_shape(A, B)
C = np.zeros((m, n))
for i in range(m):
for j in range(n):
for k in range(p):
C[i, j] = C[i, j] + A[i, k]*B[k, j]
return C
2. Dot Product (BLAS-1)
The dot product of two vectors $a = [a_1, a_2, \dots, a_n]$ and $b = [b_1, b_2, \dots, b_n]$ is defined as \(a\cdot b = \sum^n_{i=1}a_i b_i\)
The implementation of using the dot product for matrix multiplication is \(c_{ij} = a_{i}\cdot b_{j} \quad 1\leq i \leq m\ , 1\leq j \leq n\)
def matmul_dot(A, B):
[m, p, n] = mat2_shape(A, B)
C = np.zeros((m, n))
for i in range(m):
for j in range(n):
C[i, j] = C[i, j] + np.dot(A[i, :], B[:, j])
return C
3. Saxpy (BLAS-1)
Saxpy (Single-Precision A X Plus Y) is defined as the form $y:=ax + y$, where $x,y$ are vectors and $a$ is a scalar. The implementation of saxpy for matrix multiplication is \(c_i = c_i + a_j b_{ji} \quad 1\leq i \leq m\ , 1\leq j \leq p\)
def matmul_saxpy(A, B):
[m, p, n] = mat2_shape(A, B)
C = np.zeros((m, n))
for i in range(m):
for j in range(p):
C[:, i] = C[:, i] + A[:, j]*B[j, i]
return C
4. Matrix-Vector (BLAS-2)
Matrix-Vector multiplication is defined as $y:=y+Ax$ where A is matrix and x is a vector. The implementation for Matrix-Vector matrix multiplication is \(c_k = c_k + Ab_k \quad 1\leq k \leq n\)
def matmul_matvec(A, B):
[m, n, p] = mat2_shape(A, B)
C = np.zeros((m, n))
for k in range(n):
C[:, k] = C[:, k] + np.matmul(A[:, :], B[:, k])
return C
5. Outer product (BLAS-2)
The outer product is defined as $a\otimes b = ab^T$ If a and b have dimensions of m and n, then their outer product is an m by n matrix. The implementation of the outer product matrix multiplication is defined as \(C = C + a_i b^T_i\)
def matmul_outer(A, B):
[m, n, p] = mat2_shape(A, B)
C = np.zeros((m, n))
for i in range(p):
C[:, :] = C[:, :] + np.outer(A[:, i], B[i, :])
return C
Misc Function(s)
def mat2_shape(A, B):
[m, p] = A.shape
[m, n] = B.shape
return m, p, n
Get running time routine
funcs = ['matmul_triple', 'matmul_dot', 'matmul_saxpy',
'matmul_matvec', 'matmul_outer']
func_names = ['triple loop', 'dot product (BLAS-1)', 'saxpy (BLAS-1)',
'matrix-vector (BLAS-2)', 'outer product (BLAS-2)']
n = np.linspace(10, 500, 10)
t = np.rint(np.zeros((len(n), len(funcs))))
for i in range(len(n)):
size = int(n[i])
A = np.random.rand(size, size)
B = np.random.rand(size, size)
for j in range(len(funcs)):
start_time = time.time()
C = eval(funcs[j] + '(A, B)')
t[i, j] = time.time() - start_time
Get FLOPS/s routine
flops = np.zeros((t.shape[0], t.shape[1]))
for i in range(t.shape[0]):
flops[i, :] = 2*(n[i]**3)/t[i, :]/(1e6)
Experiment Results
Run Time Comparison
dispt = np.zeros((t.shape[0], t.shape[1] + 1))
dispt[:, 0] = np.round(n).astype('int')
dispt[:, 1:] = t
df_t = pd.DataFrame(dispt, columns = np.append(['n'], func_names))
df_t['n'] = df_t['n'].astype(int)
df_t
n | triple loop | dot product (BLAS-1) | saxpy (BLAS-1) | matrix-vector (BLAS-2) | outer product (BLAS-2) | |
---|---|---|---|---|---|---|
0 | 10 | 0.001958 | 0.000494 | 0.001043 | 0.000294 | 0.000350 |
1 | 64 | 0.324073 | 0.007775 | 0.028444 | 0.000872 | 0.002727 |
2 | 119 | 2.483323 | 0.056787 | 0.104679 | 0.000967 | 0.008148 |
3 | 173 | 7.373535 | 0.110846 | 0.270866 | 0.002401 | 0.019772 |
4 | 228 | 16.292575 | 0.119387 | 0.522254 | 0.004367 | 0.040190 |
5 | 282 | 28.353514 | 0.195107 | 0.848704 | 0.005697 | 0.071406 |
6 | 337 | 49.096077 | 0.392282 | 1.279814 | 0.007889 | 0.109456 |
7 | 391 | 76.775584 | 0.513610 | 1.959762 | 0.012160 | 0.248771 |
8 | 446 | 113.867998 | 0.725625 | 2.673684 | 0.018931 | 0.330257 |
9 | 500 | 161.462556 | 0.949153 | 3.341192 | 0.023851 | 0.473899 |
plt.figure(figsize = (15, 10))
plt.xlabel('n')
plt.ylabel('Run Time')
plt.title('Run Time Comparison between different BLAS-type Operation Kernels')
for i in range(t.shape[1]):
plt.plot(n, t[:, i], label = func_names[i])
plt.legend()
plt.yscale('log', basey = 10)
plt.show()
MFLOPS/s Comparison
dispf = np.zeros((flops.shape[0], flops.shape[1] + 1))
dispf[:, 0] = np.round(n).astype('int')
dispf[:, 1:] = flops
df_f = pd.DataFrame(dispf, columns = np.append(['n'], func_names))
df_f['n'] = df_f['n'].astype(int)
df_f
n | triple loop | dot product (BLAS-1) | saxpy (BLAS-1) | matrix-vector (BLAS-2) | outer product (BLAS-2) | |
---|---|---|---|---|---|---|
0 | 10 | 1.021506 | 4.048556 | 1.917396 | 6.803413 | 5.714311 |
1 | 64 | 1.651746 | 68.844448 | 18.818775 | 613.933586 | 196.289135 |
2 | 119 | 1.353383 | 59.184310 | 32.106561 | 3476.344810 | 412.482171 |
3 | 173 | 1.412539 | 93.962826 | 38.452263 | 4337.740537 | 526.774207 |
4 | 228 | 1.450689 | 197.973229 | 45.256638 | 5412.442808 | 588.093318 |
5 | 282 | 1.585611 | 230.425641 | 52.972115 | 7891.444145 | 629.604966 |
6 | 337 | 1.554474 | 194.550326 | 59.632566 | 9673.719498 | 697.253226 |
7 | 391 | 1.558502 | 232.968327 | 61.055832 | 9839.989620 | 480.983716 |
8 | 446 | 1.553581 | 243.794175 | 66.164569 | 9344.554011 | 535.652727 |
9 | 500 | 1.548347 | 263.392745 | 74.823590 | 10481.672148 | 527.538693 |
plt.figure(figsize = (15, 10))
plt.xlabel('n')
plt.ylabel('log10(MFLOPS/s)')
plt.title('Run Time Comparison between different BLAS-type Operation Kernels')
for i in range(t.shape[1]):
plt.plot(n, flops[:, i], label = func_names[i])
plt.legend()
plt.yscale('log', basey = 10)
plt.show()
Conclusion
From the collected data, the MFLOPS/s ranking was matrix-vector, outer product, dot product, saxpy and triple loop. The rankings were expected because BLAS-2 were theorized to be faster than BLAS-1 and both were theorzied to be faster than the triple loop due to the number of FLOPS/MemoryReferences. The curves plotted were consistent with the expected shape - as the size of the matrix gets larger and larger, the MFLOPS/s eventually plateaus. This wil be more evident as the size of the matrix gets larger and larger. The maximum MFLOPS/s for each of the methods is shown below
df_f_max = df_f.max()[1::].to_frame()
df_f_max.columns = ['Maximum MFLOPS/s']
df_f_max
Maximum MFLOPS/s | |
---|---|
triple loop | 1.651746 |
dot product (BLAS-1) | 263.392745 |
saxpy (BLAS-1) | 74.823590 |
matrix-vector (BLAS-2) | 10481.672148 |
outer product (BLAS-2) | 697.253226 |
Discussion
Although it is evident that BLAS-2 is faster than BLAS-1, there is still some question to the ratios of MFLOPS/s between different methods. From the table in the introduction, BLAS-2 should be approximately $\frac{2}{2/3} = 3 $ times as fast as BLAS-1. These values makes sense if the comparison was done between dot product (BLAS-1) and outer product (BLAS-2), but does not make sense for any other combination. One reason that may have caused this is because the matrix was not large enough and did not plateau. Due to CPU speed and time limitations, more data was not be taken. Another experiment that collects more data points for larger matrix multiplications should be run. A percent change difference is computed below.
df_f_pct_change = df_f.groupby('n').mean().pct_change()
df_f_pct_change
triple loop | dot product (BLAS-1) | saxpy (BLAS-1) | matrix-vector (BLAS-2) | outer product (BLAS-2) | |
---|---|---|---|---|---|
n | |||||
10 | NaN | NaN | NaN | NaN | NaN |
64 | 0.616971 | 16.004692 | 8.814756 | 89.239061 | 33.350449 |
119 | -0.180635 | -0.140318 | 0.706092 | 4.662412 | 1.101401 |
173 | 0.043710 | 0.587631 | 0.197645 | 0.247788 | 0.277084 |
228 | 0.027008 | 1.106931 | 0.176956 | 0.247756 | 0.116405 |
282 | 0.093006 | 0.163923 | 0.170483 | 0.458019 | 0.070587 |
337 | -0.019637 | -0.155692 | 0.125735 | 0.225849 | 0.107446 |
391 | 0.002591 | 0.197471 | 0.023867 | 0.017188 | -0.310174 |
446 | -0.003157 | 0.046469 | 0.083673 | -0.050349 | 0.113661 |
500 | -0.003369 | 0.080390 | 0.130871 | 0.121688 | -0.015148 |
df_f_pct_change_ri = df_f_pct_change.reset_index()
plt.figure(figsize = (15, 10))
for i in df_f_pct_change_ri.columns:
if i != 'n':
plt.plot(df_f_pct_change_ri['n'], df_f_pct_change_ri[i])
plt.title('Percent Change in MFLOPS/s for different BLAS-type Operation Kernels')
plt.ylabel('pct change (%)')
plt.xlabel('n')
plt.yscale('log', basey = 10)
plt.legend()
plt.show()
From this, there is room for improvement, as the dot product and the matrix-vector kernels have a percentage change of greater than 10% at n = 500.
Another observation is that MATLAB behaves differently when the same kernels are implemented. It is unknown if it is an implementation problem or MATLAB’s matrix multiplication routines. The MATLAB implementation is shown here https://pastebin.com/TL93T9mU and results are shown below
from IPython.display import Image
Image(filename='blas_comparison_matlab.png')
From the Run Time Comparison, we do not see the BLAS Run Time hierarchy. This may possibly be due to error in implementation. But the reason for this is still unknown.