2023-07-09

Deep leaning notes

Some notes about the deep learning related knowledge.

There is a strong feeling that if you do not learn and use learning things actively, you will be eliminated quickly, failed to apply fundings and hard to find a good job.

instead of considering why I need to learn this and how it can be used into my own work, a practical attitude might be to learn it firstly.

The main index are this series of youtube video

Linear transformation and non-linear transformation

In the linear algebra, we use a lot about the matrix operation, they can be viewed as the scale and translation transformation of the base vectors (linear transformation). However, in the deep learning field, there are a lot of non-linear transformation, so there are more various operations and associated functions, such as ReLU. This might be help to do some complex pattern recognition.

Typical NN

This blog provide a good representation with different colors (https://www.jeremyjordan.me/intro-to-neural-networks/).

We do not go through the most basic concepts about neuro networks, just put a simple form of mathmetical expression here. Assuming there are 4 input and 3 output, it is ensiensially a matrix operation. The transformation matrix should be 3 by 4. The row index represents this weigt belongs to which output, the colum index match with the input value. It represents how the input values are contributes to each output.

$$
\begin{pmatrix}
w_{11} & w_{12} & w_{13} & w_{14} \\w_{21} & w_{22} & w_{23} & w_{24} \\w_{31} & w_{32} & w_{33} & w_{34}
\end{pmatrix}
\begin{pmatrix}
x_{1} \\x_{2} \\x_{3} \\x_{4}
\end{pmatrix}
+
\begin{pmatrix}
b_{1} \\b_{2} \\b_{3}
\end{pmatrix}
=
\begin{pmatrix}
z_{1} \\z_{2} \\z_{3}
\end{pmatrix}
,
\begin{pmatrix}
w_{11} & w_{12} & w_{13} & w_{14} \\w_{21} & w_{22} & w_{23} & w_{24} \\w_{31} & w_{32} & w_{33} & w_{34}
\end{pmatrix}
=
\begin{pmatrix}
w_{1}^T \\w_{2}^T \\w_{3}^T
\end{pmatrix}
$$

The vector b represents the bias, and the value in the weight matrix represents the weight value. This is the most foundation operation. If there are multiple hidden layer, we just use z vector as the input to execute similar matrix multiplication and add new bias vector.

Another important operation is the activation function, it is a function that applied to each value in the vector. For example, in the previous example, after we go through the first layer, then we can apply the activavtion function to the elements in vector, the typical one is ReLU function. So each value in the z vector becomes ReLU(z) after this layer. ReLU function is a kind of activation function. (For example, relu(x) = max (0,x), its an nonlinear transformation, when input larger than threshold, return input, otherwise, return zero)

The output layer usually generate one decision. The ArgMax or SoftMax is applied to a vector to interpret the results. Both the input and output for ArgMax and SoftMax is a vector. This is a good video to visualize the SoftMax results (https://www.youtube.com/watch?v=ytbYRIN0N4g)(this video is really cool and there are a lot of visual results for math things) the idea is to transfer the output into the probability values.

When saying the typical NN graph, it might be helpful to understand it from mathematical’s perspective. It is important to consider what are input and output for each operator.

Gradient descent appraoch

The core motivation for the gradient descent is to transfer the learning problem into an optimization problem. There are all kinds of online tutorials regarding the gradient descent approach, the idea is that we find the minimal value of a function by iteratively way. Figuring out the details of the gradient descent is helpful to understand backward propagation.

(1) list the function we need to optimize, what is input and what it output
(2) compute the gradient of the input, which can be a vector with multiple values
(3) set the first start point of input, set the learning rate, and compute the next input value of the input
(4) using new input to compute the output value and if the output value start to converge (the difference is less than a threshold) or the we get to the maximal iteration numner, then stop the iteration, otherwise, we continue to do the iteration.

Backward propagation

It is straightforward to undersdanding how we map the input to output with known weight matrix in each layer. Basically, using the input to compute the output (forward propagation). In actual example, we need to compute or train the weight matrix by knowning input and output, this requires the loss function and back proporgation. Simply speaking, the loss function is just the error between the predicted resuts and the acutal results

A loss function can be used to define the difference between the acurate value and the predicted value. Typical loss are MSR, mean squared error. The cross entropy is another type of common loss function for classification task. No matter what form of loss function we used, we can measure the difference bewteen the predicted value and the acurate value. The goal is to minimize this loss function.

Further more, we want to minimize this error or loss function. The general appraoch to do this minimization is typical gradient descent approach.

This video is inspiring, which provides a really good explanation about the backward propagation. This video provides a one input one weight case for backward propagation. Assuming the NN is simple and there is one layer one input x(known parameter), the output is $\hat{y}$, the accurate value is y(known parameter). The loss function is simple:
$$
L = (\hat{y}-y)^2 = (w \cdot x-y)^2
$$
We want to find the gradient of the loss function towards the w, which is the unknown parameter. Based on the chain rule and gradient descent appraoch listed above, we first compute out the gradient equation:

$$
\nabla L(w) = \frac {\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w} = 2(\hat{y}-y) \cdot x = 2x(w \cdot x - y)
$$
Then assume the initial value of the w is $w^0$ we try to get the w value for the next iteration based on the learning rate r:

$$
w^1 = w^0 - r \cdot \nabla L(w^0) = w^0 - r \cdot 2x\cdot(w^0 \cdot x - y)
$$
Then we can get a new predicted value $\hat{y}^1 = w^1 \cdot x$ and new value of loss function value $L^1 = (\hat{y}^1-y)^2$ When the difference between $L^i$ and $L^(i-1)$ is less than a threshold or we get to the maximal iteration number, we can say that we find a good w value.

This is probably the simplest case for backward proporgation, we can either extend the number of layer or the number of input/output at each layer to make the equation become more compicated, but ideas are same.

Even for the gradient descent approach, it is the most basic optimizer, there are all kinds of optimizer options when solving deep learning related approaches.

This video provides a good example from scratch about the backward propagation.

The defination about epoch and batch

Be careful about the concept of epoch: Assuming we have 2 data samples. We use sample 1 to do forward propagation, get predicted results, using gradient decent and chain rule to do the backward propagation and update waits. Then we send sample 2’s data into the updated waites and do the forward propagation and backward propagation again. We finish one epoch after all of these operations (we executed forward propagration and backward propagation for all data samples). We need many epoch to train a good neural network. According to this vedio, in summary, the one forward pass and one backward pass is called one epoch.

Epoch and batch are two common parameter. In one epoch, if the number of sample is too large, we might not fit all samples to device at once, we could divide them in small partitions, each partition is called on batch and size of it is the batch size. If batch size is 64, it means we use 64 samples at one time to do the forward and backward pass.

The total number of iteration in network training is number of epoch times the number of batch. If we only have one epoch and 1000 samples, assuming batch size is 500, we need two iterations to do the worards pass and backward pass. The reason that we set the batch size is larger than 1 is that the computer can be powerful to process multiple data sample in parallel to do the forward and backward pass. Larger batch size can improve the train speed, but too large batch size is not good, there is a tradeoff here.

Using dynamic programming to find the gradient

There is m to n relationship between the input and output in real NN, and there are even multiple layers. When we compute the gradient, we might need to compute the Jacobbian matrix, look at this slides to get more detailed info. The general idea is really similar to way of thinking problems through dynamic programming way, we want to compute $\frac {\partial L_i}{\partial w_j}$ we know which related nodes contribute to this value in the previous layer, and we can build a graph to show this, and we can compute associated value at the first layer easily.

Activation function

Typical activation function is ReLU, which introduce some non-linear property to the network and make it to process more complicated question. The non-linear property of active function does not mean there is no derivative of it. We can still compute the derivative of it in the process of backpropagation, checking this question.

An simple pytorch exmaple

It is easier to look at associated simple python code after figureing out these details.

Let’s look at the example shown in this video

Before using pytorch to program it, it is good to try to express it clearly mathmatically. we use $w1$ and $w2$ to express two layers. For the first layer there is:
$$
\begin{pmatrix}
W1_{00} \\W1_{10}
\end{pmatrix}
\cdot I
+
\begin{pmatrix}
b1_{0} \\b1_{1}
\end{pmatrix}
=
\begin{pmatrix}
Z1_0 \\Z1_1
\end{pmatrix}
$$
the first subscript represent the index of the output, there is one input value so there is one column for the matrix. Then there is a ReLU function for output and the output, it can be described as
$$
(W2_{00},W2_{01}) \cdot
\begin{pmatrix}
ReLu(Z1_{0}) \\ ReLu(Z1_{1})
\end{pmatrix} = Z2
$$
which get the final output.

After figureing out these details, it is easy to understand associated pytorch example.

Associated code example can be founud here, what we need to do is to provide a forward function, and the pytorch library will help us to execute the backward function and the traning function automatically. Becareful about the properties of the function such as requires_gradient=True (the parameter that needs to be optimized)

Several standard model for deep learning program in pytorch: 1 data loader 2 description of network 3 loss function 4 optimizer 5 train (descripbe the high level of whole process)

ArgMax and SoftMax

These function are used to process the output of the NN. The argmax interprets the largest positive output value as 1 and all other values as zero. The softmax use a specific function to map the results into 0 and 1, there is probability for each output. All smaller values are also preserved

Gradient decent

CNN

Convolutional neural network is used a lot for image related task, such as the image classification.

Why convolutional operation is helpful?

It reduce the number of input nodes (intput variables for NN) The mechanism of applying the convolutional kernal can be viewed as a kind of downsampling process.
Tolerating the small shift of the image
Utilizing the correlation between neighboring pixels.

In CNN, a filter is a small square, such as image with 3 by 3 pixels. Before traning a CNN, we start with a random pixel values in kernel. End up traning with the backward proporgation, we end up with sth that is more meaningful.

Mathmatically, we mainly care about the discrete convolutional operation in CNN, here are some good animations about it. This is equation about the 1d and 2d convolutional operaton. Be careful about the offset of multiplying two functions used to compute the convolutional operation. Some online tutorial just mention that we time the value in kernel with the value in original matrix corespondingly. Actually, according to this video, the kernel has been flipped between row and column before applying the kernel operator. This is a good explanation about the convolution in 2d and 1d

1d convolution operation with 2d convolution operation

Coming back to the CNN, there are multiple ways to applying the kernel to the original input data
There are some realted terms: if we apply the kernel start from each value of the original matrix and there is overlapping between the region area covered by each kernel, the output result is Feature map. Otherwise, if there is no overlapping for the areas covered by each kernel, and the kernel is just selecting the element that has a best output value, this approach is called max pooling if the filter is just to do the average operation of the selected area in the featuer map, this operation is called average pooling.

This video provides an exmaple that is easy to understand about the CNN.

Transfer learning

how to load other existing model and then adjust their parameters in small scale to make them fit your different task.

RNN and LSTM

Other need to be studied

ResNet,
Seq2Seq,

Transformer

Vision Transformer vs CNN

Is it more efficient to use Transformer compared with the CNN

References

Good online cources on bilibili

https://www.bilibili.com/video/BV1zF411V7xu/?spm_id_from=333.337.search-card.all.click

AverageMind