Some notes about the deep learning related knowledge. Do not list details about the backgound information.

TODO

Continuing the learning on medium article, also, there is a good introduction on bilibili, also try to learn that series.

### Math basics

bennifits of using softmax to get the propability then using the standard normalization (easier to distinguish things, make specific choice more significant or dominant, doesn’s care about positive or negative value)

https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization

also this

https://www.youtube.com/watch?v=KpKog-L9veg

### value based vs policy based

Category of RL approach

good source about different types of approach

https://towardsdatascience.com/policy-gradient-methods-104c783251e0

### REINFORCE

good refernece with code

https://medium.com/p/104c783251e0

### Bellman equation

https://www.youtube.com/watch?v=9JZID-h6ZJ0

This is a good tutorial

Core ideas for understanding sum notation, the notation under the sum symbol represent which variable it is. it can be written into x1,x2,x3…

The difference between expectation and the mean, understanding it from two views

### What is the content of the policy

The general form of the policy can be in Table format, Symbol format, and Parameterize format. The Policy network is a special form of the Parameterize format. From research perspective, the Parameterize format or Policy network is what we want to investigate.

Table format and symbol format are used to solve small scale or program, for example, in which state, do what action, this is a typical form of policy

一下是一个symbol format的例子:

假设有一个迷宫环境，智能体可以向上、向下、向左、向右移动，但不能穿过墙壁。智能体的目标是找到迷宫的出口。我们可以用一组规则来表示智能体的策略：

如果右边是开放的，则向右移动。

如果右边是墙壁，但前方是开放的，则向前移动。

如果右边和前方都是墙壁，但左边是开放的，则向左移动。

如果右边、前方和左边都是墙壁，只能向后移动。

这些规则形成了一个简单的符号型Policy，它指导智能体在迷宫中选择动作。

### An example of Policy network

General framework

Main component:

env

NN

Loss function based on RL policy

Optimizer

The skeleton of the task for REINFORCE approach:

One game contains a serious of action reward loop. we can get one reward for each action, so there are a series of rewards, we use these series of rewards to do estimate the policy when one game is over (updating the policy network, updating the policy network’s weights.).

During the traning process, there are multiple rounds games (episode).

This is a simple one:

https://github.com/wangzhezhe/5MCST/blob/master/RL/REINFORCE_Carpole1.py

Looking at this one, the code is formalized better way for this one

https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/

Some tips about mapping the math notation into the programming, and some typical math notation in RL

When seeing the nabula operator, in the pytorch, it is just simply mean the backward operation, which declares the optimizer, and call the backward in this style

self.optimizer.zero_grad() |

the grah is constucted through the tensor dynamically, so when we call the backward, the compute graph is executed automatically.

This video provides a really good explanation about how the autograd of the pytorch work. Each tensor is not simply a common structure, it also contains associated values for expressing the gradients values.