Some notes about the deep learning related knowledge. Do not list details about the backgound information.
TODO
Continuing the learning on medium article, also, there is a good introduction on bilibili, also try to learn that series.
Math basics
bennifits of using softmax to get the propability then using the standard normalization (easier to distinguish things, make specific choice more significant or dominant, doesn’s care about positive or negative value)
https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization
also this
https://www.youtube.com/watch?v=KpKog-L9veg
value based vs policy based
Category of RL approach
good source about different types of approach
https://towardsdatascience.com/policy-gradient-methods-104c783251e0
REINFORCE
good refernece with code
https://medium.com/p/104c783251e0
Bellman equation
https://www.youtube.com/watch?v=9JZID-h6ZJ0
This is a good tutorial
Core ideas for understanding sum notation, the notation under the sum symbol represent which variable it is. it can be written into x1,x2,x3…
The difference between expectation and the mean, understanding it from two views
What is the content of the policy
The general form of the policy can be in Table format, Symbol format, and Parameterize format. The Policy network is a special form of the Parameterize format. From research perspective, the Parameterize format or Policy network is what we want to investigate.
Table format and symbol format are used to solve small scale or program, for example, in which state, do what action, this is a typical form of policy
一下是一个symbol format的例子:
假设有一个迷宫环境,智能体可以向上、向下、向左、向右移动,但不能穿过墙壁。智能体的目标是找到迷宫的出口。我们可以用一组规则来表示智能体的策略:
如果右边是开放的,则向右移动。
如果右边是墙壁,但前方是开放的,则向前移动。
如果右边和前方都是墙壁,但左边是开放的,则向左移动。
如果右边、前方和左边都是墙壁,只能向后移动。
这些规则形成了一个简单的符号型Policy,它指导智能体在迷宫中选择动作。
An example of Policy network
General framework
Main component:
env
NN
Loss function based on RL policy
Optimizer
The skeleton of the task for REINFORCE approach:
One game contains a serious of action reward loop. we can get one reward for each action, so there are a series of rewards, we use these series of rewards to do estimate the policy when one game is over (updating the policy network, updating the policy network’s weights.).
During the traning process, there are multiple rounds games (episode).
This is a simple one:
https://github.com/wangzhezhe/5MCST/blob/master/RL/REINFORCE_Carpole1.py
Looking at this one, the code is formalized better way for this one
https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/
Some tips about mapping the math notation into the programming, and some typical math notation in RL
When seeing the nabula operator, in the pytorch, it is just simply mean the backward operation, which declares the optimizer, and call the backward in this style
self.optimizer.zero_grad() |
the grah is constucted through the tensor dynamically, so when we call the backward, the compute graph is executed automatically.
This video provides a really good explanation about how the autograd of the pytorch work. Each tensor is not simply a common structure, it also contains associated values for expressing the gradients values.