# RL tips

Some notes about the deep learning related knowledge. Do not list details about the backgound information.

TODO

Continuing the learning on medium article, also, there is a good introduction on bilibili, also try to learn that series.

### Math basics

bennifits of using softmax to get the propability then using the standard normalization (easier to distinguish things, make specific choice more significant or dominant, doesn’s care about positive or negative value)

https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization

also this

### value based vs policy based

Category of RL approach

good source about different types of approach

### REINFORCE

good refernece with code

https://medium.com/p/104c783251e0

### Bellman equation

This is a good tutorial

Core ideas for understanding sum notation, the notation under the sum symbol represent which variable it is. it can be written into x1,x2,x3…

The difference between expectation and the mean, understanding it from two views

### What is the content of the policy

The general form of the policy can be in Table format, Symbol format, and Parameterize format. The Policy network is a special form of the Parameterize format. From research perspective, the Parameterize format or Policy network is what we want to investigate.

Table format and symbol format are used to solve small scale or program, for example, in which state, do what action, this is a typical form of policy

### An example of Policy network

General framework
Main component:

env
NN
Loss function based on RL policy
Optimizer

The skeleton of the task for REINFORCE approach:

One game contains a serious of action reward loop. we can get one reward for each action, so there are a series of rewards, we use these series of rewards to do estimate the policy when one game is over (updating the policy network, updating the policy network’s weights.).

During the traning process, there are multiple rounds games (episode).

This is a simple one:

https://github.com/wangzhezhe/5MCST/blob/master/RL/REINFORCE_Carpole1.py

Looking at this one, the code is formalized better way for this one

https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/

Some tips about mapping the math notation into the programming, and some typical math notation in RL

When seeing the nabula operator, in the pytorch, it is just simply mean the backward operation, which declares the optimizer, and call the backward in this style

the grah is constucted through the tensor dynamically, so when we call the backward, the compute graph is executed automatically.

This video provides a really good explanation about how the autograd of the pytorch work. Each tensor is not simply a common structure, it also contains associated values for expressing the gradients values.