1.Introduction

Study/Reinforcement Learning

1.Introduction

ushin20

|2023. 5. 1. 16:00

1. Types of Machine Learning(ML)

2. Sequential decision making

Each time step t:

Agent takes an action $a_{t}$
Environment updates with new state and emits observation, $o_{t}$ and reward, $r_{t}$
Agent receives $o_{t}$ and $r_{t}$

3. History

$h_{t}$ = $(a_{1}, o_{1}, r_{1}, ,,, a_{t}, o_{t}, r_{t})$

4. World state

Agent state와 다름. 실제 세계

5. Agent state

$s_{t} = f(h_{t}) = (a_{1}, o_{1}, r_{1}, ,,, a_{t}, o_{t}, r_{t})$

History로부터 현재 state가 결정됨

6. Markov Assumption

$p(s_{t+1}|s_{t}, a_{t}) = p(s_{t+1}|h_{t}, a_{t})$

History를 간략화하는 것으로, $s_{t} = h_{t}$임을 통해 바로 이전 state를 참고해 next state를 결정하는 것이다.

참고하는 정도에 따라 $n^{st}$ Markov assumption이 된다.

7. Markov Process(Chain)

Finding a steady state distribution이 목적이다.

8. Markov Reward Process, MRP

Markcov Chain에 reward가 추가된 경우이다.

(S, T, R, $\gamma$)를 사용한다. ($0 < \gamma < 1$)

9. Markov Decision Process, MDP

MRP에 action이 추가된 경우이다.

(S, A, T, R, $\gamma$)를 사용한다.

10. Full vs Partial Observability MDP

Full MDP	$o_{t} = s_{t}$
Partial Observabilit MDP, POMDP	World state != Agent state

11. Elements of RL algorithm

Model	Agent의 action에 따른 world의 변화 - Transition model Agent의 next state를 예측 $p(s_{t+1}\|s_{t}, a_{t})$ - Reward model immediate reward를 예측 $r(s_{t}, a_{t}, s_{t+1})$
Policy	Agent의 state에 따른 action - Deterministic policy $\pi(s) = a$ - Stochastic policy $\pi(a\|s) = p(a_{t}\|s_{t})$
Value function	Agent의 state와 action에 따른 기댓값 state, action의 좋음을 판별하는데 사용됨 V의 sum을 통해 policy를 판별하는데 사용됨 $V^{\pi}(s_{t}) = E_{\pi}[r_{t}+\gamma r_{t+1}+\gamma^{2} r_{t+2}+...\|s_{t}]$

12. Return

$G = R(s_{0}, s_{1},,,,) = r(s_{0})+\gamma*r(s_{1})+\gamma^{2}r(s_{2})+...$

reward가 bounded면 return도 bounded.

$\gamma$->1이면 infinite일 수 있음.

13. Types of RL Agents

Model based	Model free
Model이 주어짐. Model을 이용해 optimal policy를 찾음. Policy나 Value function이 없을 수 있음.	Model이 없음. Policy나 Value function 중 하나가 있어야함. Model 학습 없이 optimal policy를 찾을 수 있음. Model을 표현하기 어려울 때, 주로 사용함.

14. Planning vs RL

Planning	RL
rule이나 model을 알고 있음 Dynamica programming이나 tree search 등을 통해 optimal action을 선택할 수 있음.	rule이나 model을 전부 다 알지 못함. World와의 상호작용을 통해 policy를 개선함.

15. Exploration vs Exploitation

Exploration	Exploitation
Agent가 더 좋은 decision을 내릴 수 있도록, 새로운 action을 시도함.	Agent가 과거의 경험을 참고해, 좋은 reward를 주었던 action을 선택함.

trade-off 관계이다. 예를 들어, Exploration을 많이 할수록, 발전성이 높아지지만, 안정성이 떨어지게 된다.

16. Evaluation vs Control

Evaluation	Control
Given policy에 대한 expected reward 예측	Optimal policy 찾기

저작자표시 (새창열림)

'Study > Reinforcement Learning' 카테고리의 다른 글

[project] QMIX review (0)	2023.05.01
4. Monte Carlo Methods and Temporal Difference Learning in Policy Evaluation (0)	2023.05.01
3. Policy Improvement by Iterative Method (0)	2023.05.01
2. Bellman equation and MDP (0)	2023.05.01

1.Introduction

1. Types of Machine Learning(ML)

2. Sequential decision making

3. History

4. World state

5. Agent state

6. Markov Assumption

7. Markov Process(Chain)

8. Markov Reward Process, MRP

9. Markov Decision Process, MDP

10. Full vs Partial Observability MDP

11. Elements of RL algorithm

12. Return

13. Types of RL Agents

14. Planning vs RL

15. Exploration vs Exploitation

16. Evaluation vs Control

'Study > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바