no image
[project] QMIX review
1. QMIX 원본 https://arxiv.org/abs/1803.11485 QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state arxiv.org..
2023.05.01
no image
4. Monte Carlo Methods and Temporal Difference Learning in Policy Evaluation
RECALL * Policy Evaluation MDP와 policy $\pi$가 주어졌을 경우, Bellman equation을 이용해 Policy Evaluation을 할 수 있다. * DP for Policy Evaluation State-value function을 0으로 초기화하고, 수렴할때까지 state-value function을 업데이트했다. $\gamma
2023.05.01
no image
3. Policy Improvement by Iterative Method
1. Policy Iteration 초기 Policy에서 시작해서 Policy Evaluation, Policy Improvement를 반복하여 Optimal policy를 찾는 방법이다. 1.1. Policy Evaluation $Q^{\pi}(s,a) = R(s,a) + \gamma \sum_{s'}p(s'|s,a)V^{\pi}(s')$ $V^{\pi}(s) = \sum_{a} \pi(a|s)Q^{\pi}(s,a)$ State-value function이 $V_{\pi}$수렴할때까지 계속 evaluate하거나, 한번만 하고 Policy Improvement로 넘어가도 된다. 1.2. Policy Improvement (control) $Q^{\pi}(s,a) = R(s,a) + \gamma \sum_..
2023.05.01
no image
2. Bellman equation and MDP
1. Bellman equation for MRP $V(s) = E[G_{t}|s_{t}=s]$ $= E[r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+...|s_{t}=s]$ $= E[r_{t}+\gamma G_{t+1}|s_{t}=s]$ $= E[r_{t}|s_{t}=s] + \gamma E[E[G_{t+1}|s_{t+1}]|s_{t}=s]$ $= R(s) + \gamma E[V(s_{t+1})|s_{t}=s]$ $V(s) = R(s) + \gamma E[V(s_{t+1})|s_{t}=s]$ $V = R + \gamma T V$ $(1-\gamma T)V = R$ $V = (1-\gamma T)^{-1}R$ 위 식을 통해 state-value funtion을 구할 수 있다. 그..
2023.05.01
no image
1.Introduction
1. Types of Machine Learning(ML) 2. Sequential decision making Each time step t: Agent takes an action $a_{t}$ Environment updates with new state and emits observation, $o_{t}$ and reward, $r_{t}$ Agent receives $o_{t}$ and $r_{t}$ 3. History $h_{t}$ = $(a_{1}, o_{1}, r_{1}, ,,, a_{t}, o_{t}, r_{t})$ 4. World state Agent state와 다름. 실제 세계 5. Agent state $s_{t} = f(h_{t}) = (a_{1}, o_{1}, r_{1}, ,..
2023.05.01