Processing math: 100%
no image
[project] QMIX review
1. QMIX 원본 https://arxiv.org/abs/1803.11485 QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state arxiv.org..
2023.05.01
no image
4. Monte Carlo Methods and Temporal Difference Learning in Policy Evaluation
RECALL * Policy Evaluation MDP와 policy π가 주어졌을 경우, Bellman equation을 이용해 Policy Evaluation을 할 수 있다. * DP for Policy Evaluation State-value function을 0으로 초기화하고, 수렴할때까지 state-value function을 업데이트했다. $\gamma
2023.05.01
no image
3. Policy Improvement by Iterative Method
1. Policy Iteration 초기 Policy에서 시작해서 Policy Evaluation, Policy Improvement를 반복하여 Optimal policy를 찾는 방법이다. 1.1. Policy Evaluation Qπ(s,a)=R(s,a)+γsp(s|s,a)Vπ(s) Vπ(s)=aπ(a|s)Qπ(s,a) State-value function이 Vπ수렴할때까지 계속 evaluate하거나, 한번만 하고 Policy Improvement로 넘어가도 된다. 1.2. Policy Improvement (control) $Q^{\pi}(s,a) = R(s,a) + \gamma \sum_..
2023.05.01
no image
2. Bellman equation and MDP
1. Bellman equation for MRP V(s)=E[Gt|st=s] =E[rt+γrt+1+γ2rt+2+...|st=s] =E[rt+γGt+1|st=s] =E[rt|st=s]+γE[E[Gt+1|st+1]|st=s] =R(s)+γE[V(st+1)|st=s] V(s)=R(s)+γE[V(st+1)|st=s] V=R+γTV (1γT)V=R V=(1γT)1R 위 식을 통해 state-value funtion을 구할 수 있다. 그..
2023.05.01
no image
1.Introduction
1. Types of Machine Learning(ML) 2. Sequential decision making Each time step t: Agent takes an action at Environment updates with new state and emits observation, ot and reward, rt Agent receives ot and rt 3. History ht = (a1,o1,r1,,,,at,ot,rt) 4. World state Agent state와 다름. 실제 세계 5. Agent state $s_{t} = f(h_{t}) = (a_{1}, o_{1}, r_{1}, ,..
2023.05.01