A simple PyTorch Reinforcement Learning model where an agent learns to move towards food while avoiding enemies on a 2D grid, based on log-probabilities and policy updates using backpropagation through time.
Stochastic policy: \(\pi(a|s,\theta)\) where \(\pi\) represents the probability of taking action \(a\) given state \(s\) and parameters \(\theta\).
The objective is to change parameters \(\theta\) to maximize expected rewards.
Unlike TD learning, there’s no need for calculating Q-value or V-value.
Better suited for continuous state spaces.
The goal of PG methods is to maximize the expected return:
\(J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[\sum_{t=0}^T R(s_t, a_t)]\)
where \(\tau\) represents a trajectory, and \(R(s_t, a_t)\) is the reward at time \(t\).
The policy gradient theorem gives us: \(\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R_t]\)
where \(R_t = \sum_{t'=t}^T R(s_{t'}, a_{t'})\) is the return from time \(t\).
It is fundamental algorithm in RL that provides a way to compute the gradient of the expected return with respect to the policy parameters. This is really nice because it allows us to optmize the policy directly, without needing to estimate action-values.
The intuition behind this is that we need to increase the probability of actions that lead to good outcomes and decrease the probability of actions that lead to bad outcomes.
Let \(\pi_\theta(a|s)\) be a stochastic policy parametrized by \(\theta\), and let \(J(\theta)\) be the expected return when following this policy. The policy gradient theorem states that:
\(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R_t]\)
where: - \(\tau\) is a trajectory \((s_0, a_0, r_0, s_1, a_1, r_1, ...)\) - \(R_t = \sum_{t'=t}^T r_{t'}\) is the return from time \(t\)
The theorem tells us that to increase the expected return, we should move the policy parameters in a direction that: 1. Increases the probability of actions that led to high returns. 2. Decreases the probability of actions that led to low returns.
The magnitude of the update is proportional to the return, so actions that led to much better (or worse) outcomes have a larger impact on the policy update.
The policy gradient theorem works because: 1. It directly optimizes the quantity we care about (expected return). 2. It doesn’t require estimating action-values, which can be difficult in large state spaces. 3. It naturally handles continuous action spaces.
While awesome, the policy gradient theorem has some limitations: 1. High variance: The returns can vary a lot, leading to noisy gradients. 2. Sample inefficiency: It often requires many samples to estimate the gradient accurately.
These limitations have led to various improvements and extensions.
The environment in this problem is a simple 2D grid-based world where there are 3 blobs (player, food, and enemy). The objective of the RL agent is to move towards the food blob while avoiding the enemy blob by choosing appropriate actions based on its current state.
The environment is represented as a numpy array of size SIZE x SIZE where each cell can contain one of 3 blobs. The input for the policy network consists of the relative positions of these 3 blobs in the current state.
In this code, we use PyTorch to build a simple policy gradient RL model with a neural network that represents our policy function.
This network takes our state representations as input and outputs a probability distribution over possible actions. Mathematically, for a state \(s\):
\(\pi_\theta(a|s) = \text{softmax}(f_\theta(s))\)
where \(f_\theta\) is our neural network.
The agent then selects an action based on these probabilities using the Categorical distribution’s sample method.
In policy gradient methods, our objective is to maximize the expected return. However, in practice/implementation, we often minimize a loss/cost function that, when minimized, maximizes our expected return. The cost function for policy gradient methods is defined as:
\(J(\theta) = -\mathbb{E}_{\tau \sim \pi_\theta}[\sum_{t=0}^T R(s_t, a_t)]\)
Here - \(\theta\) represents the parameters of our policy, \(\tau\) is a trajectory and \(R(s_t, a_t)\) is the reward at time \(t\).
The negative sign is crucial because we typically perform gradient descent to minimize this function, which effectively maximizes our expected return.
Initialize the Policy Network: We start by initializing our policy network \(\pi_\theta(a|s)\) with random weights \(\theta\).
Collect Trajectories: For each episode:
Compute Returns: For each trajectory:
Normalize Returns (Optional): To reduce variance, we often normalize the returns: \(G_t' = \frac{G_t - \mu(G)}{\sigma(G)}\) where \(\mu(G)\) and \(\sigma(G)\) are the mean and standard deviation of all returns.
Compute Policy Gradient: The policy gradient is computed as: \(\nabla_\theta J(\theta) = -\mathbb{E}_{\tau \sim \pi_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) G_t']\)
In practice, we approximate this expectation with our sampled trajectories: \(\nabla_\theta J(\theta) \approx -\frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t^i|s_t^i) G_t'^i\)
where \(N\) is the number of trajectories.
Update Policy Parameters: We update our policy parameters using gradient descent: \(\theta \leftarrow \theta - \alpha \nabla_\theta J(\theta)\)
where \(\alpha\) is the learning rate.
Repeat: Go back to step 2 and repeat for many episodes until convergence or a maximum number of episodes is reached.
class PolicyNetwork(nn.Module):
def __init__(self, input_size, output_size):
super(PolicyNetwork, self).__init__()
self.fc = nn.Sequential(
64),
nn.Linear(input_size,
nn.ReLU(),64, 64),
nn.Linear(
nn.ReLU(),64, output_size)
nn.Linear(
)
def forward(self, x):
return torch.softmax(self.fc(x), dim=-1)
def select_action(state):
= torch.FloatTensor(state).unsqueeze(0)
state = policy_net(state)
probs = Categorical(probs)
m = m.sample()
action return action.item(), m.log_prob(action)
def update_policy(log_probs, rewards):
= []
returns = 0
G for r in reversed(rewards):
= r + 0.99 * G # 0.99 is the discount factor
G 0, G)
returns.insert(= torch.tensor(returns, dtype=torch.float32)
returns
if len(returns) > 1:
= (returns - returns.mean()) / (returns.std() + 1e-8) # Normalization
returns
= []
policy_loss for log_prob, R in zip(log_probs, returns):
-log_prob * R) # Negative because we're doing gradient ascent
policy_loss.append(
optimizer.zero_grad()= torch.stack(policy_loss).sum()
policy_loss
policy_loss.backward() optimizer.step()
select_action
function.update_policy
function.update_policy
function, where we compute the policy loss and perform
backpropagation.The key part is in the update_policy
function:
= []
policy_loss for log_prob, R in zip(log_probs, returns):
-log_prob * R) policy_loss.append(
This line implements the policy gradient. The negative sign is present because we’re doing gradient descent to minimize the negative expected return (which is equivalent to maximizing the expected return).
The backward()
call computes the gradients, and the
optimizer.step()
updates the parameters, implementing step
6 of our algorithm.
By iterating this process over many episodes, the policy gradually improves, learning to maximize the expected return in the given environment.
We calculate the returns \(G_t\) for each timestep:
\(G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}\)
where \(\gamma\) is the discount factor (0.99 in this code).
We normalize the returns to reduce variance:
\(G_t' = \frac{G_t - \mu(G)}{\sigma(G)}\)
The policy loss for each step is calculated as:
\(L_t = -\log \pi_\theta(a_t|s_t) G_t'\)
The total loss is the sum of these individual losses:
\(L = \sum_t L_t\)
We then use backpropagation to compute \(\nabla_\theta L\) and update our policy parameters:
\(\theta \leftarrow \theta - \alpha \nabla_\theta L\)
where \(\alpha\) is the learning rate.
following the core principles of the REINFORCE algorithm