vpg

A simple PyTorch Reinforcement Learning model where an agent learns to move towards food while avoiding enemies on a 2D grid, based on log-probabilities and policy updates using backpropagation through time.

Short Overview of VPG

VPG

where \(\tau\) represents a trajectory, and \(R(s_t, a_t)\) is the reward at time \(t\).

The policy gradient theorem gives us: \(\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R_t]\)

It is fundamental algorithm in RL that provides a way to compute the gradient of the expected return with respect to the policy parameters. This is really nice because it allows us to optmize the policy directly, without needing to estimate action-values.

The formal statement

The intuition behind this is that we need to increase the probability of actions that lead to good outcomes and decrease the probability of actions that lead to bad outcomes.

Let \(\pi_\theta(a|s)\) be a stochastic policy parametrized by \(\theta\), and let \(J(\theta)\) be the expected return when following this policy. The policy gradient theorem states that:

\(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R_t]\)

where: - \(\tau\) is a trajectory \((s_0, a_0, r_0, s_1, a_1, r_1, ...)\) - \(R_t = \sum_{t'=t}^T r_{t'}\) is the return from time \(t\)

Interpretation

The theorem tells us that to increase the expected return, we should move the policy parameters in a direction that: 1. Increases the probability of actions that led to high returns. 2. Decreases the probability of actions that led to low returns.

The magnitude of the update is proportional to the return, so actions that led to much better (or worse) outcomes have a larger impact on the policy update.

The policy gradient theorem works because: 1. It directly optimizes the quantity we care about (expected return). 2. It doesn’t require estimating action-values, which can be difficult in large state spaces. 3. It naturally handles continuous action spaces.

While awesome, the policy gradient theorem has some limitations: 1. High variance: The returns can vary a lot, leading to noisy gradients. 2. Sample inefficiency: It often requires many samples to estimate the gradient accurately.

Environment

The environment in this problem is a simple 2D grid-based world where there are 3 blobs (player, food, and enemy). The objective of the RL agent is to move towards the food blob while avoiding the enemy blob by choosing appropriate actions based on its current state.

The environment is represented as a numpy array of size SIZE x SIZE where each cell can contain one of 3 blobs. The input for the policy network consists of the relative positions of these 3 blobs in the current state.

Reinforcement Learning Approach

In this code, we use PyTorch to build a simple policy gradient RL model with a neural network that represents our policy function.

This network takes our state representations as input and outputs a probability distribution over possible actions. Mathematically, for a state \(s\):

The agent then selects an action based on these probabilities using the Categorical distribution’s sample method.

Cost Function in Policy Gradient Methods

In policy gradient methods, our objective is to maximize the expected return. However, in practice/implementation, we often minimize a loss/cost function that, when minimized, maximizes our expected return. The cost function for policy gradient methods is defined as:

Here - \(\theta\) represents the parameters of our policy, \(\tau\) is a trajectory and \(R(s_t, a_t)\) is the reward at time \(t\).

The negative sign is crucial because we typically perform gradient descent to minimize this function, which effectively maximizes our expected return.

Detailed Algorithm Steps

Implementation

class PolicyNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, output_size)
        )

    def forward(self, x):
        return torch.softmax(self.fc(x), dim=-1)

def select_action(state):
    state = torch.FloatTensor(state).unsqueeze(0)
    probs = policy_net(state)
    m = Categorical(probs)
    action = m.sample()
    return action.item(), m.log_prob(action)

def update_policy(log_probs, rewards):
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + 0.99 * G  # 0.99 is the discount factor
        returns.insert(0, G)
    returns = torch.tensor(returns, dtype=torch.float32)

    if len(returns) > 1:
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # Normalization

    policy_loss = []
    for log_prob, R in zip(log_probs, returns):
        policy_loss.append(-log_prob * R)  # Negative because we're doing gradient ascent

    optimizer.zero_grad()
    policy_loss = torch.stack(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()

Implementation Details

policy_loss = []
for log_prob, R in zip(log_probs, returns):
    policy_loss.append(-log_prob * R)

This line implements the policy gradient. The negative sign is present because we’re doing gradient descent to minimize the negative expected return (which is equivalent to maximizing the expected return).

The backward() call computes the gradients, and the optimizer.step() updates the parameters, implementing step 6 of our algorithm.

By iterating this process over many episodes, the policy gradually improves, learning to maximize the expected return in the given environment.

Vanilla Policy Gradient (VPG) Implementation

TLDR