Importance SamplingRevised December 2, 2024 The Off-Policy Control with Function Approximation note is optional but recommended background reading. For a random variable \(X\) with probability mass function \(\pi\), the expected value is given by: \[ \begin{equation}\label{eq:general_IS} \mathbb{E}_{X \sim \pi}[f(X)] = \sum_x f(x) \pi(x) \;, \end{equation} \] where \(f\) is a measurable function that assigns a real-valued output to each element of the sample space. In RL, a sample \(x\) is often a state-action trajectory \((a_t, s_{t+1}, a_{t+1}, \dots, s_T)\), \(f(x)\) is the total return \(G_t\) associated with that trajectory, and \(\pi\) is a policy. The resulting expectation \(\mathbb{E}_{X \sim \pi}[f(X)]\) then represents the expected return under the distribution \(\pi\) over trajectories. This formulation aligns naturally with on-policy learning, where the policy used to generate data matches the target policy \(\pi\). In off-policy learning, however, the data are generated by a behavior policy \(\pi_\beta\) that differs from the target policy \(\pi\), which we aim to evaluate and/or improve. Formally, this means that \(X \not\sim \pi\), but rather \(X \sim \pi_\beta\). To account for the discrepancies between the behavior and target policies, we use importance sampling, a collection of methods for approximating a mathematical expectation under one distribution (in this case, \(\pi\)) by reweighting samples drawn from another distribution (in this case, \(\pi_\beta\)). Importance sampling in RL vs. traditional Monte Carlo approximation In this note, we refer to the RL meaning of “importance sampling”, which differs from its usage in traditional Monte Carlo (MC) approximation literature. In RL, we estimate the value function for a target policy \(\pi\) using data generated by a behavior policy \(\pi_\beta\), often resulting in an importance sampling estimator with higher variance than on-policy MC methods. In contrast, traditional MC importance sampling focuses on designing a proposal distribution \(q\) to reduce the variance of estimates. More specifically, in traditional MC methods, the goal is to estimate an expectation with respect to a target distribution \(p(x)\): \[ \begin{equation*} \mathbb{E}_{p}[f(x)] = \int f(x) p(x) \, dx \;. \end{equation*} \] Importance sampling is used to improve the efficiency of the estimator by sampling from a carefully chosen proposal distribution \(q(x)\) instead of \(p(x)\). The estimator becomes: \[ \begin{equation*} \mathbb{E}_{p}[f(x)] = \int f(x) \frac{p(x)}{q(x)} q(x) \, dx = \mathbb{E}_{q}\left[ f(x) \frac{p(x)}{q(x)} \right] \;. \end{equation*} \] The proposal distribution \(q(x)\) is designed to reduce the variance of the estimator by ensuring that \(q(x)\) is large where \(|f(x) p(x)|\) is large and by matching \(q(x)\) closely to the integrand's shape, particularly in regions that contribute most to the integral. By appropriately choosing \(q(x)\), importance sampling in traditional MC can lead to estimators with lower variance than those obtained by sampling directly from \(p(x)\). Using importance sampling, we can estimate the value of Equation \(\eqref{eq:general_IS}\) for our target policy \(\pi\) as: \[ \begin{align} \mathbb{E}_{X \sim \pi}[f(X)] &= \sum \pi(x) f(x) \nonumber \\ &= \sum \frac{\pi_\beta(x)}{\pi_\beta(x)} \pi(x) f(x) \nonumber \\ &= \sum \pi_\beta(x) \frac{\pi(x)}{\pi_\beta(x)} f(x) \nonumber \\ &= \mathbb{E}_{X \sim \pi_\beta} \left[\frac{\pi(x)}{\pi_\beta(x)} f(x) \right] \label{eq:transformed-expectation} \;. \end{align} \] For the continuous case, with \(\pi\) representing a probability density, this becomes: \[ \begin{align*} \mathbb{E}_{X \sim \pi}[f(X)] &= \int \pi(x) f(x) \, dx \\ &= \int \pi_\beta(x) \frac{\pi(x)}{\pi_\beta(x)} f(x) \, dx \\ &= \mathbb{E}_{X \sim \pi_\beta} \left[\frac{\pi(x)}{\pi_\beta(x)} f(x) \right] \\ &\approx \frac{1}{n} \sum_{j=1}^{n} \frac{\pi(x_j)}{\pi_\beta(x_j)} f(x_j) \;, \end{align*} \] where \(\{x_j\}_{j=1}^n\) are samples drawn from the behavior policy \(\pi_\beta\). The term \(\frac{\pi(x)}{\pi_\beta(x)}\) is known as the importance sampling ratio. Intuitively, this ratio represents the relative likelihood of observing a sample \(x\) under the target policy \(\pi\) compared to the behavior policy \(\pi_\beta\). It acts as a correction factor to account for the mismatch between the two distributions. To estimate values for the target policy \(\pi\) using episodes generated under the behavior policy \(\pi_\beta\), it is necessary that every action taken by \(\pi\) is also executed, at least occasionally, under \(\pi_\beta\). Formally, if \(\pi(a \mid s) > 0\), then it must also hold that \(\pi_\beta(a \mid s) > 0\). This condition, known as the assumption of coverage, ensures that the importance sampling ratio remains well-defined. Without this condition, the denominator in the ratio becomes zero, making it impossible to reweight samples for accurate value estimation under the target policy. The assumption of coverage implies that \(\pi_\beta\) must be stochastic in all states in which it differs from \(\pi\). The target policy \(\pi\), by contrast, can be deterministic. Deterministic target policies are commonly used in real-world control applications, where \(\pi\) is often the greedy policy with respect to the current estimate of the value function. Over time, this greedy policy may converge to a deterministic optimal policy, while the behavior policy \(\pi_\beta\) remains stochastic and exploratory — for example, as an \(\epsilon\)-greedy policy. In this note, however, we focus on the prediction problem, where the target policy \(\pi\) is fixed and given. In Monte Carlo prediction, the value function is estimated using an episode's full trajectory. For a starting state \(s_t\) and a policy \(\pi\), the probability of observing a subsequent state–action trajectory \(a_t, s_{t+1}, a_{t+1}, \dots, s_T\) is: \[ \begin{align*} \mathbb{P}[a_t, s_{t+1}, a_{t+1}, \dots, s_T \mid s_t, A_{t:T-1} \sim \pi] &= \pi(a_t \mid s_t) P(s_t, a_t, s_{t+1}) \pi(a_{t+1} \mid s_{t+1}) \dots P(s_{T-1}, a_{T-1}, s_T) \\ &= \prod_{k=t}^{T-1} \pi(a_k \mid s_k) P(s_k, a_k, s_{k+1}) \;, \end{align*} \] where \(P\) is the transition probability function. The importance sampling ratio is then expressed as: \[ \begin{equation}\label{eq:IS_ratio} \rho_{t:T-1} = \frac{\prod_{k=t}^{T-1} \pi(a_k \mid s_k) \color{red}{P(s_k, a_k, s_{k+1})}} {\prod_{k=t}^{T-1} \pi_\beta(a_k \mid s_k) \color{red}{P(s_k, a_k, s_{k+1})}} = \prod_{k=t}^{T-1} \frac{\pi(a_k \mid s_k)}{\pi_\beta(a_k \mid s_k)} \;. \end{equation} \] While the trajectory probabilities depend on the MDP's transition dynamics \(P\) (which are generally unknown) these terms (highlighted in red in Equation \(\eqref{eq:IS_ratio}\)) appear in both the numerator and denominator and thus cancel out. Consequently, the importance sampling ratio depends only on the two policies \(\pi\) and \(\pi_\beta\) and the observed sequence of actions and states, not on the specifics of the MDP. From Equation \(\eqref{eq:IS_ratio}\), we can define the importance-sampled return for the target policy \(\pi\) by reweighting the return \(G_t\) obtained under the behavior policy \(\pi_\beta\): \[ \begin{align*} G_t^{\pi / \pi_\beta} &= \rho_{t:T-1} G_t \\ &= \prod_{k=t}^{T-1} \frac{\pi(a_k \mid s_k)}{\pi_\beta(a_k \mid s_k)} G_t && \text{cf. Equation } \eqref{eq:transformed-expectation} \;. \end{align*} \] This corrected return serves as an estimator for the expected return. The update rule is then given by: \[ \begin{equation*} V(s_t) \gets V(s_t) + \alpha \left( G_t^{\pi / \pi_\beta} - V(s_t) \right) \;. \end{equation*} \] In Equation \(\eqref{eq:IS_ratio}\), the importance sampling ratio is applied across the entire trajectory. Over long trajectories, the product of importance sampling ratios tends to either shrink or grow exponentially, leading to large imbalances in the weights. This imbalance causes most sampled trajectories to contribute very little to the estimate (due to small weights), while a few trajectories dominate (due to large weights). Consequently, the effective sample size (ESS), which intuitively represents the number of “useful” samples contributing meaningfully to the estimate, decreases. A smaller ESS leads to unreliable estimates, as the contribution from most samples becomes negligible, and the estimate relies heavily on a few trajectories with large weights. This phenomenon, along with the increased variance inherent in importance sampling, significantly limits the practicality of Monte Carlo methods for off-policy learning. Temporal Difference (TD) learning, by contrast, uses bootstrapping. This means importance sampling is applied over every \(n\) steps, rather than across the entire trajectory: \[ \begin{align*} V(s_t) \gets V(s_t) + \alpha \left( \rho_{t:t+n-1} \left( G_{t:t+n} - V(s_t) \right) \right) && 0 \leq t < T \;, \end{align*} \] where \(G_{t:t+n}\) is the \(n\)-step return and \(\rho_{t:t+n-1}\) is the importance sampling ratio for \(n\) steps given by: \[ \begin{align*} \rho_{t:h} = \prod_{k=t}^{\min(h,T-1)} \frac{\pi(a_k \mid s_k)}{\pi_\beta(a_k \mid s_k)} \;. \end{align*} \] For instance, with the one-step return, the update becomes: \[ \begin{align*} V(s_t) \gets V(s_t) + \alpha \left( \frac{\pi(a_t \mid s_t)}{\pi_\beta(a_t \mid s_t)} \left( r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right) \right) \;. \end{align*} \] Because importance sampling is applied over shorter horizons, the compounding effects of importance sampling ratios are limited, resulting in smaller variance in TD learning compared to MC methods. This reduced sensitivity to discrepancies between the target policy and the behavior policy makes TD methods generally more stable for off-policy learning. References
|