Off-Policy Evaluation

Off-policy evaluation (OPE) estimates the expected return of a target policy \(\pi_e\) using data generated by a different, behavior policy \(\pi_b\). This permits evaluation without executing \(\pi_e\), which may be infeasible when online experimentation is expensive, unsafe, or constrained. OPE is purely evaluative — estimate the value of \(\pi_e\) under the distribution induced by \(\pi_b\). This contrasts with offline RL which learns a policy from logged data \(\mathcal{D} = \{(s_t, a_t, r_{t+1}, s_{t+1})^i\}_{i=1}^T\) collected under \(\pi_b\). The OPE structure parallels that of causal inference, where the effect of a specified intervention is estimated from observational data. In precision medicine, for example, treatments are assigned based on individual patient features, but decisions rely on historical data in which different patients received different treatments under different conditions.

OPE in Contextual Bandits

In the \(k\)-armed bandit problem, an agent repeatedly selects from \(k\) actions with each action \(a \in \mathcal{A}\) yielding a reward drawn from a stationary distribution conditioned on \(a\). The objective is to maximize expected cumulative reward over \(T\) time steps.

Each action has an associated value function that the agent estimates from data:

\[ \begin{equation*} Q_*(a) = \mathbb{E}[r_t \mid a_t = a] \;, \end{equation*} \]

The contextual bandit problem generalizes the \(k\)-armed bandit setting by conditioning action selection on a state \(s \in \mathcal{S}\). Actions are selected according to a policy \(\pi(a \mid s)\), with rewards determined by both states and actions \(R(s,a)\). Repeating this process for \(T\) time steps and recording each transition yields a dataset:

\[ \begin{equation*} \tau_\pi = \{(s_i, a_i, r_i)\}_{i=1}^T \quad s \sim d,\; a \sim \pi,\; r \sim R(s, a) \; , \end{equation*} \]

Like the \(k\)-armed case, each action influences only the immediate reward; not future states. Thus, assuming the state distribution \(d\) is stationary and the reward distribution \(R(s,a)\) is time-invariant, the samples are independent and identically distributed (i.i.d.). This distinguishes contextual bandits from full RL problems, where actions affect long-term dynamics and rewards.

OPE estimates the expected reward of a fixed evaluation policy \(\pi_e\). Like all policies, \(\pi_e\) maps states in \(\mathcal{S}\) to probability distributions over actions in \(\mathcal{A}\). If \(\mathcal{A}\) is discrete, actions are selected with probability \(\pi_e(a \mid s)\). If \(\mathcal{A}\) is continuous, assume for simplicity that \(\pi_e(\cdot \mid s)\) defines a non-degenerate density over \(\mathcal{A}\) for every \(s \in \mathcal{S}\); that is, \(\pi_e\) is strictly stochastic.

\[ \begin{align} J &= \mathbb E[r_{\pi_e}] \nonumber \\ &= \sum_{s\in\mathcal S}\sum_{a\in\mathcal A}\sum_{r\in\mathcal R} p_{\pi_e}(s,a,r) \; r \label{eq:evaluation-policy-reward-2}\\ &= \sum_{s\in\mathcal S} p(s) \sum_{a\in\mathcal A} \pi_e(a\mid s) \sum_{r\in\mathcal R} p(r\mid s,a) \; r \;, \label{eq:evaluation-policy-reward-3} \end{align} \]

where \(p_{\pi_e}(s,a,r)\) is the joint distribution over state-action-reward triples induced by executing \(\pi_e\):

\[ \begin{equation}\label{eq:distribution-factorization} p_{\pi_e}(s,a,r) = p(s)\, \pi_e(a \mid s)\, p(r \mid s,a) \;. \end{equation} \]

Assumptions: Consistency, No Unmeasured Confounders, and Weak Positivity

The expected reward under \(\pi_e\) (Equation \(\eqref{eq:evaluation-policy-reward-3}\)) is identifiable (i.e., the population quantity of interest \(J\) can be expressed as a functional of the observed data distribution \(\tau_{\pi_b}\)) if \(\tau_{\pi_b}\) satisfies three standard assumptions from causal inference: consistency, no unmeasured confounding, and weak positivity.

1) Consistency:: The observed reward equals the potential outcome associated with the current state and action: \(r_t = R(s_t, a_t)\). For example, suppose \(X\) is a binary treatment variable (\(0\): untreated, \(1\): treated), and \(Y\) is a binary outcome (\(0\): survival, \(1\): death). Let \(Y^{x=0}\) and \(Y^{x=1}\) denote the potential outcomes under treatments \(x=0\) and \(x=1\), respectively. A causal effect exists for individual \(i\) if \(Y_i^{x=0} \neq Y_i^{x=1}\). Although both potential outcomes are defined, only one is observed. If individual \(i\) receives \(X=1\), then the factual outcome is \(Y_i = Y_i^{x=1}\). Consistency asserts that the observed outcome equals the potential outcome under the received treatment: \(X_i = x \Rightarrow Y_i = Y_i^x\).

2) No unmeasured confounders:: For all \(a \in \mathcal{A}\), the action \(A\) is conditionally independent of the potential outcome \(R(s,a)\) given the state \(s\): \(A \perp R(s,a) \mid s\). All common causes of action and outcome are assumed to be captured by \(s\). For example, in estimating the effect of income on voting, age may be a confounder, influencing both income and voting. If age is omitted from \(s\), conditional independence fails and the estimate is biased.

3) Weak positivity:: For all \(s \in \mathcal{S}\), the support of \(\pi_e(\cdot \mid s)\) is contained in the support of \(\pi_b(\cdot \mid s)\). That is, if \(\pi_e(a \mid s) > 0\), then \(\pi_b(a \mid s) > 0\).

The expectation in Equation \(\eqref{eq:evaluation-policy-reward-2}\) cannot be evaluated directly, as the full data-generating distribution \(p_{\pi_e}(s,a,r)\) is not observable. Specifically, although the evaluation policy \(\pi_e\) is known, the environment's dynamics (i.e., state distribution and reward function) are not. If samples from \(p_{\pi_e}\) were available, the expectation could be approximated via Monte Carlo estimation:

\[ \begin{equation} J = \mathbb{E}_{(s,a,r) \sim p_{\pi_e}}[r] \approx \frac{1}{T} \sum_{i=1}^T r_i \;. \label{eq:expectation-function-of-triple} \end{equation} \]

The data, however, are collected under a different policy \(\pi_b\). Thus, the expectation under \(p_{\pi_e}\) must be expressed in terms of the observed distribution \(p_{\pi_b}\):

\[ \begin{equation*} \mathbb{E}_{(s,a,r) \sim p_{\pi_e}}[r] = \mathbb{E}_{(s,a,r) \sim p_{\pi_b}}[\eta(s,a)\, r] \;, \end{equation*} \]

Importance Sampling

Importance sampling reweights samples drawn from \(p_{\pi_b}\) to estimate expectations under \(p_{\pi_e}\) as:

\[ \begin{align*} J &= \mathbb{E}_{(s,a,r) \sim p_{\pi_e}}[r] \\ &= \mathbb{E}_{(s,a,r) \sim p_{\pi_b}} \left[ \frac{p_{\pi_e}(s,a,r)}{p_{\pi_b}(s,a,r)}\, r \right] \\ &= \mathbb{E}_{(s,a,r) \sim p_{\pi_b}} \left[ \frac{p(s)\, \pi_e(a \mid s)\, p(r \mid s,a)}{p(s)\, \pi_b(a \mid s)\, p(r \mid s,a)} \, r \right] && \text{by equation } \eqref{eq:distribution-factorization} \\ &= \mathbb{E}_{(s,a,r) \sim p_{\pi_b}} \left[ \frac{\pi_e(a \mid s)}{\pi_b(a \mid s)}\, r \right] \\ &\approx \frac{1}{T} \sum_{i=1}^T \frac{\pi_e(a_i \mid s_i)}{\pi_b(a_i \mid s_i)}\, r_i \; , \quad (s_i,a_i,r_i) \sim p_{\pi_b} \;, \end{align*} \]

This derivation establishes that importance sampling yields an unbiased estimator of \(J\), provided weak positivity (also called the coverage assumption) holds. In practice, \(\pi_b\) is often unknown and must be approximated by a model \(\hat{\pi}_b\). If \(\hat{\pi}_b(a \mid s) \approx \pi_b(a \mid s)\), then the estimator \(\hat{J}_{IS}\) remains approximately unbiased.

Although importance sampling is unbiased, it can be high variance when \(\pi_e\) deviates significantly from \(\pi_b\). In such cases, small changes in the input can induce large fluctuations in the output. Consider a contextual bandit with constant deterministic reward \(r\), a uniform behavior policy \(\pi_b(\cdot \mid s) = U(\mathcal{A})\) for all \(s \in \mathcal{S}\) (i.e., \(\pi_b(a \mid s) = 1/K\) for \(K = |\mathcal{A}|\)), and a deterministic evaluation policy \(\pi_e(a \mid s) = \mathbb{I}_{a = a^*(s)}\), where \(a^*(s)\) denotes the action selected by \(\pi_e\). Let \(\rho = \frac{\pi_e(a \mid s)}{\pi_b(a \mid s)}\). Then:

\[ \begin{align} \mathbb{V}[\rho r \mid a \sim U] &= r^2 \mathbb{V}[\rho \mid a \sim U] && \text{since \(r\) is constant} \nonumber \\ &= r^2 \left( \mathbb{E}[\rho^2 \mid a \sim U] - \mathbb{E}[\rho \mid a \sim U]^2 \right) && \text{definition of variance} \nonumber \\ &= r^2 \left( \mathbb{E}[\rho^2 \mid a \sim U] - 1 \right) && \text{since \(\mathbb{E}[\rho] = 1\)} \nonumber \\ &= r^2 \left( \mathbb{E} \left[ \left( \frac{\pi_e(a \mid s)}{\pi_b(a \mid s)} \right)^2 \;\middle|\; a \sim U \right] - 1 \right) && \text{expand \(\rho^2\)} \nonumber \\ &= r^2 \left( \mathbb{E} \left[ \left( \frac{\mathbb{I}_{a = a^*(s)}}{1/K} \right)^2 \;\middle|\; a \sim U \right] - 1 \right) && \text{substitute \(\pi_e(a \mid s) = \mathbb{I}_{a = a^*(s)}\), \(\pi_b(a \mid s) = 1/K\)} \nonumber \\ &= r^2 \left( \mathbb{E} \left[ \mathbb{I}_{a = a^*(s)} \cdot K^2 \;\middle|\; a \sim U \right] - 1 \right) && \text{simplify fraction} \nonumber \\ &= r^2 \left( K^2 \cdot \mathbb{P}(a = a^*(s)) - 1 \right) && \text{linearity of expectation} \nonumber \\ &= r^2 \left( K^2 \cdot \frac{1}{K} - 1 \right) && \text{since \(a \sim U\), so \(\mathbb{P}(a = a^*(s)) = \frac{1}{K}\)} \nonumber \\ &= r^2 (K - 1) \;. \label{eq:is-variance} \end{align} \]

Hence, the variance of a single importance-weighted sample grows linearly with \(K\). Since \(\hat{J}_{IS}\) averages \(T\) such samples, its variance is equal to \(r^2(K - 1)/T\).

Weighted Importance Sampling

High variance in importance sampling arises from stochasticity in the weighting and normalization of samples, not from variability in the rewards themselves. When \(r\) is constant across all states, actions, and time steps, \(\pi_b\) is uniform, and \(\pi_e\) is deterministic, the importance weight equals \(K\) if \(a_i = a^*(s_i)\) and zero otherwise. Samples with \(a_i \ne a^*(s_i)\) contribute nothing, and only a subset of the data is effectively retained. Since the corresponding rewards are constant and identical, variance must result from normalization.

More specifically, although approximately \(T/K\) samples receive nonzero weight, the estimator averages over \(T\) total samples. This introduces sensitivity to fluctuations in the frequency of the condition \(a_i = a^*(s_i)\). Let:

\[ \begin{equation*} \tau_{\pi_b}^\prime = \{(s_i, a_i, r_i) \in \tau_{\pi_b} : a_i = a^*(s_i)\} \end{equation*} \]

\[ \begin{equation*} \frac{1}{T} \sum_{i=1}^T \frac{\mathbb{I}_{a_i = a^*(s_i)}}{1/K} r_i = \sum_{i=1}^T \frac{\mathbb{I}_{a_i = a^*(s_i)}}{T/K} r_i = \frac{1}{T/K} \sum_{i \in \tau_{\pi_b}^\prime} r_i \; . \end{equation*} \]

The denominator reflects the expected count of samples with nonzero weight \(T/K\), while the numerator (i.e., the summation) reflects the realized count \(|\tau_{\pi_b}^\prime|\). Variability in the number of matching samples thus induces estimator variance even under constant reward.

Weighted importance sampling (WIS) modifies standard importance sampling by normalizing with the realized sum of weights rather than the dataset size. Specifically:

\[ \begin{equation}\label{eq:wis} \hat{J}_{WIS} = \frac{1}{\sum_{i=1}^T \rho_i} \sum_{i=1}^T \rho_i r_i \; . \end{equation} \]

Intuitively, the resulting estimator is a weighted average of rewards from the subset of data where the behavior and evaluation policies have overlapping support (\(\rho_i > 0\)). Unlike standard importance sampling, however, WIS is generally biased. Nonetheless, it is consistent: as the number of samples increases, \(\hat{J}_{WIS}\) converges in probability to the true value of \(J\). The normalization factor satisfies:

\[ \begin{equation*} \sum_{i=1}^T \rho_i \approx T \quad \text{as } T \to \infty \;, \end{equation*} \]

since \(\mathbb{E}[\rho_i] = 1\) and the weights concentrate around their expectation by the law of large numbers. In practice, WIS is often preferred due to its favorable bias–variance tradeoff.

Bias vs. Consistency

Bias and consistency are distinct statistical properties. Bias is a finite-sample property. An estimator is unbiased if its expected value equals the true value of the parameter at each fixed sample size \(n\), i.e., \(\mathbb{E}[\hat{\theta}_n] = \theta\). This is usually required to hold for every \(n\).

Consistency, by contrast, is asymptotic. It describes whether the estimator converges in probability to the true value as the sample size grows \(\hat{\theta}_n \xrightarrow{p} \theta\) as \(n \to \infty\), meaning \(P(|\hat{\theta}_n - \theta| > \varepsilon) \to 0\) for all \(\varepsilon > 0\). Intuitively, the estimator’s distribution concentrates around the true value as the sample size grows.

For example, suppose \(X_1,\dots,X_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\mu,\sigma^2)\). The “single–observation” estimator \(T_n = X_n\) is unbiased since \(\mathbb{E}[T_n] = \mu\). It is not, however, consistent: \(P(|T_n - \mu| > \varepsilon) = 2\Phi(-\varepsilon/\sigma) > 0\) for all \(n\), so there is no convergence to \(0\).

The sample mean:

\[ \begin{equation*} \bar X_n = \frac{1}{n} \sum_{i=1}^n X_i \;, \end{equation*} \]

is also unbiased, with \(\mathbb{E}[\bar X_n] = \mu\) and \(\operatorname{Var}(\bar X_n) = \sigma^2 / n \to 0\), which implies \(\bar X_n \xrightarrow{p} \mu\). Hence it is consistent.

Unbiasedness and consistency are logically independent. An estimator can be unbiased but inconsistent, as with \(T_n = X_n\) above or biased but consistent if the bias vanishes as \(n\) grows.

Direct Method

While importance sampling reweights data to account for the mismatch between \(\pi_b\) and \(\pi_e\), the direct method (DM) estimates the value of the evaluation policy by modeling the reward function from data collected under \(\pi_b\).

In practice, the dataset \(\tau_{\pi_b}\) is partitioned into two disjoint subsets to prevent overfitting. The first is used to fit a model of the reward function. In the contextual bandit setting, this model estimates the expected reward for each state-action pair. That is, the model learns a function \(\hat{Q}\) to approximate \(Q(s,a) = \mathbb{E}[r \mid s,a]\). The second subset is used to evaluate \(\pi_e\) via the estimated reward function:

\[ \begin{equation*} \hat{J}_{DM} = \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi_e(a \mid s_i)\, \hat{Q}(s_i,a) \;. \end{equation*} \]

The direct method typically has low variance, as it avoids reweighting and discards stochasticity in the observed rewards. In theory, if \(\hat{Q}\) were unbiased for \(Q\), then \(\hat{J}_{DM}\) would also be unbiased. But in practice, \(\hat{Q}\) is almost always biased due to (i) the expressive capacity of the model class (e.g., a linear model cannot approximate a highly nonlinear reward function, regardless of data quantity) and (ii) the sample size available for training.

Moreover, the bias is generally difficult to characterize because approximation error \(\|Q - \hat{Q} \|\) is hard to quantify for arbitrary function classes. In some cases, the error can be bounded under capacity constraints or smoothness assumptions. Finally, bias may also be exacerbated by misaligned loss functions. If the loss used for model learning does not account for \(\pi_e\), the resulting estimator may allocate capacity to regions of the state space that \(\pi_e\) never visits, undermining the relevance of the model to the evaluation task.

Doubly Robust Estimator

Recall from Equation \(\eqref{eq:is-variance}\) that when the reward is a deterministic constant \(r\), the variance of a single importance-weighted sample is \(r^2(K - 1)\), where \(K\) is the number of actions. This expression vanishes if \(r^2 = 0\) or \(K = 1\). In particular, the variance can be reduced by subtracting a constant \(c\) from \(r\), yielding the shifted reward \((r - c)\). The resulting variance is \((r - c)^2(K - 1)\), which becomes zero when \(c = r\).

This idea extends to stochastic rewards via doubly robust estimation (DR), which combines importance sampling with the direct method. DR inherits the low variance of the direct method and the consistency of importance sampling:

\[ \begin{equation*} \hat{J}_{DR} = \frac{1}{n} \sum_{i=1}^n \left[ \rho_i \left( r_i - \hat{Q}(s_i, a_i) \right) + \hat{Q}(s_i, \pi_e) \right] \;, \end{equation*} \]

\[ \begin{equation*} \hat{Q}(s_i, \pi_e) = \sum_{a \in \mathcal{A}} \pi_e(a \mid s_i)\, \hat{Q}(s_i, a) \;. \end{equation*} \]

The DR estimator uses \(\hat{Q}\) as a baseline and applies a correction when a matching sample is available. The correction term \(\rho_i (r_i - \hat{Q}(s_i, a_i))\) offsets bias in the model. When \(\hat{Q}(s_i, a_i)\) closely approximates \(r_i\), this correction is small and the overall variance is reduced. Conversely, when the model estimate is poor, the correction is large, recovering the true reward via reweighting.

As with importance sampling, DR estimators often rely on an estimate \(\hat{\pi}_b\) of the behavior policy. Crucially, DR is consistent if either \(\hat{Q}\) or \(\hat{\pi}_b\) is correctly specified — meaning equal to the true \(Q\) or \(\pi_b\) on the relevant domain. This dual robustness underlies the name doubly robust — consistency is guaranteed if either the reward model or behavior model is correct, but not necessarily both.

OPE in RL

Extending OPE from contextual bandits to Markov Decision Processes (MDPs) introduces a fundamental complication — the data are no longer i.i.d. In an MDP, each action not only determines the immediate reward but also affects the next state, creating sequentially correlated trajectories. This invalidates bandit-style OPE methods for two reasons. First, distributional shift arises from differences in the state visitation patterns induced by the evaluation policy \(\pi_e\) and the behavior policy \(\pi_b\). OPE must therefore correct for mismatches in both action selection and state occupancy. Second, the curse of horizon emerges when applying importance sampling to long trajectories. Specifically, estimating the value of an \(H\)-step trajectory requires multiplying \(H\) importance weights, \(\prod_{t=1}^H \rho_t\), each contributing variance. Consequently, the overall variance increases exponentially with \(H\), making the estimator unstable for nontrivial horizons. Addressing these issues requires estimators designed specifically for sequential settings. These methods often depend on some combination of two auxiliary quantities, referred to as nuisance functions, which must be estimated from offline data.

Nuisance Functions

A nuisance function is any quantity that, while not the target of inference, must be estimated to construct a statistically valid or efficient estimator. In the MDP setting, two nuisance functions commonly arise. The first is the Q-function \(Q(s,a)\), which represents the expected cumulative reward obtained by taking action \(a\) in state \(s\) and following the evaluation policy \(\pi_e\) thereafter:

\[ \begin{equation*} Q^{\pi_e}(s, a) = \mathbb{E}_{\pi_e} \left[ \sum_{t=0}^\infty \gamma^t r_{t+1} \,\middle|\, s_0 = s, a_0 = a \right] \; , \end{equation*} \]

where \(\gamma \in [0, 1)\) is the discount factor. For a detailed treatment of the Q-function, see the Model Free Control note.

The second nuisance function is the marginal density ratio \(\mu^*(s,a)\) which corrects for distributional shift by reweighting state-action pairs according to their relative discounted visitation frequencies under \(\pi_e\) and \(\pi_b\). This quantity is central to addressing the curse of horizon. Rather than correcting per-step action probabilities, \(\mu^*\) adjusts for the long-run, discounted visitation frequency of state-action pairs.

\[ \begin{equation*} \mu^*(s, a) = \frac{d^{\pi_e}(s, a)}{d^{\pi_b}(s, a)} \;, \end{equation*} \]

\[ \begin{equation*} d^\pi(s,a) = (1 - \gamma) \sum_{t=0}^\infty \gamma^t \mathbb{P}^\pi(s_t = s, a_t = a) \end{equation*} \]

is the discounted state-action visitation distribution under \(\pi\), with trajectories initialized from a fixed distribution \(d_0\).

The ratio \(\mu^*(s,a)\) quantifies how much more (or less) frequently the pair \((s,a)\) is encountered under \(\pi_e\) than under \(\pi_b\), measured with respect to the same starting distribution. Using this single, trajectory-level ratio avoids the variance amplification associated with products of stepwise importance weights.

Although neither \(Q\) nor \(\mu^*\) is the primary object of interest, the quality of the policy value estimate \(\hat{J}(\pi_e)\) depends on the accuracy with which these auxiliary functions are learned.

Estimating Q-functions

Fitted Q-Iteration (FQE)

Fitted Q-Iteration (FQE) estimates the Q-function by solving a sequence of supervised regression problems that enforce the Bellman equation. At each iteration \(k\), the current estimate \(\hat{Q}_k\) is used to generate Bellman targets, which define the regression labels for the next update.

Given data tuples \((s, a, r, s’)\) from the offline dataset, the update solves:

\[ \begin{equation*} \hat{Q}_{k+1} \leftarrow \arg\min_{\tilde{Q}} \mathbb{E}_T \left[ \left( \tilde{Q}(s,a) - \left( r_{t+1} + \gamma \hat{Q}_k(s’, \pi_e) \right) \right)^2 \right] \;, \end{equation*} \]

where where \(\mathbb{E}_T\) is the empirical average over the \(T\) transitions in the offline dataset \(\mathcal{D}\) and \(\hat{Q}_k(s’, \pi_e)\) is the expected value under the evaluation policy:

\[ \begin{equation}\label{eq:q-hat-discrete} \hat{Q}_k(s’, \pi_e) = \sum_{a’} \pi_e(a’ \mid s’)\, \hat{Q}_k(s’, a’) \;. \end{equation} \]

This process is repeated until convergence. FQE is compatible with any function approximator and has been widely used with both linear models and deep neural networks.

Minimax Q-learning

Unlike FQE, Minimax Q-learning avoids repeated fitted-Q iterations and instead enforces the Bellman equation in expectation. The Bellman condition requires that the error term \(r + \gamma Q(s’, \pi_e) - Q(s,a)\) have zero expectation under the data distribution. To enforce this constraint, the method introduces a test function \(f \in \mathcal{F}\) that identifies systematic patterns in the Bellman error; if the expected product \(f(s,a) [r + \gamma Q(s’, \pi_e) - Q(s,a)]\) is nonzero, the Bellman condition is violated. The Q-function is optimized to minimize the worst-case violation detected by \(f\). The resulting minimax objective is:

\[ \begin{equation*} \arg\min_{\tilde{Q} \in \mathcal{Q}} \max_{f \in \mathcal{F}} \left( \mathbb{E}_T \left[ f(s,a)\left\{r_{t+1} + \gamma \tilde{Q}(s’,\pi_e) - \tilde{Q}(s,a)\right\} \right] - \lambda \mathbb{E}_T[f(s,a)^2] \right) \;, \end{equation*} \]

where \(\mathcal{Q}\) and \(\mathcal{F}\) are the function classes for the Q-function and discriminator, respectively, \(\lambda\) is a Lagrange multiplier, and the inner expectation \(\mathbb{E}_{a'\sim\pi_e(\cdot|s’)}[\tilde{Q}(s’,a’)]\) is computed as in Equation \(\eqref{eq:q-hat-discrete}\).

While computationally more demanding than FQE, this approach can yield stronger statistical guarantees. When both \(\mathcal{Q}\) and \(\mathcal{F}\) are linear, the objective reduces to Least-Squares Temporal-Difference for Q-learning (LSTDQ), which admits a closed-form solution.

Estimating Marginal Ratios

Estimating the marginal density ratio \(\mu^*\) follows a minimax framework similar to that of Q-function estimation. Like \(Q\), the function \(\mu^*\) satisfies a Bellman-style identity that enables its estimation from data. This identity expresses a fixed-point relationship over trajectories under the behavior policy. Specifically, \(\mu^*\) satisfies a recursion that relates its value at \((s,a)\) to its expected value at subsequent state-action pairs drawn from \(\pi_e\). This structure forms the basis for a minimax estimation problem, the solution \(\hat{\mu}^*\) for which is given by:

\[ \begin{equation*} \arg\min_{\tilde{\mu}^* \in \mathcal{M}} \max_{f \in \mathcal{F}} \left( \mathbb{E}_T \left[ \gamma \tilde{\mu}^*(s,a)\, f(s’,\pi_e) - \tilde{\mu}^*(s,a)\, f(s,a) \right] + (1-\gamma) \mathbb{E}_{p_e^{(1)}}[f(s, \pi_e)] - \lambda \mathbb{E}_T[f(s,a)^2] \right) \;, \end{equation*} \]

where \(p_e^{(1)}\) is the initial state distribution, \(\mathcal{M}\) and \(\mathcal{F}\) denote the function classes for the density ratio and test functions, respectively, \(\lambda\) is a Lagrange multiplier, and \(f(s, \pi_e)\) is the expected value under the evaluation policy.

This formulation directly parallels minimax Q-learning, substituting the Bellman error with the marginal ratio identity.

MDP Estimators

With the nuisance functions defined, the MDP counterparts of the previously introduced bandit estimators follow directly.

The Marginalized Importance Sampling (MIS) Estimator

The marginalized importance sampling (MIS) estimator extends the bandit importance sampling method to MDPs by replacing per-step importance ratios with the marginal density ratio \(\mu^*(s,a)\). It relies solely on \(\mu^*\) to reweight observed rewards, discarding the Q-function. The estimator for the total discounted reward \(J\) is:

\[ \begin{equation*} \hat{J}_{\text{MIS}} =\frac{1}{1-\gamma} \frac{1}{T}\sum_{i=1}^{T} \hat{\mu}^*(s_i,a_i)\,r_i \end{equation*} \]

By using a single aggregated correction term, MIS avoids the exponential variance growth of stepwise importance sampling. Its accuracy, however, depends critically on the quality of the density ratio estimate \(\hat{\mu}^*\).

Direct Method (DM)

The direct method (DM) extends its bandit analogue to the MDP setting by computing the expected return under \(\pi_e\) using a learned Q-function \(\hat{Q}\), without reference to the behavior policy or observed rewards:

\[ \begin{equation*} \hat{J}_{DM} = \frac{1}{n} \sum_{i=1}^n \hat{Q}(s_{i,0}, \pi_e) = \frac{1}{n} \sum_{i=1}^n \sum_a \pi_e(a \mid s_{i,0})\, \hat{Q}(s_{i,0}, a) \;, \end{equation*} \]

where \(\{s_{i,0}\}_{i=1}^n\) are the initial states of \(n\) observed trajectories.

This estimator avoids variance from reward stochasticity but is prone to bias from model misspecification. Systematic error in \(\hat{Q}\) directly propagates to the final estimate.

Double Reinforcement Learning (DRL) Estimator

The double reinforcement learning (DRL) estimator combines the direct method with marginalized importance sampling to yield a statistically efficient estimator for MDPs. It is the basis for many modern OPE methods in RL:

\[ \begin{equation*} \hat{J}_{\text{DRL}} = \frac{1}{n} \sum_{i=1}^n \hat{Q}(s_{i,0}, \pi_e) + \frac{1}{T} \sum_{j=1}^T \frac{\hat{\mu}^*(s_j, a_j)}{1 - \gamma} \left( r_j + \gamma \hat{Q}(s_j’, \pi_e) - \hat{Q}(s_j, a_j) \right) \;. \end{equation*} \]

The first term estimates \(J(\pi_e)\) from the learned Q-function. The second term corrects model error using observed rewards and transitions, reweighted by the estimated marginal density ratio \(\hat{\mu}^*\). If \(\hat{Q}\) is accurate, the Bellman error vanishes and the correction term has no effect. If \(\hat{Q}\) is misspecified, the correction offsets its bias.

The DRL estimator is consistent if either \(\hat{Q}\) or \(\hat{\mu}^*\) is correctly specified. This double robustness significantly improves reliability in practice. Furthermore, among a broad class of regular estimators, DRL attains the lowest possible asymptotic mean squared error, even when \(\hat{Q}\) and \(\hat{\mu}^*\) are learned via flexible, nonparametric models satisfying mild convergence conditions.