Kullback–Leibler (KL) Divergence

Revised September 2, 2025

The Importance Sampling note is optional but recommended background reading.

Importance sampling expresses the return of a target policy \(\pi\) using trajectories generated by a behavior policy \(\mu\):

\[ \begin{equation*} \mathbb{E}_{\tau \sim p_\pi}[G(\tau)] = \mathbb{E}_{\tau \sim p_\mu}\!\left[\frac{p_\pi(\tau)}{p_\mu(\tau)}\,G(\tau)\right]. \end{equation*} \]

where \(\tau=(s_0,a_0,\dots,s_{T-1},s_T)\) and \(p_\pi(\tau)=\rho_0(s_0)\prod_{t=0}^{T-1}\pi(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t)\) with the same \(\rho_0\) and \(P\) for \(\pi\) and \(\mu\).

The importance weight for a trajectory factorizes into a product of per-step action probabilities:

\[ \begin{equation*} \frac{p_\pi(\tau)}{p_\mu(\tau)} = \prod_{t=0}^{T-1} \frac{\pi(a_t\mid s_t)}{\mu(a_t\mid s_t)}. \end{equation*} \]

This product has high variance when \(\pi\) and \(\mu\) differ substantially. Moreover, its multiplicative structure is numerically unstable and obscures the contribution of individual decisions. Applying a logarithm converts products into sums, yielding a per-step additive form:

\[ \begin{equation*} \log \frac{p_\pi(\tau)}{p_\mu(\tau)} = \sum_{t=0}^{T-1} \log \frac{\pi(a_t\mid s_t)}{\mu(a_t\mid s_t)} \;. \end{equation*} \]

This decomposition is more stable and isolates the influence of each action on the total ratio. Averaging this quantity under \(p_\pi\) produces:

\[ \begin{equation*} \mathbb{E}_{\tau \sim p_\pi}\!\left[ \log \frac{p_\pi(\tau)}{p_\mu(\tau)} \right], \end{equation*} \]

which defines the Kullback–Leibler (KL) divergence:

\[ \begin{equation}\label{eq:kl-divergence} D_{KL}(p_\pi \,\Vert\, p_\mu) = \mathbb{E}_{\tau \sim p_\pi}\!\left[ \log \frac{p_\pi(\tau)}{p_\mu(\tau)} \right] \;. \end{equation} \]

Importance sampling fails when \(\pi(a\mid s)>0\) but \(\mu(a\mid s)=0\), since the ratio \(\tfrac{\pi(a\mid s)}{\mu(a\mid s)}\) is undefined. More generally, stable estimates require the per-step ratios \(\tfrac{\pi(\cdot\mid s)}{\mu(\cdot\mid s)}\) to remain near one on the support of \(d_\pi\), the discounted state visitation distribution under \(\pi\). This condition corresponds to a small KL divergence between \(p_\pi\) and \(p_\mu\). TRPO constrains the expected per-state KL between the new and old policy under the old state distribution; PPO achieves a similar effect via clipping.

In maximum-entropy reinforcement learning such as SAC, KL divergence acts as a projection onto \(P_{\text{opt}}\), the distribution over optimal trajectories. Minimizing \(D_{KL}(p_\pi \,\Vert\, P_{\text{opt}})\) shifts probability mass toward high-return trajectories:

\[ \begin{align} D_{KL}(p_\pi \Vert P_{\text{opt}}) &= \mathbb{E}_{\tau \sim p_\pi}\!\left[\log p_\pi(\tau) - \log P_{\text{opt}}(\tau)\right] && \text{by Equation } \eqref{eq:kl-divergence} \label{eq:entropy-kl} \\ &= \mathbb{E}_{\tau \sim p_\pi}[\log p_\pi(\tau)] - \mathbb{E}_{\tau \sim p_\pi} \left[ \tfrac{1}{\alpha}\sum_t \gamma^t r(s_t,a_t) - \log Z \right] && \text{with } P_{\text{opt}}(\tau)=\tfrac{1}{Z}\exp\!\Big(\tfrac{1}{\alpha}\sum_t \gamma^t r(s_t,a_t)\Big) \nonumber \\ &= -H(p_\pi) - \tfrac{1}{\alpha}\mathbb{E}_{\tau \sim p_\pi}\!\Big[\sum_t \gamma^t r(s_t,a_t)\Big] + \log Z && \text{where } H(p_\pi) = -\,\mathbb{E}_{\tau \sim p_\pi}[\log p_\pi(\tau)] \nonumber \end{align} \]

Since \(\log Z\) and \(\alpha\) are constant with respect to \(\pi\), minimizing Equation \(\eqref{eq:entropy-kl}\) is equivalent to maximizing:

\[ \begin{equation*} \mathbb{E}_{\tau \sim p_\pi}\!\Big[\sum_{t=0}^{T-1} \gamma^t r(s_t,a_t)\Big] + \alpha H(p_\pi) \;, \end{equation*} \]

which coincides with the maximum-entropy reinforcement learning objective.

Equation \(\eqref{eq:kl-divergence}\), the reverse KL divergence, tends to be mode-seeking. Because the expectation is taken under \(p_\pi\), regions where \(p_\pi(\tau)=0\) contribute nothing even if \(p_\mu(\tau)\) is large, while placing mass where \(p_\mu(\tau)\approx 0\) incurs an infinite (or very large) penalty. This asymmetry encourages \(p_\pi\) to concentrate on one (or a few) high-probability modes of \(p_\mu\) rather than spreading mass across all modes. This behavior is consistent with RL in fully observed MDPs with an expected-return objective, where there exists an optimal deterministic policy that concentrates on \(\arg\max_a Q^*(s,a)\) (though stochastic optima can arise with ties, partial observability, or explicit regularization).

reverse formulation of KL divergence 
\(p_\mu\) is a bimodal distribution and \(p_\pi\) is a Gaussian approximation. Reverse KL places mass on a single mode of \(p_\mu\).

Forward KL Divergence

Reverse KL divergence is natural in policy optimization, since expectations are taken under the policy \(p_\pi\), which can be sampled directly. In supervised learning, by contrast, samples from the data distribution \(p_\mu\) are available, but its density is unknown. A parametric model \(p_\pi\) is therefore introduced to approximate \(p_\mu\), leading to the forward KL divergence, where the expectation in Equation \(\eqref{eq:kl-divergence}\) is taken with respect to \(p_\mu\):

\[ \begin{equation}\label{eq:forward-kl} D_{KL}(p_\mu \,\Vert\, p_\pi) = \mathbb{E}_{\color{red}{\tau \sim p_\mu}} \left[ \log \frac{p_\mu(\tau)}{p_\pi(\tau)} \right] \;. \end{equation} \]

Terms with \(p_\mu(\tau)=0\) vanish, so only regions where \(p_\mu\) has support contribute. Forward KL divergence thus penalizes \(p_\pi\) for assigning low probability where \(p_\mu\) is positive, but not for placing mass where \(p_\mu\) is zero. Minimization therefore requires \(p_\pi\) to cover the full support of \(p_\mu\).

forward formulation of KL divergence 
\(p_\mu\) is a bimodal distribution and \(p_\pi\) is a Gaussian approximation. Because \(p_\pi\) must spread mass across the support of \(p_\mu\), forward KL divergence is characterized as mean-seeking.

Forward KL coincides with maximum likelihood estimation, which provides the statistical foundation for standard supervised learning objectives. Given samples from the data distribution \(p_\mu\), a parametric model \(p_\pi\) is chosen to minimize \(D_{KL}(p_\mu \Vert p_\pi)\):

\[ \begin{align*} D_{KL}(p_\mu \Vert p_\pi) &= \mathbb{E}_{x \sim p_\mu}\!\left[\log p_\mu(x) - \log p_\pi(x)\right] && \text{by Equation } \eqref{eq:forward-kl} \\ &= \mathbb{E}_{x \sim p_\mu}\!\left[\log p_\mu(x)\right] - \mathbb{E}_{x \sim p_\mu}\!\left[\log p_\pi(x)\right] && \text{linearity of expectation} \\ &= -\mathbb{E}_{x \sim p_\mu}[\log p_\pi(x)] - H(p_\mu) && \text{definition of Shannon entropy.} \end{align*} \]

The entropy term \(H(p_\mu)\) is independent of \(\pi\), so minimizing forward KL divergence is equivalent to maximizing the log-likelihood of the data under \(p_\pi\).

For classification, let \(p_\pi(y\mid x)\) be the categorical distribution defined by \(f_\pi(x)\). The cross-entropy loss:

\[ \begin{equation*} \mathcal{L}_{\text{CE}} = \mathbb{E}_{(x,y)\sim p_\mu}\!\big[-\log p_\pi(y\mid x)\big] \end{equation*} \]

differs from \(D_{KL}(p_\mu\Vert p_\pi)\) only by the constant \(H(p_\mu)\). Minimizing cross-entropy is therefore equivalent to minimizing forward KL.

For regression, if \(p_\pi(y\mid x)\) is Gaussian with mean \(f_\pi(x)\) and fixed variance, then:

\[ \begin{equation*} -\log p_\pi(y\mid x)=\tfrac{1}{2\sigma^2}|y-f_\pi(x)|_2^2 + C \end{equation*} \]

for fixed variance \(\sigma^2\). Thus, minimizing mean squared error coincides with forward KL under the Gaussian model. Both cross-entropy (classification) and squared error (regression) are therefore special cases of forward KL minimization.

References

KL Divergence for Machine Learning (2018)

Dibya Ghosh

Demystifying Entropy (And More) (2019)

Jake Tae

Six (and a half) intuitions for KL divergence (2022)

Callum McDougall