= 10, 1
n, m = 0.5
alpha = 1.0
s2 = s2 / (1 - alpha**2)
tau2 = jnp.zeros((n + 1, 1))
mu = tau2 * jnp.broadcast_to(
consecutive_covs 1.0, alpha], [alpha, 1.0]]), (n, 2, 2)
jnp.array([[
)
= proposal_from_moments(mu, consecutive_covs)
proposal
+ 1, 1, 1))
fct.test_eq(proposal.R.shape, (n 1, 1))
fct.test_eq(proposal.J_tt.shape, (n, 1, 1))
fct.test_eq(proposal.J_tp1t.shape, (n,
fct.test_close(proposal.mean, mu)0], jnp.sqrt(tau2) * jnp.eye(1))
fct.test_close(proposal.R[1:], jnp.sqrt(s2) * jnp.eye(1))
fct.test_close(proposal.R[/ proposal.J_tt, jnp.full((n, 1, 1), alpha)) fct.test_close(proposal.J_tp1t
Cross-Entropy method
[!CAUTION] this module is still under construction
The cross entropy method (Rubinstein 1997; Rubinstein and Kroese 2004) is a method for determining good importance sampling proposals. Given a parametric family of proposals \(g_\theta(x)\) and target \(p(x)\), the Cross-Entropy method aims at choosing \(\theta\) such that the Cross Entropy \[ \mathcal H_{\text{CE}} \left( p \middle|\middle| g_{\theta} \right) = \int p(x) \log g_{\theta}(x) \mathrm d x \] is maximized. This is equivalent to minimizing the Kullback Leibler divergence between \(p\) and \(g_\theta\). As \(H_\text{CE}\) is not analytically available, it is approximated by importance sampling itself, usually with a suitable proposal \(g_{\hat\theta_0}\). Then the approximate optimization problem is solved, yielding \(\hat \theta_1\). These steps are then iterated until convergence, using common random numbers to ensure convergence.
Considering the Cross-Entropy method with a Gaussian proposal \(g_\theta\), we see that the optimal \(\theta\) only depends on the first and second order moments of \(p\), indeed the optimal Gaussian is the one that matches these moments. Unfortunately this approach is not feasible for the models we consider in this package as the dimensionality (\(n \cdot m\)) is likely too high to act on the joint distribution directly - matching means is feasible, but simulating from the distribution and evaluating the likelihood is infeasible. However, we can exploit the Markov structure of our models:
For the class of state space models treated in this package, it can be shown that the smoothing distribution, the target of our inference, \(p(x|y)\), is again a Markov process, see Chapter 5 in (Chopin and Papaspiliopoulos 2020), so it makes sense to approximate this distribution with a Gaussian Markov process. Thus, we only need to find the closest (in terms of KL-divergence) Gaussian Markov process, which is feasible, and can be obtained by choosing the approximation to match the mean and consecutive covariances, i.e. match \[ \operatorname{Cov} \left( (X_{t}, X_{t + 1}) \right) \in \mathbf R^{2m \times 2m} \] for all \(t = 0, \dots, n - 1\). These are just \(\mathcal O(nm^2)\) many parameters, instead of the \(\mathcal O(n^2m^2)\) many parameters required to match the whole covariance matrix.
proposal_from_moments
proposal_from_moments (mean:jaxtyping.Float[Array,'n+1m'], consecutive_covs:jaxtyping.Float[Array,'n2*m2*m'])
Find the unique Gaussian Markov proposal that matches means and consecutive covariances.
Type | Details | |
---|---|---|
mean | Float[Array, ‘n+1 m’] | mean \(v\) |
consecutive_covs | Float[Array, ‘n 2m 2m’] | |
Returns | MarkovProposal | corresponding proposal |
To verify that this produces the correct proposal, let us check it for a stationary AR(1) process with mean \(0\) and recurrence \[ \begin{align*} X_{t + 1} = \alpha X_{t} + \varepsilon_{t} && \varepsilon_{t} \sim \mathcal N(0, \sigma^{2}) \end{align*} \]
The initial variance is \(R_{0}^2= \tau^{2} = \operatorname{Var} (X_0) = \frac{\sigma^{2}}{1 - \alpha^{2}}\), the joint covarinces are \[ \operatorname{Cov}(X_{t}, X_{t + 1}) = \tau^{2}\begin{pmatrix} 1 & \alpha \\ \alpha & 1 \end{pmatrix}, \] with innovations covariance \(R_{t}^2 = \sigma^{2}\).
Simulation
Given a Markov proposal, we can simulate from it by repeatedly applying its defining recurrence.
simulate_cem
simulate_cem (proposal:isssm.typing.MarkovProposal, N:int, key:Union[jaxtyping.Key[Array,''],jaxtyping.UInt32[Array,'2 ']])
Type | Details | |
---|---|---|
proposal | MarkovProposal | proposal |
N | int | number of samples |
key | Union | random number seed |
Returns | Float[Array, ‘N n+1 m’] |
probability density function
For importance sampling we need to evaluate the pdf of the proposal. We do this by first substracting the mean \(v\), \(U = X - v\) and going back to the innovations \[ \varepsilon_{t} = U_{t} - C_{t - 1}U_{t - 1} \sim \mathcal N(0, R_{t}R_{t}^T) \] where \(\varepsilon_0 = U_0\). These are jointly independent and so their pdf is easy to compute. j
For the log-weights, note that we cannot use the same weights as for MEIS as \(p(x) \neq g(x)\). Instead we have to calculate \[ \log w(x) = \log p(x,y) - \log g(x). \]
log_weight_cem
log_weight_cem (x:jaxtyping.Float[Array,'n+1m'], y:jaxtyping.Float[Array,'n+1p'], model:isssm.typing.PGSSM, proposal:isssm.typing.MarkovProposal)
Type | Details | |
---|---|---|
x | Float[Array, ‘n+1 m’] | point at which to evaluate the weights |
y | Float[Array, ‘n+1 p’] | observations |
model | PGSSM | modle |
proposal | MarkovProposal | proposal |
Returns | Float | log weights |
log_pdf
log_pdf (x:jaxtyping.Float[Array,'n+1m'], proposal:isssm.typing.MarkovProposal)
SSM to Markov Model
To initialize the Cross-Entropy method, we will use the Laplace approximation, see [30_laplace_approximation.ipynb]. This approximates the true posterior by the posterior of a Gaussian state space model. To initiate the Cross-entropy procedure, we determine the Cholesky root of this Gaussian posterior and use it as an initial value. To determine the diagonal and off-diagonal components of the Cholesky root, we calculate the joint covariance matrix \(\text{Cov} \left( X_t, X_{t + 1} | Y_1, \dots, Y_n \right)\) using the Kalman smoother and the FFBS, which results in \[ \text{Cov} \left( X_t, X_{t + 1} | Y_1, \dots, Y_n \right) = \begin{pmatrix} \Xi_{t|n} & \Xi_{t|t} A_t^T \Xi_{t + 1|t}^{-1} \Xi_{t + 1|n} \\ \left(\Xi_{t|t} A_t^T \Xi_{t + 1|t}^{-1} \Xi_{t + 1|n} \right)^T & \Xi_{t + 1 | n} \end{pmatrix}. \]
posterior_markov_proposal
posterior_markov_proposal (y:jaxtyping.Float[Array,'n+1p'], model:isssm.typing.GLSSM)
calculate the Markov proposal of the smoothing distribution using the Kalman smoother
Type | Details | |
---|---|---|
y | Float[Array, ‘n+1 p’] | |
model | GLSSM | observations # model |
Returns | MarkovProposal | Markov proposal of posterior X|Y |
The Cross-Entropy Method
Finally, we have all the ingredients together to apply the CE-method to perform importance sampling in a PGSSM with observations \(y\). We start by calculating the LA, convert its posterior distribution to a Markov-proposal and then repeatedly sample and update the proposal.
cross_entropy_method
cross_entropy_method (model:isssm.typing.PGSSM, y:jaxtyping.Float[Array,'n+1p'], N:int, key:Union[j axtyping.Key[Array,''],jaxtyping.UInt32[Array,'2']] , n_iter:int)
iteratively perform the CEM to find an optimal proposal
Type | Details | |
---|---|---|
model | PGSSM | model |
y | Float[Array, ‘n+1 p’] | observations |
N | int | number of samples to use in the CEM |
key | Union | random number seed |
n_iter | int | number of iterations |
Returns | MarkovProposal | the CEM proposal |
Let us perform the CE-method on our example model. We’ll set the number of observations somewhat lower than for EIS, as the CE-method is less performant in this setting.
from isssm.importance_sampling import pgssm_importance_sampling
= 5
s_order = nb_pgssm_running_example(
model =100,
n=s_order,
s_order=jnp.eye(s_order - 1),
Sigma0_seasonal=jnp.zeros(s_order - 1),
x0_seasonal
)= jrn.PRNGKey(511)
key = jrn.split(key)
key, subkey = simulate_pgssm(model, 1, subkey)
_, (y,) = laplace_approximation(y, model, 10)
proposal_la, info_la = jrn.split(key)
key, subkey = pgssm_importance_sampling(
samples_la, log_w_la 1000, subkey
y, model, proposal_la.z, proposal_la.Omega,
)= jrn.split(subkey)
key, subkey = cross_entropy_method(model, y, 10000, subkey, 10)
proposal, log_w ess_pct(log_w), ess_pct(log_w_la)
(Array(nan, dtype=float64), Array(10.93038869, dtype=float64))
The CEM can improve on the LA, but requires more samples than MEIS to do so.