SRN – Markovian Score Climbing by Naesseth et al. from NeurIPS 2020

This Sunday Reading Notes post is about a recent article on variational inference (VI). 

In variational inference, we can approximate a posterior distribution {p(z \mid x)} by finding a distribution {q(z ; \lambda^{\star})} that is the `closest’ to {p(z \mid x)} among a collection of functions {Q = \{q(z;\lambda)\}}. Once a divergence between {p} and {q} has been chosen, we can rely on optimization algorithms such as stochastic gradient descent to find {\lambda^{\star}.}

The `exclusive’ Kullback-Leiber (KL) divergence has been popular in VI, due to the ease of working with an expectation with respect to the approximating distribution {q}. This article, however, considers the `inclusive’ KL

\displaystyle \mathbb{E}(p \| q) = \mathbb{E}_{p(z \mid x)} \left[ \log \frac{p(z \mid x)}{q(z ; \lambda)} \right].

Minimizing {\mathrm{KL}(p\| q)} is equivalent to minimizing the cross entropy {L_{\mathrm{KL}} = \mathbb{E}_{p(z \mid )}[ - \log q(z ; \lambda)],} whose gradient is
\displaystyle g_{\mathrm{KL}}(\lambda) := \nabla L_{KL}(\lambda) = \mathbb{E}_{p(z \mid x)}\left[- \nabla_{\lambda} \log q(z; \lambda)\right].

If we can find unbiased estimates of {g_{\mathrm{KL}}(\lambda)}, then with a Robbins-Monroe schedule {\{\varepsilon_k\}_{k=1}^{\infty}}, we can use stochastic gradient descent to approximate {\lambda^{\star}.}

This article propose Markovian Score Climbing (MSC) as another way to approximate {\lambda^{\star}}. Given an Markov kernel {M(z' \mid z;\lambda)} that leases the posterior distribution {p(z \mid x)} invariant, one step of the MSC iterations operates as follows.

(*) Sample {z_k \sim M( \cdot \mid z_{k-1}; \lambda_{k-1})}.
(*) Compute the gradient {\nabla \log q(z_k; \lambda_{k-1}).}
(*) Set {\lambda_{k} = \lambda_{k-1} + \varepsilon_k \nabla \log q(z_k; \lambda_{k-1}).}

The authors prove that {\lambda_k \to \lambda^{\star}} almost surely and illustrate it on the skew normal distribution. One advantage of MSC is that only one sample is required per {\lambda} update. Also, the Markov kernel {M(z' \mid z;\lambda)} provides a systematic way of incorporating information from current sample {z_k} and current parameter {\lambda_k}. As the authors point out, one example of such a proposal is a conditional SMC update [Section 2.4.3 of Andrieu et al., 2010].

While this article definitely provides a general purpose VI method, I am more intrigued by the MCMC samples {z_k}. What can we say about the samples {\{z_k\}}? Can we make use of them?

References:

Naesseth, C. A., Lindsten, F., & Blei, D. (2020). Markovian score climbing: Variational inference with KL (p|| q). arXiv preprint arXiv:2003.10374.

Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342.

Online workshop: Measuring the quality of MCMC output

“Quality control” for MCMC methods is such an important but overlooked topic. Look forward to this workshop!

Statisfaction

Yes, we need a better logo.

Hi all,

With Leah South from QUT we are organizing an online workshop on the topic of “Measuring the quality of MCMC output”. The event website is here with more info:

https://bayescomp-isba.github.io/measuringquality.html

This is part of ISBABayesComp section’s efforts to organize activities while waiting for the next “big” in-person meeting, hopefully in 2023. The event benefits from the generous support of QUT Centre for Data Science. The event’s website will be regularly updated between now and the event in October 2021, with three live sessions:

  • 11am-2pm UTC on Wednesday 6th October,
  • 1pm-4pm UTC on Thursday 14th October,
  • 3pm-6pm UTC on Friday 22nd October.

Registration is free but compulsory (form here) as we want to make sure the live sessions remain convivial and focused; hence the rather specific theme, but it’s an exciting topic with lots of very much open questions…

View original post 74 more words

SRN – evidence and cross-validation

A paper that has intrigued me recently is “On the marginal likelihood and cross-validation” by E. Fong and C.C.Holmes on Biometrika.

Model evidence appears as marginal likelihood or prior predictive in this paper:

\displaystyle p_{\mathcal{M}}(y_{1:n}) = \int f_{\theta}(y_{1:n}) d\pi(\theta).

As the paper seeks to establish the connection between model evidence and cross-validation (CV), which is based on predictive performance on held-out test sets, it notes that the log evidence relates to log posterior predictive probability

\displaystyle \log p_{\mathcal{M}}(y_{1:n}) = \sum_{i=1}^{n} \log p_{\mathcal{M}}(y_i \mid y_{1:i-1}) = \sum_{i=1}^{n} \log \int f_{\theta}(y_i) d\pi(\theta \mid y_{1:i-1}).

So log evidence can be interpreted as a predictive sequential scoring rule with score function
\displaystyle s(y_i \mid y_{1:i-1}) = \log p_{\mathcal{M}}(y_i \mid y_{1:i-1}).

The paper begins by arguing that model evidence is the unique scoring rule that guarantees coherency (data ordering should not change the result of inference when data are exchangeable).

Then the authors move on to show the equivalence between evidence and cumulative cross-validation scores. When we consider leave-{p}-out CV, there are {\binom{n}{p}} number of held-out test sets and each set has some predictive score when the rest of the data is used for training. The leave-{p}-out CV score, denoted by {S_{CV}(y_{1:n};p)}, is the average of these predictive scores. When the log posterior predictive probability is used as the scoring rule, then we must have

\displaystyle \log p_{\mathcal{M}}(y_{1:n}) = \sum_{p=1}^n S_{CV}(y_{1:n};p).

So the log evidence is also the sum of leave-{p}-out CV scores, for all values of {p = 1,\ldots,n.}

SRN – Bayesian regret of Thompson Sampling

Hello 2021! This is my second Sunday Reading Notes post of the year. This week I have the chance of reading about Thompson Sampling, a topic I have been interested in since 2018. The paper “On Thompson Sampling for Smoother-than-Lipschitz Bandits” by Grant and Leslie, from Lancaster University, concerns the bounds on the regret for Thompson Sampling for continuum bandits. The regret bound is in terms of eluder dimension and ball-width function of the function class \mathcal{F} of the reward function f.

Both eluder dimension and ball-width function are new concepts to me. I was able to read more about them in Russo and Roy (2014) and was intrigued by the concept of \epsilon-eluder dimension, which is a measure of complexity of a set of functions. An action a is independent of actions \{a_1,\ldots,a_n\} with respect to \mathcal{F} if we can find two functions in \mathcal{F} can have similar function values at \{a_1,\ldots,a_n\} but very different function values at a. Based on this notion of \epsilon-independence, we can define the notion of \epsilon-eluder dimension in a similar fashion to how we define dimension based on linear independence in linear algebra.

A key assumption of this framework is that the model is well-specified. The true reward function f is not only included in the function class \mathcal{F}, but also a sample from the posterior distribution p_0. Being a Bayesian statistician, I am naturally curious about the ‘miss-specified’ case. Will the regret bound decompose in terms of distance from f to \mathcal{F} and complexity of \mathcal{F}. Is the analysis feasible and what techniques should we use?

Reference:

Grant, J. & Leslie, D.. (2020). On Thompson Sampling for Smoother-than-Lipschitz Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:2612-2622

Russo, D., & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research39(4), 1221-1243.

SRN – the magical 0.574

Hello 2021! This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA. This time, the magical number is 0.574. 

Hello 2021! Here is the first post in the Sunday Reading Notes series in the year of 2021. This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA.

Roberts and Rosenthal (1998, JRSSB) concerns the optimal scaling of Metropolis-Adjusted Langevin algorithm (MALA) for target distributions of the form {\pi^d(x) = \prod_{i=1}^d f(x_i)}. Unlike random walk Metropolis algorithm, in MALA, the proposal distribution uses the gradient of {\pi^d} at {x_t}:

\displaystyle x^{\star} = x_t + \sigma_n Z_{t+1} + \frac{\sigma^2}{2} \nabla \log (\pi^d(x_t)).

Here {\sigma_n} is the step variance and {Z_{t+1} \sim \mathcal{N}(0,I_d)} is a d-dimensional standard normal. Following this proposal, we perform an accept / reject step, so that the resulting Markov chain has the desired stationary distribution:
\displaystyle \alpha_d(x_t,x^{\star}) = \min \left\{ 1, \frac{\pi^d(x^{\star})}{\pi^d(x_t)} \frac{q_d(x^{\star},x_t)}{q_d(x_t,y_{t+1})}\right\}.

There are some smoothness assumptions on the log-density {g = \log f}: absolute value of the first 8 order derivatives are bounded, {g'} is Lipschitz, and all moments exist for {f}.

Theorem 3 (diffusion limit of first component)

Consider the chain {\{X\}} starting from stationarity and following the MALA algorithm with variance {\sigma_d^2 = l^2 n^{-1/3}} for some {l > 0}. Let {\{J\}} be a Poisson process with rate {n^{1/3}}, {\Gamma^n_t = X_{J_t}} and {U^d} is the first component of {\Gamma^n}, i.e. {U^d_t = X^d_{J_t,1}}. We must have, as {d \to \infty}, the process {U^d \to U}, where {U} is a Langevin diffusion defined by
\displaystyle dU_t = \sqrt{h(l)} dB_t + \frac{1}{2} h(l) \frac{d}{dx} \log (\pi_1(U_t))dt.

Here {h(l) = 2 l^2 \Phi(-Kl^3/2)} is the speed measure and {K = \sqrt{\mathbb{E} \left[\frac{5g'''(X)^3 - 3g''(X)}{48}\right]} > 0}.
The skeleton of the proof is similar to that of proving diffusion limit of random walk Metropolis. We will find sets {F_d^* \subseteq \mathbb{R}^d} such that {\mathbb{P}\left(\Gamma_s^d \in F_d^{\star} \ \forall \ 0 \le s \le t\right) \to 1} and on these sets the generators converges uniformly.

Theorem 4 (convergence of acceptance rate)
\displaystyle \lim_{d \to \infty} \alpha_d(l) = \alpha(l) where {a(l) = 2 \Phi(-Kl^3/2).}
As a result, when {h(l)} is maximized, the acceptance rate {\alpha(l) \approx 0.574}. Also the step variance should be tuned to be {\sigma_d^2 = (\hat{l}_d)^2 n^{-1/3}.}

MALA enjoys a faster convergence rate of order {n^{1/3}} compared to order {n} of RWM. But let’s also keep in mind that MALA requires an extra step to compute the gradients {\log \pi(x) = \sum_{i=1}^n g'(x)}, which takes {n} steps to compute.

Reference:

Roberts, G. O., & Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)60(1), 255-268.

SRN – about the magical 0.234 acceptance rate

Sunday Reading Notes series is back : Let’s understand the magical rule of ‘tuning your MH algorithm so that the acceptance rate is roughly 25%’ together!

‘Tune your MH algorithm so that the acceptance rate is roughly 25%’ has been general advice given to students in Bayesian statistics classes. It has been almost 4 years since I first read about it from the book Bayesian Data Analysis, but I never read the original paper where this result first appeared. This Christmas, I decided to read the paper ‘Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms’ by Roberts, Gelman and Gilks and to resume by Sunday Reading Notes series with a short exposition of this paper.

In Roberts, Gelman and Gilk (1997), the authors obtain a weak convergence result for the sequence of algorithms targeting the sequence of distributions {\pi_d(x^d) = \prod_{i=1}^{d} f(x_i^d)} converging to a Langevin diffusion. The asymptotic optimal scaling problem becomes a matter optimizing the speed of the Langevin diffusion, and it is related to the asymptotic acceptance rate of proposed moves.

A one-sentence summary of the paper would be

if you have a d-dimensional target that is independent in each coordinate, then choose the step size of random walk kernel to be 2.38 / sqrt(d) or tune your acceptance rate to be around 1/4.

Unfortunately, in practice the ‘if’ condition is often overlooked and people are tuning the acceptance rate to be 0.25 as long as the proposal is random walk, no matter what the target distribution is. It has been 20 years since the publication of the 0.234 result and we are witnessing the use of MCMC algorithms on more complicated target distributions, for example parameter inference for state-space models. I feel that this is good time that we revisit and appreciate the classical results while re-educating ourselves on their limitations.

Reference:

Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability7(1), 110-120.

——–TECHNICAL EXPOSITION——-

Assumption 1 The marginal density of each component f is such that {f'/f} is Lipschitz continuous and

\displaystyle \mathbb{E}_f\left[\left(\frac{f'(X)}{f(X)}\right)^8\right] = M < \infty, \ \ \ \ \ (1)

\displaystyle \mathbb{E}_f\left[\left(\frac{f''(X)}{f(X)}\right)^4\right] < \infty. \ \ \ \ \ (2)

Roberts et al. (1997) considers random walk proposal {y^d - x^d \sim \mathcal{N}(0,\sigma_d I_d)} where {\sigma_d^2 = l^2 / (d-1).} We use {X^d = (X_0^d,X_1^d,\ldots)} to denote the Markov chain and define another Markov process {(Z^d)} with {Z_t^d = X_{[dt]}^d}, which is the speed-up version of {X^d}. Let {[a]} denote the floor of {a \in \mathbb{R}}. Define {U^d_t= X^d_{[dt],1}}, the first component of {X_{[dt]}^d = Z^d_t}.

Theorem 1 (diffusion limit of first component) Suppose {f} is positive and in {\mathbb{C}^2} and that (1)-(2) hold. Let {X_0^{\infty} = (X^1_{0,1},X^{2}_{0,2},\ldots)} be such that all components are distributed according to {f} and assume {X^{i}_{0,j} = X^{j}_{0,j}} for all {i \le j}. Then as {d \to \infty},
\displaystyle U^d \to U.

The {U_0 \sim f} and {U} satisfies the Langevin SDE
\displaystyle dU_t = (h(l))^{1/2}dB_t + h(l)\frac{f'(U_t)}{2f(U_t)}dt \ \ \ \ \ (3)

and
\displaystyle h(l) = 2 l^2 \Phi(-l\sqrt{I}/2)

with {\Phi} being the standard normal cdf and
\displaystyle I = \mathbb{E}_f\left[\left(f'(X)/ f(X)\right)^2\right].

Here {h(l)} is the speed measure of the diffusion process and the `most efficient’ asymptotic diffusion has the largest speed measure. {I} measures the `roughness’ of {f}.

Example 1 If {f} is normal, then {f(x) = (2\pi\sigma^2_f)^{-1/2}\exp(-x^2/(2\sigma_f^2)).}
\displaystyle I = \mathbb{E}_f\left[\left(f'(x) / f(x) \right)^2\right] = (\sigma_f)^{-4}\mathbb{E}_f\left[x^2\right] = 1/\sigma^2_f.

So when the target density {f} is normal, then the optimal value of {l} is scaled by {1 / \sqrt{I}}, which coincides with the standard deviation of {f}.

Proof: (of Theorem 1.1) This is a proof sketch. The strategy is to prove that the generator of {Z^n}, defined by

\displaystyle G_n V(x^n) = n \mathbb{E}\left[\left(V(Y^n) - V(x^n)\right) \left( 1 \wedge \frac{\pi_n(Y^n)}{\pi_n(x^n)}\right)\right].

converges to the generator of the limiting Langevin diffusion, defined by
\displaystyle GV(x) = h(l) \left[\frac{1}{2} V''(x) + \frac{1}{2} \frac{d}{dx}(\log f)(x) V'(x)\right].

Here the function {V} is a function of the first component only.
First define a set

\displaystyle F_d = \{|R_d(x_2,\ldots,x_d) - I| < d^{-1/8}\} \cap \{|S_d(x_2,\ldots,x_d) - I| < d^{-1/8}\},

where
\displaystyle R_d(x_2,\ldots,x_d) = (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))'\right]^2

and
\displaystyle S_d(x_2,\ldots,x_d) = - (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))''\right].

For fixed {t}, one can show that {\mathbb{P}\left(Z^d_s \in F_d , 0 \le s \le t\right)} goes to 1 as {d \to \infty}. On these sets {\{F_d\}}, we have
\displaystyle \sup_{x^d \in F_d} |G_d V(x^d) - G V(x_1)| \to 0 \quad \text{as } d \to \infty ,

which essentially says {G_d \to G}, because we have uniform convergence for vectors contained in a set of limiting probability 1.
\Box

Corollary 2 (heuristics for RWMH) Let
\displaystyle a_d(l) = \int \int \pi_d(x^d)\alpha(x^d,y^d)q_d(x^d,y^d)dx^d dy^d

be the average acceptance rate of the random walk MH in {d} dimensions.
We must have {lim_{d\to\infty} a_d(l) \to a(l)} where {a(l) = 2 \Phi(-l\sqrt{I}/2)}.
{h(l)} is maximized by {l = \hat{l} = 2.38 / \sqrt{I}} and {a(\hat{l}) = 0.23} and {h(\hat{l}) = 1.3 / I.}
The authors consider two extensions of the target density {\pi_d}, where the convergence and optimal scaling properties will still hold. The first extension concerns the case where {f_i}‘s are different, but there is an law of large numbers on these density functions. Another extension concerns the case {\pi_d(x^d) = f_1(x_1) \prod_{i=2}^{d} P(x_{i-1}, x_{i})}, with some conditions on {P}.

Troubleshooting: build packages in RStudio after MacOS update

I have never considered myself a good developer. Coding activities such as writing some functions in Rcpp for an R package and building my personal website (nianqiaoju.github.io) already makes me proud.

Last night, I had some trouble building a package in RStudio. At the beginning, I thought it was because I accidentally included a test file in the ProjectDirectory/R folder, but it was not the case. So I had no choice but to read the error messages carefully.

I did not keep a record of this, but important bits of the messages look like:

ld: unknown option: -platform_version
clang: error: linker command failed with exit code 1 (use -v to see invocation)

That’s when I started to realize that this had to do with a recent MacOS update and probably my package was fine. Interestingly, I did not recall authorizing a system update and learned about this only through running sessionInfo() in console. These are some trouble shooting steps I did in RStudio console:

  • check if devtools works: devtools::load_all() –>> leads to error messages
  • check if Rcpp works: Rcpp::evalCpp("2+2") –>> leads to error messages.

It seemed like the operating system update has ‘disabled’ my R compiler tools for RCpp. Unfortunately, despite the huge amount of information on stack-overflow and github issues, I did not see any error messages matching exactly what I saw and there were some conflicting information. Moreover, I did not want to install the whole XCode app or to cause any irreversible damages. After one hour of trial and errors, below is what I think solved my crisis.

  • run in terminal xcode-select --install and it turns out my command line tools are already installed since the message xcode-select: error: command line tools are already installed, use "Software Update" to install updates.
  • run in terminal gcc --version and this is to confirm that the command gcc is recognized by my terminal.
  • set the file  ~/.R/Makevars to point to the system headers. I did it with vim: add the following lines to the file and comment out what’s already there.
# clang: start
CFLAGS=-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
CCFLAGS=-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
CXXFLAGS=-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
# clang: end

I believe that the last step did the magic for me and I was able to build packages from source without any trouble.

In this process this stack-overflow discussion and this blogpost were the most helpful.

Getting started with RStudio on AWS in 4 steps

This blog is a quick guide to setting up an RStudio server on Amazon Web Services. The credits should be given to Louis Aslett for setting up the Amazon Machine Images and making life easier for all of us.

Step 1 : creating an AWS account

A free account give access to the t2.micro serves, which means that you can run it for 750 hours a month.

Step 2: choosing an Amazon Machine Image (AMI) prepared by Louis Aslett

I can be painfully to boot your own virtual machine and install R and RStudio yourself. Louis Aslett has already done the job for us and this is a huge service to the community!!!

He has prepared some virtual machines for us and installed R,RStudio, Latex, etc. for us. You can choose one of the AMIs based on your location and the links will direct you to choose an instance type (the default is the free tier t2.micro). We can rely on the default settings until ‘Step 6: Configure Security Group”. Click on “Add Rule” and select “HTTP”, to allow internet traffic to this instance, and change “Source” to specify one IP address or an IP address range. Setting “Source” to “Anywhere” allows all IP addresses to access the instance and you can also set it to “My IP”. You can also create a new security group on this page.

While this is not secure, we can “proceed without a key pair” to launch the instances.

Step 3: changing password for the RStudio Server

It can take a few minutes to launch the instance. Once the instance is “running”, you can copy and paste the “Public IPv4 address” from the instance summary to your browser. You will see a RStudio Server. The initial username is “rstudio” and the password is “<Your instance ID>”.

This AMI comes with a pre-installed and loaded package “RStudioAMI”. We can use the the function passwd() to change the password for the RStudio server.

Step 4: Link Dropbox or Github repository

If you use Dropbox to sync files, you can get started by running the function linkDropbox(). Of course we do not want to sync our entire Dropbox account here. We can sync our project folder by excludeSyncDropbox("*") followed by includeSyncDropbox("projectFolder"). It seems to me that we can only sync directories differently under Dropbox, not subdirectories like “projectFolder/subDir”.