Rejection Sampling and Importance Sampling

STA695 Spring 2002

Integration, in general, is most useful in statistics for computing probabilities and expectations. Of course, a probability is an expectation of an indicator variable. If a random variable X has density f(x) (could be multivariate), then

displaymath172

including the special case where tex2html_wrap_inline188 for a set A,

displaymath173

Of course, when you are asked to compute an expectation, the first thing to try is to compute the integral analytically. When this fails, numerical integration is another tool you can try. However, in multivariate settings numerical integration becomes quite time consuming to apply (the number of function evaluations increases exponentially in the dimension of the problem). Thus, it is useful to consider other methods of evaluating the integral.

Fortunately, there are methods of evaluating expectation that do not suffer this ``curse of dimensionality''. In particular, recall the strong law of large numbers, which states if tex2html_wrap_inline192 is a sequence of independent, identically distributed random variables, then the quantity

displaymath174

converges almost surely to E[h(X)]. Thus, if we are asked to compute an expectation (or probability), and if we can generate an iid sequence, then we may use this result to estimate the expectation. Of course, this Monte Carlo Estimate of the expectation has error, but typically we generate a sufficiently large number of random variables that error is small.

For n large, sample means are also normally distributed, so we can compute a tex2html_wrap_inline198 confidence interval for tex2html_wrap_inline200 to be

displaymath175

where tex2html_wrap_inline202 is the sample variance of the tex2html_wrap_inline204 values,

displaymath176

Thus, if we can generate a sequence of observations from tex2html_wrap_inline206 , then in theory we can compute as many observations as necessary to make our confidence interval as small as we want. If one wants a confidence interval to have width less than w, generate tex2html_wrap_inline210 until

displaymath177

This may require generating several sets of observations. For example, one might generate 100000 observations to start, compute the confidence interval, and decide it is too large. One can then generate another 100000 observations, add those to the original sequence, and compute another confidence interval, and so on until the confidence interval is sufficiently narrow.

Of course, if we are having trouble performing the integration analytically, we are typically not dealing with a ``familiar'' distribution, and thus generating random values may be difficult. This handout discusses two methods, rejection sampling and importance sampling, which allow a Monte Carlo estimate of an expectation.

For both methods, assume we have a density f(x) known only up to a normalizing constant. Thus, tex2html_wrap_inline214 for a density tex2html_wrap_inline216 and a potentially unknown constant tex2html_wrap_inline218 .

1 Rejection Sampling

Suppose we can find a known tex2html_wrap_inline234 and density tex2html_wrap_inline236 such that tex2html_wrap_inline238 for all x, where tex2html_wrap_inline242 , and we know how to sample from the density tex2html_wrap_inline236 . Suppose we perform the following steps

  1. Generate tex2html_wrap_inline246
  2. Generate tex2html_wrap_inline248
  3. Accept X if U<f(X)/g(X), otherwise reject X.

The point of this procedure is that the density of X, given acceptance, is tex2html_wrap_inline216 . Thus, repeated uses of this procedure, retaining only accepted values of X, can yield the random sequence tex2html_wrap_inline192 required to compute Monte Carlo estimates of expectations.

To verify this claim, note that the joint density of X and U is tex2html_wrap_inline268 .

displaymath220

The event A is defined to be the event U<f(X)/g(X), Thus

displaymath221

The integral for tex2html_wrap_inline274 identical except the range of integration corresponds to X<a instead of all possible values of X. Thus

displaymath222

Note that this step is the one that requires tex2html_wrap_inline238 for all x. If f(x)>g(x) for some values of x, we would have to consider the integral with respect to u by cases, depending on whether or not f(x)/g(x) was greater than 1. Replacing tex2html_wrap_inline214 and tex2html_wrap_inline242 and canceling constants from the numerator and denominator, we find

displaymath223

which is the probability of observing X<a under the distribution tex2html_wrap_inline206 .

Recall we stated the value of tex2html_wrap_inline218 does not need to be known. This is because it cancels out of the calculation, and thus is not required to use rejection sampling. Typically both tex2html_wrap_inline234 and tex2html_wrap_inline236 must be known to compute g(x).

The efficiency of rejection sampling depends strongly on how well the function g(x) approximates f(x). We derived above that

displaymath224

If f(x)/g(x) tends to be small, indicating that g(x) is a poor approximation for f(x), then the probability of acceptance will be small, and the algorithm will spend most of its time rejecting X values. In contrast, good approximations will tend to have most of the candidate X values accepted, and will be reasonably equivalent to random sampling directly from tex2html_wrap_inline206 .

The central question when using rejection sampling is what to choose for the function g, typically called the envelope function. If tex2html_wrap_inline206 has bounded support, then a very crude method of choosing g(x) is to set g(x)=a, where a is a constant greater than or equal to tex2html_wrap_inline334 , a quantity which can be found numerically or crudely estimated from a plot. Of course, this constant function g may not be a good approximation for f overall, but it satisfies the definition. In general, it is difficult to find an appropriate envelope function, and different methods are used in different situations.

Example

Let tex2html_wrap_inline340 over the interval tex2html_wrap_inline342 . If we use the ``crude envelope'' method, we first find a maximum of f(x) by inspection. Plotting the function in R, we find that the maximum of f(x) is below 3.5, so we will use g(x)=3.5 over the range tex2html_wrap_inline342 .

The next step is to determine tex2html_wrap_inline236 and tex2html_wrap_inline234 from g(x). Since a density must integrate to 1, we know

displaymath225

and thus

displaymath226

and tex2html_wrap_inline236 is a uniform density on tex2html_wrap_inline342 . Thus, to generate a large random sample from tex2html_wrap_inline206 (the density proportional to f), we generate a large sample of X values from tex2html_wrap_inline236 (don't confuse that fact the tex2html_wrap_inline236 happens to be uniform in this example to the random variable U being uniform), generate an equivalent number of tex2html_wrap_inline376 random variables, and accept those tex2html_wrap_inline210 values for which tex2html_wrap_inline380 .

2 Importance Sampling

Importance sampling is related to rejection sampling, and may also be used to determine expectations. While we still need a function tex2html_wrap_inline242 for which tex2html_wrap_inline396 is a density we can generate from as before, the advantage to importance sampling is that we do not need f(x)<g(x) for all x.

For any function h(x) and density tex2html_wrap_inline396 ,

displaymath382

Thus, if tex2html_wrap_inline192 is a sequence of independent random variables with density tex2html_wrap_inline236 , then

displaymath383

will converge to tex2html_wrap_inline410 This method requires knowledge of the normalizing constant to determine tex2html_wrap_inline216 . Suppose tex2html_wrap_inline218 is unknown, so we only have f(x) available to us. A similar calculation is still possible. Since tex2html_wrap_inline216 is assumed to be a density, we may still formally write

displaymath384

Thus

  equation85

Multiplying the integrands in the numerator and denominator by tex2html_wrap_inline420 , we find

displaymath385

Thus, we may again use the law of large numbers argument. If tex2html_wrap_inline422 , then we may estimate tex2html_wrap_inline410 by

displaymath386

Of course, typically the 1/n terms are omitted in the numerator and denominator. I left them in this derivation because they make it clear that the numerator and denominator both converge to the appropriate values (and the denominator is nonzero), and thus the ratio converges as well.

Letting tex2html_wrap_inline428 (known as the "importance weight"), the above expression may be written

  equation116

Thus, tex2html_wrap_inline430 is a weighted average of the tex2html_wrap_inline204 values. As stated in Tanner, under general conditions tex2html_wrap_inline430 converges almost surely to tex2html_wrap_inline200 . However, in practice one acheives the best estimates when tex2html_wrap_inline236 closely approximates the shape of f.

You should also note the denominator in equation (1), tex2html_wrap_inline442 , was eventually estimated by tex2html_wrap_inline444 in equation (2). Thus, one may use tex2html_wrap_inline444 as an estimate for tex2html_wrap_inline218 , a ``side benefit'' of the entire procedure.

The variance of tex2html_wrap_inline430 may be estimated by

displaymath387

Unlike rejection sampling, importance sampling uses all the generated tex2html_wrap_inline210 values. However, the importance sampling estimate is a weighted average, and thus the distribution of the weights plays a large role in determining the efficiency of the estimate. If a few of the tex2html_wrap_inline454 are much larger than the rest, then those tex2html_wrap_inline456 values will dominate the weighted average, and thus the resulting estimate will behave as if you generated a much smaller sample size. To counteract this, typically the function g(x) is chosen to have larger tails than f(x), resulting in f(x)/g(x) to be bounded above by 1 in the tails. In the center of the distribution, it is possible for g(x) to be greater than f(x). In any case, the close g(x) approximates f(x), the better the importance sampling estimator.

About this document ...

Rejection Sampling and Importance Sampling

This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -show_section_numbers rejimp.tex.

The translation was initiated by Kert Viele on Mon Jan 14 12:22:41 EST 2002


Kert Viele
Mon Jan 14 12:22:41 EST 2002