Calculating Probability

  1. A function P : A ⊂ Ω → R is called a probability law over the sample space Ω if it satisfies
    the following three probability axioms.
    • (Nonnegativity) P(A) ≥ 0, for every event A.
    • (Countable additivity) If A and B are two disjoint events, then the probability of their
    union satisfies
    P(A ∪ B) = P(A) + P(B).
    More generally, for a countable collection of disjoint events A1, A2, … we have
    P(
    S∞
    i=1
    Ai) = P∞
    i=1 P(Ai).
    • (Normalization) The probability of the entire sample space is 1, that is, P(Ω) = 1.
    (a) (5 pts) Prove, using only the axioms of probability given, that P(A) = 1 − P(Ac
    ) for
    any event A and probability law P where Ac denotes the complement of A.
    (b) (5 pts) Let E1, E2, .., En be disjoint sets such that Sn
    i1
    Ei = Ω and let P be a probability
    law over the sample space Ω. Show that, for any event A we have
    P(A) = Pn
    i=1 P(A ∩ Ei).
    (c) (5 pts) Prove that for any two events A, B we have
    P(A ∩ B) ≥ P(A) + P(B) − 1.
  2. (10 pts) Two fair dice are thrown. Let
    X =
    (
    1, if the sum of the numbers ≤ 5
    0, otherwise
    Y =
    (
    1, if the product of the numbers is odd
    0, otherwise
    What is Cov (X, Y )? Show your steps clearly.
  3. (10 pts) Derive the mean of Poisson distribution.
  4. In this problem, we will explore certain properties of probability distributions and introduce
    new important concepts.
    1
    (a) (5 pts) Recall Pascal’s Identity for combinations: N
    m

    +
    N
    m−1

N+1
m

Use the identity to show the following
(1 + x)
N =
PN
m=0 N
m

· x
m
which is called the binomial theorem. Hint: You can use induction.
Finally, show that the binomial distribution with parameter p is normalized, that is
PN
m=0 N
m

· p
m · (1 − p)
(N−m) = 1
(b) (5 pts) Suppose you wish to transmit the value of a random variable to a receiver. In
Information Theory, the average amount of information you will transmit in the process
(in units of “nat”) is obtained by taking the expectation of ln p(x) with respect to the
distribution p(x) of your random variable and is given by
H(x) = −
R
x
p(x) · ln p(x) · dx
This quantity is the entropy of your random variable. Calculate and compare the entropies of a uniform random variable x ∼ U(0, 1) and a Gaussian random variable
z ∼ N (0, 1).
(c) In many applications, e.g. in Machine Learning, we wish to approximate some probability distribution using function approximators we have available, for example deep
neural networks. This creates the need for a way to measure the similarity or the distance between two distributions. One proposed such measure is the relative entropy
or the Kullback-Leibler divergence. Given two probability distributions p and q the
KL-divergence between them is given by
KL(p||q) = R ∞
−∞ p(x) · ln
p(x)
q(x)
· dx
i. (2 pts) Show that the KL-divergence between equal distributions is zero.
ii. (2 pts) Show that the KL-divergence is not symmetric, that is KL(p||q) 6= KL(q||p)
in general. You can do this by providing an example.
iii. (16 pts) Calculate the KL divergence between p (x) ∼ N
µ1, σ2
1

and q (x) ∼
N

µ2, σ2
2

for µ1 = 2, µ2 = 1.8, σ2
1 = 1.5, σ2
2 = 0.2. First, derive a closed form
solution depending on µ1, µ2, σ1, σ2. Then, calculate its value. (Only numerical
answer without clearly showing your steps will not be graded.)
Remark: We call this measure a divergence since a proper distance function must be
symmetric.

  1. In this problem, we will explore some properties of random variables and in particular that
    of the Gaussian random variable.
    (a) (7 pts) The convolution of two functions f and g is defined as
    (f ∗ g)(t) = R ∞
    −∞ f(τ )g(t − τ )dτ
    One can calculate the probability density function of the random variable Z = X +Y using convolution operation with X and Y independent and continuous random variables.
    In fact,
    fZ(z) = R ∞
    −∞ fX(τ )fY (z − τ
    Using this fact, find the probability density function of Z = X + Y , where X and Y
    are independent standard Gaussian random variables. Find µZ, σZ. Which distribution
    does Z belong to? (Hint: use √
    π =
    R ∞
    −∞ e
    −x
    2
    dx)
    (b) (5 pts) Let X be a standard normal Gaussian random variable and Y be a discrete
    random variable taking values {−1, 1} with equal probabilities. Is the random variable
    Z = XY independent of Y ? Give a formal argument(proof or counter example) justifying
    your answer.
    (c) (8 pts) Let X be a non-negative random variable. Let k be a positive real number.
    Define the binary random variable Y = 0 for X < k and Y = k for X ≥ k. Using the
    relation between X and Y , prove that P(X ≥ k) ≤
    E[X]
    k
    . (Hint: start with finding
    E[Y ]).
  2. In this problem, we will empirically observe some of the results we obtained above and also
    the convergence properties of certain distributions. You may use the python libraries Numpy
    and Matplotlib.
    (a) (5 pts) In 3.a you have found the distribution of Z = X+ Y. Let X and Y be Gaussian
    random variables with µX = −1, µY = 3, and σ
    2
    X = 1 and σ
    2
    Y = 4. Sample 100000
    pairs of X and Y and plot their sum Z = X+Y as a histogram. Is the shape of Z and its
    apparent mean consistent with what you have learned in the lectures?
    (b) (5 pts) Let X B(n, p) be a binomially distributed random variable. One can use the normal distribution as an approximation to the binomial distribution when n is large and/or
    p is close to 0.5. In this case, X ≈ N(np, np(1 − p)). Show how such approximation behaves by drawing 10000 samples from binomial distribution with n = 5, 10, 20, 30, 40, 100
    and p = 0.2, p = 0.33, 0.50 and plotting the distributions of samples for each case as a
    histogram. Report for which values of n and p the distribution resembles that of a
    Gaussian?
    (c) (5 pts) You were introduced to the concept of KL-divergence analytically. Now, you will
    estimate this divergence KL(p||q). Where p(x) = N (0, 1) and q(x) = N (0, 4). Sample
    1000 samples from a Gaussian with mean 0 and variance 1. Call them x1, x2, …, x1000.
    Estimate the KL divergence as
    1
    1000
    P1000
    i=1 ln
    p(xi)
    q(xi)
    where p(x) = N (0, 1) and q(x) = N (0, 4). Calculate the divergence analytically for
    KL(p||q). Is the result consistent with your estimate?

Sample Solution