- A function P : A ⊂ Ω → R is called a probability law over the sample space Ω if it satisfies
the following three probability axioms.
• (Nonnegativity) P(A) ≥ 0, for every event A.
• (Countable additivity) If A and B are two disjoint events, then the probability of their
union satisfies
P(A ∪ B) = P(A) + P(B).
More generally, for a countable collection of disjoint events A1, A2, … we have
P(
S∞
i=1
Ai) = P∞
i=1 P(Ai).
• (Normalization) The probability of the entire sample space is 1, that is, P(Ω) = 1.
(a) (5 pts) Prove, using only the axioms of probability given, that P(A) = 1 − P(Ac
) for
any event A and probability law P where Ac denotes the complement of A.
(b) (5 pts) Let E1, E2, .., En be disjoint sets such that Sn
i1
Ei = Ω and let P be a probability
law over the sample space Ω. Show that, for any event A we have
P(A) = Pn
i=1 P(A ∩ Ei).
(c) (5 pts) Prove that for any two events A, B we have
P(A ∩ B) ≥ P(A) + P(B) − 1. - (10 pts) Two fair dice are thrown. Let
X =
(
1, if the sum of the numbers ≤ 5
0, otherwise
Y =
(
1, if the product of the numbers is odd
0, otherwise
What is Cov (X, Y )? Show your steps clearly. - (10 pts) Derive the mean of Poisson distribution.
- In this problem, we will explore certain properties of probability distributions and introduce
new important concepts.
1
(a) (5 pts) Recall Pascal’s Identity for combinations: N
m
+
N
m−1
N+1
m
Use the identity to show the following
(1 + x)
N =
PN
m=0 N
m
· x
m
which is called the binomial theorem. Hint: You can use induction.
Finally, show that the binomial distribution with parameter p is normalized, that is
PN
m=0 N
m
· p
m · (1 − p)
(N−m) = 1
(b) (5 pts) Suppose you wish to transmit the value of a random variable to a receiver. In
Information Theory, the average amount of information you will transmit in the process
(in units of “nat”) is obtained by taking the expectation of ln p(x) with respect to the
distribution p(x) of your random variable and is given by
H(x) = −
R
x
p(x) · ln p(x) · dx
This quantity is the entropy of your random variable. Calculate and compare the entropies of a uniform random variable x ∼ U(0, 1) and a Gaussian random variable
z ∼ N (0, 1).
(c) In many applications, e.g. in Machine Learning, we wish to approximate some probability distribution using function approximators we have available, for example deep
neural networks. This creates the need for a way to measure the similarity or the distance between two distributions. One proposed such measure is the relative entropy
or the Kullback-Leibler divergence. Given two probability distributions p and q the
KL-divergence between them is given by
KL(p||q) = R ∞
−∞ p(x) · ln
p(x)
q(x)
· dx
i. (2 pts) Show that the KL-divergence between equal distributions is zero.
ii. (2 pts) Show that the KL-divergence is not symmetric, that is KL(p||q) 6= KL(q||p)
in general. You can do this by providing an example.
iii. (16 pts) Calculate the KL divergence between p (x) ∼ N
µ1, σ2
1
and q (x) ∼
N
µ2, σ2
2
for µ1 = 2, µ2 = 1.8, σ2
1 = 1.5, σ2
2 = 0.2. First, derive a closed form
solution depending on µ1, µ2, σ1, σ2. Then, calculate its value. (Only numerical
answer without clearly showing your steps will not be graded.)
Remark: We call this measure a divergence since a proper distance function must be
symmetric.
- In this problem, we will explore some properties of random variables and in particular that
of the Gaussian random variable.
(a) (7 pts) The convolution of two functions f and g is defined as
(f ∗ g)(t) = R ∞
−∞ f(τ )g(t − τ )dτ
One can calculate the probability density function of the random variable Z = X +Y using convolution operation with X and Y independent and continuous random variables.
In fact,
fZ(z) = R ∞
−∞ fX(τ )fY (z − τ
Using this fact, find the probability density function of Z = X + Y , where X and Y
are independent standard Gaussian random variables. Find µZ, σZ. Which distribution
does Z belong to? (Hint: use √
π =
R ∞
−∞ e
−x
2
dx)
(b) (5 pts) Let X be a standard normal Gaussian random variable and Y be a discrete
random variable taking values {−1, 1} with equal probabilities. Is the random variable
Z = XY independent of Y ? Give a formal argument(proof or counter example) justifying
your answer.
(c) (8 pts) Let X be a non-negative random variable. Let k be a positive real number.
Define the binary random variable Y = 0 for X < k and Y = k for X ≥ k. Using the
relation between X and Y , prove that P(X ≥ k) ≤
E[X]
k
. (Hint: start with finding
E[Y ]). - In this problem, we will empirically observe some of the results we obtained above and also
the convergence properties of certain distributions. You may use the python libraries Numpy
and Matplotlib.
(a) (5 pts) In 3.a you have found the distribution of Z = X+ Y. Let X and Y be Gaussian
random variables with µX = −1, µY = 3, and σ
2
X = 1 and σ
2
Y = 4. Sample 100000
pairs of X and Y and plot their sum Z = X+Y as a histogram. Is the shape of Z and its
apparent mean consistent with what you have learned in the lectures?
(b) (5 pts) Let X B(n, p) be a binomially distributed random variable. One can use the normal distribution as an approximation to the binomial distribution when n is large and/or
p is close to 0.5. In this case, X ≈ N(np, np(1 − p)). Show how such approximation behaves by drawing 10000 samples from binomial distribution with n = 5, 10, 20, 30, 40, 100
and p = 0.2, p = 0.33, 0.50 and plotting the distributions of samples for each case as a
histogram. Report for which values of n and p the distribution resembles that of a
Gaussian?
(c) (5 pts) You were introduced to the concept of KL-divergence analytically. Now, you will
estimate this divergence KL(p||q). Where p(x) = N (0, 1) and q(x) = N (0, 4). Sample
1000 samples from a Gaussian with mean 0 and variance 1. Call them x1, x2, …, x1000.
Estimate the KL divergence as
1
1000
P1000
i=1 ln
p(xi)
q(xi)
where p(x) = N (0, 1) and q(x) = N (0, 4). Calculate the divergence analytically for
KL(p||q). Is the result consistent with your estimate?
Sample Solution