- A function P : A ⊂ Ω → R is called a probability law over the sample space Ω if it satisfies

the following three probability axioms.

• (Nonnegativity) P(A) ≥ 0, for every event A.

• (Countable additivity) If A and B are two disjoint events, then the probability of their

union satisfies

P(A ∪ B) = P(A) + P(B).

More generally, for a countable collection of disjoint events A1, A2, … we have

P(

S∞

i=1

Ai) = P∞

i=1 P(Ai).

• (Normalization) The probability of the entire sample space is 1, that is, P(Ω) = 1.

(a) (5 pts) Prove, using only the axioms of probability given, that P(A) = 1 − P(Ac

) for

any event A and probability law P where Ac denotes the complement of A.

(b) (5 pts) Let E1, E2, .., En be disjoint sets such that Sn

i1

Ei = Ω and let P be a probability

law over the sample space Ω. Show that, for any event A we have

P(A) = Pn

i=1 P(A ∩ Ei).

(c) (5 pts) Prove that for any two events A, B we have

P(A ∩ B) ≥ P(A) + P(B) − 1. - (10 pts) Two fair dice are thrown. Let

X =

(

1, if the sum of the numbers ≤ 5

0, otherwise

Y =

(

1, if the product of the numbers is odd

0, otherwise

What is Cov (X, Y )? Show your steps clearly. - (10 pts) Derive the mean of Poisson distribution.
- In this problem, we will explore certain properties of probability distributions and introduce

new important concepts.

1

(a) (5 pts) Recall Pascal’s Identity for combinations: N

m

+

N

m−1

N+1

m

Use the identity to show the following

(1 + x)

N =

PN

m=0 N

m

· x

m

which is called the binomial theorem. Hint: You can use induction.

Finally, show that the binomial distribution with parameter p is normalized, that is

PN

m=0 N

m

· p

m · (1 − p)

(N−m) = 1

(b) (5 pts) Suppose you wish to transmit the value of a random variable to a receiver. In

Information Theory, the average amount of information you will transmit in the process

(in units of “nat”) is obtained by taking the expectation of ln p(x) with respect to the

distribution p(x) of your random variable and is given by

H(x) = −

R

x

p(x) · ln p(x) · dx

This quantity is the entropy of your random variable. Calculate and compare the entropies of a uniform random variable x ∼ U(0, 1) and a Gaussian random variable

z ∼ N (0, 1).

(c) In many applications, e.g. in Machine Learning, we wish to approximate some probability distribution using function approximators we have available, for example deep

neural networks. This creates the need for a way to measure the similarity or the distance between two distributions. One proposed such measure is the relative entropy

or the Kullback-Leibler divergence. Given two probability distributions p and q the

KL-divergence between them is given by

KL(p||q) = R ∞

−∞ p(x) · ln

p(x)

q(x)

· dx

i. (2 pts) Show that the KL-divergence between equal distributions is zero.

ii. (2 pts) Show that the KL-divergence is not symmetric, that is KL(p||q) 6= KL(q||p)

in general. You can do this by providing an example.

iii. (16 pts) Calculate the KL divergence between p (x) ∼ N

µ1, σ2

1

and q (x) ∼

N

µ2, σ2

2

for µ1 = 2, µ2 = 1.8, σ2

1 = 1.5, σ2

2 = 0.2. First, derive a closed form

solution depending on µ1, µ2, σ1, σ2. Then, calculate its value. (Only numerical

answer without clearly showing your steps will not be graded.)

Remark: We call this measure a divergence since a proper distance function must be

symmetric.

- In this problem, we will explore some properties of random variables and in particular that

of the Gaussian random variable.

(a) (7 pts) The convolution of two functions f and g is defined as

(f ∗ g)(t) = R ∞

−∞ f(τ )g(t − τ )dτ

One can calculate the probability density function of the random variable Z = X +Y using convolution operation with X and Y independent and continuous random variables.

In fact,

fZ(z) = R ∞

−∞ fX(τ )fY (z − τ

Using this fact, find the probability density function of Z = X + Y , where X and Y

are independent standard Gaussian random variables. Find µZ, σZ. Which distribution

does Z belong to? (Hint: use √

π =

R ∞

−∞ e

−x

2

dx)

(b) (5 pts) Let X be a standard normal Gaussian random variable and Y be a discrete

random variable taking values {−1, 1} with equal probabilities. Is the random variable

Z = XY independent of Y ? Give a formal argument(proof or counter example) justifying

your answer.

(c) (8 pts) Let X be a non-negative random variable. Let k be a positive real number.

Define the binary random variable Y = 0 for X < k and Y = k for X ≥ k. Using the

relation between X and Y , prove that P(X ≥ k) ≤

E[X]

k

. (Hint: start with finding

E[Y ]). - In this problem, we will empirically observe some of the results we obtained above and also

the convergence properties of certain distributions. You may use the python libraries Numpy

and Matplotlib.

(a) (5 pts) In 3.a you have found the distribution of Z = X+ Y. Let X and Y be Gaussian

random variables with µX = −1, µY = 3, and σ

2

X = 1 and σ

2

Y = 4. Sample 100000

pairs of X and Y and plot their sum Z = X+Y as a histogram. Is the shape of Z and its

apparent mean consistent with what you have learned in the lectures?

(b) (5 pts) Let X B(n, p) be a binomially distributed random variable. One can use the normal distribution as an approximation to the binomial distribution when n is large and/or

p is close to 0.5. In this case, X ≈ N(np, np(1 − p)). Show how such approximation behaves by drawing 10000 samples from binomial distribution with n = 5, 10, 20, 30, 40, 100

and p = 0.2, p = 0.33, 0.50 and plotting the distributions of samples for each case as a

histogram. Report for which values of n and p the distribution resembles that of a

Gaussian?

(c) (5 pts) You were introduced to the concept of KL-divergence analytically. Now, you will

estimate this divergence KL(p||q). Where p(x) = N (0, 1) and q(x) = N (0, 4). Sample

1000 samples from a Gaussian with mean 0 and variance 1. Call them x1, x2, …, x1000.

Estimate the KL divergence as

1

1000

P1000

i=1 ln

p(xi)

q(xi)

where p(x) = N (0, 1) and q(x) = N (0, 4). Calculate the divergence analytically for

KL(p||q). Is the result consistent with your estimate?

Sample Solution