HØit Nr. 1-99
Previous article Next article TOC: Nr. 1, 1999 Previous Issue Next Issue About HØit

Information Measurement

Ky Van Ha

The purpose of communications systems is to transmit information from a source to a sink. However, what exactly is information, and how do we measure it? The field of Information Theory has its origin in Claude Shannon's 1948 paper "A Mathematical Theory of Communications"[7].

1. Introduction

Written information must be coded before it can be transmitted over a digital system. We can say that information has the property of reducing the uncertainty of a situation. The measurement of information is thus the measurement of the uncertainty. That measurement is called Entropy. If entropy is large, then a large amount information is required to clarify the situation. If entropy is small, then only a small amount of information is required for clarification. Noise in a communications channel is the principal cause of uncertainty. Entropy is a major consideration of modern code. By reducing entropy per bit coding can reduce transmission errors (uncertainties) due to the transmission medium [2]. In this paper, we will review the entropy notation and look at some examples in coding techniques and neural network applications.

2. Entropy

Consider a random variable x, each presentation of which may be regarded as a message. It is usually that x is quantized into a finite number of discrete levels. Therefore we can assume that x is a discrete random variable [1,4, 6, 7]:


Let be the probability for the event , we have



Can we find a measure of how much information is produced by the random variable x taking in the discrete value xk? In general, the information content will vary from message to message, because the pk will not be equal.

The information sent from a digital source when the k-th message is transmitted is given by


Eq. (3) exhibits the following properties:




The latter property means that the less probable an event is, the more information we gain through its occurrence. The average information measure of a digital source is defined by


H is called the entropy. The unit for H is bits per event. The term bit here denotes a unit of information and is not to be confused with bit, means "binary digit", a unit of data. We have


The lower bound on entropy corresponds to no uncertainty, when p = 1 for some k, and the remaining probabilities in the set are all zeros. The upper bound on entropy corresponds to maximum uncertainty, when pk=1/(L+1) for all k.

Average Information Content in the English Language

a. Calculate the average information in bits/character for the English language, assuming that each of the 26 characters in the alphabet occurs with equal likelihood. Neglect spaces and punctuation.

We have


b. Since the alphabetic characters do not appear with equal frequency, the answer in part (a) will represent the upper bound on average information content per character. Repeat part (a) under assumption that the alphabetic characters occur with the following probabilities:

p = 0.10 for the letters a, e, o, t
= 0.07 for the letters h, i, n, r, s
= 0.02 for the letters c, d, f, l, m, p, u, y
= 0.01 for the letters b, g, j, k, q, v, w, x, z

We have


The source rate of information is given by


where H is entropy and T is the time required to send a message.

To deal with the noisy environment, another notation of information theory, Mutual Information, must be observed.

3. Mutual Information

Lets us assume that the output y representing a noisy version of the input x. How can we measure the uncertainty about x after observing y?

The conditional entropy of x given y is defined as follows:




is the entropy of the joint event, and p(x,y) is the joint probability of x and y.

The conditional entropy in (6) that is also called equivocation represents the amount of uncertainty about the transmitted message x, having receiving message y.

Shannon showed that [7] the average effective information content, I(x,y), at the receiver, is obtained by subtracting the equivocation from the entropy of the source :


This quantity is called the average mutual information. The entropy is a special case of this mutual information since we have I(x, x) = H(x). Some properties of the mutual information can be listed as follows:




The notation about the entropy and the mutual information can be applied to the continuous random variables as describes in the next section.

4. Differential Entropy

The differential entropy of a continuous random variable x with the probability density function f(x) is defined by


Similar to the case of discrete random variables, we can define the average mutual information between two continuous random variables x, y as follows:




is the conditional differential entropy of y given x.

The average mutual information has all properties of the discrete case.

One of the applications of differential entropy is to find the probability density function f(x) for which h(x) is maximum, subject to the two constraints:


Where is the mean of x and is its variance.

By using the LaGrange Multiplier method, it is not difficult to find that


Which is recognized as the probability density of a Gaussian random variable x. In this case, the differential entropy becomes:


In the next two sections, we will consider two applications of the entropy. The first is minimum entropy coding, and the second is the principle of maximum information preservation that is used in neural network training.

5. Minimum Entropy Coding

Let x be a random variable taking values an a finite alphabet X. A binary source code C for X is a mapping from X into the set of the finite length binary sequences called code words.

The aim of minimum entropy coding is to find a set of symbols to represent the message such that, in the normal environment, the occurrence of each symbol is independent of the occurrence of any of the others. If such a set can be found it is called the factorial coding, since the probabilities of the 1s and 0s in the code word are factors of the input state the word represents.


X = {1,2,3,4,5,6,7}

C = {1,010,011,0000,0001,0010,0011}


Such a code is called prefix free since no code word is the prefix of another code word. For each i in X, let C(i) be the code word associated to the symbol that x = i, with li its length. Let L = maxili and pi be the probability that x = i. The Kraft inequality is a necessary condition on the code word lengths of any prefix free code [3]:


The problem of finding the best code is just to find a set of lengths that satisfies the Kraft inequality and minimizes the average code word length

Such a code is called optimal. One can show that


A simple algorithm that generates optimal prefix free codes is the Huffman algorithm, which can be described as follows:

  1. Arrange the probabilities in decreasing order so that

  2. Form a subtree by combining the last two probability pm and pm-1 into a single node of weight

    pm = pm-1+ pm

  3. Iterate step 1 and 2, decreasing the number of nodes each time, until a single node is obtained.
Apply the Huffman algorithm for our example, we get the following binary tree:


6. The Maximum Information Preservation

Let be a transformation defined by


where is the processing noise. Assume that y is a Gaussian random variable with variance and is also a Gaussian random variable of zero mean and variance . Assume that the processing noise is uncorrelated with any of the input components, i.e.

We have from (10)

where is the information about y when we know x. It is not anything else than the processing noise, i.e.


can be seen as the "signal-to-noise ratio". If is assumed to be fixed, then is maximum whenever is maximum. It is called the principle of maximum information preservation.

Let us look at an example.


Suppose that is the input data to a neural network and is the target output data that is of course independent of . We would like to compute the mutual information .

We construct two neural networks:



Kay has shown that [5]

where r is the rank of the cross-covariance matrix , andis the normalized correlation coefficient between and for .

The weight vectors are updated as follows:

where is the learning rate that is reduced when n increases. And

with start values

Thus we can compute by computing .


[1] Leon W. Couch II, "Digital And Analog Communication Systems", Prentice Hall Inc., 1997.

[2] Roger L. Freeman, "Telecommunications Transmission Handbook", John Wiley & Sons Inc., 1998

[3] Jerry D. Gibson, The Communications Handbook, CRC Press, Inc., 1997, pp. 224-235.

[4] Simon Haykin, "Neural Networks: A Comprehensive Foundation", Macmillan College Publishing Company, Inc., 1994, pp. 444-471.

[5] Kay, J., "Feature discovery under contextual supervision using mutual information", International Joint Conference on Neural Networks, vol. 4, Baltimore MD, 1992, pp79-84.

[6] Bernard Sklar, Digital Communications, Prentice Hall Inc., 1988.

[7] N. J. A. Sloane & Aaron D. Wyner, "Claude E. Shannon Collected Papers", IEEE Press, 1993, pp. 5-83.

Previous article Next article TOC: Nr. 1, 1999 Previous Issue Next Issue About HØit
HØit Nr. 1-99

Copyright: 1998, 1999, Høgskolen i Østfold. Last Update: March.99, Jan Høiberg.