## Information MeasurementKy Van Ha
The purpose of communications systems is to transmit information from a source to a sink. However, what exactly is information, and how do we measure it? The field of Information Theory has its origin in Claude Shannon's 1948 paper "A Mathematical Theory of Communications"[7]. ## 1. IntroductionWritten information must be coded before it can be transmitted over a digital system. We can say that information has the property of reducing the uncertainty of a situation. The measurement of information is thus the measurement of the uncertainty. That measurement is calledEntropy. If entropy is large, then a large amount information is required to clarify the situation. If entropy is small, then only a small amount of information is required for clarification. Noise in a communications channel is the principal cause of uncertainty. Entropy is a major consideration of modern code. By reducing entropy per bit coding can reduce transmission errors (uncertainties) due to the transmission medium [2]. In this paper, we will review the entropy notation and look at some examples in coding techniques and neural network applications.
## 2. EntropyConsider a random variablex, each presentation of which may be regarded as a message. It is usually that x is quantized into a finite number of discrete levels. Therefore we can assume that x is a discrete random variable [1,4, 6, 7]:
(1) Let be the probability for the event , we have And (2) Can we find a measure of how much information is produced by the random variable p will not be equal.
_{k}The information sent from a digital source when the (3) Eq. (3) exhibits the following properties: ( ( ( The latter property means that the less probable an event is, the more information we gain through its occurrence. The average information measure of a digital source is defined by (4)
(5) The lower bound on entropy corresponds to no uncertainty, when p = 1 for some k, and the remaining probabilities in the set are all zeros. The upper bound on entropy corresponds to maximum uncertainty, when
We have Bits/character
= 0.02 for the letters c, d, f, l, m, p, u, y
p = 0.01 for the letters b, g, j, k, q, v, w, x, z
p We have
The source rate of information is given by bps where To deal with the noisy environment, another notation of information theory, ## 3. Mutual InformationLets us assume that the output The conditional entropy of (6) Where (7) is the entropy of the joint event, and The conditional entropy in (6) that is also called Shannon showed that [7] the average effective information content, (8) This quantity is called the average mutual information. The entropy is a special case of this mutual information since we have ( ( ( The notation about the entropy and the mutual information can be applied to the continuous random variables as describes in the next section. ## 4. Differential EntropyThe differential entropy of a continuous random variablex with the probability density function f(x) is defined by
(9) Similar to the case of discrete random variables, we can define the average mutual information between two continuous random variables (10) Where (11) is the conditional differential entropy of The average mutual information has all properties of the discrete case. One of the applications of differential entropy is to find the probability density function And
Where is the mean of x and is its variance. By using the LaGrange Multiplier method, it is not difficult to find that (12) Which is recognized as the probability density of a Gaussian random variable x. In this case, the differential entropy becomes: (13) In the next two sections, we will consider two applications of the entropy. The first is minimum entropy coding, and the second is the principle of maximum information preservation that is used in neural network training. ## 5. Minimum Entropy CodingLetx be a random variable taking values an a finite alphabet X. A binary source code C for X is a mapping from X into the set of the finite length binary sequences called code words.
The aim of minimum entropy coding is to find a set of symbols to represent the message such that, in the normal environment, the occurrence of each symbol is independent of the occurrence of any of the others. If such a set can be found it is called the factorial coding, since the probabilities of the 1s and 0s in the code word are factors of the input state the word represents. Example:
Such a code is called prefix free since no code word is the prefix of another code word. For each L = max_{i}l_{i} and p_{i} be the probability that x = i. The Kraft inequality is a necessary condition on the code word lengths of any prefix free code [3]:
(14) The problem of finding the best code is just to find a set of lengths that satisfies the Kraft inequality and minimizes the average code word length
Such a code is called optimal. One can show that (15) A simple algorithm that generates optimal prefix free codes is the Huffman algorithm, which can be described as follows: - Arrange the probabilities in decreasing order so that
- Form a subtree by combining the last two probability
*p*and_{m}*p*into a single node of weight_{m-1 }*p*=_{m}*p*+_{m-1}*p*_{m } - Iterate step 1 and 2, decreasing the number of nodes each time, until a single node is obtained.
## 6. The Maximum Information PreservationLet be a transformation defined by(16) where is the processing noise. Assume that y is a Gaussian random variable with variance and is also a Gaussian random variable of zero mean and variance . Assume that the processing noise is uncorrelated with any of the input components, i.e.
We have from (10)
where is the information about
Therefore
can be seen as the "signal-to-noise ratio". If is assumed to be fixed, then is maximum whenever is maximum. It is called the principle of maximum information preservation. Let us look at an example.
Suppose that is the input data to a neural network and is the target output data that is of course independent of . We would like to compute the mutual information . We construct two neural networks: (15) (16) Kay has shown that [5]
where r is the rank of the cross-covariance matrix , andis the normalized correlation coefficient between and for . The weight vectors are updated as follows:
where is the learning rate that is reduced when
with start values Thus we can compute by computing . ## References[1] Leon W. Couch II, "Digital And Analog Communication Systems",Prentice Hall Inc., 1997.
[2] Roger L. Freeman, "Telecommunications Transmission Handbook", [3] Jerry D. Gibson, The Communications Handbook, [4] Simon Haykin, "Neural Networks: A Comprehensive Foundation", [5] Kay, J., "Feature discovery under contextual supervision using mutual information", [6] Bernard Sklar, Digital Communications, [7] N. J. A. Sloane & Aaron D. Wyner, "Claude E. Shannon Collected Papers", IEEE Press, 1993, pp. 5-83.
Copyright: 1998, 1999, Høgskolen i Østfold. Last Update: March.99, Jan Høiberg. |