# [math] What is “entropy and information gain”?

To begin with, it would be best to understand `the measure of information`

.

# How do we `measure`

the information?

When something unlikely happens, we say it's a big news. Also, when we say something predictable, it's not really interesting. So to quantify this `interesting-ness`

, the function should satisfy

- if the probability of the event is 1 (predictable), then the function gives 0
- if the probability of the event is close to 0, then the function should give high number
- if probability 0.5 events happens it give
`one bit`

of information.

One natural measure that satisfy the constraints is

```
I(X) = -log_2(p)
```

where *p* is the probability of the event `X`

. And the unit is in `bit`

, the same bit computer uses. 0 or 1.

## Example 1

Fair coin flip :

How much information can we get from one coin flip?

Answer : `-log(p) = -log(1/2) = 1 (bit)`

## Example 2

If a meteor strikes the Earth tomorrow, `p=2^{-22}`

then we can get 22 bits of information.

If the Sun rises tomorrow, `p ~ 1`

then it is 0 bit of information.

# Entropy

So if we take expectation on the `interesting-ness`

of an event `Y`

, then it is the entropy.
i.e. entropy is an expected value of the interesting-ness of an event.

```
H(Y) = E[ I(Y)]
```

More formally, the entropy is the expected number of bits of an event.

## Example

Y = 1 : an event X occurs with probability p

Y = 0 : an event X does not occur with probability 1-p

```
H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0)
= - p log p - (1-p) log (1-p)
```

Log base 2 for all log.

I am reading this book (NLTK) and it is confusing. **Entropy** is defined as:

Entropy is the sum of the probability of each label times the log probability of that same label

How can I apply *entropy* and *maximum entropy* in terms of text mining? Can someone give me a easy, simple example (visual)?

I really recommend you read about Information Theory, bayesian methods and MaxEnt. The place to start is this (freely available online) book by David Mackay:

http://www.inference.phy.cam.ac.uk/mackay/itila/

Those inference methods are really far more general than just text mining and I can't really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods.

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

I don't think it's really necessary that you go learning about all those theorems on communication theory, but it's not possible to learn this without learning the basics about what is entropy, how it's calculated, what is it's relationship with information and inference, etc...

As you are reading a book about NLTK it would be interesting you read about MaxEnt Classifier Module http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent

For text mining classification the steps could be: pre-processing (tokenization, steaming, feature selection with Information Gain ...), transformation to numeric (frequency or TF-IDF) (I think that this is the key step to understand when using text as input to a algorithm that only accept numeric) and then classify with MaxEnt, sure this is just an example.