Wednesday, December 19, 2012

That's a pretty good guess... Naïve Bayesian Classifiers

... so, I don't need to write any more, correct? Classification problem? Apply a naïve bayesian system to it, and you are way ahead of the curve, and you've expended a fraction of the effort to stand the system up.

... shortest posting ever written.

So, anyway.

So, a Bayesian classifier is a predictor.  It says, that is, if it were to speak, it says, "Hm, I've seen this before, so I think this is a that." And, well, given a reasonable tolerance for error or noise, it's pretty good at doing its job. In fact, it works supremely well in noisy environments, shines in them, even, where other models that are more rigidly defined tend rapidly to become unstable at the slightest irregularity (the inductive-logic-programming model comes to mind).

Yes, I've worked on classification systems. We've had a hybrid system composed of a deductive rule-based system (that we hashed out with Subject Matter Experts over months of establishing, parameterizing, then fine tuning each rule) running in parallel with a Bayesian classifier. Guess where 99% of our usable results came from?

Sir Bayes walked away with the win, nearly every time.

How does the naïve Bayesian classifier work?

The algorithm is simplicity itself ... embarassingly so. It does an observation, and that observation is that there is a probability of something happening or being so, and that case depends on all of its antecedents.

Obviously. All these effects occured, lining up to cause this thing to be.

Let's give a real-world example.

The probably of Tennessee winning last Monday's game (they did) was predicated on:
  • It was a Monday night game -- Monday or  F1
  • They were playing on their home field -- Home or  F2
  • They were favored to win (with an even 2-point spread advantage) -- 2.0 spread or  F3
  • They were playing against the New York Jets -- vs(NYJ) or  F4
or p(win(Tennessee) | Monday, Home, 2.0 spread, vs(NYJ))

That works, but only when everything lines up exactly so.

So the brilliancy of Bayes was that the above probability can be rewritten with the following observation:

p(win(Tennessee)) * Πi p(Fi)

And there you have your classifier: you run your classes through the probabilities of their antecedents and for the class where the probability is the greatest is the winner.

Simple, and as the system gets more input data, matched against labeled classes, the better it gets at classification.

Problems

Well, I never promised you a rose garden...

One of the problems is a real-world concern, not a mathematical one, and that is many of these probabilities are minute, and multiplying them to each other ramps the scale down to the miniscule, very quickly. These days computers are ... well: 'eh' at handling very small fractional numbers, and they are architected in such a way that imprecision gets rolled into these numbers, compounding the problem with each multiplication.

How to handle these very small numbers?

Well, obviously, scale them. And this is very simply handled, for, after all, if it's a sense of scale that concerns you, you just rescale your domain. The logarithm does that. 'Automagically.'

SO, the above classifier can be rewritten:

log(p(win(Tennessee))) + Σi log(p(Fi))

And the scaling issue for computers just goes away.

There's another problem, however, and that is how to deal with the messiness of real-world input data.

The naïve Bayesian classifier assumes that all the evidence is present for each class. If a single piece of evidence is missing, that probability will be zero, instantly voiding the entire formula.

So, what to do?

I've observed that voiding the formula for one datum often leads to classifiers that oscillate too wildly and are therefore unusable.

So, I neutralize the nulled probability, instead of nulling it and making the formula void, I set it to unit and it therefore acts as a 'pass-through.'

This is an artifice, I know. I'm intentionally saying a missing datum is actually fully present, but I've found that this works 'eh' ... that is 'passably well.'

I've done this on two classifiers so far, and the classifiers are pretty darn good, but not good enough to my mind.

This is a fundamental problem with the naïve Bayesian classifier, and I'm sure everyone who undertakes this approach has to deal with this, one way or the other. What are good ways to make the naïve Bayesian classifier much more effective, but without excessive effort or thought around null data points that come up?

A problem, but not a crippling one.

I won last week's picks, after all, and that's not the first time

1 comment:

Anonymous said...

With regards to your second issue (zero probabilities for unobserved events), this is a well-researched area and the solution is called "smoothing". There are, of course, numerous different smoothing methods, each with their own ups and downs. Chen and Goodman did an excellent review of a slew of these (1996 original, 1998 TR, 1999 summary), albeit biased towards the applications in natural language processing.