Friday, March 7, 2014

Bayesian Football Picker: an implementation

Some time ago I wrote an article about applying Bayes to pick winners for the weekly football pool we had at our office. So, that pool went on for seventeen weeks, and I came out as one of the winners. Success!

So, how did that all work? 

I'll give you the business process of it and provide a high level overview of my Bayesian system. After that, I'll get down into the guts of the implementation.

I wrote this entire system in Haskell. It was something less than a day's effort.

Business Process/Big Picture

Each week, our boss, Dharni, provided the USA Today's predictions for the teams as a spreadsheet, which I saved in CSV format ('comma-separated values'). The weekly input spreadsheet (as CSV) looked like this:

,Wed Sep 5,
NYG GIANTS,3½ ,Dallas
,Sun Sep 9,
CHICAGO,10 ,Indianapolis
Philadelphia,7½ ,CLEVELAND
NYJ JETS,2½ ,Buffalo
NEW ORLEANS,7 ,Washington
New England,5 ,TENNESSEE
MINNESOTA,3½ ,Jacksonville
HOUSTON,11 ,Miami
DETROIT,6½ ,St. Louis Rams
Atlanta,2½ ,KANSAS CITY
GREEN BAY,5 ,San Francisco
Carolina,2½ ,TAMPA BAY
Seattle,2½ ,ARIZONA
DENVER,2½ ,Pittsburgh
,Mon Sep 10,,,,,,,
BALTIMORE,6 ,Cincinnati,TS,35-20,,,,
OAKLAND,2½ ,San Diego,,,,,,

where the number (e.g.: '') was the predicted point-spread, the team on the left was  the team predicted to win by the point-spread, and the capitalized team was the home team. The 'TS' indicator was if there were ties of people in the pool who guess the same number of picks correctly, then the TS was a tie-breaker: you had to guess the 'Total Score' for that game and the person who got the closest total score won the tie-breaker. I didn't worry my head over this particular aspect, so I just had my system predict the winner for that game and threw out a random total score.

I fed these data to my system which took each data point as a class-value for the Bayesian classifier, and my system spit out ('spit out' is a technical term, just like: 'run off') its own predictions which I turned right around back to Dharni: "I've highlighted the teams my system picked to win, Dharni."

The output result from the system was something like this:


Just as before, right? Well, sort of. In this case my system output teams as (discriminated type) values, the spreads were floats, and the Home/Away indicator told me where the left team played. 'LEFT' or 'RIGHT' indicated which team the system predicted would win that game.

My coworkers were enchanted that I wrote myself a little football-picker. They'd exclaim in mock-dismay: "Something only geophf would come up with!"

Then my system won, and won, and won again. Not consistently, nor convincingly, but, to my satisfaction (that it even ever won at all): profitably. Pretty good. Pretty darn good enough!

Then, with what really happened that week (you know: that pesky ITRW result-set messing with my conceptual-frameworks!), I'd feed the results back into my system (the feedback look), training it with what were good and bad guesses on its part. Week after week my system accumulated the historical data and results into a data-set I named basis/training-set.csv. 

As the season progressed, my historical training data set grew, and I did, indeed, win pools more often later in the season rather than earlier. Which was nice: my predictive system was predicable.

The Guts/Little Picture

It's actually embarrassing how small my baby classifier is, and really, it'd be a lot smaller if Haskell natively read football spread predictions: it'd just be the Bayesian classifier.

But since Haskell doesn't natively read CSV files nor football spreads, I wrote those modules, too.

Here's the system.

The football picker is in the Football (parent) module and contains the module named Go:

> module Football.Go where

> import Football.Data.Week
> import Football.Analytics.Bayes
> import Football.Trainer

> go picks training = snarf training >>= \basis ->
>                     slurp picks >>= readWrite pick Thursday basis writeln

> choose picks = putStrLn "<table border='1'>" >>
>              slurp picks >>= readWrite justShow Thursday () writeRow >>
>              putStrLn "</table>"

> writeRow (match, _) = writeWeek HTML match >> putStrLn ""

What you do is load the Go module and say "go" to Haskell with the teams to be picked for this week with their data (the 'picks') and the training data set, both in CSV-format.

The system then 'snarfs' the training data, training the Bayesian classifier, yielding a training 'basis,' then 'slurps' in the teams for this week to give the picks.

As to the names of the functions: don't look at me. I don't know. The dude who wrote system had an oddball sense of humor.

And the 'choose' and 'writeRow' functions. Eh. I had made my system both CSV and HTML (output) friendly. I had this grandiose plan to have everything printed out all prettily in HTML and have an HTML input interface, setting this up as a web service with micro-transactions and everything so I could eventually sell the system off to Microsoft for 1.21 gigabucks, but in practice I just 'go'ed with the console-output results and highlighted the winning teams on the spreadsheet Dharni provided, and never really used the HTML functionality beyond some initial prototyping. Idle hands, lesson learned, simplicity, ... you know: all that.

So, the 'Go' module is just an entry-point to the 'readWrite' function which reads the input UTF-8 data, converts it to Haskell types and then writes the results out as CSV. Let's look at that function, then, and keep digging down until we're all dig-dugged out.


The 'readWrite' function is in module Football.Trainer, its definition is as follows:

> readWrite _ _ _ _ [] = return ()
> readWrite fn day basis w (row:rows) = if   length (csv row) == 1
>                                       then readWrite fn (succ day) basis w rows 
>                                       else w (fn day row basis) >>
>                                            readWrite fn day basis w rows

... the whole if-clause is to handle whether this row is a football match row or a row declaring the day of the week.

If we query the type of readWrite from the Haskell system we get this 'interesting' type-value:

  :: (Monad m, Enum a) =>
     (a -> String -> t -> t1)
     -> a
     -> t
     -> (t1 -> m a1)
     -> [String]
     -> m ()

which is my lovely way of making this generalized function that reads an enumerable type and puts it out to some monadic domain.


But, since I'm not working with HTML any more, readWrite is always called with the higher-order function arguments of

readWrite pick day basis writeln

where the functions pick and writeln are declared as:

> pick :: DayOfWeek -> String -> ([Match], [Match]) -> (Match, Call)


> writeln :: (Show t) => (Match, t) -> IO ()
> writeln (match, winner) = writeWeek CS match >> putStrLn (show winner)

which reduces readWrite, once the higher-order function arguments are applied, to a more compact type of:

> justDoIt :: DayOfWeek -> ([Match], [Match]) -> [String] -> IO ()
> justDoIt day basis rows = readWrite pick day basis writeln rows

which, of course, means we need to explain the function pick now, with its calls, but let's first clarify the types used, now that the function readWrite has been reduced to its used form.


DayOfWeek is an example right out of the book for a disjoint type:

> data DayOfWeek = Thursday | Sunday | Monday
>      deriving (Show, Read, Eq, Ord, Enum)

(apparently there are only three days of the week ... in football, anyway, and that's all that matters, right?)

And the Match-type is the representation of a row of data (a football match):

> data Match = Match DayOfWeek Team Field Float Team deriving Show

Team is just simply the teams as disjoint enumerated values (e.g.: GREENBAY, STLOUISRAMS, etc.), The Field-type is Home or Away, and there you go!

There is a bit of a hex reading in the floating point value from, e.g.  '' and for that particular hack I created a separate type in a separate module:

> module Football.Data.Spread where


So, the spread is a number but represented in unicode ... do we wish to
capture the original datum as it was represented in the picks sheet for any
reason? Mucking with data always leads to sorrow, but ... well, this module
provides a Num interface to Spread, so you can swing either way with it.

> import Data.Char

> data Spread = Spread String
>      deriving Show

> toNum :: Spread -> Float
> toNum (Spread x) = toNum' x 0.0

> toNum' [] x = x
> toNum' (c:rest) x | c == ' ' = x
>                   | c > '9' = x + 0.5
>                   | otherwise = toNum' rest (10 * x
>                                              + fromIntegral (ord c - ord '0'))

> instance Read Spread where
>    readsPrec _ datum = [(Spread datum, "")]

> instance Eq Spread where
>    x == y = toNum x == toNum y

You'll note that the Spread-type isn't actually a Num instance, as the spread is a typed class value, I don't care, in my Bayesian system, whether the spread is lesser or greater; I just care if it's different from another spread.


Okay, back to the program!

So readWrite slurps in training data and the matches for this week and sends it to the 'pick' function which we've declared above, and we now define as:

> pick day row basis = let (match, _) = mkWeek day row
>                          winner = classify match basis
>                      in (match, winner)

The mkWeek function just slurps in the row as a string and converts it to a Match-type and with that reified typed value we ask the Bayesian system to classify this n-tuple as a Left or Right (predicted-to-)win.

So, let's look at the classify-function.

> classify :: Match -> ([Match], [Match]) -> Call
> classify query basis@(l, r) = let ll  = fromIntegral $ length l
>                             lr  = fromIntegral $ length r
>                             tot = ll + lr
>                         in  if (bayes query l + log (ll / tot)) >
>                                (bayes query r + log (lr / tot))
>                             then LEFT
>                             else RIGHT

The classify-function is your classic Bayes algorithm: classify the n-tuple as the type with the greatest probabilistic outcome. In our case we have two possible results, LEFT and RIGHT, so if the probablistic (logarithmic) sum for LEFT is greater, then it's LEFT, otherwise it's RIGHT.

Very 'Either'escque ... without the packaged data. So, actually, not very 'Either'escque at all but more 'Boolean'escque. Kinda. Sorta.

Our bayes-function is just summing the logarithms of the probabilities:

> bayes :: Match -> [Match] -> Float
> bayes (Match dow fav field spred dis) side = 
>       let pdow =   logp (\(Match d _ _ _ _) -> d == dow) side
>           pfav =   logp (\(Match _ f _ _ _) -> f == fav) side
>           pfield = logp (\(Match _ _ f _ _) -> f == field) side
>           psprd =  logp (\(Match _ _ _ s _) -> s == spred) side
>           pdis   = logp (\(Match _ _ _ _ d) -> d == dis) side
>       in sum [pdow, pfav, pfield, psprd, pdis]

We need a fix here for 0% match ... the log return -infinite, skewing the results, so let's just return 0 for 0, okay?[1]

> logp :: (Match -> Bool) -> [Match] -> Float
> logp fn matches = let num = fromIntegral (length $ filter fn matches)
>                       dem = fromIntegral (length matches)
>                   in  if (num == 0.0 || dem == 0.0)
>                       then 0.0
>                       else log (num / dem) 

That's basically everything, just some window-dressing functions of reading in the training set, which requires us to read in CSV files, so I present these functions here for completeness.

To read a CSV file, we simply rewrite the words-function from the Prelude to allow separators other than spaces:

> module CSV where

> import Char

-- I modified Prelude.words to accept a set of alternative delimiters to space

> wordsBy :: String -> String -> [String]
> wordsBy delims line
>        = let isDelim = flip elem delims
>          in  case dropWhile isDelim line of
>                           "" -> []
>                           s' -> w : wordsBy delims s''
>                                 where (w, s'') =
>                                        break isDelim s'

> csv = wordsBy ","

The CSV module is useful not only here for picking football teams, but also for anywhere you need to parse in CSV data as linearized Haskell-types.

and then readResults (the results of the training data, that is) is as follows:

> readResults :: ([Match], [Match]) -> [String] -> ([Match], [Match])
> readResults map [] = map
> readResults map (row:rows) = let [dow, fav, field, spred, dis, call] = csv row
>                                  day = read dow
>                                  match = Match day (read fav) (read field)
>                                                (read spred) (read dis)
>                                  winner = read call
>                                  res = Result match winner
>                              in  readResults (addmap winner map match) rows

where addmap is the obvious add-to-left-or-right-branch-of-the-training-data-function:

> addmap :: Call -> ([a], [a]) -> a -> ([a], [a])
> addmap LEFT (l, r) m = (m:l, r)
> addmap RIGHT (l, r) m = (l, m:r)

The 'a'-type here is invariably a Match-(Row)-type that I use as my training-set basis, but the addmap-function works with any set of paired-lists.

The readResults-function is used by the snarf-function:

> snarf :: FilePath -> IO ([Match], [Match])
> snarf file = readFile file >>=
>              return . lines >>=
>              return . readResults ([], [])

So that just leaves the slurp-function for reading in this week's matches:

> slurp :: FilePath -> IO [String]
> slurp file = readFile file >>= return . tail . lines

(we skip, or 'tail', by the first line, because that's the day of the first football games, which we invariably call Thursday)

Conclusion, or: please send your 1.21 gigabuck checks to geophf

And there you have it, folks: a little Bayesian system that picks (we hope) winning football teams from a set of accumulating historical data. The 'we hope'-caveat is the usual standard disclaimer when using Bayesian systems: as one of my colleagues pointed out, using such a system for classification is like driving your car with the only guidance being the view in the rear-mirror, but, for me, such guidance paid off. I walked away from this experience building yet another (naïve) Bayesian classifier, and my little pet project helped me walk away from this pool with a few dollars more than I invested into it. Yay!

('Yay!' is a technical term, c.f. Despicable Me 2)



[1] There was a comment on my original blog entry about normalizing the absent probability instead of what I do here, which is to zero it out. My system worked okay without linearization, so it would be an area of further research to see how the results are improved from linearization.