Monday, June 2, 2014

That's totes my Bag!

So, does that mean I like tote-bags?

So, today's question on @1HaskellADay was this:

write a function countOccurences :: [Stirng] -> Map Char Int

(typos faithfully reproduced)

such that

lookup 'l' $ countOccurences "Hello" ~> Just 2
lookup 'q' $ countOccurences "Hello" ~> Nothing

Okay, that can be done easily enough, I suppose, by torquing Map into something that it isn't, so one gets wrapped around the axel of creating a mapping from characters to occurrences.

But why?

First of all, countOccurences maps a String (not a List of Strings) to a Map, and that map is a very specialized kind of map that has existed in the literature for quite a while, and that map is known as the Bag data type, and is also, nowadays, called the MultiSet by people too embarrassed to say the word 'bag' in a sentence, because of their prior drug convictions.

("I got two months for selling a dime bag.")

So they now twist the word 'Set' (a 'collection of unique objects') to mean something that's not a set at all, the 'Multi'Set, which is a 'collection of unique objects, but you can have multiples of these unique objects, so they're not unique at all, so it isn't a set at all, but we need to say the word 'set' because we can't say the word 'bag' because saying the word 'bag' would make us sound plebeian for some reason.'

Yeah, that. 'MultiSet.'

What. Ev. Er.

But I digress.

As always.

So I COULD write the countOccurences as a String -> Map Char Int function, but then: why bother? You can either write tons of algorithmic code that obscures the intent or just simply use the appropriate data type.

I went for the latter.

Now, I wuz gonna do a dependently-typed pair to represent an occurrence...

... notice how countOccurences is so badly misspelled, by the way?

SOMEbody didn't QA-check their problem for the day today, I'm thinking.

... but then I said: 'eh!'

I mean: WHY is lookup 'q' $ countOccurences "Hello" ~> Nothing?

WHY can't it be that count 'q' for a Bag Char representation of "Hello" be 0? 0 is a valid answer and it keeps everything nice and monoidal without having to lift everything unnecessarily into the monadic domain.

So, yeah. Let's do that, instead.

So, here we go, and in Idris, because that's how I'm rolling these days. The advantages of dependent types have been enumerated elsewhere, so we'll just go with that they're better as an assumption and move on, using them, instead of extolling them, in this post.


So, my first attempt at Bag crashed and burned, because I did this:

data Bag : (x : Type) -> Type where
    add : Bag x -> x -> Bag x
    emptyBag : Bag x

and the compiler was fine with that. Hey, I can declare any type I'd like, so long as the types just stay as types, but as soon as I tried to define these things:

emptyList : List x
emptyList = []

emptyBag = Bag emptyList
add (Bag []) x = Bag [(x, 1)]
add (Bag ((x, y) :: rest)) x = Bag ((x, y + 1) :: rest)
add (Bag ((z, y) :: rest)) x = Bag ((z, y) :: (add rest x))

The compiler looked at me and asked: 'geophf, what in tarnation are you-ah tryin' to do?'

And about the only intelligent answer I could muster was: 'Ummmm... idk.'

I had gotten too clever for myself by half, trying to reshape a data type you learn in Comp.Sci. 101 as a purely functional type.

Back to Basics ... (but not BASIC)

So, let's just declare Bag to be what it is and KISS: 'keep it simple, stupid!'

Yes, let's.

data Bag x = Air | Stuffed (x, Nat) (Bag x)

Now, I so totally could've gone with the balanced binary-tree representation instead of the simple and standard linked list, but, you know: 'live and learn!'

With this declaration the emptyBag becomes so trivial as to be unnecessary, and then add is simplicity, itself, too, but add is, either way, so that's not saying much.

add : Eq x => Bag x -> x -> Bag x
add Air x = Stuffed (x, 1) Air
add (Stuffed (z, y) rest) x =
    case x == z of
        True  => Stuffed (x, y + 1) rest
        False => Stuffed (z, y) (add rest x)

Now, you see me relying on the case-statement, here. Unhappily.

I'd like my dependent types to say, 'unify x with x (reflexive) for the isomorphic case, and don't unify x with z for the other case.' But we're not there yet, or my coding isn't on par with being there yet, so I forced total coverage bifurcating the result-set into isomorphic and not with a hard-case statement.

Ick. I hate explicit case-statements! Where is really, really, really smart pattern-matching when I need it?

But with add, constructing a Bag becomes easy, and then counting elements of that bag is easy, too (again, with another case-statement, sigh!):

count : Eq x => x -> Bag x -> Nat
count _ Air = 0
count x (Stuffed (z, y) rest) =
    case x == z of
        True  => y
        False => count x rest

countOccurences (with one-too-few 'r's in the function name) becomes easy, given the Bag data type:

countOccurences : String -> Bag Char
countOccurences str = co' (unpack str) where
    co' [] = Air
    co' (char :: rest) = add (co' rest) char


But look at this:

depth : Bag x -> Nat
depth Air = 0
depth (Stuffed _ rest) = 1 + depth rest

sample : ?bag
sample = countOccurences "The quick, brown fox jumped over the lazy dog."

bag = proof search

When we do a depth sample, we get the not-surprising answer of 29 : Nat

Perhaps this could be made a tad bit more efficient?

Just perhaps.

Well, then, let's do that!

data Bag x = Air | Stuffed (x, Nat) (Bag x) (Bag x)

We make Bag balanced, with the add-function, doing the work of (very simply) branching off new nodes:

add : Ord x => Bag x -> x -> Bag x
add Air x = Stuffed (x, 1) Air Air
add (Stuffed (z, y) less more) x =
    case (compare x z) of
        LT => Stuffed (z, y) (add less x) more
        GT => Stuffed (z, y) less (add more x)
        EQ => Stuffed (z, y + 1) less more

Then all the other functions change ('morph') to work with a tree, not a list and work with Ord elements, not with (simply) Eq ones.

And so, the redefined depth-function gives a very different result:

depth sample ~> 9 : Nat

Not bad! Not bad! The improved data-structure improves efficiency across the board from O(N) to O(log N).

Hm, perhaps I'll have count return a dependently-typed pair, just as the library function filter does on List types, but not tonight.

Good night, Moon!

No comments: