Thursday, November 15, 2012

CoReader x CoState == Silver Bullet


Thesis:

The title reads: 'CoReader x CoState == Silver Bullet' ... of course, it isn't a silver bullet, but it surely reduces (exponential) complexity AND provides just-in-time realized constants that (e.g.) would, and do, strain and break production systems working on very large data sets.

So, it isn't a silver bullet, but it sure-as-heavens-to-betsy puts the 'oh, I'll just generate the (resource intensive) constant' approach and then 'oh, I just wrap each layer in another layer of if-then-else (kill me with code complexity)' approach up and down walls and mops them off the floor, please, and thank you.

Antithesis:

Okay, so the CoReader (co)monad is suppose to give you a constant, and it does ...

CoReader readValue context = (context, readValue) -> context

with the askC function:

askC :: (CoReader r c) -> r

and so you have your constant.

The thing is, the CoReader depends on the context to be preexisting, which can be troublesome for data that are realized from an environment that is established at some point after the program begins to run, and that problem is that if you want a realized constant, you can't have the CoReader be a final value in some program component (the context isn't there yet), and then, if you wait to establish the CoReader until after the data set is realized, then you have all this checking code and establishment code, that can be rerun at any time (and, most likely, every time) you ask for that constant, which means that you don't have a constant at all, but a new value every time you ask over a data stream.

Sub-optimal.

So, how does one get a realized constant from a data stream that takes time to establish and guarantee that, yes, that constant I asked for upstream or in another branch is the same constant that I'm getting now from my CoReader, because I'm not re-creating it each time I ask for a constant from the data stream I get after the program is started and initialized?

... e.g. (or, 'i.e.' in my particular case): a database connection takes time, and is usually not available with a system that needs a CoReader to be constant from the get-go.

My solution, after much head-banging with trying to make the CoReader, itself, be something that it isn't, is to have the CoReader not read the stream directly, because it may not be there or be ready, but to have the CoReader read a different 'stream' entirely: the State monad ...

Or actually, since I don't want to push all my computations thereafter up into the monadic domain, to use the CoState comonad, a.k.a. the Context comonad:

Context context value = Context (context -> value) context

with the very helpful putC function:

putC :: Context (context -> value) context -> newValue -> Context (context -> newValue) context

You're probably scratching your head, saying, "I don't see any putC in the literature..."

And, sure, you are correct, there's 'experiment' and 'modify' but then you see where I'm going ... a little this plus that gets you to putC and that's what we want from our CoState comonad: I'm in a context, gimme the thing I stowed away in there for later use without me having to thread all the baggage of a monad throughout my computation.

You wrap that in a CoReader and then you have the constifying functionality you need to say, hey, once the state is establish, then give me that constant value from then on.

So, given a data stream, context, that comes to us somewhere down the road, we have a CoState comonad, which I'll call box (for a mutable boxed type indexing into that stream, or 'context'), paired with the CoReader, reader, to give you a constant value in the pure-functional (not monadic) domain.

That's half of it.

The other half is this: now that you've got CoState paired with CoReader, you, functionally, know that this is the first read into the context, or if this the nth read where you're just returning the read/realized value.

So, what if you want to do something the first time, that you don't want to do anytime thereafter.

AND, what if you don't wish to encumber yourself with the old

if this-is-the-first-time-i'm-reading-and-initializing-this-stupid-constant
then do-something-special-here
else just-return-the-stupid-constant

logic that, well, encumbers too much programming logic everywhere?

The answer is surprisingly easy, once you have the CoState x CoReader pairing ...

With a little bit of monadic magic to provide the glue.

Because, the first time you read the constant, or, that is to say, just before the first time the constant value is ground to what (it is just about to) be(come), it's value is ... wait for it ...

Nothing.

As in Nothing :: Maybe.

That is, if you lift that (non)value into the monadic domain:

uninitializedValue = Nothing
value x = Just x

And, boom! You use mplus (or (+)) to do something magical the first time:

val (+) magical-initialization-code

equals your solution with dispensing with stupid boolean flags called 'inited' and stupid if-then-else tests everywhere that are now totally unnecessary, ... just redefine askC:

askC reader = realAskC reader (+) magical-initialization-code

and then you just ask for the constant, the reader gives it to you, in the pure-functional domain, and automagically does initialization code only on the first read of that constant.

Synthesis:

Real-world example.

"Yeah, geophf," you're saying, "nice whitepaper, but what can I use it for?"

Um, impress hot babes at cocktail parties?

Get a free 'tall' latte at sbux, ... that is, after you pass over a fiver?

(Okay, seriously, 5 bucks for a small coffee? What has the world come to?)

AND:

So, let's say you're working on one of the world's largest databases (every company claims to have at least one of them ...), and you're getting 10,000 records per hour, and one section is generating up to 5 indices, but you don't know which indices you need, and which you don't, as it depends on data down-stream, and since these records are highly hierarchical XML files, with each of the layers of the hierarchy optional, we're talking WAY down stream.

So, one approach, you just getSequence from the database for all the sequences you may need, and then you're good, you insert rows with the prepopulated sequence.

Hm, 10000 records per hour, 24x7 data feed, and int is how big? 2 billion ... only?

(start-rant: Okay, does EACH AND EVERY TEAM have to relearn that you exhaust int much sooner than expected so that your 'vital system' crashes with dev and production teams working round the clock to change int into a much bigger number, when, after all, these are sequences, not (necessarily) numbers (GUID are hexa-strings-of-numerals, so why not make your precious sequence strings that you will NEVER do arithmetic on anyway, so you've set it as an int, why? :end-rant)

So, from a purely optional data input, you are probably generating all these sequences, and bringing your project to a screeching halt much sooner.

How about this.

Wrap your sequence generator in a CoReader comonad, and wrap each optional insert in the Maybe monad.

What does this give you?

In a hierarchy, it give you that you don't create a sequence you don't use, ever.

And, five levels down, when you actually do do an insert of a child record who's parent wasn't there in the data feed? It 'automagically' generates the parent record from the dataflow trace, inserts that parent first, and then inserts your child record without throwing an 'integrity violation, parent not found' error, and since it's done in the monadic domain, you wrote 0, zero, zip, nada conditional code at each later.

> return layer1 (+) deferInsert1 >>= \deferred1 -> insertm (reader1 askC) [] >>
> return layer2 (+) deferInsert2 >>= \deferred2 -> insertm (reader2 askC) [deferred1] >>

etc, etc, to your heart's content.

Which is a very different thing than:

let seq1 = getSequence
    seq2 = getSequence
    seq3 = getSequence
...
in do val1 <- return layer1
        val2 <- return layer2
        val3 <- return layer3
...
        insertm seq1 val1
        insertm seq2 val2
        insertm seq3 val3
...

This example trivializes things a bit. Looking at it, it's obvious you want to defer the creation of the sequence until you know you have that row to insert into your database. Unfortunately, it is that obvious, and we intend to create thing only when needed, but then the 'real world' rears it's ugly head, and you're looking at a branching data flow where your sequence may be needed downstream if any of the branches result in an insert, and your tree is not binary, but nary, and several layers deep.

What do you do as a programmer?

Well, you just create the sequence, because you know you're going to use, because your test data says so, but the real stream is never that dogmatic and a whole lot sparser than your 'in-a-perfect-world' test cases anticipated.

And you have hundreds of tables, ... and just one of the tables has 100 million records added to it, each month.

Do you see where 'just create the sequence' can be a key factor in crippling your application soon after it goes live?

'Just create the sequence, because for each layer, a layer of complexity is added to this already complex system,' is one approach, and in industry, pretty much the standard approach.

I propose another way; in sum: the way proposed in this paper. Use the CoState x CoReader comonadic pairing to take all the complexity, put it into a paper bag, and then throw that bag into a black hole where you'll never see it again. If your data set is rife with optionality, that is: semideterminism, then wrap that semideterminism in a Maybe monad, and you have a powerful framework that is simple to understand and easy to use.

In a DBMS system I'm working on, the SLOC (lines of code) count decreased, AND the number of sequences generated decreased by a factor of 2-4. We were generating twice to four times as many sequences that we needed, ... per input record. With the CoState x CoReader x Maybe framework in place, this spurious sequence generation stopped, and the code became more declarative: I was saying what was to be done, much more so than how to go about doing it.