Thursday, August 13, 2015

Uploading Data to GrapheneDB


We look at how to upload a set of data (potentially massive) to GrapheneDB. I say '(potentially massive)' because an option, of course, is to enter Cypher, line-by-line in the Neo4J web admin interface, but this becomes onerous when there are larger data sets with complex (or, strike that, even 'simple') relations.

An ETL-as-copy-paste is not a solution for the long term, no matter how you slice it (trans: no matter for how long you have that intern).

So, let's look at a viable long-term solution using a specific example.

Let's do this.

The Data

The data is actually a problem in and of itself, as it is the set of Top 5 securities by group, and it is reported by various outlets, but the reports are (deeply) embedded into ((very) messy) HTML, in the main, or have a nice, little fee attached to them if you want to tap into a direct feed.

As I'm a start-up, I have more time than money (not always true for all start-ups, but that's a good rule of thumb for this one), so, instead of buying a subscription to the top 5s-feed, I built an HTML-scraper in Haskell to download the sets of Top 5s. Scraping HTML is not in the scope of this article, so if you wish to delve into that topic, please review Tagsoup in all its glory.

Okay, prerequisite,

Step 0. scraped data from HTML: done (without explanation. Deal.)

Next, I save the data locally. I suppose I could go into a database instance, such as MySQL, but for now, I have only 50 or so days worth of data, which I'm happily managing in a file and a little map in memory.

Step 1. store data we care about locally: done

Okay, so we have scraped data, automatically stored away for us. What does it all mean? That's when I got the idea of having a way of visualizing and querying these data. Neo4J was a fit, and GrapheneDB, being DaaS (you just have to need to know that 'DaaS' means 'Data as a Service'), makes sense for collaborating as a geographically-dispersed team.

Two Options

So, how do I get the data there? Two options that we explored. One was to load the data into a local neo4j-instance and then snap-restore in the Cloud with that image. That works, the first time, but it seems to me to be rather ponderous, as this is a daily process, and I do not wish to replicate my database locally daily and snap restore to the Cloud. So, I chose the latter option, which was to build a system that takes the local map, translate that into Cypher queries (to save as graph nodes and edges), then translate those Cypher queries into JSON, then create a web client that ships that JSON-of-Cypher-queries over the wire to the targeted web service.

... Neo4J and GrapheneDB are web services that allow REST data queries... (very) helpful, that.

Step 2. Translate the local data to Cypher queries

Okay, this is not going to be a Cypher tutorial. And this is not going to be the Cypher you like. I have my Cypher, you have yours, critique away. Whatevs. Let's get on with it.

The data is of the following structure, as you saw above:

Date -> { ("Mkt_Cap", ([highs], [lows])), ("Price", ([highs], [lows])), ("Volume", [leaders]) }

And we wish to follow this structure in our graph-relational data. Let's look at the top-tier of the data-structure:

You see I've also added a 'Year'-node. I do this so I can anchor around a locus of days if I wish to explore data across a set of days.

So, okay, from there, do I then create grouping nodes of 'leaders' and 'losers' for the categorization of stocks? This gets into the data-modelling question. I chose to label the relations to the stocks as such instead of creating grouping nodes. There're tradeoffs in these modeling decisions, but I'm happy with the result:

The module that converts the internal data structures is named Analytics.Trading.Web.Upload.Cypher. Looking at that module you see it's very MERGE-centric. Why? Here's why:

What you see here is that symbols, such as, well, primarily $AAPL, and others like $T and $INTC find themselves on the Top 5s lists, over and over again. By using MERGE we make sure the symbol is created if this is its first reference, or linked-to if we've seen it before in this data set.

In this domain, MERGE and I are like this: very close friends.

Okay, Map-to-Cypher, it's a very close ... well, mapping, because the relational calculus and domain-to-co-domain-mappings have a high correspondence.

I'm at a 'disadvantage' here: I come to functional programming from a Prolog-background: I think functional data structures relationally, so, usually, mappings of my functional data structures fall very nicely into graphs.

I don't know how to help you with your data structures, especially if you've been doing the Java/SQL object/relation-mapping stuff using JPA ... I mean, other than saying: 'Switch to ... Haskell?' Sorry.

Okay, so we have a set of Cypher queries, grouped in the following structures:

Date -> [groups] where groups are Mkt_Cap, Volume, and Price

Then, for each group for that date

group -> Leader [symbols] -> Losers [symbol]

So we have (with the three groups), four sets of Cypher queries, each of the grouped Cypher query weighing in with thirty MERGE statements each (three MERGE statements for each stock symbol node). Not bad.

How do we convert this list of grouped Cypher queries into JSON that Neo4J understands?

Two things make this very easy.

  1. The JSON-structure that Neo4J accepts is very simple, it is simply a group of "statements" which are individuated into single Cypher-"statement" elements. Very simple JSON! (thank you, Neo4J!)
  2. There is a module in Haskell, Data.Aeson, that facilitates converting from data structures in Haskell into JSON-structure, so the actual code to convert the Cypher queries reduces to three definitions:

With that, we have Cypher queries packaged up in JSON.

Step 3: SHIP IT!
So now that we have our data set, converted to Cypher, and packaged as JSON, we want to send it to GrapheneDB. Before I went right to that database (I could have, but I didn't), I tested my results on a Neo4J instance running on my laptop, ran the rest call and verified the results. BAM! It worked for the one day I uploaded.

After I got that feel-good confirmation in the small, I simply switched the URL from localhost:7474 to the URL GrapheneDB provides in the "Connection" tab, and voilà: we have data here!

(lots of it!)

Step n: Every day
So now that I have the back-data uploaded, I just simply run my scraper->ETL-over-REST->GrapheneDB little system and I have up-to-the-day Top 5s stock securities for my analysis, on the Cloud.


No comments: