Monday, June 29, 2015

Tabular and Visual Representations of Data using Neo4J

Corporate and Employee Relationships
Both Graphical and Tabular Results

So, there are many ways to view data, and people may have different needs for representing that data, either for visualization (in a graph:node-edges-view) or for tabulation/sorting (in your standard spreadsheet view).

So, can Neo4J cater to both these needs?

Yes, it can.

Scenario 1: Relationships of owners of multiple companies

Let's say I'm doing some data exploration, and I wish to know who has interest/ownership in multiple companies? Why? Well, let's say I'm interested in the Peter-Paul problem: I want to know if Joe, who owns company X is paying company Y for whatever artificial scheme to inflate or to deflate the numbers of either business and therefore profit illegally thereby.

Piece of cake. Neo4J, please show me the owners, sorted by the number of companies owned:

MATCH (o:OWNER)--(p:PERSON)-[r:OWNS]->(c:CORP)
RETURN p.ssn AS Owner, collect(c.name) as Companies, count(r) as Count 
ORDER BY Count DESC


Diagram 1: Owners by Company Ownership

Boom! There you go. Granted, this isn't a very exciting data set, as I did not have many owners owning multiple companies, but there you go.

What does it look like as a graph, however?

MATCH (o:OWNER)--(p:PERSON)-[r:OWNS]->(c:CORP)-[:EMPLOYS]->(p1) 
WHERE p.ssn in [2879,815,239,5879] 
RETURN o,p,c,p1


Diagram 2: Some companies with multiple owners

To me, this is a richer result, because it now shows that owners of more than one company sometimes own shares in companies that have multiple owners. This may yield interesting results when investigating associates who own companies related to you. This was something I didn't see in the tabular result.

Not a weakness of Neo4J: it was a weakness on my part doing the tabular query. I wasn't looking for this result in my query, so the table doesn't show it.

Tellingly, the graph does.

Scenario 2: Contract-relationships of companies 

Let's explore a different path. I wish to know, by company, the contractual-relationships between companies, sorted by companies with the most contractual-relationships on down. How do I do that in Neo4J?

MATCH (c:CORP)-[cc:CONTRACTS]->(c1:CORP) 
RETURN c.name as Contractor, collect(c1.name) as Contractees, count(cc) as Count 
ORDER BY Count DESC


Diagram 3: Contractual-Relationships between companies

This is somewhat more fruitful, it seems. Let's, then, put this up into the graph-view, looking at the top contractor:

MATCH (p:PERSON)--(c:CORP)-[:CONTRACTS*1..2]->(c1:CORP)--(p1:PERSON) 
WHERE c.name in ['YFB'] 
RETURN p,c,c1,p1


Diagram 4: Contractual-Relationships of YFB

Looking at YFB, we can see contractual-relationships 'blossom-out' from it, as it were, and this is just immediate, then distance 1 from that out! If we go out even just distance 1 more in the contracts, the screen fills with employees, so then, again, you have the forest-trees problem where too much data is hiding useful results with data.

Let's prune these trees, then. Do circular relations appear?

MATCH (c:CORP)-[:CONTRACTS*1..5]->(c1:CORP) WHERE c.name in ['YFB'] RETURN c,c1


Diagram 5: Circular Relationship found, but not in YFB! Huh!

Well, would you look at that. This shows the power of the visualization aspect of graph databases. I was examining a hot-spot in corporate trades, YFB, looking for irregularities there. I didn't find any, but as I probed there, a circularity did surface in downstream, unrelated companies: the obvious one being between AZB and MZB, but there's also a circular-relationship that becomes apparent starting with 4ZB, as well. Yes, this particular graph is noisy, but it did materialize an interesting area to explore that may very well have been overlooked with legacy methods of investigation.

Graph Databases.


BAM.

Saturday, June 20, 2015

Business Interrelations as a Graph

We look at a practical application of Graph Theory to take a complex representation of information and distill it to something 'interesting.' And by 'interesting,' I mean: finding a pattern in the sea of information otherwise obscured (sometimes intentionally) by all the overwhelming availability of data.

We'll be working with a graph database created from biz.csv, so to get this started, load that table into neo4j:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///[...]/biz.csv" AS csvLine
MERGE (b1:CORP { name: csvLine.contractor })
MERGE (b2:CORP { name: csvLine.contractee })

MERGE (b1)-[:CONTRACTS]->(b2)

We got to a graph of business interrelations this past week (see, e.g.: http://lpaste.net/5444203916235374592) showing who contracts with whom, and came to a pretty picture like this from the Cypher query:

MATCH (o:OWNER)--(p:PERSON)-[]->(c:CORP) RETURN o,p,c

diagram 1: TMI

That is to say: the sea of information. Yes, we can tell that businesses are conducting commerce, but ... so what? That's all well and good, but let's say that some businesses want to mask how much they are actually making by selling their money to other companies, and then getting it paid back to them. This is not an impossibility, and perhaps it's not all that common, but companies are savvy to watch dogs, too, so it's not (usually) going to be an obvious A -[contracts]-> B -[contracts] -> A relationship.

Not usually, sometimes you have to drill deeper, but if you drill too deeply, you get the above diagram which tells you nothing, because it shares too much information.

(Which you never want to do ... not on the first date, anyway, right?)

But the above, even though noisy, does show some companies contracting with other companies, and then, in turn being contracted by some companies.

So, let's pick one of them. How about company 'YPB,' for example? (Company names are changed to protect the modeled shenanigans)

MATCH (c:CORP)-[:CONTRACTS]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 2: tier 1 of YPB

So, in this first tier we see YPB contracting with four other companies. Very nice. Very ... innocuous. Let's push this inquiry to the next tier, is there something interesting happening here?

MATCH (c:CORP)-[:CONTRACTS*1..2]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 3: tier 2 of YPB

Nope. Or what is interesting is to see the network of businesses and their relationships (at this point, not interrelationships) begin to extend the reach. You tell your friends, they tell their friends, and soon you have the MCI business-model.

But we're not looking at MCI. We're looking at YPB, which is NOT MCI, I'd like to say for the record.

Okay. Next tier:

MATCH (c:CORP)-[:CONTRACTS*1..3]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 4: tier 3 of YPB

Okay, a little more outward growth. Okay. (trans: 'meh') How about the next tier, that is to say: tier 4?

MATCH (c:CORP)-[:CONTRACTS*1..4]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 5: tier 4 of YPB

So, we've gone beyond our observation cell, but still we have no loop-back to YPB. Is there none (that is to say: no circular return to YPB)? Let's push it one more time to tier 5 and see if we have a connection.

MATCH (c:CORP)-[:CONTRACTS*1..5]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1


diagram 6: tier 5 with a (nonobvious) cycle of contracts

Bingo. At tier 5, we have a call-back.

But from whom?

Again, we've run into the forest-trees problem in that we see we have a source of YPB, and YPB is the destination, as well, but what is the chain of companies that close this loop. We can't see this well in this diagram, as we have so many companies. So let's zoom into the company that feeds money back to YPB and see if that answers our question.

MATCH (c:CORP)-[:CONTRACTS]->(c1:CORP)-[:CONTRACTS]->(c2:CORP)-[:CONTRACTS]->(c3:CORP)-[:CONTRACTS]->(c4:CORP)-[:CONTRACTS]->(c5:CORP) WHERE c.name='YPB' AND c5.name='GUB' RETURN c, c1, c2, c3, c4, c5

diagram 7: cycle of contracts from YPB

Aha! There we go. By focusing our query the information leaps right out at us. Behold, we're paying Peter, who pays Paul to pay us back, and it's there, plain as day.

Now, lock them up and throw away the key? No. We've just noted a cyclical flow of contracts, but as to the legality of it, that is: whether it is allowed or whether this is fraudulent activity, there are teams of analysts and lawyers who can sink their mandibles into this juicy case.

No, we haven't determined innocence or guilt, but we have done something, and that is: we've revealed an interesting pattern, a cycle of contracts, and we've also identified the parties to these contracts. Bingo. Win.

The problem analysts face today is diagram 1: they have just too much information, and they spend the vast majority of their time weeding out the irrelevant information to winnow down to something that may be interesting. We were presented with the same problem: TMI. But, by using graphs, we were able to see, firstly, that there are some vertices (or 'companies') that have contracts in and contracts out, and, by investigating further, we were able to see a pattern develop that eventually cycled. My entire inquiry lasted minutes of queries and response. Let's be generous and say it took me a half-hour to expose this cycle.

Data analysts working on these kinds of problems are not so fortunate. Working with analysts, I've observed that:

  1. First of all, they never see the graph: all they see are spreadsheets,
  2. Secondly, it takes days to get to even just the spreadsheets of information
  3. Third, where do they go from there? How do they see these patterns? The learning curve for them is prohibitive, making training a bear, and niching their work to just a few brilliant experts and shunting out able-bodied analysts who are more than willing to help, but just don't see the patterns in grids of numbers
With the graph-representation, you can run into the same problems, but 
  1. Training is much easier for those who can work with these visual patterns,
  2. Information can be overloaded, leaving one overwhelmed, but then one can very easily reset to just one vertex and expand from there (simplifying the problem-space). And then, the problem grows in scope when you decide to expand your view, and if you don't like that expanse, it's very easy either to reset or to contract that view.
  3. An analyst can focus on a particular thread or materialize that thread on which to focus, or the analyst can branch from a point or branch to associated sets. If a thread is not yielding interesting results, then they can pursue other, more interesting, areas of the graph, all without losing one's place.
The visual impact of the underlying graph (theory) cannot be over-emphasized: "Boss, we have a cycle of contracts!" an analyst says and presents a spreadsheet requires explanation, checking and verification. That same analysis comes into the boss' office with diagram 7, and the cycle practically leaps off the page and explains itself, that, coupled with the ease and speed of which these cycles are explored and materialized visually makes a compelling case of modeling related data as graphs.

We present, for your consideration: graphs.



Models presented above are derived from various Governmental sources include Census Bureau, Department of Labor, Department of Commerce, and the Small Business Administration.

Graphs calculated in Haskell and stored and visualized in Neo4J

Friday, June 5, 2015

May 2015 1HaskellADay problems and solutions

May 2015
  • May 29th, 2015: The President of the United States wants YOU to solve today's #haskell problem http://lpaste.net/5394990454380953600 Super-secret decoder ring sez: http://lpaste.net/4232036193933459456
  • May 28th, 2015: So, you got the for-loop down. Good. Now: how about the 'no'-loop. HUH? http://lpaste.net/419764621470072832 Ceci n'est pas un for-loop http://lpaste.net/4585222543073345536
  • May 27th, 2015: For today's #haskell problem we look at Haskell by Example and bow before the mighty for-loop! http://lpaste.net/188189811754926080 The for-loop solution. I count even. http://lpaste.net/3919145189309939712 *groans
  • May 26th, 2015: TMI! So how to distinguish the stand-outs from the noise? Standard deviations FTW! http://lpaste.net/7326574484482162688 for today's #haskell problem. First rule: don't lose! http://lpaste.net/5153008470756163584
  • May 25th, 2015: We look at monad transformers to help in logging http://lpaste.net/8301304912738779136 for today's #haskell problem. Ooh! What does the log say? (that foxy log!) http://lpaste.net/109687 Ooh! Ouch! Bad investment strategy! Onto the next strategy!
  • May 22nd, 2015: Tomorrow's another day for today's #haskell problem http://lpaste.net/8939124091818868736 #trading Huh! http://lpaste.net/5738095651988701184 we get very similar returns as from previously essayed. Well, now we are confident of a more accurate model.
  • May 21st, 2015: For today's #haskell problem http://lpaste.net/1900634933852897280 we talk of categories (not 'The'), so we need a cat-pic: LOL
    Is 'parfait' a Category? http://lpaste.net/3432870087273480192 … If so, I would eat categories every day! 


  • May 20th, 2015: Let's view a graph as, well, a #neo4j-graph for today's #haskell problem http://lpaste.net/8344081425502306304 Ooh! Pretty Graph(ic)s! http://lpaste.net/621084277097889792 
  • May 19th, 2015: "Lemme put that on my calendar," sez the trader http://lpaste.net/7002659561530195968 for today's #haskell problem So ... next Monday, is that a #trading day? #inquiringmindswanttoknow http://lpaste.net/729791673181143040
  • May 18th, 2015: So, what, precisely, does a bajillion clams look like? We find out by doing today's #haskell problem http://lpaste.net/3228618210228043776 #trading #clams Were those cherry clams? No: apple http://lpaste.net/6731000367502327808
  • May 15th, 2015: We learn that π is a Π-type in today's #haskell problem of co-constants http://lpaste.net/2567253109898215424 (#typehumour) 'I' like how Id is the I-combinator http://lpaste.net/7574047523665346560
  • May 14th, 2015: Today's #haskell problem brings out the statistician in you that you didn't know you had. Hello, Standard Deviations! http://lpaste.net/3520547034257948672 Wow, rational roots? Sure, why not? And: 'σ'? ... so cute! http://lpaste.net/8056787266321776640
  • May 13th, 2015: Today's #haskell problem teaches us that variables ... AREN'T! http://lpaste.net/7856430804354727936 Well, if variables aren't, then they certainly do. Except in Haskell – A solution to this problem is posted at http://lpaste.net/767712970229678080
  • May 12th, 2015: For Today's #haskell problem we look at values and their types, but NOT types and their values, please! http://lpaste.net/7773498464792477696 *glares* taf, tof, tiff, tough solution today http://lpaste.net/7739074628332552192
  • May 11th, 2015: Hello, world! Haskell-style! http://lpaste.net/7162836591558262784 -- A solution to this 'Hello, world!'-problem is posted at http://lotz84.github.io/haskellbyexample/ex/hello-world
  • Octo de Mayo, 2015: Strings not enumerable in Haskell? Balderdash! Today's #haskell problem http://lpaste.net/7204216668719939584 Today we produce not only one, but two solutions to the enumerable-strings problem http://lpaste.net/5041390572904906752 #haskell 
  • Septo de Mayo, 2015: When all your Haskell documentation is gone, whom do you call? The web crawler, SpiderGwen! to the rescue! http://lpaste.net/4022946512970448896 
    We CURL the HTML-parsing solution at http://lpaste.net/6475977888209305600
  • Sexo de Mayonnaise, 2015: @BenVitale Offers us a multi-basal ('basel'? 'basil'?) palindromic number-puzzle for today's #haskell problem http://lpaste.net/858675562900619264
  • Cinco de Mayo, 2015: May the #maths be with you. Always. http://lpaste.net/8062431866961002496 Today's #haskell problem, we learn that 5 is the 6th number. Happy Cinco de Mayo! Pattern-matching, by any other name, would still smell as sweet #lolsweet http://lpaste.net/7419804910079705088 A solution to today's #haskell problem.
  • May 4th, 2015: "There is no trie, my young apprentice!" Today's #haskell problem proves Master Yoda wrong. FINALLY! http://lpaste.net/179758090873208832 So ... I 'trie'd ... *GROAN!* http://lpaste.net/5961897275272724480 A solution to today's 'trie'ing #haskell problem 
  • May 1st, 2015: (Coming up to) Today's #haskell problem is 'slightly' more challenging than yesterday's problem (forall, anyone?) http://lpaste.net/1400137464227561472 Existential Quantification, For(all) The Win! http://lpaste.net/4400237442641166336