Databook

Databook

Databook is an opensource project I started that aims to give companies an enriched data portal to understand how data sources are consumed, created, transformed and moved. There are some commercial tools available that promise to do this, but the solution usually revolves around the data too much and not enough about the users and communities that surround themselves around the data.

The original idea, design and philosophy comes from airbnb. You can read about all that here:

The github repository is available here:

Databook on github

The strength of databook is that it uses log files of your end user data visualization tools (wherever possible), information schemas, dictionaries, API's, LDAP, audit logs and any other sources to ensure the freshness of the most recent data on data consumption, transformation, transfer and through links like github, understand how they were created and worked on and by whom.

I'm looking for people who are interested in helping out with the project and there are lots of opportunities:

  • The data munging is done in airflow, so we need hooks, operators and a lot of python code to extract log files, grab user data from github, understand who's on slack and where, grab LDAP data for users and groups, query metadata from RDBMS's, query metadata from Hive, analyze queries used in Tableau dashboards and in the end tie it all together so the flow of data becomes clear.
  • There's a web interface built with python, flask, bootstrap and angular. There are plenty of opportunities where the interaction can improve and the web page be made more user-friendly and useful.
  • The back-end uses elasticsearch and a neo4j graph database. The reason is that in this kind of work, there's a lot of relationships between nodes of varying types. Modeling for the relationships instead of the 'nodes' is an effective way to avoid having to deal with many link tables in a typical RDBMS.

Get involved and offer your help! Connect with the group at:

Google Group

See you at databook!

To view or add a comment, sign in

More articles by Gerard T.

  • Data Democratization

    Data democratization. A phrase used a lot nowadays to indicate a direction that companies should pursue in order to…

  • Relations between acceleration, development velocity and technical debt

    The above diagram is something I drew up to communicate my understanding how fast coding efforts to enable fast…

    4 Comments
  • Software complexity

    I wrote about the subject of software complexity before; as a matter of fact I wrote a whole book about it, which you…

  • Really clever algorithms, part III

    The previous posts were about a lot of hashing algorithms. I showed how very obscure symbols, which to us look like…

  • Very clever algorithms - part II

    In the first part of this series, I explained how you can extract fingerprints of media, which are then usable to find…

  • Very clever algorithms - part I

    Algorithms aren't the most sexy things to talk about usually, but I'd like to give it a shot. They used to be the food…

    1 Comment
  • Software management: reduction to 2 dimensions

    Friday I had the privilege to receive a one-to-one mentoring/training session from a colleague at Coolblue. The…

  • Why software should be simple

    Up to 30 years ago, companies were still built on the backs of hard-working employees, who'd process thousands of…

  • System's complexity graph

    This diagram is a visualization of how design choices, refactoring and innovation impact the technical complexity of a…

    2 Comments
  • Managing complexity in web applications

    It was 1998, the Internet was pretty new, but e-commerce had already started and the first "dot-com investment failure…

Others also viewed

Explore content categories