Databook
Databook is an opensource project I started that aims to give companies an enriched data portal to understand how data sources are consumed, created, transformed and moved. There are some commercial tools available that promise to do this, but the solution usually revolves around the data too much and not enough about the users and communities that surround themselves around the data.
The original idea, design and philosophy comes from airbnb. You can read about all that here:
The github repository is available here:
The strength of databook is that it uses log files of your end user data visualization tools (wherever possible), information schemas, dictionaries, API's, LDAP, audit logs and any other sources to ensure the freshness of the most recent data on data consumption, transformation, transfer and through links like github, understand how they were created and worked on and by whom.
I'm looking for people who are interested in helping out with the project and there are lots of opportunities:
- The data munging is done in airflow, so we need hooks, operators and a lot of python code to extract log files, grab user data from github, understand who's on slack and where, grab LDAP data for users and groups, query metadata from RDBMS's, query metadata from Hive, analyze queries used in Tableau dashboards and in the end tie it all together so the flow of data becomes clear.
- There's a web interface built with python, flask, bootstrap and angular. There are plenty of opportunities where the interaction can improve and the web page be made more user-friendly and useful.
- The back-end uses elasticsearch and a neo4j graph database. The reason is that in this kind of work, there's a lot of relationships between nodes of varying types. Modeling for the relationships instead of the 'nodes' is an effective way to avoid having to deal with many link tables in a typical RDBMS.
Get involved and offer your help! Connect with the group at:
See you at databook!