Selecting Data Systems and Tools
Insights from a Data Professional
This series I will discuss topics that are directly related to Data. My goal with this series will be to help others understand the importance of a specific topic and share any insights I might have as well as actions that can be taken.
TL;DR
- Standardize where possible; specialize when necessary.
- Optimize for the challenges of today.
- Consider how to move attributes through your ecosystem.
“Standardize wherever possible; specialize where necessary.” - Jack Spirko
Admittedly, I often have been known to pull inspiration from unusual places. This quote came from an entrepreneur who hosts a podcast (http://www.thesurvivalpodcast.com/) that often covers business principles and permaculture (a design science) principles. These are two topics I am always interested in. Now in full disclosure, if memory serves, Jack was speaking about batteries and why someone would might want to try to have as many things running the same type of batteries as possible in their home. However, the efficiencies of applying this to data infrastructure might include:
- Standardized drivers, connection libraries, and syntax for interacting with data. This contributes to faster development times.
- Standardized operationalizing of Data Systems (minimal changes to automated build process, security measures, monitoring, backups, and disaster recovery plan). This translates to faster provisioning time and reduces delays in development or deployment.
- Updating and Patching can be automated or time necessary can be greatly reduced.
- Tooling and Maintenance of systems can be implemented and improved quicker. This can include common practices such as archiving older records, auditing connections, maintaining indexes, etc.
“You never optimize for something that will happen in the future, you optimize for now.” - Andreas M. Antonopoulos
These days, it seems organizations from all sectors are struggling with maximizing the value from their dispersed data systems (generally due to vendor requirements). Master Data Management (MDM) seems to go by many names such as Unified Data Vision or Master Record, but seems largely misunderstood.
MDM is a technical solution to a business problem. The business needs to determine the process (rules) to ensure data sanity and quality and the technical side needs to implement procedures. I like to think of MDM as retro-actively applying domain driven design by identifying the owner of an entity and ensuring it becomes the focal point for handling all any changes to that data. Methods vary but each Data Subject (domain) needs to have a single owner which will act as the focal point or source of truth to settle any disputes.
A quick example of this might be a business decision to add product descriptions to billing for clarity. Finance may be the authority on billing information such as prices, while marketing may be the authority on product descriptions. Each may use different systems to meet their needs. However, if product descriptions are to be added to billing, they need to match what marketing has.
Standardizing systems used sounds good on paper, but there are many data systems out there today. To list a few:
- Relational Database Management Systems (aka RDBMS e.g. DB2, MariaDB/MySQL, MSSQL Server, Oracle, PostgreSQL, or SQLite).
- Not Only SQL (aka NoSQL e.g. CockroachDB, Couchbase, Datomic, Dynamo, MongoDB, Redis, or Spanner)
- Graph Databases (e.g. Neo4J, Neptune)
- Network Databases (e.g. IDMS)
- Data Warehouse Systems (e.g. Redshift or BigTables)
- Time Series Database Systems (e.g. InfluxDB or TimescaleDB)
Each of these solutions comes with it's own benefits, drawbacks and costs. Choosing the wrong solution or implementing the wrong pattern can be costly as it will require re-engineering everything built on top of it and finding a way to move and validate the data.
Personally, I favor starting with a good general purpose platform that doesn't require developers and staff to learn an entire new skill set can save a lot of time and money. Choosing a solution that is dedicated to conforming well with ISO/IEC 9075 Structured Query Language (SQL) Standard is a sound idea. If that system is highly customizable and extendable it becomes even more attractive. Arguably, IBM offers the best paid solutions for MDM and managing dispersed data systems for large organizations I have seen demonstrated to date. Smaller companies may need consider other options to become the focal point for their data / analytics.
“PostgreSQL does that too...” - me
PostgreSQL is something like a high quality swiss army knife of data systems at this point. In addition to a solid relational database management system with the best documentation, it offers:
- JSON / JSONB (BSON) handling (functions) with ways to index the data efficiently
- Key value stores (for flexible source data ingestion)
- Spacial and Geographic objects with extension
- Time series (as of) functionality with extension
- Graph database functionality with extension
- Publisher / Subscriber features
- The ability to read data from foreign sources
- Nearly infinite scaling for real time analytics
- Close to feature parody with other major DBMS Systems (DB2, Informix, Microsoft SQL Server, Oracle, Sybase, etc.)
PostgreSQL is readily available on the cloud or on premises. AWS is rumored to be using it behind Redshift and their new Mongo compatible DocumentDB .
Top this off with the fact that PostgreSQL is open source and can be downloaded for free, currently ranks in the top five for Database Management Solutions, and you quickly see that growing or finding talent is not likely going to be an issue. If you work with Software Engineers, they can easily setup sandbox instances running on their local systems for development before touching a shared environment resource.
"Apache Kafka is more disruptive than merely being faster ETL." - Gwen Shapira (The Future of ETL Isn’t What It Used To Be)
When Publication/Subscriber (Pub/Sub) is paired with something like Apache's Kafka, you get a very powerful solution to standardize, move and operationalize your data within the organization. Using these tools, you can easily add meta data to define the source, action type, dates, etc, to streamline other processes or add clarity. Extract Transform and Load (ETL) processes, everything is handled more or less real time by streaming. The solutions by design are more scalable and performant than batch processing could hope to be.
Feedback:
Questions, critiques and feedback are always welcome. My goal is to provide value to those who read these articles.
What would you like me to cover in my next article?
1) A deep dive or evaluation of a specific technology (suggestions welcome).
2) Concepts of Database (or Systems) Administration.
3) Architecture / Design Methodology.
4) Interview with another professional.
5) Other (please specify).