Choosing the right Database
Operational Data reference guide
What is an Operational Data Source ?
Operational Data Source refers to the data sets in either Systems of Record or Systems of Engagement, the nature of the Data Source is always OLTP. It is important to note that OLAP processes are also becoming critical in augmenting Operational System Processes
Operational Data in Management Information Systems
MIS Systems are typically used to Run the Operations of a business by providing Business Intelligence reports etc. This class of Data is always OLAP
Is OLAP really Operational Data ?
Digital business processes are having to react on the basis of the outcome of some analytics. A Next Best Action micro-service for example will be developed on a Analytical Data Source and still be used operationally.
This Guide is consists of the following technical considerations :
Key Considerations
- Availability and Recovery - Data Sources are typically the only stored persisted state of an Application System. This is should be the first and primary non-functional consideration
- Integrity - The application ultimately depends on the Database to persist the data and manage the various states of the values.
- Performance - Optimising the DataSource Should be considered upfront , it can be the most complex tier to optimise.
- Data Lifecycle - DataSource Retention and Archiving is key to achieving the above mentioned considerations.
- Data Quality - The DataSource must have mechanisms to remediate data in case of duplicates etc. Schema versioning etc must be considered.
- Data Services - Performance must factor OLAP sources needing the data in near realtime. META DATA must be descriptive of the sources
- Data Structures - More optimal Data Structures are now available for various types of DATA, New SQL Sources are now available in the forms of Document DB, Graph DB , Key Value Store , InMemory Columba Stores etc.
Considerations Unpacked
Availability & Recovery
Most, if not all RDBMS Sources would have been engineered to include ACID principles; What is ACID ?
Most, if not all RDBMS Sources would have been engineered to include ACID principles;
What is ACID ?
- Atomicity - In a transaction involving two or more discrete pieces of information, either all of the pieces are committed or none are.
- Consistency - A transaction either creates a new and valid state of data, or, if any failure occurs, returns all data to its state before the transaction was started.
- Isolation - A transaction in process and not yet committed must remain isolated from any other transaction.
- Durability - Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in its correct state.
Whilst all RDBMS vendors would correctly claim to be ACID compliant , to remain ACID compliant in a High Availability Configuration or Disaster Recovery Cluster can be become almost impossible. Database technologies such as Apache Cassandra solve the problem using a different model.
Always consider and document the tolerances for HA and DR across the database cluster.
Integrity
ACID to a large degree takes care of integrity of an RDBMS Source. Always consider to what degree the Application relies on the Database for enablement of two phase commits etc.
Performance
Performance can be the most complex of considerations because of its multi-dimensional nature. Following are the most usual considerations:
- Table Normalisation , Under normalisation usually hinders performance the most. Over Normalisation often leads to cryptic schemas , multitudes of Views etc.
- Data Size - Sharding has become a popular strategy where horizontal scaling is required. Sharding can itself become very expensive and difficult to manage. extra consideration should be taken when considering this strategy.
- Concurrency - Vertical Scaling is often a popular strategy for dealing higher concurrency. For extreme high speed read requirements , consider in-memory storage.
- Application optimisation - Stored procedures were popular for Application Performance Optimisation, Processing at the Data tier became less popular overtime as compute tier processing increased (effectively in-memory). processing at the Data Tier is once again becoming popular with In-memory Data Bases. Note that Stored Procedures often led to Batch Systems.
Data Lifecycle
As the value of Data Increases , applications inevitably store more parameters. When the breath of Data Increases , it becomes more important economically to have a clear Data Retention Strategy. Vertical Scaling also has physical limitations and ultimately performance can drop off drastically once the tipping point is reached in an RDBMS. Retention processes must be set in place for all OLTP Systems upfront.
Data Quality
The requirement of Data Quality should always be pushed as close to point of capture as possible, DataSources can assist by following strict structures e.g. Data types, length etc. must be strictly enforced by the Database Schema. Views to identify duplicates can be setup without having to move the data over to an analytical source.
Data Services
All Data sources can eventually be used for MI / BI and Analytics. Some Data Sources may also be required Operationally by other Systems. Performance must be factored for this. Limit as far as possible (especially long running) batches where possible to enable Anytime Access.
Data Structures
Database technologies have rapidly advanced in the past 5-10 Years. New (No) SQL Sources are fast gaining popularity largely because of the agility and light-weight nature i.e. fit for purpose.
Data bases like MongoDB for example seem like good candidates for Web Applications but attention must be given to the above mentioned considerations as transactional integrity may be compromised for performance etc.
Make appropriate use of Key-Value , Graph and other NoSQL datasources.
Advanced Data Source Considerations
Data Retention & Schema Versioning
Schema's should be Version Controlled, Managed and Archived. Data Archives should be tied back to the appropriate Schema to avoid the following:
Security
Not all RDBM Systems are equal, a fundamental yet least understood is the various layers of security offered by each DBMS.
Depending on the sensitivity of the Systems Data, correct security considerations need to apply.
Considerations include but are not limited to :
- Does the Database integrate into Enterprise Directory or Does it rely on an internal Security Directory
- Does the Database support role based access
- Does the Database support encryption
- What level of Audit Logging can be configured
Scalability
As a Systems usage scales , so typically does the Database; Scaling ability can often determine a Systems Performance.
Horizontal vs Vertical Scaling
"Horizontal scaling means that you scale by adding more machines into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine."
Vertical Scaling
Vertical scaling rarely impacts the calling application and is usually a simpler route; Vertical scaling is recommended as a first alternative. When this physical tipping point is reached , it is logical to adopt a horizontal scaling approach.
Horizontal Scaling
Databases designed for Big Data Typically follow a Horizontal Scaling Pattern from the get go, These Databases usually hide the complexity of Data Shading by providing single points of access typically through the use of Access Nodes.
Where Database Systems don't hide the complexity, the application may need to deal with many Data Sources to perform a single function which adds allot of complexity into the ecosystem. If it is envisaged that an application may require large amounts of Data, select a Database Management system that supports Horizontal Scaling and ensure that it hides the complexities that come with the strategy.
Always ensure a good retention strategy is in place before considering scaling.
Database Automation
Automation of Database platform provisioning & updates must be planned for. Database setup usually requires robust configuration all of which should be Automated as far as possible; Technology selection must include Automation as a parameter.
Scaling should also be automated where possible.
Data extracts
Data extracts should be catered for in the initial Design, the Database platform must support both application and extract load; Some Systems will require realtime operational access for OLAP and operational Purposes; Consider Support for realtime Extract.
Support & Resilience
Integrated Monitoring , Predictive Performance Analysis etc. should be planned for each Database System. Machine Metrics and DB Logs should be appropriately configured in production environments; Proper Retention Strategy for Database System Logs should be in place. These logs are most informative for Performance , Scaling Requirements etc.
Failures that are predictable like "out of space" are predictable and should identified through proper monitoring.
Database Selection Matrix
The following is a quick matrix on Database Selection
Development Considerations
- What Data Structures need to be supported ?
- What is the typical size of Data Types ?
- Does the Application Require High Degree of Data Integrity ?
- What Concurrency would the application require ?
Operational Considerations
- Who will Administrate the Database Platform ?
- How will the platform be monitored ?
- How will the Platform Scale ?
- What degree of Automation is achievable ?
Commercial Considerations
- Is the Technology Commercially viable for the Systems Use Cases ?
- What Training is Available ?
- What is the TCO ?
- What is is SLA required by the System ?
Security Considerations
- What types of Data Access Control is Required ?
- What types of Authorisation Controls are available ?
- Is Encryption Required and Does the Platform Support this ?
References :
http://searchsqlserver.techtarget.com/definition/ACID http://cassandra.apache.org/ https://datahq.co.uk/knowledge-centre/blog/top-10-considerations-whenchoosing-a-database-management-system
Nicely laid out Amit thanks