Driven by Big Data – Governance & Security

Driven by Big Data – Governance & Security

My definition of this topic is “a necessary component of any big data solution with a perfect blend of skepticism and confidence by all of those involved”. Think of this as not just securing a file with permissions and capturing its name and glossary, but about sustainable architecture designs.

Our recent blog “Make Data your BAE – Best Asset Ever!” by Roger Wahman helps in justifying the enablement aspect of data governance and is a great starting point for this discussion.

With data being the primary asset driving organizations, setting up controls around it at an early stage is becoming a top priority for all of us practitioners.

I believe we all realize the benefits of why we would want governance and security. With that said, this article will help with how you could start your journey down this path.

Big Data Governance: Fortunately, we now live in a world where once “clash of Business and IT” is now a collaborative force where the maturing practice itself is driving innovation in this field. We are now implementing methods to discover and profile data, tag relevant data points, and add meaningful glossary helping us build and manage our data asset inventory. I see it drive not just analytical tasks but also the overall application design.

One of the notable open source projects in this space is Apache Atlas. Initially targeted at providing an end-to-end data lineage, it now is a full suite solution with business catalog, cross component lineage, and supporting existing access policies. Integrating with ranger and falcon to act on defined tag based policies is one of the most prominent features. Waterline Data is another great solution with support for numerous big data suites. It provides an ability to perform automated data profiling and data tagging. It allows users to perform faceted search, maintain a metadata repository and allow export of this information to other platforms like Apache Atlas. The value proposition to bringing in such technologies is when coupled with compliance models of FISMA, HIPAA, SOX, or PCI DSS.

Big Data Security: As one of my favorite topics of discussion, I see this as a holistic approach to define security across the overall architecture. The best approach often is to assume that an attack is inevitable. While DDoS or ransom attacks are on the rise, the ability to recover with minimal damage must be the basis for any solution architecture. Inter-cluster mirroring with small RPOs and RTO is the way to go. A reliable no-loss data streaming application hosted on cloud is another great option most of the organizations are now adopting.

There is a healthy list of solutions now for achieving your security goals. One such is Apache Knox. It integrates with Kerberos and LDAP, supports SSO to other services, and provides a decent level of audit data. You may also want to look at solutions that support Linux based block-device and field-level encryption. Another proactive approach is to set up a security analytics framework to allow analysis of audit trails, external IP’s, and building key KPI’s for the security team.

Governance requires not just adding a technology to your architecture but also to setup processes and dedicated resources who would help define and evangelize its long-term benefits. This would also drive the policies around a more secured application. Security is a multifaceted solution and my recommendation is to take a bottom-up approach starting from the underlying infrastructure all the way up to how external users access the application.

My Blog Series - Driven by Big Data

Nicely put broad thoughts, the need for data privacy has always been there, but GDPR is driving the need. When it comes to Privacy, it is important that the responsible function gets to integrate it's op-model well with Metadata, Data Quality, Architecture and Risk functions as well. Early discovery of privacy entitlements and security controls are required across the entire information lifecycle. But, the metamodels and governance op-models in the firms need to be improved, to accommodate discovery and ownership of privacy requirements. An entitlement can be as simple as a data owner saying "third party sourced PII data must not be stored within the organization and should be use specifically for lead verification alone". These requirements can then be realized as controls in Design, Architecture, DB or application. There are many challenges that the organizations are tackling with Data Privacy integration with Data Management and Governance functions. The Vendors are yet to accommodate most of these needs into their solutions though some as you stated are available.

Like
Reply

Thanks Ayush Kumar for the mention. We're enabling some interesting use cases in the compliance and risk management space including GDPR and others that you mention. Look forward to working with your team.

Like
Reply

To view or add a comment, sign in

More articles by Ayush S.

Others also viewed

Explore content categories