Implementing Kerberos and sssd to protect HDFS data
HealthCare and Financial Services are two industries that require controls on access to data at rest in a relational database because of federal laws that have been enacted to protect patient privacy and prevent alteration of financial data reported to the SEC. HIPAA regulates patient privacy concerns around PII and PHI (personally identifiable information and personal health information). Sarbanes Oxley regulates financial data that is included in 10-Q reports submitted to the Securities and Exchange Commission. These are just two of the many examples of why governing access to data at rest is now more important than ever before.
In my work as a Solutions Architect I've recently implemented Kerberos and sssd (single sign-on) for a major HealthCare organization here in Boston. This is to protect the research data that is at rest in a Hadoop environment. However these same techniques can be applied to a relational database environment such as Greenplum, Postgres or SQL Server.
This discussion focuses on Linux as the platform, however the same techniques can be applied to a proprietary OS such as HP-UX, AIX, Solaris or Windows with some modification.
Kerberos - is a two way trust solution that relies on generating a private key that is exchanged every time someone is accessing a particular service. Service principals are stored in a KDC (Kerberos Data Store) on server that has one or several standby servers to provide fault tolerance. The service principals consist of a public key that is generated for each service and a keytab file that is encrypted and stored on the servers in the cluster that are running each service (e.g. HDFS, YARN, Hive... etc). The permissions on the keytab files are set to 400 on Linux and ownership is set to the service account for that service (e.g. hive, hdfs, postgres, oracle, etc...) Only root or the service account can read those files which are encrypted on disk at the time a key is requested by a user. The service account acts as a proxy for the user to authorize the user for access to the service they are requesting, only after they have been successfully granted a key.
The keys are stored on each server in a key cache which is also protected with stronger security. When the user issues a kinit command and provides either their public key, a passphrase stored with their principal in the KDC, or using their LDAP credentials, they are granted a private key that has a predefined lifetime. This private key is then exchanged on every transaction with the service they are using, such as a Hive query, a MapReduce job, etc... The Kerberos Realm administrator defines the private key lifetimes for the Realm in a kerberos configuration file that resides on the KDC server and is copied out to each of the other servers that are in the same Kerberos realm. Typical durations for these keys are set to hours. Very rarely will someone set these up to last for several days unless it is in a high performance compute cluster that will run an analytics job that last for several days.
Without a valid key, a user cannot access any of the services that are in the Kerberos realm and thereby prevented from starting up any service on the cluster to manipulate the data.
Kerberos can also be setup in a cross-realm trust with an LDAP store such as Active Directory to enable users to provide their LDAP credentials (username/password) to request a key. This is very common in large enterprise organizations as a further ring of security in an environment. If a person's account is locked or deactivated on the LDAP domain, they won't have access to any of the servers or services within the enterprise. This eliminates the need to maintain multiple local /etc/passwd files on hundreds or thousands of servers.
The key files to maintain are /etc/krb5.conf on each server in the Kerberos realm and /var/kerberos/krb/krb.conf. The actual location may vary depending on your implementation
On Linux there is a sssd package for providing single sign-on that works in conjunction with PAM (pluggable authentication modules). SSSD stands for single sign-on daemon. With sssd in place, users can authenticate against an existing LDAP realm to access a user shell on a server, rather than relying on the local /etc/passwd and /etc/group files. This is fairly straight forward to setup.
The configuration file for this is /etc/sssd/sssd.conf and there is a /etc/pam.d/system-auth file that needs to be defined to determine which pam modules get loaded into the kernel and in what order. Finally there is the /etc/nsswitch.conf file which determines in what order does the OS use for authentication. Listing LDAP first will speed things up on login in most cases. sssd also provides another ring of security around an environment such that a user cannot even login to the environment when the sssd daemon is running as they won't have any valid LDAP/AD credentials. When the user does login, they will be dropped automatically into their home directory and their LDAP groups will be assigned to their files.
Finally I want to talk about encryption of data at rest. Data that sits in tables is divided into two dimensional objects consisting of rows and columns. A common use case in HealthCare clinical research is that data that lab, medical record results that have been signed-of by a patient for use in clinical research will almost always contain PII and PHI. Researchers won't have the time nor the inclination to scrub that data before ingesting it into a relational data store. In some cases they may dispatch a post-doc to do that work for them, but in the end its of little value to their basic research. The problem really falls on IT to prevent PII and PHI from being exposed to the wrong people. We can do this through encrypting the columns of data. The Postgres Open Source community has developed and maintains the pgcrypto package which is also FIPS compliant in the pgcrypto.fips package. The data at rest such as patient name, social security number, address is encrypted in each column so that someone who is querying a table will retrieve the encrypted value, not in clear text. If the data needs to be decrypted, a public key needs to be supplied to the query command which will then decrypt those values and display them in plain text.
Good demonstration about Kerberos ....