Unity Catalog is a robust tool designed to provide centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. It offers a unified platform for administering data access policies, ensuring that these policies apply consistently across all workspaces and user personas. It supports a standards-compliant security model based on ANSI SQL, which allows for familiar syntax when granting permissions. Additionally, it automatically captures detailed audit logs and lineage data to track data access and usage, and facilitates data discovery through tagging, documentation, and an intuitive search interface.
One of the key advantages of Unity Catalog is its ability to work seamlessly with your existing data catalogs, data storage systems, and governance solutions. This means you can leverage your current investments and build a future-proof governance model without incurring expensive migration costs.
The primary objective of Unity Catalog is to enable teams to efficiently manage and collaborate on their data assets. By implementing best practices for utilizing Unity Catalog, organizations can unlock the full potential of their data and enhance collaboration across teams.
Here are some best practices for using Unity Catalog:
- Configure a Unity Catalog Metastore: Create a single Metastore for each region where you use Azure Databricks and link it to all the workspaces in that region. This ensures that you have a centralized governance solution for data and AI on the Databricks Lakehouse. It is crucial to prevent direct user access to the designated root storage location for managed tables to maintain security and auditability.
- Use Cluster Configurations to Control Data Assets: Enforce standardized cluster configurations to prevent resource misuse, optimize utilization, and help control costs. Unity Catalog ensures precise chargeback processes by accurately tagging clusters, enabling transparent cost allocation.
- Use Audit Logs: Monitor and track the access and usage of data assets through audit logs. These logs record various events such as queries, updates, grants, and revokes, which can be used to analyze user behavior, detect anomalies, enforce compliance, and troubleshoot issues.
- Share Data Using Delta Sharing: Use Delta Sharing to share data between Metastores or with external parties. Delta Sharing is a secure and open protocol for sharing Delta Lake tables across organizations and platforms, enabling cross-metastore queries, federated analytics, and data collaboration.
- Use DBFS While Launching Unity Catalog Clusters with Single-User Access Mode: When designing cluster configurations, choose between single-user and shared access modes. Single-user mode allows you to run queries and commands on the cluster as yourself, while shared mode allows multiple users to share the cluster resources.
- Secure Your Unity Catalog-Managed Storage: Ensure that the storage locations used by Unity Catalog are not accessible by any users directly. Encrypt your data both at rest and in transit to prevent unauthorized access.
However, there are some general limitations to be aware of when using Unity Catalog.
- Language and Runtime Support: Scala, R, and workloads using Databricks Runtime for Machine Learning are supported only on clusters using the Single User access mode. Additionally, Python UDFs, including UDAFs, UDTFs, and Pandas on Spark, are not supported in Databricks Runtime 13.1 and below.
- Shallow Clones: Shallow clones are supported to create Unity Catalog managed tables from existing Unity Catalog managed tables only in Databricks Runtime 13.1 and above
- Bucketing: Bucketing is not supported for Unity Catalog tables. Commands that try to create a bucketed table in Unity Catalog will throw an exception.
- Cross-Region Performance: Writing to the same path or Delta Lake table from workspaces in multiple regions can lead to unreliable performance if some clusters access Unity Catalog and others do not.
- Custom Partition Schemes: Custom partition schemes created using commands like ALTER TABLE ADD PARTITION are not supported for tables in Unity Catalog. However, Unity Catalog can access tables that use directory-style partitioning.
- Overwrite Mode: Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables, not for other file formats. The user must have the CREATE privilege on the parent schema and must be the owner of the existing object or have the MODIFY privilege on the object.
- Cluster Access Mode: Spark-submit jobs are supported on single user access but not shared clusters.
- Group Management: Groups that were previously created in a workspace (workspace-level groups) cannot be used in Unity Catalog GRANT statements. To use groups in GRANT statements, you need to create your groups at the account level.
These limitations are important to consider when planning and implementing Unity Catalog in your organization to ensure smooth and efficient data management.
By following these best practices, users can effectively utilize Unity Catalog to govern their data assets, improve data organization, enhance access control, and ensure data security. These recommendations help users optimize their workflows and leverage the capabilities of Unity Catalog for efficient and secure data management.