Row Level Security for S3 Data on Redshift Spectrum

Row Level Security for S3 Data on Redshift Spectrum

Part of the challenge of using some reporting applications on AWS is the limited connectors that are often available.

A recent challenge required reporting on data that was held in S3 and also to apply row-level security to that data. The reporting application in question did not have connectors for S3 but could connect to a database.

This article describes how I achieved this aim using AWS services. This article is in two parts - Part I (this part) describes the configuration of the components at a high level. The next article will discuss how the objective can be automated.

Challenge

The obvious approach was to make use of AWS Spectrum External Schema against a Glue Catalog.

However, The first challenge in implementing row-level security is the way that the external schema is treated in Redshift from a permissions perspective. There is no way to restrict access to certain objects in the external schema. A user either has access to the schema or not. It was an "all or nothing" scenario.

The second challenge is that even if you create a view on top of the External Schema, the user would not be able to see any data unless they had access to the External Schema. Therefore, we are back to where we began - an "all or nothing scenario".

Solution

The solution was to create a layer between the External Schema and the User that was not a basic view. This layer was not to be a standard view, but a materialized view. A materialized view - for all intents and purposes - is a table. When the view is created, all of the data is read from the source (in this case, S3) and persisted in the Redshift Database. This has both benefits and drawbacks:

Benefits

  1. Query performance is improved - your data is stored directly in Redshift, there is no call to S3 when a query is performed;
  2. Data types can be sensible - The Glue Data Catalog is very liberal when it comes to determining column data types. When using a materialized view, you can create the view with much tighter data types than the types that Glue would determine itself - without the need for manually updating the Glue Catalog

Drawbacks

  1. Because data is physically stored in Redshift, you pay for the volume of data this is stored;
  2. The Materialized views need to be refreshed when the data is updated in S3 (how to do this automatically will be discussed in Part II).

Basic Infrastructure Diagram

The solution involved the following high-level components:

  1. An S3 Bucket to contain the files (of course);
  2. A Glue Catalog that points to the S3 Bucket;
  3. An External Schema in Redshift (i.e. Spectrum) that points to the Glue Catalog;
  4. A Materialized View that included an additional "record type" metadata column;
  5. A User View that would provide access to the Materialized view, and filter the data that the user could see based on the record type.

The following architecture shows how these components are connected:

No alt text provided for this image

Amazon Redshift Data Model

The Redshift data model contains 4x different schemas:

  1. An External Schema that is connected to the Glue Catalog. The user does not have access to this schema;
  2. A middleware schema that contains materialized views. When the materialized view is created, a derived "Record Type" column is added to the view to indicate the security marking for that row (or, indeed, all rows) - and this is used to implement row-level security. The user also does not have access to objects in this schema;
  3. A permissions schema that contains a Permissions table with a list of User Names and "Record Types" the user has access to. The user does not have access to the objects in this schema either;
  4. Finally - a Query Schema, containing query Views. This view returns all records from the materialized view where the Record Type exists in the Permissions Table for the user running the query (the user name can be retrieved in this view using the Redshift CURRENT_USER function). The user requires read access to all objects in this schema.

The following diagram shows how these objects are linked:

No alt text provided for this image

Conclusion

And that's it!

With the above approach:

  1. A user can run all Query Views - a requirement, for example, to avoid errors when you have Dashboard Applications that are connecting to Redshift;
  2. When a user retrieves a view, the view is filtered to return only the records they have access to. If a user has access to none of the records, the view will still return - but will contain no data;
  3. There is no way for the user to access data other than through the query View;
  4. The same approach works with multiple Glue Catalog Tables and therefore also supports table-level security for users.
No alt text provided for this image

Note: This article is an opinion and does not constitute advice - any actions taken by a reader based on this article are at the discretion of the reader, who is solely responsible for the outcome of those actions.

To view or add a comment, sign in

More articles by Stephen McMaster

  • We are hiring AWS and Azure Cloud Engineers and Architects

    Calling all Cloud Engineers!!! If you have been continuously resident in the UK for 5+ years, and are based in Northern…

  • Downloadable Threat Landscape Diagram

    What is a Threat Landscape? A threat landscape is a way of illustrating major threats that a user or organisation may…

  • First Look: AWS SNS Message Data Protection

    AWS have a released a new Beta feature for SNS called "AWS SNS Message Data Protection". Given that I am one of those…

    1 Comment
  • Hands-On with AWS Skill Builder

    AWS Skill Builder is a new offering from AWS, allowing access to online training courses, hands-on AWS Labs, as well as…

  • Using Custom RDS Databases

    Is is commonly known that RDS supports a number of different database formats and that AWS provides full-stack support…

  • Using AWS SNS with Private HTTPS Endpoints

    Introduction In today's security-conscious world, many companies avoid Public Endpoints as a necessity. Private subnets…

  • Automatically Logging S3 File Entries to CloudWatch

    A recent challenge I have faced is the need to write AWS Service Logs to CloudWatch where the AWS Service did not write…

    2 Comments
  • AWS Architect Professional (Recertification)

    This article discusses my experience of re-sitting the AWS Architect Professional Exam (Spoiler: 3 years makes quite a…

    4 Comments
  • Building Docker Images using AWS Tooling

    For some time now, I have been frustrated with the limited means by which you can add your own Libraries to AWS Lambda.…

    4 Comments
  • Row Level Security for S3 Data on Redshift Spectrum (Part II)

    In the previous article of this 2-part blog, I outlined how Row-Level security can be implemented using Amazon Spectrum…

Others also viewed

Explore content categories