Using Redshift's external tables to query data stored in AWS S3 with SQL.

AWS Redshift’s external tables feature allows you to query data stored in AWS S3 directly from your Redshift cluster using SQL. This provides a convenient and efficient way to access large amounts of data without having to load it into Redshift first. In this blog post, we’ll walk through the steps of setting up and querying external tables in Redshift.

Table of Contents

What are External Tables?

External tables in Redshift are virtual tables that reference data files stored in AWS S3. These tables do not physically store the data within Redshift, but instead provide a schema and metadata for the files in S3. This allows you to query the data in S3 using standard SQL statements.

Setting Up an External Table

To set up an external table in Redshift, you’ll need to perform the following steps:

  1. Create an IAM Role: Create an IAM role with the necessary permissions to access the S3 bucket where your data is stored. This role will be used by Redshift to access the S3 files.

  2. Create a Redshift Database: Create a Redshift database where you’ll create the external table. This can be done through the AWS Management Console or by using the AWS CLI.

  3. Create an External Schema: Create an external schema within the Redshift database. This schema will be used to organize the external tables. You can create the schema using the CREATE EXTERNAL SCHEMA SQL statement.

  4. Create the External Table: Use the CREATE EXTERNAL TABLE SQL statement to create the external table. Specify the S3 path where your data is stored, the IAM role to use, and the table schema. You can also define column mappings, data formats, and compression options.

  5. Verify the External Table: Run a SELECT query on the external table to verify that it is correctly set up and you can access the data stored in S3.

Querying Data from External Tables

Once the external table is set up, you can query the data using standard SQL statements. You can use the table in joins with other internal tables, apply filters, aggregations, and perform any other SQL operations supported by Redshift.

For example, to query data from an external table named my_external_table, you can use the following SQL statement:

SELECT * FROM my_external_table WHERE category = 'electronics';

By executing this query, Redshift will read the data from S3 and return the results as if they were stored in a regular Redshift table.

Performance Considerations

When working with external tables in Redshift, there are some performance considerations to keep in mind:

Conclusion

Using external tables in Redshift provides a powerful way to query and analyze your data stored in AWS S3 directly. By leveraging the scalability and performance benefits of Redshift, you can perform complex analytics on large datasets without the need to load the data into Redshift first.

In this blog post, we covered the basics of setting up and querying external tables in Redshift. By following the outlined steps and considering the performance considerations, you can make the most out of this feature and unlock the full potential of your data analysis in Redshift.

References