An introduction to Redshift's COPY command for loading data from various sources.

25 Oct 2023

In Amazon Redshift, the COPY command is a powerful and efficient way to load data into your Redshift cluster from various sources. It is especially useful when dealing with large volumes of data.

What is the COPY command?
Loading data from S3
Loading data from other sources
COPY options
Conclusion

What is the COPY command?

The COPY command in Redshift allows you to load large amounts of data in parallel from multiple sources such as Amazon S3, Amazon DynamoDB, or even data in other Redshift clusters. It efficiently distributes the data to optimize load performance.

Loading data from S3

One of the most common use cases for the COPY command is loading data from Amazon S3. To load data from S3, you need to have the necessary IAM permissions and provide the S3 bucket and object key information in the COPY command. Here’s an example of loading a CSV file from S3:

COPY my_table
FROM 's3://my-bucket/my-file.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/my-redshift-role'
CSV
DELIMITER ',';

You can specify additional options such as the file format, delimiter, encoding, and more. The COPY command automatically handles parallel data transfer and distribution across your Redshift cluster.

Loading data from other sources

Besides S3, Redshift’s COPY command supports loading data from other sources as well. This includes copying data from other Redshift clusters, Amazon DynamoDB, and even data from remote hosts via SSH.

To load data from other sources, you need to provide the appropriate credentials and connection information in the COPY command. Redshift has built-in support for various data formats such as CSV, JSON, Avro, and more, making it versatile for different data sources.

COPY options

The COPY command provides a wide range of options to customize the loading process based on your specific requirements. Some of the commonly used options include:

FORMAT: Specify the data format of the source file, such as CSV, JSON, AVRO, etc.
REGION: Specify the AWS region where the data is located.
IAM_ROLE: Specify the IAM role that has permission to access the data.
IGNOREHEADER: Specify the number of header rows to skip in the source file.
NULL AS: Specify the string representation of NULL in the source file.
MAXERRORS: Specify the maximum number of errors allowed before the COPY command fails.

For a full list of options, please refer to the Redshift COPY command documentation.

Conclusion

The COPY command in Amazon Redshift is a powerful tool for loading data from various sources. Whether it’s loading data from S3, other Redshift clusters, or external sources, the COPY command efficiently distributes and loads the data in parallel, making it ideal for handling large volumes of data. Experiment with different options to optimize your data loading process in Redshift.

#redshift #dataloading

Table of Contents

What is the COPY command?

Loading data from S3

Loading data from other sources

COPY options

Conclusion