How To Easily Extract Data With Headers From Amazon Redshift

The process to extract data from Redshift can be as simple as running an UNLOAD command. However, the UNLOAD command has some limitations. The biggest limitation is not allowing you to include a header row in your output.

We created a service that wraps the Redshift UNLOAD command with some extra capabilities. One of these capabilities is that the service automatically retrieves and adds headers to the extract from the convenience of a Docker container. Everything is built into the container, including psycopg, OS and other Python packages. What you will need to have is the connection details to your Redshift host and the AWS credentials to write the UNLOADED data to an S3 bucket.

Get Redshift Extracts package from GitHub

How To Run

The service requires a configuration file named config.json in the same directory. The config file is comprised of the following parameters set here to obtain database connection info, AWS credentials and any UNLOAD options you prefer to use.

A sample configuration file is below.

{
    "db": {
        "host": "test.redshift.io",
        "port": "5439",
        "database": "db1",
        "user": "username",
        "password": "password"
    },
    "aws_access_key_id": "myawsaccesskeyid",
    "aws_secret_access_key": "myawssecretaccesskey",
    "unload_options": [
    	"ADDQUOTES",
    	"PARALLEL OFF",
    	"ALLOWOVERWRITE",
    	"GZIP",
    	"DELIMITER ','"
    ]
}

Examples

This command will unload the data in the table mytable using thedatecol to specify a export period. It will then export to the specified S3 location.

Running inside a container:

python unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01

Running it via Docker command:

docker run -it openbridge/redshift-unload python /unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01

Runtime parameters

The service can accept different runtime parameters:

-t: The table you wish to UNLOAD
-f: The S3 key at which the file will be placed
-s (Optional): The file you wish to read a custom valid SQL WHERE clause from. This will be sanitized then inserted into the UNLOAD command.
-r (Optional): The range column you wish to use to constrain the results. Any type supported by Redshift's BETWEEN function is accepted here (date, integer, etc.)
-r1 (Optional): The desired start range to constrain the result set
-r2 (Optional): The desired end range to constrain the result setNote: -s and -d are mutually exlusive and cannot be used together. If neither is used, the script will default to not specifying a WHERE clause and output the entire table.

The -s Option

As mentioned previously, it is possible to supply your own WHERE clause to be used in the UNLOAD command. This is done via an external SQL file.

Here is an example. To use this functionality to UNLOAD only new users, you createD a SQL file called new-users.sql that contains WHERE is_new = true. Pretty simple.

python unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01 -s /new-users.sql

NOTE: The -s option will only work for the WHERE clause. If you try other clauses it will fail. The script would need to be ehanced to do more.

Security

It may be wise to create a read-only user on the database for the script to use. This will improve the security of the script by further protecting against SQL injection. For more information on how to do this, check the manual for your database type.

References

Unloading a file from Redshift to S3 (with headers) | Medium.com

Unloading from redshift to s3 with headers | StackOverflow

How to Extract Data from Amazon Redshift | FlyData

How to download data from a Redshift table as a CSV file | andreigridnev.com