The process to extract data from Redshift can be as simple as running an UNLOAD
command. However, the UNLOAD
command has some limitations. The biggest limitation is not allowing you to include a header row in your output.
We created a service that wraps the Redshift UNLOAD
command with some extra capabilities. One of these capabilities is that the service automatically retrieves and adds headers to the extract from the convenience of a Docker container. Everything is built into the container, including psycopg,
OS and other Python packages. What you will need to have is the connection details to your Redshift host and the AWS credentials to write the UNLOADED
data to an S3 bucket.
How To Run
The service requires a configuration file named config.json
in the same directory. The config file is comprised of the following parameters set here to obtain database connection info, AWS credentials and any UNLOAD
options you prefer to use.
A sample configuration file is below.
{
"db": {
"host": "test.redshift.io",
"port": "5439",
"database": "db1",
"user": "username",
"password": "password"
},
"aws_access_key_id": "myawsaccesskeyid",
"aws_secret_access_key": "myawssecretaccesskey",
"unload_options": [
"ADDQUOTES",
"PARALLEL OFF",
"ALLOWOVERWRITE",
"GZIP",
"DELIMITER ','"
]
}
Examples
This command will unload the data in the table mytable
using thedatecol
to specify a export period. It will then export to the specified S3 location.
Running inside a container:
python unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01
Running it via Docker command:
docker run -it openbridge/redshift-unload python /unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01
Runtime parameters
The service can accept different runtime parameters:
-t: The table you wish to UNLOAD
-f: The S3 key at which the file will be placed
-s (Optional): The file you wish to read a custom valid SQL WHERE clause from. This will be sanitized then inserted into the UNLOAD command.
-r (Optional): The range column you wish to use to constrain the results. Any type supported by Redshift's BETWEEN function is accepted here (date, integer, etc.)
-r1 (Optional): The desired start range to constrain the result set
-r2 (Optional): The desired end range to constrain the result setNote: -s and -d are mutually exlusive and cannot be used together. If neither is used, the script will default to not specifying a WHERE clause and output the entire table.
The -s Option
As mentioned previously, it is possible to supply your own WHERE clause to be used in the UNLOAD command. This is done via an external SQL file.
Here is an example. To use this functionality to UNLOAD only new users, you createD a SQL file called new-users.sql
that contains WHERE is_new = true
. Pretty simple.
python unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01 -s /new-users.sql
NOTE: The -s option will only work for the WHERE
clause. If you try other clauses it will fail. The script would need to be ehanced to do more.
Security
It may be wise to create a read-only user on the database for the script to use. This will improve the security of the script by further protecting against SQL injection. For more information on how to do this, check the manual for your database type.