The process to extract data from Redshift can be as simple as running an
UNLOAD command. However, the
UNLOAD command has some limitations. The biggest limitation is not allowing you to include a header row in your output.
We created a service that wraps the Redshift
UNLOAD command with some extra capabilities. One of these capabilities is that the service automatically retrieves and adds headers to the extract from the convenience of a Docker container. Everything is built into the container, including
psycopg, OS and other Python packages. What you will need to have is the connection details to your Redshift host and the AWS credentials to write the
UNLOADED data to an S3 bucket.
How To Run
The service requires a configuration file named
config.json in the same directory. The config file is comprised of the following parameters set here to obtain database connection info, AWS credentials and any
UNLOAD options you prefer to use.
A sample configuration file is below.
This command will unload the data in the table
mytable using the
datecol to specify a export period. It will then export to the specified S3 location.
Running inside a container:
python unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01
Running it via Docker command:
docker run -it openbridge/redshift-unload python /unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01
The service can accept different runtime parameters:
-t: The table you wish to UNLOAD
-f: The S3 key at which the file will be placed
-s (Optional): The file you wish to read a custom valid SQL WHERE clause from. This will be sanitized then inserted into the UNLOAD command.
-r (Optional): The range column you wish to use to constrain the results. Any type supported by Redshift's BETWEEN function is accepted here (date, integer, etc.)
-r1 (Optional): The desired start range to constrain the result set
-r2 (Optional): The desired end range to constrain the result setNote: -s and -d are mutually exlusive and cannot be used together. If neither is used, the script will default to not specifying a WHERE clause and output the entire table.
The -s Option
As mentioned previously, it is possible to supply your own WHERE clause to be used in the UNLOAD command. This is done via an external SQL file.
Here is an example. To use this functionality to UNLOAD only new users, you createD a SQL file called
new-users.sql that contains
WHERE is_new = true. Pretty simple.
python unload.py -t mytable -f s3://dest-bucket/foo/bar/output_file.csv -r datecol -r1 2017-01-01 -r2 2017-06-01 -s /new-users.sql
NOTE: The -s option will only work for the
WHERE clause. If you try other clauses it will fail. The script would need to be ehanced to do more.
It may be wise to create a read-only user on the database for the script to use. This will improve the security of the script by further protecting against SQL injection. For more information on how to do this, check the manual for your database type.