The best way to load data into LOFT is through
the pg2disk
utility. It uses psql to connect to a
Postgres, Greenplum or Deepgreen database, and download the data
to the local disk.
Usage: pg2disk -c -D BASEDIR SCHEDULENAME SCHEDULEFILE
The -c
flag will trigger a check for another
active pg2disk. If there is another pg2disk running, the current
one will exit normally. This is useful in preventing overlapping
invocation of pg2disk when it is launched by a cron daemon.
The -D
flag specifies the base directory of where
to put the data files. Continuing with our example in
the introduction, this would
be /data/loftdata/base
.
SCHEDULENAME is just an arbitrary job-schedule-name as specified in the SCHEDULEFILE.
SCHEDULEFILE specifies a path to a file that has the job schedules and their purposes. Each job is specified by a line in the file in this format:
schedulename:source-clause:destination-clause:pname:condition-clause
Again, the schedulename is an arbitrary name given to identify a group of jobs. For example, if we want to separate the jobs into three kinds depending on how often they are invoked, we may name the schedules monthly, daily and hourly.
The source-clause specifies the origin of the data. It is of the form:
src(host, port, dbuser, dbname, schema, tablename) e.g. src(dbhost1, 5432, scott, glog, public, weblog)
The destination-clause specifies the target of the copy operation. It allows the user to specify a different name for the replica. In practice, the names are usually the same for source and target, and you may leave the destination names unspecified for them to default to the source name.
dst(dbname, schema, tablename) e.g. dst(glog, public, weblog) or dst()
The optional pname specifies a partition name. If unspecified, it will be set to the partition named default.
The condition-clause is also optional. It lets you specify a SQL WHERE filter to copy only a subset of the data.
For the purpose of illustration, suppose our schedule
file ~/schedfile
lists the following lines:
hourly:src(dbhost1, 5432, scott, production, public, nations):dst():: bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q1:o_date between '2015-01-01' and '2015-03-31' bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q2:o_date between '2015-04-01' and '2015-06-30' bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q3:o_date between '2015-07-01' and '2015-09-30' bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q4:o_date between '2015-10-01' and '2015-12-31' daily:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q1:o_date between '2016-01-01' and '2016-03-31' hourly:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q2:o_date between '2016-03-01' and '2016-06-30' never:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q3:o_date between '2016-07-01' and '2016-09-30' never:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q4:o_date between '2016-10-01' and '2016-12-31'
We can invoke pg2disk on hourly to copy the nations table in its entirety, and a fragment of the orders table for the records in Q2, 2016. Note that you may need to create a .pgpass file for this to run successfully.
% pg2disk -c -D /data/loftdata/base hourly ~/schedfile
There is also a special name __all__
that matches
all schedule names. Therefore, the following command will copy
all tables from the production database as
user scott:
% pg2disk -c -D /data/loftdata/base __all__ ~/schedfile
When you start LOFT, it will create these directories
under /data/loftdata
:
When you load data, LOFT will create three levels of
directories under /data/loftdata/base
,
corresponding to your database name, schema name, and table
name. A data file will then be created per partition of the
table under the table directory. In addition, two sql files will
be created to reflect the DDL of the table for Vitesse DB and
Deepgreen DB.
For the example schedule file above, you would expect to see these files: