The best way to load data into LOFT is through the pg2disk utility. It uses psql to connect to a Postgres, Greenplum or Deepgreen database, and download the data to the local disk.

Usage: pg2disk -c -D BASEDIR SCHEDULENAME SCHEDULEFILE

The -c flag will trigger a check for another active pg2disk. If there is another pg2disk running, the current one will exit normally. This is useful in preventing overlapping invocation of pg2disk when it is launched by a cron daemon.

The -D flag specifies the base directory of where to put the data files. Continuing with our example in the introduction, this would be /data/loftdata/base.

SCHEDULENAME is just an arbitrary job-schedule-name as specified in the SCHEDULEFILE.

SCHEDULEFILE specifies a path to a file that has the job schedules and their purposes. Each job is specified by a line in the file in this format:

schedulename:source-clause:destination-clause:pname:condition-clause

Again, the schedulename is an arbitrary name given to identify a group of jobs. For example, if we want to separate the jobs into three kinds depending on how often they are invoked, we may name the schedules monthly, daily and hourly.

The source-clause specifies the origin of the data. It is of the form:

src(host, port, dbuser, dbname, schema, tablename)

e.g. src(dbhost1, 5432, scott, glog, public, weblog)

The destination-clause specifies the target of the copy operation. It allows the user to specify a different name for the replica. In practice, the names are usually the same for source and target, and you may leave the destination names unspecified for them to default to the source name.

dst(dbname, schema, tablename)

e.g. dst(glog, public, weblog) or dst()

The optional pname specifies a partition name. If unspecified, it will be set to the partition named default.

The condition-clause is also optional. It lets you specify a SQL WHERE filter to copy only a subset of the data.

Example

For the purpose of illustration, suppose our schedule file ~/schedfile lists the following lines:

hourly:src(dbhost1, 5432, scott, production, public, nations):dst()::

bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q1:o_date between '2015-01-01' and '2015-03-31'
bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q2:o_date between '2015-04-01' and '2015-06-30'
bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q3:o_date between '2015-07-01' and '2015-09-30'
bluemoon:src(dbhost1, 5432, scott, production, public, orders):dst():p2015q4:o_date between '2015-10-01' and '2015-12-31'
   daily:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q1:o_date between '2016-01-01' and '2016-03-31'
  hourly:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q2:o_date between '2016-03-01' and '2016-06-30'
   never:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q3:o_date between '2016-07-01' and '2016-09-30'
   never:src(dbhost1, 5432, scott, production, public, orders):dst():p2016q4:o_date between '2016-10-01' and '2016-12-31'

We can invoke pg2disk on hourly to copy the nations table in its entirety, and a fragment of the orders table for the records in Q2, 2016. Note that you may need to create a .pgpass file for this to run successfully.

% pg2disk -c -D /data/loftdata/base hourly ~/schedfile

There is also a special name __all__ that matches all schedule names. Therefore, the following command will copy all tables from the production database as user scott:

% pg2disk -c -D /data/loftdata/base __all__ ~/schedfile

Data Organization

When you start LOFT, it will create these directories under /data/loftdata:

/data/loftdata/log/
Stores the log files for diagnostics.
/data/loftdata/base/
This is where LOFT stores the data.

When you load data, LOFT will create three levels of directories under /data/loftdata/base, corresponding to your database name, schema name, and table name. A data file will then be created per partition of the table under the table directory. In addition, two sql files will be created to reflect the DDL of the table for Vitesse DB and Deepgreen DB.

For the example schedule file above, you would expect to see these files: