Greenplum Data Connection- FineDataLink Help Document

Last update: September 27, 2024

Overview

Version

FineDataLink Version	Functional Change
4.0.4	Supported connection to Greenplum (Parallel Loading) under System Management > Data Connection > Data Connection Management. Data Input supported Greenplum and Greenplum (Parallel Loading). Data Output supported Greenplum and Greenplum (Parallel Loading).
4.0.14	The deployment package had a built-in gpfdist file of Greenplum (Parallel Loading).
4.0.29	GreenPlum (parallel loading) and Pivotal Greenplum Database were merged as Pivotal Greenplum Database at New Data Connection.
4.1.2	Scheduled Task supported the use of the COPY command to write data (including binary fields and JSON fields) into Greenplum databases. In the Parallel Loading mode, the writing of JSON fields was supported. In the Parallel Loading mode, the write method Append/Update/Delete Data Based on Identifier Field and the following strategies for primary key conflict were supported: Ignore Source Data If Same Primary Key Exists Record as Dirty Data If Same Primary Key Exists Overwrite Data in Target Table If Same Primary Key Exists

Function Description

Scheduled Task supports the reading from and writing into the Greenplum database.

Pipeline Task supports data writing into the Greenplum database.

Data Service supports the Greenplum database.

Configuration Instruction

Pipeline Task

If GreenPlum is selected as the target data source in a pipeline task, the COPY loading mode will be used.

Using parallel loading requires specified database privilege.

1. Assign users who need to use a Greenplum data connection the privilege to create schemas in the corresponding database.

2. Create a fdl_temp schema in the target database to store temporary tables, and assign users the privilege to create tables in this schema.

The example command is as follows:

GRANT USAGE,CREATE ON SCHEMA fdl_temp TO trans_user ;
ALTER DEFAULT PRIVILEGES IN SCHEMA fdl_temp GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES, TRIGGER ON TABLES TO trans_user ;
GRANT USAGE,CREATE ON SCHEMA fdl_temp TO trans_user ;

Scheduled Task

When GreenPlum is selected as the data source in a scheduled task, three load methods are supported, namely, Parallel Loading, COPY Loading, and Common Loading. The differences among the three load methods are shown in the following table.

Load Method	Difference
Common Loading	1. This method is not suitable for writing data into Greenplum databases. 2. If you only need to read data from a Greenplum database, configure the data connection following the steps in the section "Configuration Without Parallel Loading Setting" of this document.
Parallel Loading	1. FineDataLink 4.1.2 and later releases support the writing of JSON fields, but do not support the writing of binary fields. 2. Parallel loading outperforms COPY loading in scenarios with large data volumes and large-scale clusters. 3. Configure the data connection following the steps in the section "Configuration with Parallel Loading Setting" of this document. Note: Using Parallel Loading requires specified privileges.
COPY Loading (New in V4.1.2)	1. It supports the writing of binary fields and JSON fields. 2. Configure the data connection following the steps in the Configuration Without Parallel Loading Setting section of this article. Note: To use COPY Loading, you need to create a fdl_temp schema in the target database to store temporary tables and assign users privileges to create tables within the specified schema. (If the database administrator has created the schema and assigned privileges on it, database users do not need privileges to create schemas).

Load Method

Difference

Common Loading

1. This method is not suitable for writing data into Greenplum databases.

2. If you only need to read data from a Greenplum database, configure the data connection following the steps in the section "Configuration Without Parallel Loading Setting" of this document.

Parallel Loading

1. FineDataLink 4.1.2 and later releases support the writing of JSON fields, but do not support the writing of binary fields.

2. Parallel loading outperforms COPY loading in scenarios with large data volumes and large-scale clusters.

3. Configure the data connection following the steps in the section "Configuration with Parallel Loading Setting" of this document.

Note:

Using Parallel Loading requires specified privileges.

COPY Loading (New in V4.1.2)

1. It supports the writing of binary fields and JSON fields.

2. Configure the data connection following the steps in the Configuration Without Parallel Loading Setting section of this article.

Note:

To use COPY Loading, you need to create a fdl_temp schema in the target database to store temporary tables and assign users privileges to create tables within the specified schema. (If the database administrator has created the schema and assigned privileges on it, database users do not need privileges to create schemas).

Assigning the Privilege for Parallel Loading

Using GreenPlum as the target data source in the parallel loading mode requires specified privileges.

1. Assign privileges to create tables and read existing tables in the gpfdist_temp schema.

Note:

If you don't want to assign the read privilege on existing tables, stop the task that uses parallel loading and delete the ext_gpload_* and staging_gpload_* tables in the gpfdist_temp schema. After that, you only need to assign the privilege to create tables in the schema.

GRANT USAGE,CREATE ON SCHEMA gpfdist_temp TO trans_user ;

2. Assign privileges to create external tables.

alter role trans_user with createexttable;

3. Assign read privileges on the target table. Using Auto Table Creation requires table creation privilege on corresponding databases.

ALTER DEFAULT PRIVILEGES IN SCHEMA gpfdist_temp GRANT SELECT, INSERT, UPDATE, DELETE, 
REFERENCES, TRIGGER ON TABLES TO trans_user ;

Assigning the Privilege for COPY Loading

For details, see the Pipeline Task section of this article.

Data Service

Using a Greenplum database in Data Service requires configuring Parallel Loading Setting. For details on data services, see Overview of Data Service.

Configuration with Parallel Loading Setting

Prerequisite

Confirming the Database Version

Greenplum Database (Parallel Loading) 5.x and 6.x are supported.

Confirming the Data Type

Binary fields cannot be synchronized in the parallel loading mode and trigger an error message during loading. Binary fields can only be loaded via JDBC. For details, see the data connection procedure in this section.

Placing the gpfdist File

The related operations and storage location of the gpfdist file are shown in the following table.

FineDataLink Project	Operation	File Location
1. Projects before 4.0.55 2. Projects upgraded from a version before 4.0.14 to a version before 4.0.21	See the following content of this section.	\webapps\webroot\WEB-INF
Projects upgraded from a version before 4.0.14 to 4.0.21 and later versions	\webapps\webroot\WEB-INF\assist
Projects deployed with 4.0.14 and later installation packages	The driver is built in. Ignore this section.

FineDataLink Project

Operation

File Location

1. Projects before 4.0.55

2. Projects upgraded from a version before 4.0.14 to a version before 4.0.21

See the following content of this section.

\webapps\webroot\WEB-INF

Projects upgraded from a version before 4.0.14 to 4.0.21 and later versions

\webapps\webroot\WEB-INF\assist

Projects deployed with 4.0.14 and later installation packages

The driver is built in. Ignore this section.

Linux System:

You can download the package for Linux systems: gpfdist_linux.tar.gz

1. Upload the downloaded package to the Linux server, and then extract it to the \webapps \webroot\WEB-INF\assist directory.

Note:

The installation directory cannot contain a space, otherwise gpload cannot read the file.

2. Place the gpfdist file (in the bin folder) at the same level as the lib folder, and then delete the bin folder.

3. Rename the gpfdist_linux folder to gpfdist.

The effect is shown in the following figure.

Windows System:

1. Obtain the installation package.

Create a gpfdist folder in the \webapps\webroot\WEB-INF\assist directory, change the obtained package to an EXE file, and place it in the folder.

2. Check if the server where the database is located can access the 15500 port of the FineDataLink project server as the database needs to read the CSV file generated by FineDataLink for loading.

3. Check if the account that needs to create the Greenplum data connection has the privilege to create schemas and tables.

Note:

1. For Windows systems, the gpfdist file (which has been pre-compiled for Linux systems) must be compiled into an EXE file based on the source code. Windows systems do not support the integration of gpfdist-related components (which can be integrated into Linux systems).

2. The maximum data size of a single row is 1 MB (for Windows systems), which cannot be modified.

Data Connection Procedure

Uploading the Driver

Download the driver package and upload it to FineDataLink. For the specific steps of uploading the driver package, see Driver Management.

Driver Package Download
Download the latest version of the PostgreSQL driver.

Data Connection Configuration

1. Log in to FineDataLink, choose System Management > Data Connection > Data Connection Management > New Data Connection, and click Pivotal Greenplum Database.

Note:

1. If you are not the admin, you can configure data connections only after the admin assigns you permission on Data Connection under Permission Management > System Management. For details, see Data Connection Management Permission.

2. For FineDataLink before 4.0.29, select Greenplum (Parallel Loading) when creating the data connection.

2. Fill in the connection information. Select Custom, and select the uploaded driver mentioned in the Uploading the Driver section.

You cannot set Pattern unless the database is connected. Click Click to Connect Database and then click Pattern, as shown in the following figure.

3. Configure Parallel Loading Setting if you need to write data into Greenplum databases.

Configuration Item	Description
Server Address - Node 1	Enter the path of the gpfdist file mentioned in the Placing the gpfdist file section, ensuring it can be accessed by the SEG on the FineDataLink server. If the project is deployed in a clustered environment, multiple configuration items will be displayed in the format of Server Address - Node x. Type the path in the drop-down box.
Temporary Table Reuse	Determine whether to reuse temporary tables. (Reusing temporary tables can effectively reduce the table growth rate during high-frequency loading.) If it is set to Yes, the gpfdist_temp schema will be automatically created and used during runtime.
Limit on Temporary File Quantity	Set the maximum number of temporary files that can be written into the disk. Adjust the value according to the disk size and the network speed. Default value: 100,000. Range: 10,000 to 100,000,000. Required.
Limit on Temporary File Size (MB)	Set the maximum size of the file that can be written into the disk. When either Limit on Temporary File Quantity or Limit on Temporary File Size (MB) is reached, data file writing stops, and file loading starts immediately. Default value: 1024. Range: 10 to 102400. Required.

4. Click Test Connection. If the connection is successful, click Save to save the configuration.

Configuration Without Parallel Loading Setting

See the Configuration Instruction section of this article carefully.

Database Version

Greenplum 5.x and 6.x are supported.

Data Connection Procedure

The procedure is the same as that in the Configuration with Parallel Loading Setting section, except that you do not need to configure Parallel Loading Setting.

Data Source Usage

For details on using Greenplum data sources in FineDataLink, see Instruction on Greenplum Data Sources.

Scheduled Task supports the reading from and writing into the Greenplum database. For details, see Overview of Data Development.

Pipeline Task supports data writing into the Greenplum database. For details, see Overview of Data Pipeline.

Using a Greenplum database in Data Service requires configuring Parallel Loading Setting. For details on data services, see Overview of Data Service.

Previous：PostgreSQL Data Connection

Next：DM Data Connection

Helpful
Not helpful
Only read

中文（简体）

English

Greenplum Data Connection

Overview

Version

Function Description

Configuration Instruction

Pipeline Task

Scheduled Task

Data Service

Configuration with Parallel Loading Setting

Prerequisite

Data Connection Procedure

Configuration Without Parallel Loading Setting

Database Version

Data Connection Procedure

Data Source Usage

附件列表