Greenplum Data Connection

  • Last update: September 27, 2024
  • Overview

    Version

    FineDataLink VersionFunctional Change
    4.0.4
    • Supported connection to Greenplum (Parallel Loading) under System Management > Data Connection > Data Connection Management.

    • Data Input supported Greenplum and Greenplum (Parallel Loading).

    • Data Output supported Greenplum and Greenplum (Parallel Loading).

    4.0.14The deployment package had a built-in gpfdist file of Greenplum (Parallel Loading).
    4.0.29GreenPlum (parallel loading) and Pivotal Greenplum Database were merged as Pivotal Greenplum Database at New Data Connection.
    4.1.2
    • Scheduled Task supported the use of the COPY command to write data (including binary fields and JSON fields) into Greenplum databases.

    • In the Parallel Loading mode, the writing of JSON fields was supported.

    • In the Parallel Loading mode, the write method Append/Update/Delete Data Based on Identifier Field and the following strategies for primary key conflict were supported:

        Ignore Source Data If Same Primary Key Exists

        Record as Dirty Data If Same Primary Key Exists

        Overwrite Data in Target Table If Same Primary Key Exists

    Function Description 

    Scheduled Task supports the reading from and writing into the Greenplum database.

    Pipeline Task supports data writing into the Greenplum database.

    Data Service supports the Greenplum database.

    Configuration Instruction 

    Pipeline Task

    If GreenPlum is selected as the target data source in a pipeline task, the COPY loading mode will be used. 

    Using parallel loading requires specified database privilege.

    1. Assign users who need to use a Greenplum data connection the privilege to create schemas in the corresponding database.

    2. Create a fdl_temp schema in the target database to store temporary tables, and assign users the privilege to create tables in this schema.

    The example command is as follows:

    GRANT USAGE,CREATE ON SCHEMA fdl_temp TO trans_user ;
    ALTER DEFAULT PRIVILEGES IN SCHEMA fdl_temp GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES, TRIGGER ON TABLES TO trans_user ;
    GRANT USAGE,CREATE ON SCHEMA fdl_temp TO trans_user ;

    Scheduled Task 

    When GreenPlum is selected as the data source in a scheduled task, three load methods are supported, namely, Parallel LoadingCOPY Loading, and Common Loading. The differences among the three load methods are shown in the following table.

    Load MethodDifference
    Common Loading

    1. This method is not suitable for writing data into Greenplum databases.

    2. If you only need to read data from a Greenplum database, configure the data connection following the steps in the section "Configuration Without Parallel Loading Setting" of this document.

    Parallel Loading

    1. FineDataLink 4.1.2 and later releases support the writing of JSON fields, but do not support the writing of binary fields.

    2. Parallel loading outperforms COPY loading in scenarios with large data volumes and large-scale clusters.

    3. Configure the data connection following the steps in the section "Configuration with Parallel Loading Setting" of this document.

    iconNote:

    Using Parallel Loading requires specified privileges.


    COPY Loading (New in V4.1.2)

    1. It supports the writing of binary fields and JSON fields.

    2. Configure the data connection following the steps in the Configuration Without Parallel Loading Setting section of this article.

    iconNote:
    To use COPY Loading, you need to create a fdl_temp schema in the target database to store temporary tables and assign users privileges to create tables within the specified schema. (If the database administrator has created the schema and assigned privileges on it, database users do not need privileges to create schemas).


    Assigning the Privilege for Parallel Loading

    Using GreenPlum as the target data source in the parallel loading mode requires specified privileges.

    1. Assign privileges to create tables and read existing tables in the gpfdist_temp schema.

    iconNote:
    If you don't want to assign the read privilege on existing tables, stop the task that uses parallel loading and delete the ext_gpload_* and staging_gpload_* tables in the gpfdist_temp schema. After that, you only need to assign the privilege to create tables in the schema.


    GRANT USAGE,CREATE ON SCHEMA gpfdist_temp TO trans_user ;

    2. Assign privileges to create external tables.

    alter role trans_user with createexttable;

    3. Assign read privileges on the target table. Using Auto Table Creation requires table creation privilege on corresponding databases.

    ALTER DEFAULT PRIVILEGES IN SCHEMA gpfdist_temp GRANT SELECT, INSERT, UPDATE, DELETE, 
    REFERENCES, TRIGGER ON TABLES TO trans_user ;

    Assigning the Privilege for COPY Loading

    For details, see the Pipeline Task section of this article.

    Data Service

    Using a Greenplum database in Data Service requires configuring Parallel Loading Setting. For details on data services, see Overview of Data Service.

    Configuration with Parallel Loading Setting

    Prerequisite 

    Confirming the Database Version

    Greenplum Database (Parallel Loading) 5.x and 6.are supported.

    Confirming the Data Type

    Binary fields cannot be synchronized in the parallel loading mode and trigger an error message during loading. Binary fields can only be loaded via JDBC. For details, see the data connection procedure in this section.

    Placing the gpfdist File

     The related operations and storage location of the gpfdist file are shown in the following table.

    FineDataLink Project

    Operation

    File Location

    1. Projects before 4.0.55

    2. Projects upgraded from a version before 4.0.14 to a version before 4.0.21

    See the following content of this section.

    \webapps\webroot\WEB-INF

    Projects upgraded from a version before 4.0.14 to 4.0.21 and later versions

    \webapps\webroot\WEB-INF\assist

    Projects deployed with 4.0.14 and later installation packages

    The driver is built in. Ignore this section.


    Linux System:

    You can download the package for Linux systems: gpfdist_linux.tar.gz

    1. Upload the downloaded package to the Linux server, and then extract it to the \webapps \webroot\WEB-INF\assist directory.

    iconNote:
    The installation directory cannot contain a space, otherwise gpload cannot read the file.


    2. Place the gpfdist file (in the bin folder) at the same level as the lib folder, and then delete the bin folder.

    3. Rename the gpfdist_linux folder to gpfdist.

    The effect is shown in the following figure.

    Windows System:

    1. Obtain the installation package.

    Create a gpfdist folder in the \webapps\webroot\WEB-INF\assist directory, change the obtained package to an EXE file, and place it in the folder.

     2. Check if the server where the database is located can access the 15500 port of the FineDataLink project server as the database needs to read the CSV file generated by FineDataLink for loading.

    3. Check if the account that needs to create the Greenplum data connection has the privilege to create schemas and tables.

    iconNote:

    1. For Windows systems, the gpfdist file (which has been pre-compiled for Linux systems) must be compiled into an EXE file based on the source code. Windows systems do not support the integration of gpfdist-related components (which can be integrated into Linux systems).

    2. The maximum data size of a single row is 1 MB (for Windows systems), which cannot be modified.

    Data Connection Procedure 

    Uploading the Driver

    Download the driver package and upload it to FineDataLink. For the specific steps of uploading the driver package, see Driver Management

    Driver Package Download
    Download the latest version of the PostgreSQL driver.

    Data Connection Configuration

    1. Log in to FineDataLink, choose System Management > Data Connection > Data Connection Management > New Data Connection, and click Pivotal Greenplum Database.

    iconNote:

    1. If you are not the admin, you can configure data connections only after the admin assigns you permission on Data Connection under Permission Management > System Management. For details, see Data Connection Management Permission

    2. For FineDataLink before 4.0.29, select Greenplum (Parallel Loading) when creating the data connection.

    2. Fill in the connection information. Select Custom, and select the uploaded driver mentioned in the Uploading the Driver section.

    You cannot set Pattern unless the database is connected. Click Click to Connect Database and then click Pattern, as shown in the following figure.

    3. Configure Parallel Loading Setting if you need to write data into Greenplum databases.

    Configuration ItemDescription

    Server Address - Node 1

    Enter the path of the gpfdist file mentioned in the Placing the gpfdist file section, ensuring it can be accessed by the SEG on the FineDataLink server.

    If the project is deployed in a clustered environment, multiple configuration items will be displayed in the format of Server Address - Node x. Type the path in the drop-down box.

    Temporary Table Reuse

    Determine whether to reuse temporary tables. (Reusing temporary tables can effectively reduce the table growth rate during high-frequency loading.)

    If it is set to Yes, the gpfdist_temp schema will be automatically created and used during runtime.

    Limit on Temporary File Quantity

    Set the maximum number of temporary files that can be written into the disk. Adjust the value according to the disk size and the network speed.

    Default value: 100,000. Range: 10,000 to 100,000,000. Required.

    Limit on Temporary File Size (MB)

    Set the maximum size of the file that can be written into the disk. When either Limit on Temporary File Quantity or Limit on Temporary File Size (MB) is reached, data file writing stops, and file loading starts immediately.

    Default value: 1024. Range: 10 to 102400. Required.

    4. Click Test Connection. If the connection is successful, click Save to save the configuration.

    Configuration Without Parallel Loading Setting

    See the Configuration Instruction section of this article carefully.

    Database Version

    Greenplum 5.x and 6.are supported.

    Data Connection Procedure

    The procedure is the same as that in the Configuration with Parallel Loading Setting section, except that you do not need to configure Parallel Loading Setting.


    Data Source Usage 

    For details on using Greenplum data sources in FineDataLink, see Instruction on Greenplum Data Sources.

    Scheduled Task supports the reading from and writing into the Greenplum database. For details, see Overview of Data Development.

    Pipeline Task supports data writing into the Greenplum database. For details, see Overview of Data Pipeline.

    Using a Greenplum database in Data Service requires configuring Parallel Loading Setting. For details on data services, see Overview of Data Service.


    附件列表


    主题: Data Source Configuration
    Previous
    Next
    • Helpful
    • Not helpful
    • Only read

    滑鼠選中內容,快速回饋問題

    滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。

    不再提示

    10s後關閉

    Get
    Help
    Online Support
    Professional technical support is provided to quickly help you solve problems.
    Online support is available from 9:00-12:00 and 13:30-17:30 on weekdays.
    Page Feedback
    You can provide suggestions and feedback for the current web page.
    Pre-Sales Consultation
    Business Consultation
    Business: international@fanruan.com
    Support: support@fanruan.com
    Page Feedback
    *Problem Type
    Cannot be empty
    Problem Description
    0/1000
    Cannot be empty

    Submitted successfully

    Network busy