This document explains the unique terms of FineDataLink to facilitate usage.
The function modules in FineDataLink include Data Development, Data Pipeline, Data Service, Task O&M, and others, meeting a series of needs such as data synchronization, processing, and cleaning.
You can develop and orchestrate tasks with SQL statements and visual operations.
For details about real-time tasks, see Overview of Real-Time Task.
Scheduled Task O&M
Pipeline Task O&M
Data Service O&M
Database Table Management
Lineage Analysis
1. Database Table Management:
For databases supporting SQL statements, you can write SQL statements to query and modify table data.
You can view data and structures of tables, modify table names and descriptions, empty, delete, and copy tables, and so on.
2. Lineage Analysis:
You can view the lineage relationship between tables used in data development tasks, pipeline tasks, and data services.
FineDataLink provides Development Mode and Production Mode for scheduled tasks, with code isolated between the two environments, as shown in the following figure.
For details, see Development Mode and Production Mode.
Development Mode:
Development Mode serves as a test environment for task design and editing, where all modifications remain isolated from tasks in Production Mode. Tasks developed in this mode can be published to Production Mode.
Production Mode:
You can publish production-ready tasks in Development Mode to create view-only counterparts in Production Mode, based on which you can configure scheduling plans for conditional execution. You can filter tasks by publication status on the Task Managementpage in O&M Center.
A folder can contain multiple scheduled tasks or real-time tasks, as shown in the following figure.
The Data Development module supports the development of two types of tasks:
Scheduled Task
The source and target ends of scheduled tasks support more than 30 types of data sources. For details, see Data Sources Supported by FineDataLink.
Scheduled Task provides various nodes and operators for you to extract, transform, and load data on visual pages, facilitating the construction of offline data warehouses. You can configure scheduling plans for scheduled tasks for automatic execution, thus achieving efficient and stable data production.
Real-Time Task
Real-Time Task enables real-time data delivery from Point A to Point B. It supports data processing, such as parsing data in real-time data warehouses, during delivery, providing downstream businesses with available and accurate data to meet business requirements.
For details, see Overview of Real-Time Task.
In FineDataLink documents, nodes within the Data Transformation node are termed operators. Data Transformation and nodes outside the Data Transformation node are termed nodes.
A node is a basic unit of a scheduled task. Multiple nodes form an execution process after being connected by lines and further form a complete scheduled task.
Nodes in a task run in turn according to the dependencies between nodes.
ETL processing is carried out in the Data Transformation node. The encapsulated visual functions enable efficient data cleaning, processing, and loading.
A data flow (a Data Transformation node) only provides the following three types of operators, and does not contain combinatorial and process-type operators:
An example of output operators: DB Table Output
An example of processing operators: Data Association
An example of input operators: DB Table Input
Terms involved in the Data Transformation node are explained in the following table.
Input
DB Table Input
Dataset Input
Jiandaoyun Input
File Input
Data Output
DB Table Output
Comparison-Based Deletion
You can identify data rows that exist in the target table but are absent in the input source by comparing field values and then perform deletion operations. The supported deletion modes include:
Physical Deletion: It actually deletes data.
Logical Deletion: It does not delete data, and only adds deletion identifiers.
Parameter Output
API Output
Jiandaoyun Output
File Output
MongoDB Output
Dataset Output
Connection
Data Association
It is used to join multiple input sources and output the join results.
It supports cross-database and cross-source joins.
The join methods include:
Left Join
Right Join
Inner Join
Full Outer Join
These join methods are consistent with how database tables are joined. You can get the join results by defining the join fields and conditions. It requires more than two input sources and only one output source.
Data Comparison
Procedure:
Select two tables to be compared.
Configure the logical primary key.
Configure the comparison field.
Set the identification relationship.
Transformation
Field Setting
It provides the following functions:
Set columns: You can select and delete fields.
Modify columns: You can modify the field name and type.
Column to Row
You can convert columns in the input data table to rows.
Column to Row (also known as unpivot): It can convert one-row multi-column data into multi-row one-column data. After unpivoting, source column names are transposed into values within a designated attribute column, enabling traceability to their original data context.
Row to Column
You can convert rows in the input data table to columns.
Row to Column (also known as pivot): It can convert multi-row one-column data into one-row multi-column data. After pivoting, distinct categorical values from a source column become the new column headers, while corresponding values across multiple rows are aggregated into a single row per category.
JSON Parsing
XML Parsing
JSON Generation
It is used to generate new columns through calculation.
Data Filtering
Group Summary
Field-to-Row Splitting
A step flow is composed of nodes.
Each step is a closed loop from input to output.
The terms involved in a step flow are explained in the following table.
General
Data Synchronization
It provides multiple data fetching methods, such as API Input, SQL Statement, and File Input. Memory calculation is not required as there is no data processing during the process, making this node suitable for scenarios where:
Rapid synchronization of data tables is needed.
The calculation needs to be finished during data fetching, and no calculation or transformation is needed during synchronization.
The target database has strong computing ability or the data volume is large. In this case, you can synchronize data to the target database and then use SQL statements for further development.
It can meet the requirements of data conversion and processing between input and output.
It supports complex data processing, such as data association, transformation, and cleaning, between input and output during data synchronization.
Data Transformation fundamentally operates as a data flow. It relies on an in-memory computing engine for data processing, making it suitable for development tasks with smaller datasets (up to 10 million records). Its computational performance scales with allocated memory resources.
Script
SQL Script
Shell Script
Python Script
Bat Script
Process
A virtual node is a no-operation element that connects multiple upstream branches to multiple downstream branches in process design.
Notification channels include emails, SMSs, platform messages, WeCom messages (through chatbots and app messages), and DingTalk messages.
You can customize the notification content.
You can right-click a connector in a step flow and choose the execution condition. Options include Execute Unconditionally, Execute on Success, and Execute on Failure.
You can right-click a node in a step flow and click Execution Judgment to enter the Execution Judgment window, where you can customize the judgment logic of multiple conditions (All or Any) to determine whether to execute the node, controlling the dependencies of nodes in the task flexibly.
An instance is generated each time a scheduled task runs, which can be viewed under O&M Center > Scheduled Task > Running Record.
When a task runs, the time when the instance starts constructing is displayed in Log, as shown in the following figure.
If you have set the execution frequency for a scheduled task, the instance construction may start slightly later than the set time. For example, if the task is set to run at 11:00:00 every day, the instance construction may start at 11:00:02.
For details about data synchronization schemes, see Overview of Data Synchronization Schemes.
1 Incremental Update:
This scheme applies to target table update scenarios where the source table only experiences data inserts.
2 Full Update
This scheme is to comprehensively replace all existing data in the target table with the latest data in the source table.
3. Comparison-Based Update
This scheme applies to target table update scenarios where the source table experiences data inserts, modification, and deletion.
A zipper table maintains historical states alongside the latest data, allowing convenient reconstruction of customer records at any specific point in time. This table is suitable for scenarios that require recording all data changes for auditing or tracing purposes.
The identifier field marks whether data rows are inserted, modified, or deleted.
Identifier values are values of this field, where different values correspond to different data change types.
During data output, insert/update/delete operations are executed based on the identifier field and its values.
1. Scenario one: The identifier field and its values are generated by the Data Comparison operator.
When you use the combination of Data Comparison and DB Table Output/Jiandaoyun Output to synchronize data operations (insert, delete, and update), the fdl_comparison_type field (automatically added by the Data Comparison operator) serves as the identifier field. Its values, including Identical (marking unchanged records data), Changed (marking modification), Added (marking addition), and Removed (marking deletion), serve as the identifier values, as shown in the following figure.
2. Scenario Two: The source table contains an identifier field with valid values, and you want to synchronize data insert/delete/update operations.
For details, see Adding/Modifying/Deleting Data Based on the Identifier Field.
The source table Product includes an identifier field Status, whose values include Hot-selling (indicating that the record needs to be added), Normal (indicating that the record needs to be deleted), and Viral (indicating that the record needs to be updated).
In the Status column in the source table, the data whose Product ID value is 15 is marked as Hot-selling, whose Product ID value is 16 is marked as Normal, and whose Product ID value is 14 is marked as Viral. You want to change the data in the target table Product Data accordingly.
When the data volume is large, you can enable Parallel Read to accelerate data reading.
For details, see Data Synchronization - Data Source.
1. Execution Log
After a scheduled task runs, execution logs are displayed on the Log tab page, where you can check whether the task runs successfully and view failure reasons.
You can set Log Level Setting to adjust th level of detail of the output logs.
2. Running Record
You can view task execution information, such as execution status, task duration, and trigger method, under O&M Center > Scheduled Task > Running Record.
Pipeline Task
After a pipeline task runs, execution logs can be viewed. For details, see Single Pipeline Task O&M.
For details about the execution records of pipeline task, see Real-Time Pipeline Task O&M - Task Management.
In case of substantial historical data, incremental updates must be executed periodically during data synchronization to ensure data timeliness.
If abnormal field values or dirty data are encountered during incremental updates, the synchronization task may fail after partial completion. In such cases, data in the target table requires rollback to its state before the current incremental update.
FineDataLink provides native support for this scenario from V4.1.5.2. For details, see DB Output (Transaction).
You can also implement rollback by referring to the help document. For details, see Data Rollback After Extraction Failures.
You can set execution priority levels for scheduled tasks. Options include HIGHEST, HIGH, MEDIUM, LOW, and LOWEST.
During thread resource contention, tasks with higher priority in the queue execute first, and tasks with equal priority follow the FIFO (First-In-First-Out) execution order.
For details, see Task Control - Task Attribute.
You can set the execution frequency for scheduled tasks to have them executed automatically at regular intervals, ensuring prompt data updates.
For details, see Overview of Scheduling Plan.
For details about the task retry function, see Task Record: Task Retry.
The task retry function is required in the following scenarios:
1. A scheduled task fetches data of 24 hours preceding the scheduling time each day and synchronizes data to the target database. During a three-day holiday, the system crashes and the scheduled task does not run, resulting in a lack of data for those three days in the target database.
2. During the execution of a scheduled task, dirty data appears in an output component. The scheduled task continues running as the configured dirty data threshold has not been reached. The existence of dirty data is not perceived by the O&M personnel until they receive notification after task completion.
The O&M personnel then check the reasons on the dirty data processing page and find that the dirty data results from the limit-exceeding field length. After modifying the field length at the target end, they want the task to rerun.
Data Pipeline provides real-time data synchronization functionality, enabling convenient single-table or entire-database synchronization to replicate data changes from partial or all tables in source databases to target databases in real time, ensuring continuous data correspondence between target and source systems.
During real-time synchronization, data from source databases is temporarily stored via the data pipeline to facilitate writing to target databases.
Therefore, configuring middleware for data staging is a prerequisite before setting up pipeline tasks. FineDataLink employs Kafka as synchronization middleware to temporarily store transmitted data.
A failed pipeline task can continue from the breakpoint. In this case, if the full data load has not been synchronized, the data synchronization will start from the beginning. If the full load has been synchronized, the data synchronization will start from the breakpoint.
The following is an example of resynchronization from the breakpoint:
The pipeline task read data on March 21, stopped reading data on March 23, and restarted on March 27. Data from March 23 to March 27 would be synchronized.
Binlog
MySQL's binary log (binlog) is a critical feature that records all database change operations (such as INSERT, UPDATE, and DELETE) with precise SQL statement execution timestamps. It serves as the foundation for data replication, point-in-time recovery, and audit analysis for MySQL databases.
When using MySQL databases as the source ends of pipeline tasks in FineDataLink, ensure you have enabled Binlog-based CDC. For details, see MySQL Environment Preparation.
CDC
Change Data Capture (CDC) extracts incremental changes to data and schemas from source databases, and propagates them to other databases or app systems in near real time. This enables efficient and low-latency data transfer to data warehouses, facilitating timely data transformation and delivery to analytical applications.
Physical Deletion: If the data is deleted from the source table, the corresponding data will be deleted from the target table.
Logical Deletion: Mark data as deleted using an identifier column without actually deleting it from the target table.
Data Service enables the one-click release of processed data as APIs, facilitating cross-domain data transmission and sharing.
It is a unique API authentication method of FineDataLink. AppCode can be regarded as a long-term valid Token. If set, it will take effect on APIs bound with specified apps.
To access an API, you need to specify the AppCode value in the Authorization request header in the form of AppCode AppCode value (with a space between AppCode and the AppCode value).
For details, see Binding an API to an Application.
Concurrency refers to the maximum number of scheduled tasks and pipeline tasks that can be started simultaneously.
Field mapping establishes source-to-target field correspondence. For details, see Table Field Mapping.
For source tables with multiple fields, only mapped fields in the target table will be updated. You can unmap fields that require no synchronization.
There are two types of mapping methods:
Map Fields with Same Name: Use this method when source and target fields share identical names. For specific logic, see Table Field Mapping.
Map Fields in Same Row: Use this method when source and target fields follow identical column order. For specific logic, see Table Field Mapping.
Definition of Dirty Data in Scheduled Task
1. Data that fails to be written due to a mismatch between source and target fields (such as length/type mismatch, target field missing, and violation of NOT NULL constraints of target tables) is regarded as dirty data.
2. Data with conflict primary key values in case Strategy for Primary Key Conflict in Write Method is set to Record as Dirty Data is regarded as dirty data.
Definition of Dirty Data in Pipeline Task
Data that fails to be written due to a mismatch between source and target fields (such as length/type mismatch, target field missing, and violation of NOT NULL constraints of target tables) is regarded as dirty data.
You can view the lineage relationships of tables used in data development tasks, pipeline tasks, and data services in FineDataLink, as shown in the following figure.
For details, see Lineage Analysis.
In database management systems, Data Definition Language (DDL) comprises commands for defining and modifying database architectures.
Common DDL statements include:
CREATE: used to create database objects
ALTER: used to modify the structure of existing database objects
DROP: used to delete database objects
TRUNCATE: used to remove all rows from a table without affecting its structure.
DDL serves as a critical tool for database administrators and developers to design and maintain database architectures.
In FineDataLink, DDL synchronization refers to the function that enables automatic synchronization of source-end DDL operations (such as table deletion, field adding/deletion, field renaming, and data field modification) to the target end, requiring no manual intervention to modify the target table structure. This synchronization mechanism ensures structural consistency and reduces manual maintenance efforts.
For details, see Data Pipeline - Synchronizing Source Table Structure Changes.
Data updates, deletion, and insertion are typically performed based on logical or physical primary keys to ensure data uniqueness.
Logical Primary Key
The logical primary key is business-defined, supporting multiple fields. Its strength lies in meeting business requirements while guaranteeing data uniqueness. For example, in an employee table, the logical primary key can be the employee ID, ID card number, or other unique identifiers.
Physical Primary Key
The database generates a unique identifier as the primary key value for each record through auto-increment or unique indexes. Physical primary keys offer simplicity and efficiency, operating independently of business rules.
Note: You can refer to authoritative sources for distinctions between logical and physical primary keys.
A task interrupted due to network fluctuations or other reasons can be executed successfully if you rerun it after a while. To prevent such task interruption, you can configure the number of retries and the interval between retries in Retry After Failure to automatically rerun the task upon failure.
For details, see Task Control - Fault Tolerance Mechanism and Pipeline Task Configuration - Pipeline Control.
FineDataLink employs task edit locks to prohibit concurrent editing of scheduled tasks, pipeline tasks, API tasks, and data service apps by multiple users.
A task currently being edited by one user is locked against editing by others. Other users opening it can only view the task and will receive a prompt "The current task/service/application is being edited by Username."
You can select ERROR, WARN, or INFO in Log Level Setting.
Log levels ranked by severity (from highest to lowest): ERROR > WARN > INFO
Log levels ranked by detail (from simplest to most detailed): ERROR < WARN < INFO
For details about log levels in scheduled tasks, see Task Control - Task Attribute.
For details about log levels in pipeline tasks, see Pipeline Task Configuration - Pipeline Control.
You can set Dirty Data Threshold to enhance the fault tolerance of tasks. The scheduled task continues running despite dirty data and does not trigger the error until the set limit of dirty data is reached.
For details about this function in scheduled tasks, see Dirty Data Tolerance.
A synchronization task can proceed despite issues such as field type/length mismatch and the primary key conflicts after the threshold of the dirty data volume is set. The pipeline task will be aborted automatically when the threshold is reached.
A pipeline task whose Dirty Data Threshold is set to 1000 Row(s) will be aborted when the number of dirty data records reaches 1000 during runtime. The dirty data threshold limits the total number of dirty data records in a task since task creation.
For details, see Pipeline Task Configuration - Pipeline Control.
Prior to formal production use, functions including data source usage, Data Pipeline, Data Service, Data Development, Database Table Management, Lineage Analysis, and System Management require registration. For details, see Registration Introduction.
The FineDB database stores all the platform configuration information of FineDataLink projects, including pipeline tasks, scheduled tasks, permission control, and system management. FineDataLink contains a built-in HSQL database used as the FineDB database.
The HSQL database does not support multi-threaded access. It may become unstable in a clustered environment or when handling large volumes of data. It is suitable for a local trial of product functionality.
FineDataLink supports the use of an external FineDB database. For details, see External Database Configuration.
You must configure an external database for the formal project.
Subform is a concept in Jiandaoyun. For details, see SubForm.
For the explanation of terms related to real-time tasks, see Overview of Real-Time Task.
滑鼠選中內容,快速回饋問題
滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。
不再提示
10s後關閉
Submitted successfully
Network busy