Real-Time Pipeline Task - Dirty Data Processing- FineDataLink Help Document

Last update: March 04, 2026

Note:

This section applies to FineDataLink of V4.2.11.3 and later versions.

Overview

Version

FineDataLink Version	Functional Change
4.0.5	/
4.0.27	Allowed skipping and retrying dirty data in single tables and multiple tables, and performing resynchronization.
4.1.2	Optimized dirty data display. Optimized dirty data processing.
4.2.1.1	Adjusted the impact scope of dirty data configuration from task-level to table-level.
4.2.11.3	Adjusted display conditions for the Skip Dirty Data, Retry Dirty Data, and Resync buttons.
4.2.15.1	Added support for dirty data retries during the full synchronization phase. Ensured the task to report the error of table synchronization abortion when dirty data storage failed. Optimized the dirty data processing logs.

Application Scenario

You want to quickly locate any dirty data generated during synchronization of a table in a pipeline task, analyze the cause, and process relevant data on the page to ensure normal task operation.

Function Description

FineDataLink provides three dirty data processing methods, including Skip Dirty Data, Retry Dirty Data, and Resync, as shown in the following figure.

Dirty Data Definition

Data that fails to be written due to a mismatch between source and target fields (such as length/type mismatch, target field missing, and violation of NOT NULL constraints of target fields) is regarded as dirty data.
Data that fails to be written due to overall database errors (such as network exceptions, target database crashes, account permission issues, and database disk issues) is regarded as dirty data.

Note:

Primary key conflicts in real-time pipeline tasks do not result in dirty data because the new data overwrites the old one.

Explanation of Dirty Data Storage Logic

FineDataLink of Versions Before V4.2.15.1

Dirty data is stored in the FineDB database (in the fdl_pipe_dirty_record table) and Kafka. If dirty data storage fails, the real-time pipeline task does not report an error or abort. However, the task will subsequently report an error when processing dirty data, requiring resynchronization or specifying a start time for recovery, leading to high costs.

FineDataLink of V4.1.9.3 and Later Versions

When dirty data storage fails, an error indicating single-table synchronization abortion will be reported.

Note:

If dirty data storage fails, the task checkpoint will not be updated. This ensures that synchronization resumes from the correct position upon recovery, avoiding data inconsistency.

Viewing Specific Dirty Data

Viewing Dirty Data in Task Details

1. Click a pipeline task. You can view the number of dirty data records generated by the task on the Pipeline Activity tab page, as shown in the following figure.

2. Click the filter icon in the Dirty Data column. You can filter tables to obtain those generating dirty data, as shown in the following figure.

3. Click the dirty data count in the Dirty Data column. You can view the dirty data details of the corresponding table, as shown in the following figure.

Viewing Dirty Data in FineDB

You can view dirty data information in the fdl_pipe_dirty_record table.

Processing Dirty Data

Note:

1. Dirty data processing for overall grouped tables is currently not supported. All processing operations must be performed on the individual tables within the group.

2. When designing a real-time pipeline task, you can set Single-Table Dirty Data Threshold to enable automatic task abortion only after the dirty data volume in a single table reaches this threshold.

3. Starting from FineDataLink V4.2.15.1, you can manually retry dirty data generated during the full synchronization phase.

Usage Instruction

Dirty Data Event Recording Logic

If a record with a primary key value A is regarded as dirty data at time t1, subsequent writes with the same primary key value A after t1 are handled as follows: if the write succeeds, the historical dirty data with this primary key value will be cleared; if the write fails, only the latest dirty data with this primary key value will be retained.
If a record with a primary key value A is updated with a new value B at time t1, the pipeline will break this down into two events to be processed sequentially in the target end, namely Delete A and Insert B.

Constraints When Output Ends Use Batch Loading

If batch loading is enabled for the output end, typically, data will be submitted in one large batch, which may cause the output end to be unable to identify which specific records in the batch generate dirty data. Details of output ends supporting batch loading are described as follows:

1. GaussDB 200 supports two write methods, copy loading and parallel loading:

COPY Loading:

In the full synchronization phase, if a single batch fails to be submitted, the data in this batch will be written by calling the JDBC API (currently 1024 records per batch). Error data records can be obtained during JDBC writing, enabling the display of dirty data details, but the performance is poor.
In the incremental synchronization phase, if a single batch fails to be submitted, the data in the entire batch will be recorded as dirty data. (The current maximum batch size is 5 MB, roughly within 10,000 rows, depending on the individual record size.)

Parallel Loading:

All the required data in a single task is submitted in one go. An API is provided to query error data, allowing you to query dirty data details in FineDataLink.

2. Hadoop Hive (HDFS write method):

The data is written to the HDFS file in one go. Detailed error data cannot be obtained, so dirty data management is not supported.

Steps for Processing Single/Multiple Dirty Data Records

Single Dirty Data Record Processing

1. To process a single dirty data record, you can click the dirty data count of a specific table in the Dirty Data column and then click Retry or Skip to process the required record, as shown in the following figure.

2. To retry all dirty data in a table in the pop‑up window, select Select All and click Retry, as shown in the following figure.

Batch Dirty Data Processing

To process all dirty data in the current task or all dirty data in the specified table, tick a single table or multiple tables in the required source database and click Skip Dirty Data, Retry Dirty Data, or Resync, as shown in the following figure.

Explanation of Three Processing Methods

1. The following table describes the three dirty data processing methods, namely Skip Dirty Data, Retry Dirty Data, and Resync.

Processing Method	Description
Skip Dirty Data	For a single table and specified multiple tables, if you click Skip Dirty Data, the cached dirty data will be deleted and cannot be retrieved. Meanwhile, these data rows will no longer be included in the dirty data statistics.
Retry Dirty Data	For a single table and specified multiple tables, if you click Retry Dirty Data, the cached dirty data will be resubmitted, and the dirty data statistics will be updated.
Resync	1. This function enables full resynchronization for single tables or the entire pipeline task. If you click Resync, the task performs a full synchronization again after clearing the target table and triggers incremental synchronization after the full synchronization completes. 2. If the task is enabled with full resynchronization, the statistics (of the input and output row count) will be reset. Note: If Logical Deletion is enabled, the target table will be cleared for data rewriting during resynchronization. The rewriting uses the INSERT logic. A prompt will appear, indicating that logically deleted data generated during task operation will be cleared.

Processing Method

Description

Skip Dirty Data

For a single table and specified multiple tables, if you click Skip Dirty Data, the cached dirty data will be deleted and cannot be retrieved.

Meanwhile, these data rows will no longer be included in the dirty data statistics.

Retry Dirty Data

For a single table and specified multiple tables, if you click Retry Dirty Data, the cached dirty data will be resubmitted, and the dirty data statistics will be updated.

Resync

1. This function enables full resynchronization for single tables or the entire pipeline task. If you click Resync, the task performs a full synchronization again after clearing the target table and triggers incremental synchronization after the full synchronization completes.

2. If the task is enabled with full resynchronization, the statistics (of the input and output row count) will be reset.

Note:

If Logical Deletion is enabled, the target table will be cleared for data rewriting during resynchronization. The rewriting uses the INSERT logic. A prompt will appear, indicating that logically deleted data generated during task operation will be cleared.

2. The following table describes the conditions for the three buttons (Skip Dirty Data, Retry Dirty Data, and Resync) to appear.

Note:

1. Resync is unavailable if the data source is Kafka. To retry expired dirty data, you can only rerun the table synchronization or manually insert dirty data and skip it in the pipeline.

2. If you select a paused or aborted task and click Resync, the resynchronization will only be performed after task startup.

Processing Method	Scenario
Skip Dirty Data	1. The table status is Incremental Synchronization in Progress, and the table contains dirty data. 2. The table status is Aborted, and the table contains dirty data. 3. The table status is Paused, and the table contains dirty data.
Retry Dirty Data	1. The table status is Incremental Synchronization in Progress, and the table contains dirty data. 2. The table status is Paused, and the table contains dirty data. 3. The table status is Aborted, and the table contains dirty data.
Resync	1. The table status is Paused, and the table contains dirty data. 2. The table status is Aborted, and the table contains dirty data.

Processing Method

Scenario

Skip Dirty Data

1. The table status is Incremental Synchronization in Progress, and the table contains dirty data.

2. The table status is Aborted, and the table contains dirty data.

3. The table status is Paused, and the table contains dirty data.

Retry Dirty Data

1. The table status is Incremental Synchronization in Progress, and the table contains dirty data.

2. The table status is Paused, and the table contains dirty data.

3. The table status is Aborted, and the table contains dirty data.

Resync

1. The table status is Paused, and the table contains dirty data.

2. The table status is Aborted, and the table contains dirty data.

To process dirty data generated during the full synchronization phase, you can export the detailed dirty data and manually adjust and insert it, as shown in the following figure.

Viewing Logs After Dirty Data Processing

For details, see Execution Details of Real-Time Pipeline Tasks.

After processing dirty data, you can view the execution details on the Execution Log tab page, as shown in the following figure.

Helpful
Not helpful
Only read

中文（简体）

English

Real-Time Pipeline Task - Dirty Data Processing

Overview

Version

Application Scenario

Function Description

Dirty Data Definition

Explanation of Dirty Data Storage Logic

Viewing Specific Dirty Data

Viewing Dirty Data in Task Details

Viewing Dirty Data in FineDB

Processing Dirty Data

Usage Instruction

Steps for Processing Single/Multiple Dirty Data Records

Explanation of Three Processing Methods

Viewing Logs After Dirty Data Processing

附件列表