Real-Time Pipeline Task - Dirty Data Processing

  • Last update: March 04, 2026
  • iconNote:
    This section applies to FineDataLink of V4.2.11.3 and later versions.

    Overview

    Version

    FineDataLink VersionFunctional Change
    4.0.5/
    4.0.27

    Allowed skipping and retrying dirty data in single tables and multiple tables, and performing resynchronization.

    4.1.2
    • Optimized dirty data display.

    • Optimized dirty data processing.

    4.2.1.1

    Adjusted the impact scope of dirty data configuration from task-level to table-level.

    4.2.11.3

    Adjusted display conditions for the Skip Dirty DataRetry Dirty Data, and Resync buttons.

    4.2.15.1
    • Added support for dirty data retries during the full synchronization phase.

    • Ensured the task to report the error of table synchronization abortion when dirty data storage failed.

    • Optimized the dirty data processing logs.

    Application Scenario

    You want to quickly locate any dirty data generated during synchronization of a table in a pipeline task, analyze the cause, and process relevant data on the page to ensure normal task operation.

    Function Description

    FineDataLink provides three dirty data processing methods, including Skip Dirty Data, Retry Dirty Data, and Resync, as shown in the following figure.

    Dirty Data Definition

    • Data that fails to be written due to a mismatch between source and target fields (such as length/type mismatch, target field missing, and violation of NOT NULL constraints of target fields) is regarded as dirty data.

    • Data that fails to be written due to overall database errors (such as network exceptions, target database crashes, account permission issues, and database disk issues) is regarded as dirty data.

    iconNote:
    Primary key conflicts in real-time pipeline tasks do not result in dirty data because the new data overwrites the old one.

    Explanation of Dirty Data Storage Logic

    FineDataLink of Versions Before V4.2.15.1

    Dirty data is stored in the FineDB database (in the fdl_pipe_dirty_record table) and Kafka. If dirty data storage fails, the real-time pipeline task does not report an error or abort. However, the task will subsequently report an error when processing dirty data, requiring resynchronization or specifying a start time for recovery, leading to high costs.

    FineDataLink of V4.1.9.3 and Later Versions

    When dirty data storage fails, an error indicating single-table synchronization abortion will be reported.

    iconNote:
    If dirty data storage fails, the task checkpoint will not be updated. This ensures that synchronization resumes from the correct position upon recovery, avoiding data inconsistency.

    Viewing Specific Dirty Data

    Viewing Dirty Data in Task Details

    1. Click a pipeline task. You can view the number of dirty data records generated by the task on the Pipeline Activity tab page, as shown in the following figure.

    2. Click the filter icon in the Dirty Data column. You can filter tables to obtain those generating dirty data, as shown in the following figure.

    3. Click the dirty data count in the Dirty Data column. You can view the dirty data details of the corresponding table, as shown in the following figure.

    Viewing Dirty Data in FineDB

    You can view dirty data information in the fdl_pipe_dirty_record table.

    Processing Dirty Data

    iconNote:

    1. Dirty data processing for overall grouped tables is currently not supported. All processing operations must be performed on the individual tables within the group.

    2. When designing a real-time pipeline task, you can set Single-Table Dirty Data Threshold to enable automatic task abortion only after the dirty data volume in a single table reaches this threshold.

    3. Starting from FineDataLink V4.2.15.1, you can manually retry dirty data generated during the full synchronization phase.

    Usage Instruction

    Dirty Data Event Recording Logic

    • If a record with a primary key value A is regarded as dirty data at time t1, subsequent writes with the same primary key value A after t1 are handled as follows: if the write succeeds, the historical dirty data with this primary key value will be cleared; if the write fails, only the latest dirty data with this primary key value will be retained.

    • If a record with a primary key value A is updated with a new value B at time t1, the pipeline will break this down into two events to be processed sequentially in the target end, namely Delete A and Insert B.

    Constraints When Output Ends Use Batch Loading

    If batch loading is enabled for the output end, typically, data will be submitted in one large batch, which may cause the output end to be unable to identify which specific records in the batch generate dirty data. Details of output ends supporting batch loading are described as follows:

    1. GaussDB 200 supports two write methods, copy loading and parallel loading:

    COPY Loading:

    • In the full synchronization phase, if a single batch fails to be submitted, the data in this batch will be written by calling the JDBC API (currently 1024 records per batch). Error data records can be obtained during JDBC writing, enabling the display of dirty data details, but the performance is poor.

    • In the incremental synchronization phase, if a single batch fails to be submitted, the data in the entire batch will be recorded as dirty data. (The current maximum batch size is 5 MB, roughly within 10,000 rows, depending on the individual record size.)

    Parallel Loading:

    • All the required data in a single task is submitted in one go. An API is provided to query error data, allowing you to query dirty data details in FineDataLink.

    2. Hadoop Hive (HDFS write method):

    The data is written to the HDFS file in one go. Detailed error data cannot be obtained, so dirty data management is not supported.

    Steps for Processing Single/Multiple Dirty Data Records

    Single Dirty Data Record Processing

    1. To process a single dirty data record, you can click the dirty data count of a specific table in the Dirty Data column and then click Retry or Skip to process the required record, as shown in the following figure.

    2. To retry all dirty data in a table in the pop‑up window, select Select All and click Retry, as shown in the following figure. 

    Batch Dirty Data Processing

    To process all dirty data in the current task or all dirty data in the specified table, tick a single table or multiple tables in the required source database and click Skip Dirty Data, Retry Dirty Data, or Resync, as shown in the following figure.

    Explanation of Three Processing Methods

    1. The following table describes the three dirty data processing methods, namely Skip Dirty Data, Retry Dirty Data, and Resync.

    Processing MethodDescription
    Skip Dirty Data

    For a single table and specified multiple tables, if you click Skip Dirty Data, the cached dirty data will be deleted and cannot be retrieved.

    Meanwhile, these data rows will no longer be included in the dirty data statistics.

    Retry Dirty DataFor a single table and specified multiple tables, if you click Retry Dirty Data, the cached dirty data will be resubmitted, and the dirty data statistics will be updated.
    Resync

    1. This function enables full resynchronization for single tables or the entire pipeline task. If you click Resync, the task performs a full synchronization again after clearing the target table and triggers incremental synchronization after the full synchronization completes.

    2. If the task is enabled with full resynchronization, the statistics (of the input and output row count) will be reset.

    iconNote:
    If Logical Deletion is enabled, the target table will be cleared for data rewriting during resynchronization. The rewriting uses the INSERT logic. A prompt will appear, indicating that logically deleted data generated during task operation will be cleared.

    2. The following table describes the conditions for the three buttons (Skip Dirty Data, Retry Dirty Data, and Resync) to appear.

    iconNote:

    1. Resync is unavailable if the data source is Kafka. To retry expired dirty data, you can only rerun the table synchronization or manually insert dirty data and skip it in the pipeline.

    2. If you select a paused or aborted task and click Resync, the resynchronization will only be performed after task startup.

    Processing MethodScenario
    Skip Dirty Data

    1. The table status is Incremental Synchronization in Progress, and the table contains dirty data.

    2. The table status is Aborted, and the table contains dirty data.

    3. The table status is Paused, and the table contains dirty data.

    Retry Dirty Data

    1. The table status is Incremental Synchronization in Progress, and the table contains dirty data.

    2. The table status is Paused, and the table contains dirty data.

    3. The table status is Aborted, and the table contains dirty data.

    Resync

    1. The table status is Paused, and the table contains dirty data.

    2. The table status is Aborted, and the table contains dirty data.

    To process dirty data generated during the full synchronization phase, you can export the detailed dirty data and manually adjust and insert it, as shown in the following figure.

    Viewing Logs After Dirty Data Processing

    For details, see Execution Details of Real-Time Pipeline Tasks.

    After processing dirty data, you can view the execution details on the Execution Log tab page, as shown in the following figure.

    附件列表


    主题: O&M Center
    • Helpful
    • Not helpful
    • Only read

    滑鼠選中內容,快速回饋問題

    滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。

    不再提示

    10s後關閉

    Get
    Help
    Online Support
    Professional technical support is provided to quickly help you solve problems.
    Online support is available from 9:00-12:00 and 13:30-17:30 on weekdays.
    Page Feedback
    You can provide suggestions and feedback for the current web page.
    Pre-Sales Consultation
    Business Consultation
    Business: international@fanruan.com
    Support: support@fanruan.com
    Page Feedback
    *Problem Type
    Cannot be empty
    Problem Description
    0/1000
    Cannot be empty

    Submitted successfully

    Network busy