Allowed skipping and retrying dirty data in single tables and multiple tables, and performing resynchronization.
Optimized dirty data display.
Optimized dirty data processing.
Adjusted the impact scope of dirty data configuration from task-level to table-level.
Adjusted display conditions for the Skip Dirty Data, Retry Dirty Data, and Resync buttons.
Added support for dirty data retries during the full synchronization phase.
Ensured the task to report the error of table synchronization abortion when dirty data storage failed.
Optimized the dirty data processing logs.
You want to quickly locate any dirty data generated during synchronization of a table in a pipeline task, analyze the cause, and process relevant data on the page to ensure normal task operation.
FineDataLink provides three dirty data processing methods, including Skip Dirty Data, Retry Dirty Data, and Resync, as shown in the following figure.
Data that fails to be written due to a mismatch between source and target fields (such as length/type mismatch, target field missing, and violation of NOT NULL constraints of target fields) is regarded as dirty data.
Data that fails to be written due to overall database errors (such as network exceptions, target database crashes, account permission issues, and database disk issues) is regarded as dirty data.
FineDataLink of Versions Before V4.2.15.1
Dirty data is stored in the FineDB database (in the fdl_pipe_dirty_record table) and Kafka. If dirty data storage fails, the real-time pipeline task does not report an error or abort. However, the task will subsequently report an error when processing dirty data, requiring resynchronization or specifying a start time for recovery, leading to high costs.
FineDataLink of V4.1.9.3 and Later Versions
When dirty data storage fails, an error indicating single-table synchronization abortion will be reported.
1. Click a pipeline task. You can view the number of dirty data records generated by the task on the Pipeline Activity tab page, as shown in the following figure.
2. Click the filter icon in the Dirty Data column. You can filter tables to obtain those generating dirty data, as shown in the following figure.
3. Click the dirty data count in the Dirty Data column. You can view the dirty data details of the corresponding table, as shown in the following figure.
You can view dirty data information in the fdl_pipe_dirty_record table.
1. Dirty data processing for overall grouped tables is currently not supported. All processing operations must be performed on the individual tables within the group.
2. When designing a real-time pipeline task, you can set Single-Table Dirty Data Threshold to enable automatic task abortion only after the dirty data volume in a single table reaches this threshold.
3. Starting from FineDataLink V4.2.15.1, you can manually retry dirty data generated during the full synchronization phase.
Dirty Data Event Recording Logic
If a record with a primary key value A is regarded as dirty data at time t1, subsequent writes with the same primary key value A after t1 are handled as follows: if the write succeeds, the historical dirty data with this primary key value will be cleared; if the write fails, only the latest dirty data with this primary key value will be retained.
If a record with a primary key value A is updated with a new value B at time t1, the pipeline will break this down into two events to be processed sequentially in the target end, namely Delete A and Insert B.
Constraints When Output Ends Use Batch Loading
If batch loading is enabled for the output end, typically, data will be submitted in one large batch, which may cause the output end to be unable to identify which specific records in the batch generate dirty data. Details of output ends supporting batch loading are described as follows:
1. GaussDB 200 supports two write methods, copy loading and parallel loading:
COPY Loading:
In the full synchronization phase, if a single batch fails to be submitted, the data in this batch will be written by calling the JDBC API (currently 1024 records per batch). Error data records can be obtained during JDBC writing, enabling the display of dirty data details, but the performance is poor.
In the incremental synchronization phase, if a single batch fails to be submitted, the data in the entire batch will be recorded as dirty data. (The current maximum batch size is 5 MB, roughly within 10,000 rows, depending on the individual record size.)
Parallel Loading:
All the required data in a single task is submitted in one go. An API is provided to query error data, allowing you to query dirty data details in FineDataLink.
2. Hadoop Hive (HDFS write method):
The data is written to the HDFS file in one go. Detailed error data cannot be obtained, so dirty data management is not supported.
Single Dirty Data Record Processing
1. To process a single dirty data record, you can click the dirty data count of a specific table in the Dirty Data column and then click Retry or Skip to process the required record, as shown in the following figure.
2. To retry all dirty data in a table in the pop‑up window, select Select All and click Retry, as shown in the following figure.
Batch Dirty Data Processing
To process all dirty data in the current task or all dirty data in the specified table, tick a single table or multiple tables in the required source database and click Skip Dirty Data, Retry Dirty Data, or Resync, as shown in the following figure.
1. The following table describes the three dirty data processing methods, namely Skip Dirty Data, Retry Dirty Data, and Resync.
For a single table and specified multiple tables, if you click Skip Dirty Data, the cached dirty data will be deleted and cannot be retrieved.
Meanwhile, these data rows will no longer be included in the dirty data statistics.
1. This function enables full resynchronization for single tables or the entire pipeline task. If you click Resync, the task performs a full synchronization again after clearing the target table and triggers incremental synchronization after the full synchronization completes.
2. If the task is enabled with full resynchronization, the statistics (of the input and output row count) will be reset.
2. The following table describes the conditions for the three buttons (Skip Dirty Data, Retry Dirty Data, and Resync) to appear.
1. Resync is unavailable if the data source is Kafka. To retry expired dirty data, you can only rerun the table synchronization or manually insert dirty data and skip it in the pipeline.
2. If you select a paused or aborted task and click Resync, the resynchronization will only be performed after task startup.
1. The table status is Incremental Synchronization in Progress, and the table contains dirty data.
2. The table status is Aborted, and the table contains dirty data.
3. The table status is Paused, and the table contains dirty data.
2. The table status is Paused, and the table contains dirty data.
3. The table status is Aborted, and the table contains dirty data.
1. The table status is Paused, and the table contains dirty data.
To process dirty data generated during the full synchronization phase, you can export the detailed dirty data and manually adjust and insert it, as shown in the following figure.
For details, see Execution Details of Real-Time Pipeline Tasks.
After processing dirty data, you can view the execution details on the Execution Log tab page, as shown in the following figure.
滑鼠選中內容,快速回饋問題
滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。
不再提示
10s後關閉
Submitted successfully
Network busy