Data Deduplication

  • Last update: January 27, 2026
  • Overview

    Expected Effect

    When building a data warehouse, you may need to deduplicate data, as duplicate records constitute dirty data. In this case, you can use the GROUP BY clause in the Spark SQL operator for processing.

    For example, dirty data will appear if an order is accidentally triggered twice, resulting in two duplicate records for the same order. You can retain only one row by removing the duplicates, as shown in the following figure.

    Implementation Method

    You can perform deduplication by using the GROUP BY clause in Spark SQL.

    Procedure

    You can download the example data: Orderlist.xlsx

    1. Click Data Development and create a scheduled task.

    Add a Data Transformation node on the scheduled task editing page, as shown in the following figure.

     

    Enter the Data Transformation node, add a DB Table Input operator, and configure it to extract the data table that needs to be deduplicated, as shown in the following figure.

    2. Add a Spark SQL operator and use the GROUP BY clause to deduplicate the data, as shown in the following figure.

    select 
     `customer`,`area` ,`date` , `sales` 
    from 
    DB Table Input
    group by  `customer`,`area` ,`date` , `sales` 
    iconNote:
    Specify the data table and the field to be queried in the SQL statement by selecting them instead of entering their names.

    2. Click Data Preview to view the deduplicated data, as shown in the following figure.

    Then use the DB Table Output operator to output the deduplicated data to a specified data table, as shown in the following figure.

    iconNote:
    If you need to write data back to the original table or check duplicate entries, you can see Function Description of Data Comparison.

     

    附件列表


    主题: Data Development - Scheduled Task
    • Helpful
    • Not helpful
    • Only read

    滑鼠選中內容,快速回饋問題

    滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。

    不再提示

    10s後關閉

    Get
    Help
    Online Support
    Professional technical support is provided to quickly help you solve problems.
    Online support is available from 9:00-12:00 and 13:30-17:30 on weekdays.
    Page Feedback
    You can provide suggestions and feedback for the current web page.
    Pre-Sales Consultation
    Business Consultation
    Business: international@fanruan.com
    Support: support@fanruan.com
    Page Feedback
    *Problem Type
    Cannot be empty
    Problem Description
    0/1000
    Cannot be empty

    Submitted successfully

    Network busy