When building a data warehouse, you may need to deduplicate data, as duplicate records constitute dirty data. In this case, you can use the GROUP BY clause in the Spark SQL operator for processing.
For example, dirty data will appear if an order is accidentally triggered twice, resulting in two duplicate records for the same order. You can retain only one row by removing the duplicates, as shown in the following figure.
You can perform deduplication by using the GROUP BY clause in Spark SQL.
You can download the example data: Orderlist.xlsx
1. Click Data Development and create a scheduled task.
Add a Data Transformation node on the scheduled task editing page, as shown in the following figure.
Enter the Data Transformation node, add a DB Table Input operator, and configure it to extract the data table that needs to be deduplicated, as shown in the following figure.
2. Add a Spark SQL operator and use the GROUP BY clause to deduplicate the data, as shown in the following figure.
select `customer`,`area` ,`date` , `sales` from DB Table Inputgroup by `customer`,`area` ,`date` , `sales`
2. Click Data Preview to view the deduplicated data, as shown in the following figure.
Then use the DB Table Output operator to output the deduplicated data to a specified data table, as shown in the following figure.
滑鼠選中內容,快速回饋問題
滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。
不再提示
10s後關閉
Submitted successfully
Network busy