The following issue may occur when you parse JSON data at scale:
Parsing invalid JSON data (if any) will cause the entire scheduled task to terminate. This is because JSON parsing is a data processing step, and the dirty data tolerance mechanism of scheduled tasks cannot exclude the impact of invalid JSON data on the task.
You may want to:
Filter out invalid JSON data to prevent it from affecting scheduled task execution.
Quickly identify invalid JSON data in large-volume data scenarios.
You can define an is_valid_json function using Python to verify the validity of JSON data and parse valid JSON data only.
1. The Python operator is required in the solution outlined in this document. You need to refer to the Python Operator document to prepare the environment and understand its usage.
2. In this solution, JSON data is stored in a TXT file. Therefore, you need to prepare either of the following data connections: FTP/SFTP Data Connection, or Data Connection to a Local Server Directory.
The JSON data to be parsed is as follows:
You can download the example data: test_1.txt
1. Create a scheduled task, drag a Data Transformation node onto the page, and enter the Data Transformation editing page.
2. Drag in a File Input operator and configure it to read JSON data. In this solution, JSON data is stored in a TXT file. You can configure the operator based on actual conditions, as shown in the following figure.
Set Filename Extension to TXT; set Column Separator to None; untick First Row As Field Name, select Manual Acquisition in Output Field, name the output field column, and set the data type to varchar.
Click Data Preview, as shown in the following figure.
1. Drag in a Python operator and define an is_valid_json function to determine whether the JSON data is valid, as shown in the following figure.
import pandas as pd# You must use pandas.# If there is a connected data source, you can click the data source above to use it. Data from the input source exists in a pandas DataFrame, and can be processed through the DataFrame method.# ----------------------------------------import json def is_valid_json(json_string): try: json.loads(json_string) return value except json.JSONDecodeError: return false # Example input=File Inputa=[]for row in input.index: json_string=input.loc[row]['column'] a.append(is_valid_json(json_string))input['isvalid']=a # ----------------------------------------output = input# Assign the data to be output to the downstream operator to an output variable. If the data is of the DataFrame data type, output it in the form of a two-dimensional table. If the data is of other data types, output it in the form of a string.
Code explanation:
The script iterates through each row of the input data, parses the JSON string in the specific column (column), and checks if it conforms to a valid JSON format. Then, the script adds the check result to a new column isvalid, indicating whether each JSON string is valid. Finally, it passes the input data with the new column as output to the downstream operator.
2. Drag a Data Filtering operator onto the page and configure it to obtain data where the isvalid field value is true (indicating valid JSON data), as shown in the following figure.
1. Drag in a JSON Parsing operator and configure it to parse the correctly formatted data, as shown in the following figure.
2. Click Data Preview, as shown in the following figure.
1. Drag a DB Table Output operator onto the page and configure the operator, as shown in the following figure. Select Append Data to Existing File as the write method, as shown in the following figure. You can configure it according to actual conditions.
Click Run to execute the task. After a successful execution, the log is displayed, as shown in the following figure.
Data in the database table is shown in the following figure.
Click the Publish button to publish the scheduled task to Production Mode, as shown in the following figure.
In Production Mode, click the icon to configure the scheduling plan, where you can set the task execution frequency.
滑鼠選中內容,快速回饋問題
滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。
不再提示
10s後關閉
Submitted successfully
Network busy