Looping Through Files in the Linux Environment for Data Synchronization- FineDataLink Help Document

Last update: January 07, 2025

Overview

Application Scenario

Application Scenario:

You have multiple Excel files in the same format on the Windows system and need to read all the file data and perform row-to-column conversion.

When you use the File Input operator to read files in batches, all file data is concatenated row-wise. The concatenated data may contain duplicate values, which will cause an error in row-to-column conversion, as the Row to Column operator requires unique names of the fields to be converted.

Implementation Method

Execute a Shell script to scan the CSV or Excel file in the specified folder to output a file containing the absolute path of the scanned files.
Extract the absolute path in the output file to obtain the relative path, and output the relative path as a parameter.
Loop through the relative path to read data from these files. You can process the file data, and then output the result data to a database table.
Scan the file regularly, summarize data, and output it into a database table.

Note:

1. The solution in this document supports data reading from Excel and CSV files on the local FineDataLink server and the FTP/SFTP server.

2. Since this solution uses the Shell Script node, the FineDataLink project must run on a Linux system.

Procedure

This example in this document demonstrates how to summarize data from multiple CSV files on the FineDataLink server and output it into a database table.

Scenario Description

A company uploads files to the CSV folder on a monthly schedule, and the file data is manually summarized into a specific database table on a regular basis, leading to inefficiency and a high likelihood of errors.

In this example, the CSV folder contains two CSV files. The file data is shown in the following figure.

Solution Description

1. Use the Shell Script node to execute a shell script that scans the CSV/Excel file in the specified folder and outputs the file containing the absolute path information of the scanned file.

2. Use the Spark SQL operator to obtain the relative path from the absolute path of the scanned file, and use the Parameter Output operator to output the relative path as a parameter.

3. Use the parameter in a Loop Container node to pass relative file paths, which is looped through by the File Input operator in Loop Container for data reading. Use a DB Table Output operator to output the file data to a database table.

4. Configure a Timed Scheduling plan for regular file scans, data summary, and data writing into the database table.

Preparation

This document uses the Shell Script node to scan CSV files in a specified folder on the FineDataLink server. Therefore, the following preparations are required.

1. To execute a specified shell script on the target server, you must have established an SSH connection to the server.

For details, see Shell Script.

2. You have created a folder in the /data/FDL_local/webapps/webroot/WEB-INF/assets/local_files/df_files path of the FineDataLink project and uploaded the files to be scanned to the folder every month, as shown in the following figure. (In this example, a df_files folder is created in the /data/FDL_local/webapps/webroot/WEB-INF/assets/local_files/df_files path.)

1 p路径.png

Note:

Note: To read data from CSV/Excel files on the local FineDataLink server via a data connection to the local server directory, the folder storing files to be scanned must be placed in the /data/FDL_local/webapps/webroot/WEB-INF/assets/local_files/df_files path of the FineDataLink project.

3. Create a Local Server Directory data connection to connect FineDataLink to the path of the folder storing CSV files.

4. Prepare the shell script as this example uses the Shell Script node to scan the CSV file in the specified folder.

Create a file_search.sh file. The code is as follows.

You can download the example data (source data): file_search.zip

#!/bin/bash
# Get input parameters
path=$1  # Specify the folder path to scan
output=$2 # Specify the output file path
# Output the header
echo Filename, Absolute Path, Modification Time > $output
# Traverse the folder
for file in $(ls $path) do
# Get the filename, absolute path, and modification time
filename=$(basename $file)
abspath=$(realpath $path/$file)   modtime=$(stat -c %y $path/$file | awk '{print $1,$2}')
# Output the filename, absolute path, and modification time to the CSV file
echo "$filename,$abspath,$modtime" >> $output
done
# Output completion message
echo "Scan completed! Output file path: $output" The output file path is $output.
Expand

5. Upload the file_search.sh file to the path of the CSV folder.

6p路径.png.jpg

Configuring the Shell Script Node

The goal of this section is to scan the CSV files in the specified folder and output the results to the file_list.csv file. The file_list.csv file will contain information about the files in the CSV folder, including file names, absolute paths, and modification time.

1. Create a scheduled task and add a Shell Script node.

2. Add two parameters of Text type in Parameter List, as shown in the following table.

Parameter

Description

Value

source_folder

Specify the folder path to be scanned. In this example, the folder contains CSV files.

/data/demo_FR/webroot/WEB-INF/assets/local_files/df_files/csv

target_csv

Specify the path of the output file. You can customize the name of the output file, which is generated automatically.

It contains file names, absolute paths, and modification time of CSV files in the folder.

/data/demo_FR/webroot/WEB-INF/assets/local_files/df_files/file_list.csv

补.png

3. Enter the full path of the shell script on the target terminal in Script Path in the Shell Script node. Add the parameters set in Step 2 of this section in the Script Parameters field, as shown in the following figure.

4. Right-click the Shell Script node and click Run Node. The file_list.csv file will be generated, as shown in the following figure.

12（1）.png

Configuring the Data Transformation Node

This section is to read the output file file_list.csv mentioned in the "Configuring the Shell Script Node" section, obtain the relative path of the scanned file from the absolute path, and output the relative path as a parameter.

Reading the Output File

1. Add a Data Transformation node and connect it to the Shell Script node.

2. Enter the Data Transformation node and drag a File Input operator onto the page to read the file_list.csv file, as shown in the following figure.

Click Data Preview, as shown in the following figure.

Obtaining the Relative Path of the Scanned File

In this solution, the File Input operator is used to read Excel and CSV file data. Since the file address setting item in this operator requires a relative file path, you need to obtain the relative path of the scanned file.

Add a Spark SQL operator to obtain the relative paths of the scanned files, as shown in the following figure.

Click Data Preview, as shown in the following figure.

Outputting the Relative Path of Scanned Files as a Parameter

1. Add a Parameter Output operator to output the relative path of the scanned file as a parameter, as shown in the following figure.

Note:

You are advised to set a debug value (to any valid parameter value) to facilitate effect viewing at subsequent nodes where the parameter is referenced. The debug value does not participate in the actual execution.

2. Click the Back button in the upper right corner.

Configuring the Loop

1. Add a Loop Container node and configure it as shown in the following figure.

2. Drag a Data Transformation node into the Loop Container node, and click the Data Transformation node to enter the Data Transformation editing page.

3. Add a File Input operator to read file data through the relative path of files to be read, as shown in the following figure.

In this example, enter ${filePathList} in File Path and press Enter.

Click Data Preview, as shown in the following figure.

4. Add other data processing operators (such as the Row to Column operator and the Field Setting operator) behind the File Input operator as needed. This step is not included in the current example.

5. Add a DB Table Output operator and configure it to output the fetched data into a database table, as shown in the following figure.

5. Run the task.

6. Click the Publish button to publish the task to Production Mode.

Configuring a Timed Scheduling Plan

Configure the timed scheduling plan to summarize data at the end of every month, as shown in the following figure.

Result Display

huizong table data is shown in the following figure.

Previous：MongoDB Output

Next：API Output Logic

Helpful
Not helpful
Only read

中文（简体）

English

Looping Through Files in the Linux Environment for Data Synchronization

Overview

Application Scenario

Implementation Method

Procedure

Scenario Description

Solution Description

Preparation

Configuring the Shell Script Node

Configuring the Data Transformation Node

Obtaining the Relative Path of the Scanned File

Outputting the Relative Path of Scanned Files as a Parameter

Configuring the Loop

Configuring a Timed Scheduling Plan

Result Display

附件列表