反馈已提交
网络繁忙
direct: New support for left and right merge for data tables in "Hadoop Hive database" with Excel
The system actively interrupts the update when there is too much data in the "union combined"
In the process of using data in practice, there are often situations where two tables need to be joined together to form a new table.
Suppose there are two such data tables.
① Table A:record the name and English grade of the students.
② Table B:record the name and math grade of the students.
If you want to see the student's name, math grade and English grade in one table, you can use the "Name" field as the "merge basis" for left and right merge.
The different merge methods are shown below.
Now you have two tables below,
The effect of "union combined".
The effect of "intersection".
The effect of "left merge".
The effect of "right merge".
Direct connection table and Excel left and right merge
In the Direct Connect version, there are four databases "Microsoft SQL Server 2016, Oracle, Pivotal Greenplum Database, Hadoop Hive" that support merging data tables with Excel, while the rest of the databases do not.
Before merging, you need to change the field value of SystemOptimizationConfig.excelExtractDataBase in FINE_CONF_ENTITY table to "true", please refer to FINE_CONF_ENTITY Visual Configuration for details.
In the directly connected version, the two tables for left and right merging cannot be from two data sources.
Otherwise, an error is reported after merging:[DIRECT-ETL] unsupported data source: databases on different hosts/ports
For example, in the estate industry package, there are two data tables: the contract fact table (FACT_SALE_CONTRACT) and the contract dimension table (DIM_CONTRACT), which need to be joined together to form a new table based on the contract ID field and ensure that the data in the contract fact table will not be lost. This can be achieved by merging left and right.
1)Go to "Data Preparation > Package" and click "Add Table > Self-Service Dataset", as shown below.
2)Enter the self-service dataset configuration interface, select the fields to be added to the final table from "FACT_SALE_CONTRACT" of the "estate industry" package.
Note: Only data tables that have been added and updated with data can be selected here; data tables that have not been updated cannot be used to create self-service data sets.
Note: Users are recommended to reduce the length of table field names before performing left-right merging to avoid the problem of saving self-service datasets caused by too-long field names.
After adding the fields, click "+" in the left column and select "Left and right merge", as shown below.
Select the table to be merged as "DIM_CONTRACT", and select the fields to be added, click "OK", as shown in the following figure.
The meaning of the merge basis as described in the "Name" field in Section 1.2.
1)When you enter the left and right merge settings screen, you need to select the "left merge" method to ensure the data integrity of the contract fact table.
At this time, a field common to both tables will automatically appear in the merge basis field as "merge basis", here is the "CONTRACTGUID" contract ID field. The following figure shows.
Note: The fields in the merged basis of the two tables added here must be of the same type, and field types that are inconsistent cannot be merged left or right and require field type conversion.
You can also manually change the merge basis field or add multiple merge basis. If you add multiple merge basis, you can
Description of the consolidation method:
2)After adding the merge basis, a preview of the new merged table will appear in the data preview box below, as shown in the following figure.
Extract data version: The merge basis will be merged into one field and only the merge result "CONTRACTUID" will be kept
Directly connected data version: the merge basis fields of the left and right merge two tables are still two fields, as shown in the following figure.
3)If you want to modify the tables/fields to be merged that have been added previously, you can hover your mouse over the table column to be merged and click the Edit button to reselect the tables and fields. As shown in the figure below.
After clicking the "Save/Save As" button in the upper right corner and updating the data, you see the processed self-service dataset in the package, as shown below.
Implement "CONTRACTGUID" and "Signing Time" into one table, as shown below.
It is not possible to use a single field as the basis for merging, so the fields "CONTRACTGUID, Signing Time" are required as the basis for merging. The meaning of multiple merge basis is to ensure the uniqueness of data when merging left and right, as shown in the following figure.
Since the merge is equivalent to N:N, it generates a Cartesian product, which leads to data volume expansion. In order not to affect the stability of the system, the system actively interrupts the update of the self-service dataset that has undergone N:N operations when the amount of data after the merge satisfies the following conditions.
Performing union combined merger with data volumes exceeding 10 million.
Expansion index >= 5; Expansion index: MIN (and aggregate and result data volume/left table data, and aggregate and result data volume/right table data). For example, if 1000 thousand data and 5000 thousand data N:N results in 2000w data, then the N:N expansion factor is: 20000 thousand/5000 thousand = 4
If the Cartesian product appears, the user is prompted at the update message whether the update was successful.
Data set generation is successful: Cartesian product is detected when updating, data expansion, and update time increase, please check if there are duplicate values in the data of the merge basis column in the left and right merge steps.
Data set generation failed: The cartesian product was detected during the update, and the operation was aborted due to severe data inflation. Please check if there are duplicate values in the data of the merge base column in the left and right merge steps, and please make sure that there should not be too many duplicate values.
If the "Update Information" indicates that the generation has failed, but the user still wants to use N:N for analysis, you can reduce the amount of data by filtering and other means to meet the requirement that the system does not interrupt.
Please make sure there are no spaces in the fields of the merged data table before doing the left and right merge. The following figure shows.
If the self-service dataset is created by the data analysis user or by the data processing user in the self-service dataset.
When the amount of data is large, the calculation can be made smoother by using the first 5000 rows in the editing process of the dataset, as shown in the following figure.
售前咨询电话
400-811-8890转1
在线技术支持
在线QQ:800049425
热线电话:400-811-8890转2
总裁办24H投诉
热线电话:173-1278-1526
文 档反 馈
鼠标选中内容,快速反馈问题
鼠标选中存在疑惑的内容,即可快速反馈问题,我们将会跟进处理。
不再提示
10s后关闭