Dataset Development Suggestions- FineBI Help Document

Last update：December 18, 2023

Basic Dataset

General Specifications

1. It is suggested that a new basic dataset contain no more than 100 fields. The excessive number of fields may impact the performance.

2. If you need to convert the field type of the basic table, change it to an SQL table and use SQL statements to achieve the conversion.

3. If the table data query takes more than 5 s in the database, do not create a DB table. Optimizing the SQL statements (in both direct connection and extraction modes) is recommended.

4. It is recommended that the basic table contains no more than 10 million records of data, or you filter the data records to below 10 million before analysis.

Usage of Excel Datasets

Importing an Excel file containing excessive data will slow down the import process, or even cause the front end to freeze.

Therefore, it is recommended that:

1. Limit the size of a single Excel file to be imported to 100 M.

2. Remove useless rows/empty rows in the Excel file and keep only the valid data.

3. Add SQL tables or DB tables for use after the Excel file is imported into the database.

If you create numerous self-service datasets using Excel files in direct connection mode, and these datasets contain intricate data lineage, it is prone to cause serious memory occupation when you edit, preview, and design components, or perform other operations on these datasets, causing system downtime.

Therefore, it is recommended that:

1. Add Excel files in data extraction mode for analysis unless real-time data is necessary.

2. Use Excel files that contain no more than 100,000 lines in direct connection mode.

3. Keep the data lineage within 3 levels when conducting joint analysis involving Excel files and datasets.

Usage of SQL Datasets

1. Avoid complex SQL statements when creating SQL datasets, to ensure that the query takes no more than 5 s. If the query of the SQL table takes a long time, using this SQL table to create dashboards and self-service datasets will also be time-consuming. You can reduce the amount of data scanned at one time to improve efficiency by dividing the table into several parts.

2. Follow the SQL statement writing specifications to ensure the query performance.

Specify column names when using the SELECT statement, and only select necessary fields. All operations require specifying column names. Do not use column numbers or use * to select all columns.
Filter the data in the larger table before correlation, and do not use WHERE for table correlation.
Parameterize subqueries that repeat frequently in the SQL statement to reduce the number of database queries. Keep at most two subqueries (nested queries) in a statement (i.e. up to 3 times of joins and union all). Over-nesting will affect the final performance.
Reduce unnecessary scanning. If UNION ALL can meet the requirements, use UNION ALL instead of UNION, because the latter involves an extra process of comparison and deduplication.
Avoid scanning the whole content of large tables. For frequently queried large tables, create indexes on the underlying database table, and filter and sort data in the indexed columns to improve query efficiency. Avoid using functions and arithmetic operations in the WHERE statement.
Set clear boundaries for interval range comparison (especially for index column comparison) to reduce calculation precision at comparison. Use >= and <= instead of > and <.
Avoid unnecessary sorting when querying a large table. Sorting should be performed the last in a statement. Avoid using ORDER BY, DISTINCT, and other statements in subqueries. Because ORDER BY is used to sort the results, and DISTINCT and GROUP BY are used to sort during the calculation process. Use EXISTS instead of DISTINCT if the subquery involves a large amount of data.
Ensure the SQL statement doesn't cause Cartesian Product. Check data size, and if the amount of data is large, process the database using measures such as sharding, partitioning, slimming, merging, and indexing.

3. Use direct connection cache reasonably. Configure the cache function according to data timeliness for datasets that are used frequently but have poor performance in each query, to reduce database queries and improve response efficiency.

Self-Service Dataset

Field Number

Select necessary fields when creating a self-service dataset. Avoid selecting all fields, which will waste the physical space of the server and increase the time consumed for updates. It is recommended that the number of fields should not exceed 30.

Step Number

Keep the number of data processing steps for a single dataset within 15.
Add no more than 10 columns.
Keep the number of joins within 3.
Keep the number of group summary steps within 3.

Convert the dataset into a database table if the step number exceeds the recommended value.

Step Sequence

Filter the data first, and do not use all the data for calculation.
Had better place sorting after filtering as the sorting performance in case of large data volume is poor.
Minimize the use of sort and summary after adding a new column as it will result in poor performance.

Step Type

If the number of data records is greater than 10 million, filter the data first to reduce the amount, and minimize the use of Join, Group Summary, Column from Other Tables, and row-column conversions. Otherwise, the performance will be compromised.
When using Join or Column from Other Tables, avoid configuring N : N (many-to-many) relationships. If it cannot be avoided, control the data volume after the Cartesian product within 10 million rows.
The total number of grouped and summarized fields should not exceed 10, the number of fields containing summary conditions should not exceed 3, and the result should not exceed 10 million rows.
The calculation performance of the new column is strongly related to the complexity of the function, and the more complex the function formula is, the worse it will be. The number of fields using the DEF function should be no more than 3, and nesting should be avoided. For new columns, the Sum of All Values and the Sum of All Values in Group have poor performance. Date-related calculations such as YoY and MoM calculations should be carried out in the Quick Calculation in the component.
Use SQL statements for field type conversion in SQL tables. To convert the field types in the simplest table, add a Field Setting step, and extract data to local. If the formula involved is too complex, convert the dataset into a database table and then perform relevant operations.

Lineage Level

A self-service dataset with complex lineage will cause slow updates for the basic table and self-service dataset, affecting the normal usage of the related basic table.

Therefore, it is recommended that:

Keep at most 5 levels of lineage for the extracted self-service datasets.
Keep at most 3 levels of lineage for the direct connection datasets.

If a deeper level of lineage is required, record the kinship using SQL datasets or finish it in a database table for use.

Performance Reference

Standard Performance of Self-Service Datasets

Scenario (Single Step)	Details	Response Time	Magnitude
Join, Union All, Field Settings, Sort, Filter	All scenarios involving the use of these functions	3 s	10 million
Group Summary	Scenarios involving calculating the variance, standard deviation, count, distinct count, average value, maximum and minimum value, and sum of value fields	3 s	10 million
Column Adding	All scenarios requiring adding columns except for the assignment column	3 s	10 million

Non-conforming Performance of Self-Service Datasets

Scenario (Single Step)	Details	Response Time	Magnitude
Group Summary	Scenarios involving calculating the median, last period value, and period-on-period ratio for value fields	Longer than 10 s	10 million
Group Summary	Scenarios involving calculating the count and distinct count of the text field, calculating the count, distinct count, and earliest/latest time of the date fields, and viewing custom grouping and interval grouping	3 to 5 s	10 million
Column Adding	Scenarios involving the use of the assignment column	Longer than 5 s	10 million

Previous：Data Permission Management

Next：Single-Server Deployment

Helpful
Not helpful
Only read

中文（简体）中文（繁體）日本語

English

Dataset Development Suggestions

Basic Dataset

General Specifications

Usage of Excel Datasets

Usage of SQL Datasets

Self-Service Dataset

Field Number

Step Number

Step Sequence

Step Type

Lineage Level

Performance Reference

Standard Performance of Self-Service Datasets

Non-conforming Performance of Self-Service Datasets

附件列表