Dataset Development Suggestions

  • Last update:December 18, 2023
  • Basic Dataset

    General Specifications

    1. It is suggested that a new basic dataset contain no more than 100 fields. The excessive number of fields may impact the performance.

    2. If you need to convert the field type of the basic table, change it to an SQL table and use SQL statements to achieve the conversion.

    3. If the table data query takes more than 5 s in the database, do not create a DB table. Optimizing the SQL statements (in both direct connection and extraction modes) is recommended.

    4. It is recommended that the basic table contains no more than 10 million records of data, or you filter the data records to below 10 million before analysis.

     

    Usage of Excel Datasets

    Importing an Excel file containing excessive data will slow down the import process, or even cause the front end to freeze.

    Therefore, it is recommended that:

    1. Limit the size of a single Excel file to be imported to 100 M.

    2. Remove useless rows/empty rows in the Excel file and keep only the valid data.

    3. Add SQL tables or DB tables for use after the Excel file is imported into the database.

    If you create numerous self-service datasets using Excel files in direct connection mode, and these datasets contain intricate data lineage, it is prone to cause serious memory occupation when you edit, preview, and design components, or perform other operations on these datasets, causing system downtime.

    Therefore, it is recommended that:

    1. Add Excel files in data extraction mode for analysis unless real-time data is necessary.

    2. Use Excel files that contain no more than 100,000 lines in direct connection mode.

    3. Keep the data lineage within 3 levels when conducting joint analysis involving Excel files and datasets.

     

    Usage of SQL Datasets

    1. Avoid complex SQL statements when creating SQL datasets, to ensure that the query takes no more than 5 s.  If the query of the SQL table takes a long time, using this SQL table to create dashboards and self-service datasets will also be time-consuming. You can reduce the amount of data scanned at one time to improve efficiency by dividing the table into several parts.

    2. Follow the SQL statement writing specifications to ensure the query performance.

      • Specify column names when using the SELECT statement, and only select necessary fields. All operations require specifying column names. Do not use column numbers or use * to select all columns.

      • Filter the data in the larger table before correlation, and do not use WHERE for table correlation.

      • Parameterize subqueries that repeat frequently in the SQL statement to reduce the number of database queries. Keep at most two subqueries (nested queries) in a statement (i.e. up to 3 times of joins and union all). Over-nesting will affect the final performance.

      • Reduce unnecessary scanning. If UNION ALL can meet the requirements, use UNION ALL instead of UNION, because the latter involves an extra process of comparison and deduplication.

      • Avoid scanning the whole content of large tables. For frequently queried large tables, create indexes on the underlying database table, and filter and sort data in the indexed columns to improve query efficiency. Avoid using functions and arithmetic operations in the WHERE statement.

      • Set clear boundaries for interval range comparison (especially for index column comparison) to reduce calculation precision at comparison. Use >= and <= instead of > and <.

      • Avoid unnecessary sorting when querying a large table. Sorting should be performed the last in a statement. Avoid using ORDER BYDISTINCT, and other statements in subqueries. Because ORDER BY is used to sort the results, and DISTINCT and GROUP BY are used to sort during the calculation process. Use EXISTS instead of DISTINCT if the subquery involves a large amount of data.

      • Ensure the SQL statement doesn't cause Cartesian Product. Check data size, and if the amount of data is large, process the database using measures such as sharding, partitioning, slimming, merging, and indexing.

    3. Use direct connection cache reasonably. Configure the cache function according to data timeliness for datasets that are used frequently but have poor performance in each query, to reduce database queries and improve response efficiency.

    Self-Service Dataset

    Field Number

    Select necessary fields when creating a self-service dataset. Avoid selecting all fields, which will waste the physical space of the server and increase the time consumed for updates. It is recommended that the number of fields should not exceed 30.

    Step Number

    • Keep the number of data processing steps for a single dataset within 15.

    • Add no more than 10 columns.

    • Keep the number of joins within 3.

    • Keep the number of group summary steps within 3.

    Convert the dataset into a database table if the step number exceeds the recommended value.

    Step Sequence

    • Filter the data first, and do not use all the data for calculation.

    • Had better place sorting after filtering as the sorting performance in case of large data volume is poor.

    • Minimize the use of sort and summary after adding a new column as it will result in poor performance.

    Step Type

    • If the number of data records is greater than 10 million, filter the data first to reduce the amount, and minimize the use of JoinGroup Summary, Column from Other Tables, and row-column conversions. Otherwise, the performance will be compromised.

    • When using Join or Column from Other Tables, avoid configuring N : N (many-to-many) relationships. If it cannot be avoided, control the data volume after the Cartesian product within 10 million rows.

    • The total number of grouped and summarized fields should not exceed 10, the number of fields containing summary conditions should not exceed 3, and the result should not exceed 10 million rows.

    • The calculation performance of the new column is strongly related to the complexity of the function, and the more complex the function formula is, the worse it will be. The number of fields using the DEF function should be no more than 3, and nesting should be avoided. For new columns, the Sum of All Values and the Sum of All Values in Group have poor performance. Date-related calculations such as YoY and MoM calculations should be carried out in the Quick Calculation in the component.

    • Use SQL statements for field type conversion in SQL tables. To convert the field types in the simplest table, add a Field Setting step, and extract data to local. If the formula involved is too complex, convert the dataset into a database table and then perform relevant operations.

    Lineage Level

    self-service dataset with complex lineage will cause slow updates for the basic table and self-service dataset, affecting the normal usage of the related basic table.

    Therefore, it is recommended that:

    • Keep at most 5 levels of lineage for the extracted self-service datasets.

    • Keep at most 3 levels of lineage for the direct connection datasets.

    If a deeper level of lineage is required, record the kinship using SQL datasets or finish it in a database table for use.

     

    Performance Reference

    Standard Performance of Self-Service Datasets

    Scenario (Single Step)

    Details

    Response Time

    Magnitude

    Join, Union All, Field Settings, Sort, Filter

    All scenarios involving the use of these functions

    3 s

    10 million

    Group Summary

    Scenarios involving calculating the variance, standard deviation, count, distinct count, average value, maximum and minimum value, and sum of value fields

    3 s

    10 million

    Column Adding

    All scenarios requiring adding columns except for the assignment column

    3 s

    10 million


    Non-conforming Performance of Self-Service Datasets

    Scenario (Single Step)

    Details

    Response Time

    Magnitude

    Group Summary

    Scenarios involving calculating the median, last period value, and period-on-period ratio for value fields

    Longer than 10 s

    10 million

    Scenarios involving calculating the count and distinct count of the text field, calculating the count, distinct count, and earliest/latest time of the date fields, and viewing custom grouping and interval grouping

    3 to 5 s

    10 million

    Column Adding

    Scenarios involving the use of the assignment column

    Longer than 5 s

    10 million

     


    附件列表


    主题: Advance Doc
    • Helpful
    • Not helpful
    • Only read

    滑鼠選中內容,快速回饋問題

    滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。

    不再提示

    10s後關閉

    Get
    Help
    Online Support
    Professional technical support is provided to quickly help you solve problems.
    Online support is available from 9:00-12:00 and 13:30-17:30 on weekdays.
    Page Feedback
    You can provide suggestions and feedback for the current web page.
    Pre-Sales Consultation
    Business Consultation
    Business: international@fanruan.com
    Support: support@fanruan.com
    Page Feedback
    *Problem Type
    Cannot be empty
    Problem Description
    0/1000
    Cannot be empty

    Submitted successfully

    Network busy