Spark SQL Window Function

  • Last update: February 13, 2025
  • Overview

    Window functions are supported in the Spark SQL operator of FineDataLink.

    Window functions are online analytical processing (OLAP) functions.

    Window functions share similar functionality with Spark SQL aggregate functions. For details, see Spark SQL Aggregate Function. Both can be used to conduct statistical analysis on a specified group of rows (known as a window), yet they differ in several ways.

    • Window functions return a value for each row from a query as they have specified the window size, which may change as the current row  changes, enabling them to operate on a set of rows in the sliding window. In contrast, aggregate functions return only one row because they group query rows into a single result row.

    • Aggregation functions fail to extract detailed data of columns other than the grouping column along with the aggregate value simultaneously whereas window functions can easily achieve this.

    Syntax

    PARTITION BY: Divides data into partitions by the specified dimension. If PARTITION BY is omitted, query rows are ungrouped.

    ORDER BY: Sorts data in each partition by the specified field. If ORDER BY is omitted, partition rows are unordered.

    ROWS BETWEEN: Specifies both window endpoints. If ROWS BETWEEN is omitted, the default range is from the start point to the current row.

    PRECEDING: Refers to the rows before the current row.

    FOLLOWING: Refers to the rows after the current row.

    CURRENT ROW: Refers to the current row.

    UNBOUNDED: Refers to the start point.

    UNBOUNDED PRECEDING: Specifies that the window starts at the first partition row.

    UNBOUNDED FOLLOWING: Specifies that the window ends at the last partition row.

    SUM

    The source table is shown in the following figure.


    Use the following statement in the Spark SQL operator.

    select 
    Order ID,Product ID,Unit Price
    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t1,
    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS t2,
    SUM(Unit Price) OVER(PARTITION BY Product ID) AS t3,
    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS t4,
    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS t5,
    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS t6 
    from DB Table Input

    1.png

    The following table describes the statement.

    Statement

    Meaning

    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t1

    Sums the Unit Price values from the start point to the current row within each Product ID group.

    2.png

    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS t2

    Sums the Unit Price values from the start point to the current row within each Product ID group. The result is the same as t1.

    SUM(Unit Price) OVER(PARTITION BY Product ID) AS t3

    Sums all Unit Price values within each Product ID group.

    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS t4

    Sums the Unit Price values on the current row and its three preceding rows within each Product ID group.

    SUM(Unit Price) OVER(PARTITION BY Product ID order by Order ID ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS t5

    Sums the Unit Price values on the current row, its two preceding rows, and its one following row within each Product ID group.

    SUM(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID ROWS   BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS t6

    Sums the Unit Price values on the current row and all following rows within each Product ID group.

    AVG

    The usage of AVERAGE is similar to that of SUM. The following is a statement example.


    select 
    Order ID ,Product ID , Unit Price , 
    avg(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t1 
    from DB Table Input


    MAX

    The usage of MAX is similar to that of SUM.

    MIN

    The usage of MIN is similar to that of SUM.

    NTILE

    NTILE(N): Divides the data rows in each partition into N buckets and returns the number of the bucket to which the current row belongs.

    iconNote:
    NTILE cannot be used with ROWS BETWEEN. If the number of rows is not divisible by the number of buckets, the extra rows are distributed to the first bucket by default.

    The source table is shown in the following figure.

    Use the following statement in the Spark SQL operator.

    select 
    Order IDProduct IDUnit Price
    NTILE(2) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t1,
    NTILE(3) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t2
    NTILE(4) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t3 
    from DB Table Input





    The execution result is shown in the following figure.

    The following table describes the statement.

    Statement

    Meaning

    NTILE(2) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t1

    Divides data within each Product ID group into two buckets.

    NTILE(3) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t2

    Divides data within each Product ID group into three buckets.

    NTILE(4) OVER(PARTITION BY Product ID ORDER BY Order ID) AS t3

    Divides data within each Product ID group into four buckets.

    Adding Sequential Numbers

    ROW_NUMBER(): Returns the number of the current row within its partition. Rows numbers range from 1 to the number of partition rows.

    The source table is shown in the following figure.

    Example:

    select Order IDProduct IDUnit Price, ROW_NUMBER() OVER(PARTITION BY Product ID ORDER BY Unit Price asc) AS t1 from DB Table Input


    iconNote:
    ROW_NUMBER() OVER(PARTITION BY Product ID ORDER BY Unit Price asc) AS t1 adds the sequential number in ascending order by the unit price for rows within the partition.

    Effect Display

    RANK

    RANK(): Returns the rank of the current row within its partition, with gaps. Peers are considered ties and receive the same rank.

    Example:

    select Order IDProduct IDUnit Price, RANK() OVER(PARTITION BY Product ID ORDER BY Unit Price desc) AS RANK from DB Table Input4


    iconNote:
    RANK() OVER(PARTITION BY Product ID ORDER BY Unit Price desc) AS RANK ranks rows within each Product ID group in descending order by the unit price.

    Effect Display

    DENSE_RANK(): Returns the rank of the current row within its partition, without gaps. Peers are considered ties and receive the same rank.

    Example:

    select Order IDProduct IDUnit Price, DENSE_RANK() OVER(PARTITION BY Product ID ORDER BY Unit Price desc) AS DENSE_RANK from DB Table Input5


    iconNote:
    DENSE_RANK() OVER(PARTITION BY Product ID ORDER BY Unit Price desc) AS DENSE_RANK ranks rows within each Product ID group in descending order by the unit price.

    The result is shown in the following figure.

    LAG and LEAD

    LAG(Column name, N, Default value): Returns the column value from the row that precedes the current row by N row(s) within its partition.

    N is optional and defaults to 1.

    If there is no such row, the return value is Default value, or NULL if the default value is missing.

    Example:

    select 
    Order IDProduct IDUnit Price
    ROW_NUMBER() OVER(PARTITION BY Product ID ORDER BY Unit Price) AS ROW_NUMBER,
    LAG(Unit Price,1,10) OVER(PARTITION BY Product ID ORDER BY Unit Price) AS last_1_time
    LAG(Unit Price,1) OVER(PARTITION BY Product ID ORDER BY Unit Price) AS last_2_time 
    from DB Table Input6


    The following table describes the statement.

    Statement

    Description

    LAG(Unit Price,1,10) OVER(PARTITION BY Product ID ORDER BY Unit Price) AS last_1_time

    Returns the Unit Price value on the preceding row of the current row within each Product ID group, with a default value of 10.

    LAG(Unit   Price,1) OVER(PARTITION BY Product   ID ORDER BY Unit Price) AS last_2_time

    Returns the Unit Price value on the preceding row of the current row within each Product ID group, with no default value.
     

     

    LEAD(Column name, N, Default value): Returns the column value from the row that follows the current row by N row(s) within its partition.

    Example:

    select 
    Order IDProduct IDUnit Price,
    LEAD(Unit Price,1,10) OVER(PARTITION BY Product ID ORDER BY Unit Price) AS last_1_time
    LEAD(Unit Price,1) OVER(PARTITION BY Product ID ORDER BY Unit Price) AS last_2_time 
    from DB Table Input7



    Explanation:

    Statement

    Description

    LEAD(Unit Price,1,10) OVER(PARTITION BY Product ID ORDER BY Unit Price) AS last_1_time

    Returns the Unit Price value on the following row of the current row within each Product ID group, with a default value of 10.

    LEAD(Unit   Price,1) OVER(PARTITION BY Product   ID ORDER BY Unit Price) AS last_2_time

    Returns the Unit Price value on the following row of the current row within each Product ID group, with no default value.
     

    FIRST_VALUE and LAST_VALUE

    FIRST_VALUE(Column): Returns the Column value on the first row within the partition after sorting.

    Example:

    select 
    Order IDProduct IDUnit Price
    FIRST_VALUE(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS FIRST_VALUE 
    from DB Table Input8


    iconNote:
    FIRST_VALUE(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS FIRST_VALUE returns the first value of Unit Price within each partition after data is grouped by Product ID and sorted by Order ID.

    The result is shown in the following figure.

    LAST_VALUE(Column): Returns the Column value on the last row within the partition after sorting.

    Example:

    select 
    Order IDProduct IDUnit Price
    LAST_VALUE(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS LAST_VALUE 
    from DB Table Input9
    iconNote:
    LAST_VALUE(Unit Price) OVER(PARTITION BY Product ID ORDER BY Order ID) AS LAST_VALUE returns the last value of Unit Price within each partition after data is grouped by Product ID and sorted by Order ID.


     


    附件列表


    主题: Data Development - Scheduled Task
    Previous
    Next
    • Helpful
    • Not helpful
    • Only read

    滑鼠選中內容,快速回饋問題

    滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。

    不再提示

    10s後關閉

    Get
    Help
    Online Support
    Professional technical support is provided to quickly help you solve problems.
    Online support is available from 9:00-12:00 and 13:30-17:30 on weekdays.
    Page Feedback
    You can provide suggestions and feedback for the current web page.
    Pre-Sales Consultation
    Business Consultation
    Business: international@fanruan.com
    Support: support@fanruan.com
    Page Feedback
    *Problem Type
    Cannot be empty
    Problem Description
    0/1000
    Cannot be empty

    Submitted successfully

    Network busy