1. 概述
FineDataLink 中的 SparkSQL 支援使用視窗函式。
視窗函式是一種分析型的OLAP函式。
視窗函式與聚合函式的功能相似,兩者都可以對指定資料視窗進行統計分析,但視窗函式與聚合函式又有所差別。
視窗函式可以為每行資料進行一次計算,因為視窗函式指定了資料視窗大小,這個資料視窗大小可能會隨着行的變化而滑動變化,可以在這個滑動視窗裏進行計算並傳回一個值。而聚合函式只傳回一行,因為它只能對分組下的所有資料進行統計。
如果想要把除了分組列之外的其他明細資料和聚合值同時提取,聚合函式是實現不了的 ,而視窗函式就可以方便地實現這一點。
2. 文法
PARTITION BY 表示將資料先按指定維度進行分組,如果不指定 PARTITION BY,則不對資料進行分組。
ORDER BY:將各個分組內的資料按指定欄位排序,如果不指定 ORDER BY,則不對分割槽做排序。
ROWS BETWEEN:指定視窗區間,如果不指定 ROWS BETWEEN,預設為從起點到當前行;
PRECEDING:往前
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點
UNBOUNDED PRECEDING:從前面的起點
UNBOUNDED FOLLOWING:到後面的終點
3. 求和
原資料表:

在 SparSQL 中使用如下語句:
select
`訂單ID` ,`產品ID` , `單價` ,
SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t1,
SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS t2,
SUM(`單價`) OVER(PARTITION BY `產品ID`) AS t3,
SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS t4,
SUM(`單價` ) OVER(PARTITION BY `產品ID` order by `訂單ID` ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS t5,
SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS t6
from DB表輸入
語句含義:
| 語句 | 含義 |
|---|---|
| SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t1 | 產品ID 分組內從起點到當前行的單價列的值累加
|
| SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS t2 | 產品ID分組內從起點到當前行的單價列的值累加。和t1一樣
|
SUM(`單價`) OVER(PARTITION BY `產品ID`) AS t3 | 產品ID 分組內所有單價列的值累加
|
| SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS t4 | 產品ID 分組內當前行加當前行的前三行的單價列的值累加
|
| SUM(`單價` ) OVER(PARTITION BY `產品ID` order by `訂單ID` ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS t5 | 產品ID 分組內當前行加當前行的前兩行和當前行的往後一行的單價列值累加
|
| SUM(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS t6 | 產品ID 分組內當前行加往後所有行的單價列的值累加。
|
4. 求平均
用法與求和相似,例如:
select
`訂單ID` ,`產品ID` , `單價` ,
avg(`單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t1
from DB表輸入 5. 最大值
用法與求和相似。
6. 最小值
用法與求和相似。
7. 切片
NTILE(n):用於將分組資料按照順序切分成 n 片,傳回當前切片值。
注:NTILE 不支援 ROWS BETWEEN,如果切片不均勻,預設增加第一個切片的分佈。
原表如下圖所示:

在 SparSQL 中使用如下語句:
select
`訂單ID` ,`產品ID` , `單價` ,
NTILE(2) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t1,
NTILE(3) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t2,
NTILE(4) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t3
from DB表輸入-複本
執行結果如下:

語句含義:
| 語句 | 含義 |
|---|---|
| NTILE(2) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t1 | 產品ID 分組內將資料分成 2 片 |
| NTILE(3) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t2 | 產品ID 分組內將資料分成 3 片 |
NTILE(4) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID`) AS t3 | 產品ID 為分組內將資料分成 4 片 |
8. 新增序列號
ROW_NUMBER():從 1 開始,按照順序,生成分組內記錄的序列。
原表如下圖所示:

範例:
select `訂單ID` ,`產品ID` , `單價` , ROW_NUMBER() OVER(PARTITION BY `產品ID` ORDER BY `單價` asc) AS t1
from DB表輸入
注: ROW_NUMBER() OVER(PARTITION BY name ORDER BY t0 asc) AS t1 為分組內根據單價升冪排列新增序號。

效果查看:

9. 排名
RANK():在分組中的排名,排名相等會在名次中留下空缺位。
範例:
select `訂單ID` ,`產品ID` , `單價` , RANK() OVER(PARTITION BY `產品ID` ORDER BY `單價` desc) AS RANK
from DB表輸入
注:RANK() OVER(PARTITION BY `產品ID` ORDER BY `單價` desc) AS RANK 產品ID 分組內單價值排名。

效果查看:

DENSE_RANK():在分組中的排名,排名相等會在名次中不會留下空缺位。
範例:
select
`訂單ID` ,`產品ID` , `單價` ,
DENSE_RANK() OVER(PARTITION BY `產品ID` ORDER BY `單價` desc) AS DENSE_RANK
from DB表輸入
注:DENSE_RANK() OVER(PARTITION BY `產品ID` ORDER BY `單價` desc) AS DENSE_RANK 為「產品ID」分組內單價值的排名。

結果如下:

10. 錯位
LAG(col,n,DEFAULT):用於統計視窗內往上第 n 行值。
col 為列名,n 為往上第 n 行(可選, 預設為 1),
DEFAULT 為預設值(當往上第 n 行為 NULL 時候,取預設值,如不指定,則為 NULL)。
範例:
select
`訂單ID` ,`產品ID` , `單價` ,
ROW_NUMBER() OVER(PARTITION BY `產品ID` ORDER BY `單價` ) AS ROW_NUMBER,
LAG(`單價`,1,'10') OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_1_time,
LAG(`單價`,1) OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_2_time
from DB表輸入

解譯說明:
| 語句 | 說明 |
|---|---|
| LAG(`單價`,1,'10') OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_1_time | 產品ID 分組內單價往上一行,預設值 10
|
| LAG(`單價`,1) OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_2_time | 產品ID 分組內單價往上一行,無預設值![]() |
LEAD(col,n,DEFAULT):用於統計視窗內往下第 n 行值
範例:
select
`訂單ID` ,`產品ID` , `單價` ,
LEAD(`單價`,1,'10') OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_1_time,
LEAD(`單價`,1) OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_2_time
from DB表輸入

解譯說明:
| 語句 | 說明 |
|---|---|
| LEAD(`單價`,1,'10') OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_1_time | 產品ID 分組內單價往下一行,預設值 10
|
| LEAD(`單價`,1) OVER(PARTITION BY `產品ID` ORDER BY `單價`) AS last_2_time | 產品ID 分組內單價往下一行,無預設值![]() |
11. 取首尾值
FIRST_VALUE(col):取分組內排序後,截止到當前行,col 列第一個值。
範例:
select
`訂單ID` ,`產品ID` , `單價` ,
FIRST_VALUE( `單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ) AS FIRST_VALUE
from DB表輸入
注:FIRST_VALUE( `單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ) AS FIRST_VALUE 為根據產品ID分組並根據訂單ID組內排序後的每組資料單價的第一個值。

結果如下:

LAST_VALUE(col):取分組內排序後,截止到當前行,col 列最後一個值。
範例 :
select
`訂單ID` ,`產品ID` , `單價` ,
LAST_VALUE( `單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ) AS LAST_VALUE
from DB表輸入注:LAST_VALUE( `單價`) OVER(PARTITION BY `產品ID` ORDER BY `訂單ID` ) AS LAST_VALUE 為根據產品ID分組並根據訂單ID組內排序後的每組資料單價的最後一個值。

