1. 概述
FineDataLink 中的 SparkSQL 支持使用窗口函数。
窗口函数是一种分析型的OLAP函数。
窗口函数与聚合函数的功能相似,两者都可以对指定数据窗口进行统计分析,但窗口函数与聚合函数又有所区别。
窗口函数可以为每行数据进行一次计算,因为窗口函数指定了数据窗口大小,这个数据窗口大小可能会随着行的变化而滑动变化,可以在这个滑动窗口里进行计算并返回一个值。而聚合函数只返回一行,因为它只能对分组下的所有数据进行统计。
如果想要把除了分组列之外的其他明细数据和聚合值同时提取,聚合函数是实现不了的 ,而窗口函数就可以方便地实现这一点。
2. 语法
PARTITION BY 表示将数据先按指定维度进行分组,如果不指定 PARTITION BY,则不对数据进行分组。
ORDER BY:将各个分组内的数据按指定字段排序,如果不指定 ORDER BY,则不对分区做排序。
ROWS BETWEEN:指定窗口区间,如果不指定 ROWS BETWEEN,默认为从起点到当前行;
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点
UNBOUNDED PRECEDING:从前面的起点
UNBOUNDED FOLLOWING:到后面的终点
3. 求和
原数据表:
在 SparSQL 中使用如下语句:
select
`订单ID` ,`产品ID` , `单价` ,
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t1,
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS t2,
SUM(`单价`) OVER(PARTITION BY `产品ID`) AS t3,
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS t4,
SUM(`单价` ) OVER(PARTITION BY `产品ID` order by `订单ID` ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS t5,
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS t6
from DB表输入
语句含义:
语句 | 含义 |
---|---|
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t1 | 产品ID 分组内从起点到当前行的单价列的值累加 |
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS t2 | 产品ID分组内从起点到当前行的单价列的值累加。和t1一样 |
SUM(`单价`) OVER(PARTITION BY `产品ID`) AS t3 | 产品ID 分组内所有单价列的值累加 |
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS t4 | 产品ID 分组内当前行加当前行的前三行的单价列的值累加 |
SUM(`单价` ) OVER(PARTITION BY `产品ID` order by `订单ID` ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS t5 | 产品ID 分组内当前行加当前行的前两行和当前行的往后一行的单价列值累加 |
SUM(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS t6 | 产品ID 分组内当前行加往后所有行的单价列的值累加。 |
4. 求平均
用法与求和相似,例如:
select
`订单ID` ,`产品ID` , `单价` ,
avg(`单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t1
from DB表输入
5. 最大值
用法与求和相似。
6. 最小值
用法与求和相似。
7. 切片
NTILE(n):用于将分组数据按照顺序切分成 n 片,返回当前切片值。
注:NTILE 不支持 ROWS BETWEEN,如果切片不均匀,默认增加第一个切片的分布。
原表如下图所示:
在 SparSQL 中使用如下语句:
select
`订单ID` ,`产品ID` , `单价` ,
NTILE(2) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t1,
NTILE(3) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t2,
NTILE(4) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t3
from DB表输入-副本
执行结果如下:
语句含义:
语句 | 含义 |
---|---|
NTILE(2) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t1 | 产品ID 分组内将数据分成 2 片 |
NTILE(3) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t2 | 产品ID 分组内将数据分成 3 片 |
NTILE(4) OVER(PARTITION BY `产品ID` ORDER BY `订单ID`) AS t3 | 产品ID 为分组内将数据分成 4 片 |
8. 添加序列号
ROW_NUMBER():从 1 开始,按照顺序,生成分组内记录的序列。
原表如下图所示:
示例:
select `订单ID` ,`产品ID` , `单价` , ROW_NUMBER() OVER(PARTITION BY `产品ID` ORDER BY `单价` asc) AS t1
from DB表输入
注: ROW_NUMBER() OVER(PARTITION BY name ORDER BY t0 asc) AS t1 为分组内根据单价升序排列添加序号。
效果查看:
9. 排名
RANK():在分组中的排名,排名相等会在名次中留下空缺位。
示例:
select `订单ID` ,`产品ID` , `单价` , RANK() OVER(PARTITION BY `产品ID` ORDER BY `单价` desc) AS RANK
from DB表输入
注:RANK() OVER(PARTITION BY `产品ID` ORDER BY `单价` desc) AS RANK 产品ID 分组内单价值排名。
效果查看:
DENSE_RANK():在分组中的排名,排名相等会在名次中不会留下空缺位。
示例:
select
`订单ID` ,`产品ID` , `单价` ,
DENSE_RANK() OVER(PARTITION BY `产品ID` ORDER BY `单价` desc) AS DENSE_RANK
from DB表输入
注:DENSE_RANK() OVER(PARTITION BY `产品ID` ORDER BY `单价` desc) AS DENSE_RANK 为「产品ID」分组内单价值的排名。
结果如下:
10. 错位
LAG(col,n,DEFAULT):用于统计窗口内往上第 n 行值。
col 为列名,n 为往上第 n 行(可选, 默认为 1),
DEFAULT 为默认值(当往上第 n 行为 NULL 时候,取默认值,如不指定,则为 NULL)。
示例:
select
`订单ID` ,`产品ID` , `单价` ,
ROW_NUMBER() OVER(PARTITION BY `产品ID` ORDER BY `单价` ) AS ROW_NUMBER,
LAG(`单价`,1,'10') OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_1_time,
LAG(`单价`,1) OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_2_time
from DB表输入
解释说明:
语句 | 说明 |
---|---|
LAG(`单价`,1,'10') OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_1_time | 产品ID 分组内单价往上一行,默认值 10 |
LAG(`单价`,1) OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_2_time | 产品ID 分组内单价往上一行,无默认值 |
LEAD(col,n,DEFAULT):用于统计窗口内往下第 n 行值
示例:
select
`订单ID` ,`产品ID` , `单价` ,
LEAD(`单价`,1,'10') OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_1_time,
LEAD(`单价`,1) OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_2_time
from DB表输入
解释说明:
语句 | 说明 |
---|---|
LEAD(`单价`,1,'10') OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_1_time | 产品ID 分组内单价往下一行,默认值 10 |
LEAD(`单价`,1) OVER(PARTITION BY `产品ID` ORDER BY `单价`) AS last_2_time | 产品ID 分组内单价往下一行,无默认值 |
11. 取首尾值
FIRST_VALUE(col):取分组内排序后,截止到当前行,col 列第一个值。
示例:
select
`订单ID` ,`产品ID` , `单价` ,
FIRST_VALUE( `单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ) AS FIRST_VALUE
from DB表输入
注:FIRST_VALUE( `单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ) AS FIRST_VALUE 为根据产品ID分组并根据订单ID组内排序后的每组数据单价的第一个值。
结果如下:
LAST_VALUE(col):取分组内排序后,截止到当前行,col 列最后一个值。
示例 :
select
`订单ID` ,`产品ID` , `单价` ,
LAST_VALUE( `单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ) AS LAST_VALUE
from DB表输入
注:LAST_VALUE( `单价`) OVER(PARTITION BY `产品ID` ORDER BY `订单ID` ) AS LAST_VALUE 为根据产品ID分组并根据订单ID组内排序后的每组数据单价的最后一个值。