This document introduces the syntax of the aggregate functions in Spark SQL.
The source table content is shown in the following figure.
count(*): Counts the number of rows retrieved, including rows with null values.
You can use the following statement in Spark SQL to obtain the number of retrieved rows, as shown in the following figure.
select count(*) from DB Table Input
count(Expression): Returns the number of non-null values of the Expression field.
You can use the following statement in Spark SQL to obtain the number of non-null values of the Ship City field, as shown in the following figure.
select count(Ship City) from DB Table Input-1
count(DISTINCT Expression 1 [, Expression2]): Returns the number of rows with different non-null expression values.
You can use the statement in Spark SQL to obtain the number of unique non-null values of the Ship City field, as shown in the following figure.
select count(distinct Ship City) from DB Table Input-2
sum(Column): Returns the sum of all values in the specified column.
You can use the following statement in Spark SQL to obtain each shipper's freight sum, as shown in the following figure.
select Shipper,sum(Freight) from DB Table Input-3group by Shipper
sum (DISTINCT Column): Returns the sum of unique values in the specified column.
You can use the following statement in Spark SQL to obtain the sum of unique Freight values of each shipper, as shown in the following figure.
select Shipper,sum(distinct Freight)from DB Table Input-4group by Shipper
The effect is shown in the following figure.
avg(Column): Returns the mean of all values in the specified column.
You can use the following statement in Spark SQL to obtain the freight mean of each shipper, as shown in the following figure.
select Shipper, avg(Freight)from DB Table Input-5group by Shipper
avg(DISTINCT Column): Returns the mean of unique values in the specified column.
You can use the following statement in Spark SQL to obtain the mean of unique Freight values of each shipper, as shown in the following figure.
select Shipper, avg(distinct Freight)from DB Table Input-6group by Shipper
min(Column): Returns the minimum value in the specified column.
You can use the following statement in Spark SQL to obtain the minimum freight, as shown in the following figure.
select min() from
max (Column): Returns the maximum value in the specified column.
You can use the following statement in Spark SQL to obtain the maximum freight, as shown in the following figure.
select max(Freight) from DB Table Input-8
variance(Column): Returns the sample variance of the values in the specified column, commonly used in statistics.
The result of the following statement is shown in the following figure.
select variance(Unit Price),variance(Cost Price) from DB Table Input-9
var_pop(Column): Returns the population variance of the values in the specified column.
select var_samp(Unit Price),var_samp(Cost Price) from DB Table Input-10
var_samp(col): Returns the sample variance of the values in the specified column.
select var_samp(Unit Price),var_samp(Cost Price) from DB Table Input-11
stddev_pop(Column): Returns the population standard deviation of the values in the specified column.
select stddev_pop(Unit Price),stddev_pop(Cost Price) from DB Table Input-12
stddev_samp(Column): Returns the sample standard deviation of the values in the specified column.
select stddev_samp(Unit Price),stddev_samp(Cost Price) from DB Table Input-13
covar_pop(Column 1, Column 2): Returns the population covariance of the values in the specified column.
select covar_pop(Unit Price) ,(Cost Price) from DB Table Input-14
covar_samp (col 1, col 2): Returns the sample covariance of the values in the specified column.
select covar_samp(Unit Price,Cost Price) from DB Table Input-15
corr(Column 1, Column 2): Returns the correlation coefficient between values in the two specified columns.
select corr(Unit Price,Cost Price) from DB Table Input-16
percentile(Column, Percentile): Returns an estimate for the specified nearest-rank percentile of values in the specified column.
The column must be an integer column. The Percentile value should be within [0, 1]. For example, percentile(Unit Price, 0.5) returns the median of values in the Unit Price column.
select percentile(Unit Price,0.5) from DB Table Input-17
This function is mainly used to generate arrays.
The syntax is collect_list (Expression), where the Expression value is the field name.
Example: For instance, if you want to transform a JSON object column into an array column, you can combine concat_ws and collect_list.
The content of the source object is shown in the following figure. (The table name is JSON Generation.)
You can use the following statement to transform JSON_DATA data into array-format data, as shown in the following figure.
SELECT CONCAT("[",concat_ws(",",collect_list(JSON_DATA)),"]") as a from JSON Generation].
Specify the table in SQL Statement by clicking the button besides Input Source.
滑鼠選中內容,快速回饋問題
滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。
不再提示
10s後關閉
Submitted successfully
Network busy