在简书上看到的一篇文章,理解Hive的窗口函数,感觉不错,分享之:https://www.jianshu.com/p/fd86a9743045
-----------------------------
一.原始数据
执行如下函数后分别得到不同的结果,以此来理解开窗函数的使用方法
- where month(orderdate)='4'
复制代码- select name,count(*) over()
复制代码- where month(orderdate)='4'
复制代码- +-------+-----------------+--+
复制代码- | name | count_window_0 |
复制代码- +-------+-----------------+--+
复制代码- +-------+-----------------+--+
复制代码- select name,count(*) over()
复制代码- where month(orderdate)='4';
复制代码- +-------+-----------------+--+
复制代码- | name | count_window_0 |
复制代码- +-------+-----------------+--+
复制代码- +-------+-----------------+--+
复制代码- select name,count(*) over(partition by name)
复制代码- where month(orderdate)='4'
复制代码- +-------+-----------------+--+
复制代码- | name | count_window_0 |
复制代码- +-------+-----------------+--+
复制代码- +-------+-----------------+--+
复制代码- select name,count(*) over(partition by name)
复制代码- where month(orderdate)='4'
复制代码- +-------+-----------------+--+
复制代码- | name | count_window_0 |
复制代码- +-------+-----------------+--+
复制代码- +-------+-----------------+--+
复制代码 二.原始数据
· 相关函数说明OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化。CURRENT ROW:当前行n PRECEDING:往前n行数据n FOLLOWING:往后n行数据UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点LAG(col,n,default_val):往前第n行数据LEAD(col,n, default_val):往后第n行数据NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型
- name string,orderdate string,cost int)
复制代码- row format delimited fields terminated by ','
复制代码- load data local inpath"/opt/module/datas/business.txt"
复制代码- select name,count(*) over()
复制代码- where subString(orderdate,1,7)='2017-04'
复制代码- select name,sum(cost) over(partition bymonth(orderdate))
复制代码- 3.上述的场景,将每个顾客的cost按照日期进行累加
复制代码- select name,orderdate,cost,
复制代码- sum(cost) over() as sample1, --将所有行相加
复制代码- sum(cost) over(partition by name) assample2,--按name分组,组内数据相加
复制代码- sum(cost) over(partition by name order byorderdate) as sample3,--按name分组,组内数据累加
复制代码- sum(cost) over(partition by name order byorderdate rows between UNBOUNDED PRECEDING and current row) as sample4,--和sample3一样,由起点到当前行的聚合
复制代码- sum(cost) over(partition by name order byorderdate rows between 1 PRECEDING AND current row) as sample5,--当前行和前面一行做聚合
复制代码- sum(cost) over(partition by name order byorderdate rows between 1 PRECEDING and 1 FOLLOWING) as simple6,--当前行和前边一行及后面一行
复制代码- sum(cost) over(partition by name order byorderdate rows between current row and UNBOUNDED FOLLOWING) as sample7 --当前行及后面所有行
复制代码- rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量
复制代码- select name,orderdate,cost,
复制代码- lag(orderdate,1,'1970-01-01') over(partitionby name order by orderdate ) as time1,
复制代码- lag(orderdate,2) over(partition by name orderby orderdate) as time2
复制代码- name orderdate cost time1 time2
复制代码- jack 2017-01-01 10 1970-01-01 NULL
复制代码- jack 2017-01-05 46 2017-01-01 NULL
复制代码- jack 2017-01-08 55 2017-01-05 2017-01-01
复制代码- jack 2017-02-03 23 2017-01-08 2017-01-05
复制代码- jack 2017-04-06 42 2017-02-03 2017-01-08
复制代码- mart 2017-04-08 62 1970-01-01 NULL
复制代码- mart 2017-04-09 68 2017-04-08 NULL
复制代码- mart 2017-04-11 75 2017-04-09 2017-04-08
复制代码- mart 2017-04-13 94 2017-04-11 2017-04-09
复制代码- neil 2017-05-10 12 1970-01-01 NULL
复制代码- neil 2017-06-12 80 2017-05-10 NULL
复制代码- tony 2017-01-02 15 1970-01-01 NULL
复制代码- tony 2017-01-04 29 2017-01-02 NULL
复制代码- tony 2017-01-07 50 2017-01-04 2017-01-02
复制代码- NTILE(n):把有序分区中的行分发到指定数据的组中,
复制代码- 各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
复制代码- select name,orderdate,cost,ntile(5)
复制代码- over(order by orderdate) sorted
复制代码 三.Rank
1.函数说明
Rank()排序相同时会重复,总数不会变DENSE_RANK() 排序相同时会重复,总数会减少ROW_NUMBER() 会根据顺序计算
数据准备:
2.创建hive表并导入数据
- row format delimited fields terminated by"\t";
复制代码- load data local inpath'/opt/module/datas/score.txt' into table score;
复制代码- [/code]3.按需求查询数据
- [list][*][*][*][*][*][/list][code]select name,subject,score,
复制代码- rank() over(partition by subject order byscore desc) rp,
复制代码- dense_rank() over(partition by subject orderby score desc) drp,
复制代码- row_number() over(partition by subject orderby score desc) rmp
复制代码 结果如下:
- name subject score rp drp rmp
复制代码 [code][/code] |
|