未信有情皆是累
但能无病便为仙
介绍
这是一个作业,根据搜狗2011.12.30号的500w条搜索数据,做一个数据分析,阅读这条博客前,需要先看下使用和配置hive碰到的坑。
数据分析
条数统计
-
数据总条数
select count(*) from sogou.sogou_table;
-
非空查询条数:
select count(*) from sogou.sogou_ext where keyword is not null and keyword!='';
-
无重复总条数(根据 ts、uid、keyword、url):
select count(*) from (select ts, uid, keyword, url from sogou.sogou_ext group by ts, uid, keyword, url having count(*)=1) a;
-
独立 UID 总数:
select count(distinct(uid)) from sogou.sogou_ext;
关键词分析
-
关键词个数统计
select avg(a.cnt) from (select size(split(keyword,'\\s+')) as cnt from sogou.sogou_ext) a;
-
关键词长度统计
select avg(a.cnt) from (select size(split(keyword,'')) as cnt from sogou.sogou_ext) a;
UID分析
-
UID 的查询次数分布(查询 1 次的 UID 个数,…查询 N 次的 UID个数):
select SUM(IF(uids.cnt=1,1,0)),SUM(IF(uids.cnt=2,1,0)),SUM(IF(uids.cnt=3,1,0)),SUM(IF(uids.cnt=4,1,0)), SUM(IF(uids.cnt>4,1,0)) from (select uid,count(*) as cnt from sogou.sogou_ext group by uid) uids;
-
UID 平均查询次数:
select sum(a.cnt)/count(a.uid) from (select uid,count(*) as cnt from sogou.sogou_ext group by uid) a;
-
查询次数大于 2 次的用户总数:
select count(a.uid) from (select uid,count(*) as cnt from sogou.sogou_ext group by uid having cnt > 2) a; 546343
-
查询次数大于 2 次的用户占比:
select count(distinct (uid)) from sogou.sogou_ext; select uid, count(*) as cnt from sogou.sogou_ext group by uid having cnt > 2) a; 1352645
-
查询次数大于 4 次的数据展示:
select b.* from (select uid,count(*) as cnt from sogou.sogou_ext group by uid having cnt > 4) a join sogou.sogou_ext b on a.uid=b.uid limit 10;
用户行为分析
-
点击次数与 Rank 之间的关系分析:
# Rank 在 10 以内的点击次数占比 select count(*) from sogou.sogou_ext where rank < 11; 4999742 # 搜索结果中位于前十的数目 5000000 # 总条数 比例:4999742/5000000 绝大部分情况下,用户只点击了前10的搜索结果
-
直接输入 URL 作为查询词的比例:
select count(*) from sogou.sogou_ext where keyword like '%www%'; 73979
-
独立用户行为分析(搜索具备多样性,因人而异,主要注意个性化需求):
(1)查询魔兽世界且次数大于3的用户 select uid,count(*) as cnt from sogou.sogou_ext where keyword='魔兽世界' group by uid having cnt > 3; 00500c8db13dca7587b3a84edfc42c4c 4 3b535bb62eb6de7be9866f227fb62fe7 5 b6afa6d0697dbb7d0a4bedbc6bd1f90e 5 fa7fb5c4e265c5482a75175d358ac1ad 4 a64e04683b8c2dda48c8eec50f923c6f 7 e8c9d4a151434e89b6fbb9d5b7365a07 5 (2)查询用户00500c8db13dca7587b3a84edfc42c4c和用户a64e04683b8c2dda48c8eec50f923c6f的相关搜索记录 select * from sogou.sogou_ext where uid='00500c8db13dca7587b3a84edfc42c4c' and keyword like '%魔兽世界%'; select * from sogou.sogou_ext where uid='a64e04683b8c2dda48c8eec50f923c6f' and keyword like '%魔兽世界%';
2011年10-12月关键词搜索分析
-
“金正日”相关搜索频次
select count(*) as cnt from sogou.sogou_ext where keyword='金正日'; 234
-
“英雄联盟”搜索频次
select count(*) as cnt from sogou.sogou_ext where keyword='英雄联盟'; 697
-
“微博”搜索频次(微博实名制)
select count(*) as cnt from sogou.sogou_ext where keyword='微博'; 1561
-
“2012”
select count(*) as cnt from sogou.sogou_ext where keyword='2012'; 225
-
“阿里巴巴”
select count(*) as cnt from sogou.sogou_ext where keyword='阿里巴巴'; 715
-
“腾讯”
select count(*) as cnt from sogou.sogou_ext where keyword='腾讯'; 614
-
“百度”
select count(*) as cnt from sogou.sogou_ext where keyword='百度'; 38437 ???
高校热度
-
“浙江大学”
select count(*) as cnt from sogou.sogou_ext where keyword='浙江大学'; 39
-
“清华大学”
select count(*) as cnt from sogou.sogou_ext where keyword='清华大学'; 79
-
“北京大学”
select count(*) as cnt from sogou.sogou_ext where keyword='北京大学'; 44
-
“复旦大学”
select count(*) as cnt from sogou.sogou_ext where keyword='复旦大学'; 18
-
“上海交通大学”
select count(*) as cnt from sogou.sogou_ext where keyword='上海交通大学'; 10
-
“武汉大学”
select count(*) as cnt from sogou.sogou_ext where keyword='武汉大学'; 49