搜狗数据分析

Hadoop/Hive

Posted by bbkgl on November 1, 2019

未信有情皆是累

但能无病便为仙

介绍

这是一个作业,根据搜狗2011.12.30号的500w条搜索数据,做一个数据分析,阅读这条博客前,需要先看下使用和配置hive碰到的坑

数据分析

条数统计

  1. 数据总条数

    select count(*) from sogou.sogou_table;
    

    5dbbd6cbbd461d945ad8c5db

  2. 非空查询条数:

    select count(*) from sogou.sogou_ext where keyword is not null and keyword!='';
    

    5dbbd72abd461d945ad8cb97

  3. 无重复总条数(根据 ts、uid、keyword、url):

    select count(*) from (select ts, uid, keyword, url from sogou.sogou_ext group by ts, uid, keyword, url having count(*)=1) a;
    

    H9a3740f9ab7543f4a4461dee5faa7b122

  4. 独立 UID 总数:

    select count(distinct(uid)) from sogou.sogou_ext;
    

    5dbbd8afbd461d945ad8e73d

关键词分析

  1. 关键词个数统计

    select avg(a.cnt) from (select size(split(keyword,'\\s+')) as cnt from sogou.sogou_ext) a;
    

    5dbbd98bbd461d945ad8ff70

  2. 关键词长度统计

    select avg(a.cnt) from (select size(split(keyword,'')) as cnt from sogou.sogou_ext) a;
    

    Hdfbc4bfe80cb451589635069addc3a16t

UID分析

  1. UID 的查询次数分布(查询 1 次的 UID 个数,…查询 N 次的 UID个数):

    select SUM(IF(uids.cnt=1,1,0)),SUM(IF(uids.cnt=2,1,0)),SUM(IF(uids.cnt=3,1,0)),SUM(IF(uids.cnt=4,1,0)), SUM(IF(uids.cnt>4,1,0)) from
    (select uid,count(*) as cnt from sogou.sogou_ext group by uid) uids;
    

    5dbbdd75bd461d945ad93b58

  2. UID 平均查询次数:

    select sum(a.cnt)/count(a.uid) from (select uid,count(*) as cnt from sogou.sogou_ext group by uid) a;
    

    5dbbde01bd461d945ad948da

  3. 查询次数大于 2 次的用户总数:

    select count(a.uid) from (select uid,count(*) as cnt from sogou.sogou_ext group by uid having cnt > 2) a;
       
    546343
    
  4. 查询次数大于 2 次的用户占比:

    select count(distinct (uid)) from sogou.sogou_ext;
    select uid, count(*) as cnt from sogou.sogou_ext group by uid having cnt > 2) a;
       
    1352645
    
  5. 查询次数大于 4 次的数据展示:

    select b.* from (select uid,count(*) as cnt from sogou.sogou_ext group by uid having cnt > 4) a join sogou.sogou_ext b on a.uid=b.uid limit 10;
    

    Hd6c3e269c95343188ce633dfac5a14eeD

用户行为分析

  1. 点击次数与 Rank 之间的关系分析:

    # Rank  10 以内的点击次数占比
    select count(*) from sogou.sogou_ext where rank < 11;
       
    4999742    # 搜索结果中位于前十的数目
       
    5000000    # 总条数
       
    比例:4999742/5000000   绝大部分情况下,用户只点击了前10的搜索结果
    
  2. 直接输入 URL 作为查询词的比例:

    select count(*) from sogou.sogou_ext where keyword like '%www%';
       
    73979
    
  3. 独立用户行为分析(搜索具备多样性,因人而异,主要注意个性化需求):

    1)查询魔兽世界且次数大于3的用户
    select uid,count(*) as cnt from sogou.sogou_ext where keyword='魔兽世界' group by uid having cnt > 3;
       
    00500c8db13dca7587b3a84edfc42c4c	4
    3b535bb62eb6de7be9866f227fb62fe7	5
    b6afa6d0697dbb7d0a4bedbc6bd1f90e	5
    fa7fb5c4e265c5482a75175d358ac1ad	4
    a64e04683b8c2dda48c8eec50f923c6f	7
    e8c9d4a151434e89b6fbb9d5b7365a07	5
       
    2)查询用户00500c8db13dca7587b3a84edfc42c4c和用户a64e04683b8c2dda48c8eec50f923c6f的相关搜索记录
       
    select * from sogou.sogou_ext where uid='00500c8db13dca7587b3a84edfc42c4c' and keyword like '%魔兽世界%';
       
       
    select * from sogou.sogou_ext where uid='a64e04683b8c2dda48c8eec50f923c6f' and keyword like '%魔兽世界%';
    

    Hb42b5856ad564c61807e3ec20852ae3dd

    H1906ca3b5830495cb64fabd30af33612G

2011年10-12月关键词搜索分析

  1. “金正日”相关搜索频次

    select count(*) as cnt from sogou.sogou_ext where keyword='金正日';
       
    234
    
  2. “英雄联盟”搜索频次

    select count(*) as cnt from sogou.sogou_ext where keyword='英雄联盟';
       
    697
    
  3. “微博”搜索频次(微博实名制)

    select count(*) as cnt from sogou.sogou_ext where keyword='微博';
       
    1561
    
  4. “2012”

    select count(*) as cnt from sogou.sogou_ext where keyword='2012';
       
    225
    
  5. “阿里巴巴”

    select count(*) as cnt from sogou.sogou_ext where keyword='阿里巴巴';
       
    715
    
  6. “腾讯”

    select count(*) as cnt from sogou.sogou_ext where keyword='腾讯';
       
    614
    
  7. “百度”

    select count(*) as cnt from sogou.sogou_ext where keyword='百度';
       
    38437   ???
    

高校热度

  1. “浙江大学”

    select count(*) as cnt from sogou.sogou_ext where keyword='浙江大学';
       
    39
    
  2. “清华大学”

    select count(*) as cnt from sogou.sogou_ext where keyword='清华大学';
       
    79
    
  3. “北京大学”

    select count(*) as cnt from sogou.sogou_ext where keyword='北京大学';
       
    44
    
  4. “复旦大学”

    select count(*) as cnt from sogou.sogou_ext where keyword='复旦大学';
       
    18
    
  5. “上海交通大学”

    select count(*) as cnt from sogou.sogou_ext where keyword='上海交通大学';
       
    10
    
  6. “武汉大学”

    select count(*) as cnt from sogou.sogou_ext where keyword='武汉大学';
       
    49