上一节中,在nhl
数据库中创建好了表,这一节继续对里面的表进行查询分析。
先将所有表加载到内存:
use nhl;
cache table nhl.game;
cache table nhl.game_goalie_stats;
cache table nhl.game_plays;
cache table nhl.game_plays_players;
cache table nhl.game_shifts;
cache table nhl.game_skater_stats;
cache table nhl.game_teams_stats;
cache table nhl.player_info;
cache table nhl.team_info;
注意加载内存是一个比较耗时的操作:
执行group by查询:
SELECT nationality, COUNT(*) FROM player_info GROUP BY nationality;
由于数据已经加载到内存,查询速度非常快:
测试其他复杂的查询:
select player_id, sum(saves) as saves from nhl.game_goalie_stats group by player_id order by saves desc ;
select first_value(pi.player_id),first_value(pi.firstName),first_value(pi.nationality), sum(ggs.saves) as saves
from nhl.game_goalie_stats ggs,nhl.player_info pi
where pi.player_id=ggs.player_id
group by ggs.player_id
order by saves desc;
查看执行计划:
explain select first_value(ti.teamName),sum(gts.goals) as goals
from nhl.team_info ti,nhl.game_teams_stats gts
where ti.team_id =gts.team_id
group by gts.team_id
order by goals desc;
通过这个项目,我们经历了从数据准备
,到数据清洗及建表
,最后到查询分析
的过程。从中可以体会到spark内存计算的优势。