Spark项目 - 分析数据

上一节中,在nhl数据库中创建好了表,这一节继续对里面的表进行查询分析。

先将所有表加载到内存:

use nhl;

cache table nhl.game;
cache table nhl.game_goalie_stats;
cache table nhl.game_plays;
cache table nhl.game_plays_players;
cache table nhl.game_shifts;
cache table nhl.game_skater_stats;
cache table nhl.game_teams_stats;
cache table nhl.player_info;
cache table nhl.team_info;

注意加载内存是一个比较耗时的操作:

image-20240310003603093

执行group by查询:

SELECT nationality, COUNT(*) FROM player_info GROUP BY nationality;

由于数据已经加载到内存,查询速度非常快:

image-20240310003859632

测试其他复杂的查询:

select player_id, sum(saves) as saves from nhl.game_goalie_stats group by player_id order by saves desc ;

select first_value(pi.player_id),first_value(pi.firstName),first_value(pi.nationality), sum(ggs.saves) as saves 
from nhl.game_goalie_stats ggs,nhl.player_info pi 
where pi.player_id=ggs.player_id 
group by ggs.player_id 
order by saves desc;

image-20240310003942328

查看执行计划:

explain select first_value(ti.teamName),sum(gts.goals) as goals
from nhl.team_info ti,nhl.game_teams_stats gts
where ti.team_id =gts.team_id
group by gts.team_id
order by goals desc;

image-20240310004122422

通过这个项目,我们经历了从数据准备,到数据清洗及建表,最后到查询分析的过程。从中可以体会到spark内存计算的优势。