Introduction > Spark SQL Shell > Spark项目 - 分析数据

Spark项目 - 分析数据

上一节中，在nhl数据库中创建好了表，这一节继续对里面的表进行查询分析。

先将所有表加载到内存：

use nhl;

cache table nhl.game;
cache table nhl.game_goalie_stats;
cache table nhl.game_plays;
cache table nhl.game_plays_players;
cache table nhl.game_shifts;
cache table nhl.game_skater_stats;
cache table nhl.game_teams_stats;
cache table nhl.player_info;
cache table nhl.team_info;

注意加载内存是一个比较耗时的操作：

执行group by查询：

SELECT nationality, COUNT(*) FROM player_info GROUP BY nationality;

由于数据已经加载到内存，查询速度非常快：

测试其他复杂的查询：

select player_id, sum(saves) as saves from nhl.game_goalie_stats group by player_id order by saves desc ;

select first_value(pi.player_id),first_value(pi.firstName),first_value(pi.nationality), sum(ggs.saves) as saves 
from nhl.game_goalie_stats ggs,nhl.player_info pi 
where pi.player_id=ggs.player_id 
group by ggs.player_id 
order by saves desc;

查看执行计划：

explain select first_value(ti.teamName),sum(gts.goals) as goals
from nhl.team_info ti,nhl.game_teams_stats gts
where ti.team_id =gts.team_id
group by gts.team_id
order by goals desc;

通过这个项目，我们经历了从数据准备，到数据清洗及建表，最后到查询分析的过程。从中可以体会到spark内存计算的优势。