上一节我们介绍了DataFrame的一些基本操作,本节将继续介绍DataFrame常见的查询语句
统计所有交易日中,交易量大于5000000的天数:
df.filter("Volume>5000000").count()
# 或者
df.filter(df["Volume"] > 5000000).count()
按年统计,统计每年的股价最大值:
from pyspark.sql.functions import year
df_with_year = df.withColumn("Year", year(df['Date']))
df_with_year.show()
max_year_df = df_with_year.groupBy("Year").max()
max_year_df.show()
max_year_df.select("Year", "max(High)").show()
统计每月的High平均值:
from pyspark.sql.functions import month
df_with_month = df.withColumn("Month", month("Date"))
df_with_month.select("Month", "High").groupBy("Month").mean().show()
df_with_month.select("Month", "High").groupBy("Month").mean().orderBy("Month").show()
PySQL
的查询语法有学习成本,对于熟悉SQL的人员,可以直接编写SQL来查询:
from pyspark.sql import SparkSession
app = SparkSession.builder.appName("SQL").getOrCreate()
# 创建临时表
df.createOrReplaceTempView("stock")
app.sql("select * from stock").show()
app.sql("select count(*) from stock where Volume > 5000000").show()