DataFrame操作 - II

上一节我们介绍了DataFrame的一些基本操作,本节将继续介绍DataFrame常见的查询语句

统计所有交易日中,交易量大于5000000的天数:

df.filter("Volume>5000000").count()
# 或者
df.filter(df["Volume"] > 5000000).count()

image-20220604124609007

按年统计,统计每年的股价最大值:

from pyspark.sql.functions import year

df_with_year = df.withColumn("Year", year(df['Date']))
df_with_year.show()

max_year_df = df_with_year.groupBy("Year").max()
max_year_df.show()
max_year_df.select("Year", "max(High)").show()

image-20220604124623297

image-20220604124936635

统计每月的High平均值:

from pyspark.sql.functions import month

df_with_month = df.withColumn("Month", month("Date"))
df_with_month.select("Month", "High").groupBy("Month").mean().show()
df_with_month.select("Month", "High").groupBy("Month").mean().orderBy("Month").show()

image-20220604125227325

使用SQL查询

PySQL的查询语法有学习成本,对于熟悉SQL的人员,可以直接编写SQL来查询:

from pyspark.sql import SparkSession

app = SparkSession.builder.appName("SQL").getOrCreate()

# 创建临时表
df.createOrReplaceTempView("stock")
app.sql("select * from stock").show()

app.sql("select count(*) from stock where Volume > 5000000").show()

image-20220604134159349

image-20220604134315395