Introduction > Spark DataFrame操作 > DataFrame操作 - II

DataFrame操作 - II

上一节我们介绍了DataFrame的一些基本操作，本节将继续介绍DataFrame常见的查询语句

统计所有交易日中，交易量大于5000000的天数：

df.filter("Volume>5000000").count()
# 或者
df.filter(df["Volume"] > 5000000).count()

按年统计，统计每年的股价最大值：

from pyspark.sql.functions import year

df_with_year = df.withColumn("Year", year(df['Date']))
df_with_year.show()

max_year_df = df_with_year.groupBy("Year").max()
max_year_df.show()
max_year_df.select("Year", "max(High)").show()

统计每月的High平均值：

from pyspark.sql.functions import month

df_with_month = df.withColumn("Month", month("Date"))
df_with_month.select("Month", "High").groupBy("Month").mean().show()
df_with_month.select("Month", "High").groupBy("Month").mean().orderBy("Month").show()

使用SQL查询

PySQL的查询语法有学习成本，对于熟悉SQL的人员，可以直接编写SQL来查询：

from pyspark.sql import SparkSession

app = SparkSession.builder.appName("SQL").getOrCreate()

# 创建临时表
df.createOrReplaceTempView("stock")
app.sql("select * from stock").show()

app.sql("select count(*) from stock where Volume > 5000000").show()