Spark能够自动推断出Json数据集的“数据模式”(Schema),并将它加载为一个SchemaRDD实例。这种“自动”的行为是通过下述两种方法实现的:
jsonFile:从一个文件目录中加载数据,这个目录中的文件的每一行均为一个JSON字符串(如果JSON字符串“跨行”,则可能导致解析错误)
jsonRDD:从一个已经存在的RDD 中加载数据,这个RDD中的每一个元素均为一个JSON字符串
我们先测试第一种方式——每一行均为一个JSON字符串, 这种形式的文件又叫JSONL
。
将下面内容保存到/home/ec2-user/file.json
:
{"name": "John Doe", "age": 30, "city": "New York", "email": "john.doe@example.com", "tags": ["developer", "programmer", "engineer"], "address": {"street": "123 Main St", "city": "New York", "zip": "10001"}}
{"name": "Jane Smith", "age": 25, "city": "Los Angeles", "email": "jane.smith@example.com", "tags": ["designer", "artist"], "address": {"street": "456 Elm St", "city": "Los Angeles", "zip": "90001"}}
{"name": "Alice Johnson", "age": 35, "city": "Chicago", "email": "alice.johnson@example.com", "tags": ["manager"], "address": {"street": "789 Oak St", "city": "Chicago", "zip": "60601"}}
运行spark-shell
, 先加载json文件:
val zipDF = spark.read.json("/home/ec2-user/file.json")
zipDF.printSchema
可见Spark将json的结构解析出来。
对DataFrame进行计算:
zipDF.createOrReplaceTempView("zipTable")
val cityDF =spark.sql("select distinct city from zipTable")
cityDF.count
cityDF.show
将下面内容保存到/home/ec2-user/multi-line.json
:
[
{
"name": "John Doe",
"age": 30,
"city": "New York",
"email": "john.doe@example.com",
"tags": [
"developer",
"programmer",
"engineer"
],
"address": {
"street": "123 Main St",
"city": "New York",
"zip": "10001"
}
},
{
"name": "Jane Smith",
"age": 25,
"city": "Los Angeles",
"email": "jane.smith@example.com",
"tags": [
"designer",
"artist"
],
"address": {
"street": "456 Elm St",
"city": "Los Angeles",
"zip": "90001"
}
},
{
"name": "Alice Johnson",
"age": 35,
"city": "Chicago",
"email": "alice.johnson@example.com",
"tags": [
"manager"
],
"address": {
"street": "789 Oak St",
"city": "Chicago",
"zip": "60601"
}
}
]
运行spark-shell,先加载该文件:
val js = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/home/ec2-user/multi-line.json")
js.show
js.printSchema
multi-line模式下,spark将json的结构也解析成功:
参考: https://stackoverflow.com/questions/38545850/read-multiline-json-in-apache-spark