Spark读取JSON数据

Spark能够自动推断出Json数据集的“数据模式”(Schema),并将它加载为一个SchemaRDD实例。这种“自动”的行为是通过下述两种方法实现的:

​ jsonFile:从一个文件目录中加载数据,这个目录中的文件的每一行均为一个JSON字符串(如果JSON字符串“跨行”,则可能导致解析错误)

​ jsonRDD:从一个已经存在的RDD 中加载数据,这个RDD中的每一个元素均为一个JSON字符串

每一行均为一个JSON字符串

我们先测试第一种方式——每一行均为一个JSON字符串, 这种形式的文件又叫JSONL

将下面内容保存到/home/ec2-user/file.json:

{"name": "John Doe", "age": 30, "city": "New York", "email": "john.doe@example.com", "tags": ["developer", "programmer", "engineer"], "address": {"street": "123 Main St", "city": "New York", "zip": "10001"}}
{"name": "Jane Smith", "age": 25, "city": "Los Angeles", "email": "jane.smith@example.com", "tags": ["designer", "artist"], "address": {"street": "456 Elm St", "city": "Los Angeles", "zip": "90001"}}
{"name": "Alice Johnson", "age": 35, "city": "Chicago", "email": "alice.johnson@example.com", "tags": ["manager"], "address": {"street": "789 Oak St", "city": "Chicago", "zip": "60601"}}

运行spark-shell, 先加载json文件:

val zipDF = spark.read.json("/home/ec2-user/file.json")
zipDF.printSchema

image-20240303224000217

可见Spark将json的结构解析出来。

对DataFrame进行计算:

zipDF.createOrReplaceTempView("zipTable")

val cityDF =spark.sql("select distinct city from zipTable")

cityDF.count

cityDF.show

image-20240303224154029

Multi-line JSON

将下面内容保存到/home/ec2-user/multi-line.json:

[
  {
    "name": "John Doe",
    "age": 30,
    "city": "New York",
    "email": "john.doe@example.com",
    "tags": [
      "developer",
      "programmer",
      "engineer"
    ],
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "zip": "10001"
    }
  },
  {
    "name": "Jane Smith",
    "age": 25,
    "city": "Los Angeles",
    "email": "jane.smith@example.com",
    "tags": [
      "designer",
      "artist"
    ],
    "address": {
      "street": "456 Elm St",
      "city": "Los Angeles",
      "zip": "90001"
    }
  },
  {
    "name": "Alice Johnson",
    "age": 35,
    "city": "Chicago",
    "email": "alice.johnson@example.com",
    "tags": [
      "manager"
    ],
    "address": {
      "street": "789 Oak St",
      "city": "Chicago",
      "zip": "60601"
    }
  }
]

运行spark-shell,先加载该文件:

val js = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/home/ec2-user/multi-line.json")
js.show
js.printSchema

image-20240303224855447

multi-line模式下,spark将json的结构也解析成功:

image-20240303224902372


参考: https://stackoverflow.com/questions/38545850/read-multiline-json-in-apache-spark