Working with JSON

Reading JSON files — Apache Arrow v23.0.1arrow.apache.org

Working with JSON

Working with JSON in Apache Arrow is a bit different than working with CSVs or Parquet. Because JSON is inherently hierarchical and schema-less, Arrow's goal is to convert that "messy" nested structure into a rigid, high-performance Struct Array or Map Array as quickly as possible.

In PyArrow, JSON support is handled by the pyarrow.json module.

The Core Reading Process

Like the CSV reader, the JSON reader is a multi-threaded C++ engine. It specifically targets JSONL (Line-delimited JSON), where each line is a valid JSON object. This is the standard for big data because it allows the reader to scan the file in parallel.

from pyarrow import json
import pyarrow as pa

# Simple read
table = json.read_json("data.jsonl")

print(table.schema)

The Two Pillars of JSON Options

Unlike CSVs, which have three pillars, JSON has two: ReadOptions and ParseOptions.

I. ReadOptions

Controls the infrastructure of the read.

use_threads: (Default: True) Read blocks of the file in parallel.
block_size: How much text to buffer in memory (default is ~1MB).

II. ParseOptions

This is where the magic happens. Since JSON can have different structures on every line, you use ParseOptions to define the "target" shape.

explicit_schema: If you don't provide this, Arrow will "guess" (infer) the types by reading the first block. Providing a schema is much faster and safer.
unexpected_field_behavior: What should Arrow do if it finds a key that isn't in your schema? Options are ignore, error, or infer.

Handling Nested Data (The Struct Array)

If your JSON looks like this:

{"id": 1, "location": {"lat": 51.1, "lon": 71.4}}

Arrow will automatically convert the location field into a Struct Array.

# Define the nested structure
schema = pa.schema([
    ("id", pa.int64()),
    ("location", pa.struct([
        ("lat", pa.float64()),
        ("lon", pa.float64())
    ]))
])

parse_options = json.ParseOptions(explicit_schema=schema)
table = json.read_json("nested_data.jsonl", parse_options=parse_options)

# To work with just 'lat', you'd flatten it:
lats = table.column("location").flatten()[0]

Converting Arrow Back to JSON

When you want to go the other direction (Arrow to JSON), you have two main paths:

A. The "I need a file" Path

Use the Dataset API or simple Python conversion to write JSONL.

import pandas as pd
# Currently, PyArrow doesn't have a dedicated 'json.write_json' 
# The common bridge is to go through a RecordBatch/Table to Python/Pandas
df = table.to_pandas()
df.to_json("output.jsonl", orient="records", lines=True)

B. The "I need a string" Path (Streaming)

If you are building an API, you likely want to convert your RecordBatches into JSON strings to send to a frontend.

# Convert the table to a list of Python dictionaries (handling Nulls as None)
data_dicts = table.to_pylist()

import json as python_json
json_string = python_json.dumps(data_dicts)

JSON vs. Parquet vs. Arrow

Feature

JSON

Parquet

Arrow (IPC/Feather)

Human Readable

Yes

Nested Data

Native (but slow)

Native (Complex)

Native (StructArrays)

Performance

Slow (Text parsing)

Fast (Compressed)

Blazing (Zero-copy)

Schema

Implicit/None

Explicit

The Verdict for Data Engineering

JSON is almost always your "Ingestion" format (from an API or a NoSQL DB like MongoDB). Your first step in a pipeline should be to use pyarrow.json to "rigidify" that data into a Parquet or Feather file.

Once it's in Arrow format, you get the Vectorization and Zero-copy benefits we've discussed, which are impossible to do while the data is still just a "blob" of JSON text.

PreviousWorking with CSV NextAbout Arrow IPC/Feather format

Last updated 1 month ago

hashtagWorking with JSON

hashtagThe Core Reading Process

hashtagThe Two Pillars of JSON Options

hashtagI. ReadOptions

hashtagII. ParseOptions

hashtagHandling Nested Data (The Struct Array)

hashtagConverting Arrow Back to JSON

hashtagA. The "I need a file" Path

hashtagB. The "I need a string" Path (Streaming)

hashtagJSON vs. Parquet vs. Arrow

hashtagThe Verdict for Data Engineering

Working with JSON

The Core Reading Process

The Two Pillars of JSON Options

I. ReadOptions

II. ParseOptions

Handling Nested Data (The Struct Array)

Converting Arrow Back to JSON

A. The "I need a file" Path

B. The "I need a string" Path (Streaming)

JSON vs. Parquet vs. Arrow

The Verdict for Data Engineering