# Working with JSON

***

{% embed url="<https://arrow.apache.org/docs/python/json.html>" %}

***

## Working with JSON

Working with JSON in Apache Arrow is a bit different than working with CSVs or Parquet. Because JSON is inherently **hierarchical** and **schema-less**, Arrow's goal is to convert that "messy" nested structure into a rigid, high-performance **Struct Array** or **Map Array** as quickly as possible.

In PyArrow, JSON support is handled by the `pyarrow.json` module.

***

### The Core Reading Process

Like the CSV reader, the JSON reader is a multi-threaded C++ engine. It specifically targets **JSONL** (Line-delimited JSON), where each line is a valid JSON object. This is the standard for big data because it allows the reader to scan the file in parallel.

```python
from pyarrow import json
import pyarrow as pa

# Simple read
table = json.read_json("data.jsonl")

print(table.schema)
```

***

### The Two Pillars of JSON Options

Unlike CSVs, which have three pillars, JSON has two: **`ReadOptions`** and **`ParseOptions`**.

#### I. ReadOptions

Controls the infrastructure of the read.

* `use_threads`: (Default: True) Read blocks of the file in parallel.
* `block_size`: How much text to buffer in memory (default is \~1MB).

#### II. ParseOptions

This is where the magic happens. Since JSON can have different structures on every line, you use `ParseOptions` to define the "target" shape.

* `explicit_schema`: If you don't provide this, Arrow will "guess" (infer) the types by reading the first block. Providing a schema is **much** faster and safer.
* `unexpected_field_behavior`: What should Arrow do if it finds a key that isn't in your schema? Options are `ignore`, `error`, or `infer`.

***

### Handling Nested Data (The Struct Array)

If your JSON looks like this:

`{"id": 1, "location": {"lat": 51.1, "lon": 71.4}}`

Arrow will automatically convert the `location` field into a **Struct Array**.

```python
# Define the nested structure
schema = pa.schema([
    ("id", pa.int64()),
    ("location", pa.struct([
        ("lat", pa.float64()),
        ("lon", pa.float64())
    ]))
])

parse_options = json.ParseOptions(explicit_schema=schema)
table = json.read_json("nested_data.jsonl", parse_options=parse_options)

# To work with just 'lat', you'd flatten it:
lats = table.column("location").flatten()[0]
```

***

### Converting Arrow Back to JSON

When you want to go the other direction (Arrow to JSON), you have two main paths:

#### A. The "I need a file" Path

Use the Dataset API or simple Python conversion to write JSONL.

```python
import pandas as pd
# Currently, PyArrow doesn't have a dedicated 'json.write_json' 
# The common bridge is to go through a RecordBatch/Table to Python/Pandas
df = table.to_pandas()
df.to_json("output.jsonl", orient="records", lines=True)
```

#### B. The "I need a string" Path (Streaming)

If you are building an API, you likely want to convert your **RecordBatches** into JSON strings to send to a frontend.

```python
# Convert the table to a list of Python dictionaries (handling Nulls as None)
data_dicts = table.to_pylist()

import json as python_json
json_string = python_json.dumps(data_dicts)
```

***

### JSON vs. Parquet vs. Arrow

| **Feature**        | **JSON**            | **Parquet**       | **Arrow (IPC/Feather)** |
| ------------------ | ------------------- | ----------------- | ----------------------- |
| **Human Readable** | Yes                 | No                | No                      |
| **Nested Data**    | Native (but slow)   | Native (Complex)  | Native (StructArrays)   |
| **Performance**    | Slow (Text parsing) | Fast (Compressed) | Blazing (Zero-copy)     |
| **Schema**         | Implicit/None       | Explicit          | Explicit                |

#### The Verdict for Data Engineering

JSON is almost always your "**Ingestion**" format (from an API or a NoSQL DB like MongoDB). Your first step in a pipeline should be to use `pyarrow.json` to "rigidify" that data into a Parquet or Feather file.

Once it's in Arrow format, you get the **Vectorization** and **Zero-copy** benefits we've discussed, which are impossible to do while the data is still just a "blob" of JSON text.

***
