# Working with CSV

***

### Working with CSVs (The Arrow Way)

While CSVs are "text" and not "binary," PyArrow has a highly optimized, multi-threaded C++ engine for reading them.

**A. The Basic Interaction (Headers vs. No Headers)**

You can control headers through `ReadOptions`.

```python
from pyarrow import csv

# 1. With Headers (Default)
table = csv.read_csv("data.csv")

# 2. Without Headers (Specify your own names)
read_options = csv.ReadOptions(column_names=["id", "name", "score"])
table_no_header = csv.read_csv("data_no_header.csv", read_options=read_options)
```

***

### The "Three Types of Options"

This is the "Secret Sauce" of the Arrow CSV reader. It splits the work into three logical steps:

**I. ReadOptions (The "Structure" Options)**

Controls how the file is accessed and how the initial metadata is handled.

* `use_threads`: True/False (Arrow can read CSVs in parallel!).
* `block_size`: How many bytes to read at once.
* `column_names`: If the file lacks a header.

**II. ParseOptions (The "Delimiter" Options)**

Controls how the raw text is chopped into "cells."

* `delimiter`: Is it a comma, a pipe `|`, or a tab `\t`?
* `quote_char`: How are strings wrapped?
* `escape_char`: How are special characters handled?

**III. ConvertOptions (The "Type" Options)**

This is the most powerful part. It controls how the strings in the CSV are turned into **Arrow Data Types**.

* `column_types`: Force a specific column to be `int64` or `timestamp`.
* `null_values`: A list of strings to treat as Null (e.g., `["NA", "nan", "NULL"]`).
* `strings_can_be_null`: Whether empty strings are null.

***

### Full Code Example: Using all 3 Options

```python
from pyarrow import csv
import pyarrow as pa

# 1. How to read
read_opts = csv.ReadOptions(
    use_threads=True,
    block_size=1024 * 1024 * 10 # 10MB blocks to process at a time
    encoding='utf8', # encoding type of the file
    column_names=['user_id', 'is_active']
)

# 2. How to cut the text
parse_opts = csv.ParseOptions(
    delimiter=";",
    quote_char='"'
)

# 3. How to define the data types
convert_opts = csv.ConvertOptions(
    column_types={
        "user_id": pa.int64(),
        "is_active": pa.bool_()
    },
    null_values=["NONE", "N/A"],
    # leaving out any other columns
       # include_columns=['tip', 'total_bill',
                        'timestamp'],
)

# Bring it all together
table = csv.read_csv(
    "messy_data.csv", 
    read_options=read_opts, 
    parse_options=parse_opts, 
    convert_options=convert_opts
)
```

#### Why this is better than `pandas.read_csv`:

Pandas is famous for being slow with large CSVs because it's single-threaded and does a lot of "guessing" for data types. Arrow's CSV reader is written in C++, uses **SIMD** for parsing, and can use all your CPU cores. It's often **10x to 50x faster** than Pandas.

***

As we discussed, the Read, Parse, and Convert options give you total control over messy text files.

<details>

<summary>Advanced CSV Handling (another example with The 3 Pillars) </summary>

```python
from pyarrow import csv

# PILLAR 1: ReadOptions - The "How to access"
read_opts = csv.ReadOptions(
    use_threads=True, 
    column_names=["timestamp", "user_id", "status", "value"], # Use if no header
    skip_rows=1 # Skip the first row if it's junk
)

# PILLAR 2: ParseOptions - The "Physical structure"
parse_opts = csv.ParseOptions(
    delimiter="|", 
    quote_char="'",
    ignore_empty_lines=True
)

# PILLAR 3: ConvertOptions - The "Logic and Types"
convert_opts = csv.ConvertOptions(
    column_types={
        "user_id": pa.int64(),
        "value": pa.float64()
    },
    null_values=["N/A", "MISSING", "null"],
    timestamp_parsers=["%Y-%m-%d %H:%M:%S", "%d/%m/%Y"] # Handle multiple date formats
)

# Execution
table = csv.read_csv(
    "data.csv", 
    read_options=read_opts, 
    parse_options=parse_opts, 
    convert_options=convert_opts
)
```

</details>

***

### Writing to CSV

To implement a streaming CSV write in PyArrow, you use the `CSVWriter`. This is the exact pattern used for processing data that is too large to fit in memory—you iterate, create a `RecordBatch` or `Table`, and "pipe" it to the writer.

Here is the clean, executable version of your example. Note that I used `write_table` (or `write_batch`) inside the context manager.

```python
import pyarrow as pa
import pyarrow.csv as csv

# 1. Define the schema
schema = pa.schema([("col", pa.int64())])

# 2. Open the CSVWriter
# You can also pass 'write_options' here to change delimiters, etc.
with csv.CSVWriter("output.csv", schema=schema) as writer:
    for chunk in range(10):
        # Create a range for this chunk
        datachunk = range(chunk * 10, (chunk + 1) * 10)
        
        # Create a table for this chunk
        table = pa.Table.from_arrays(
            [pa.array(datachunk, type=pa.int64())], 
            schema=schema
        )
        
        # Write the chunk to the file
        writer.write_table(table)

print("Streaming write complete. 'output.csv' created.")
```

#### Key Details to Notice:

* **Header Handling:** The `CSVWriter` automatically writes the header based on your `schema` once when the file is opened. Subsequent calls to `write_table` only append the data rows.
* **Memory Efficiency**: Because of the `with` block and the loop, only 10 rows are in memory at any given time. This is the Streaming concept we discussed earlier applied to text files.
* **Data Consistency**: By passing the `schema` to both the `CSVWriter` and the `Table`, you ensure that every chunk matches the expected data type, preventing "dirty" data from hitting your disk.

***

#### What if you want to customize the output?

Just like the `read_csv` options we covered, you can use `WriteOptions` to change how the CSV is formatted (e.g., using a pipe `|` instead of a comma).

```python
write_options = csv.WriteOptions(include_header=True, delimiter="|")

with csv.CSVWriter("output_custom.csv", schema=schema, write_options=write_options) as writer:
    # ... same loop as above ...
    writer.write_table(table)
```

***