.. _batch-evaluation:

========================
Batch Evaluation
========================

The batch evaluator runs the full library of historical balloon launches in
``evaluation/launches.json`` against the **current state of the codebase**, tags
the results with the current git hash, and writes everything — metrics CSV,
interactive HTML report, per-launch comparison plots, and trajectory maps —
into a self-contained timestamped folder.

Useful for:

* establishing a **baseline** before changing the model or tuning hyperparameters,
* measure the **effect of a code change** across many flights at once,

Every batch is reproducible from ``batch_info.json`` (git hash, branch, commit
message, and a snapshot of ``launches.json`` are all stored alongside the
results).

For a single-flight workflow with finer-grained per-flight tuning, see
:ref:`single-evaluation`.  All assumptions in
:ref:`single-evaluation-assumptions` (phase detection, smoothing, NaN handling,
reforecast construction, sunset detection) apply identically to batch runs —
each launch is processed by the exact same :py:class:`evaluation.evaluate.BalloonEvaluator`.


System Overview
---------------

The system lives entirely in the ``evaluation/`` directory:

.. list-table::
   :widths: 35 65
   :header-rows: 1

   * - File
     - Purpose
   * - ``evaluation/launches.json``
     - Master metadata file — one entry per historical launch
   * - ``evaluation/launches.example.json``
     - Template you can copy to bootstrap your own ``launches.json``
   * - ``evaluation/run_batch.py``
     - Runs all launches and writes a timestamped batch folder
   * - ``tests/test_validate_launches.py``
     - Auto-parametrized validation that every file referenced in ``launches.json`` exists and is correctly formatted


Step 1 — Populate ``launches.json``
-------------------------------------

``launches.json`` is the single source of truth for all historical flights.
Each entry describes one balloon launch and points to its trajectory and
forecast files.

**Required fields** (the batch will skip the launch if any are missing):

.. list-table::
   :widths: 30 15 55
   :header-rows: 1

   * - Field
     - Type
     - Description
   * - ``shab_name``
     - string
     - Balloon identifier (e.g. ``"SHAB14V"``)
   * - ``organization``
     - string
     - Launching organization (e.g. ``"UA"``, ``"NRL"``, ``"JPL"``)
   * - ``launch_time``
     - string
     - UTC launch time in ISO format: ``"2022-08-22 14:36:00"``
   * - ``sim_time_hr``
     - number
     - Simulation duration in hours.  **Set this manually** — APRS trackers often lose signal near the ground at launch and landing, so auto-detection is unreliable.  Pad by 1–2 hours past expected landing.
   * - ``aprs_file``
     - string
     - Filename only (e.g. ``"SHAB14V-APRS.csv"``).  File must exist in ``balloon_data/``.
   * - ``launch_lat``
     - number
     - Launch latitude in decimal degrees
   * - ``launch_lon``
     - number
     - Launch longitude in decimal degrees
   * - ``launch_alt_m``
     - number
     - Ground elevation at launch site in meters (also used as ``min_alt`` for landing detection)
   * - ``payload_weight_kg``
     - number
     - Payload mass in kg
   * - ``envelope_weight_kg``
     - number
     - Envelope mass in kg
   * - ``balloon_shape``
     - string
     - ``"sphere"``
   * - ``balloon_size``
     - number
     - Balloon diameter in meters (sphere) or characteristic length (trapezoid)
   * - ``gfs_file`` and/or ``era5_file``
     - string
     - Forecast filename (e.g. ``"gfs_0p25_20220822_12.nc"``).  At least one is required; set the other to ``null`` if not available.

**Optional fields** (fall back to current ``config_earth.py`` defaults if omitted):

``callsign``, ``campaign``, ``landing_time``, ``launch_type``,
``areaDensityEnv``, ``cp``, ``absEnv``, ``emissEnv``, ``Upsilon``, and any
``earth_properties`` field (``Cp_air0``, ``Cv_air0``, ``Rsp_air``, ``P0``,
``emissGround``, ``albedo``).

.. tip::
    When the ``campaign`` field is included, the HTML report groups all
    launches that share a campaign name into their own sub-table with its own
    campaign-average row (see :ref:`batch-html-summary`).


.. _launch-types:

Launch Types
~~~~~~~~~~~~

The optional ``launch_type`` field describes how the SHAB got off the ground.
EarthSHAB's solar-balloon physics model only describes a self-ascending
solar balloon, so for non-standard deployments the ascent metrics are not
meaningful — see the rules below.

.. list-table::
   :widths: 22 78
   :header-rows: 1

   * - Value
     - Meaning
   * - ``"standard"`` (default)
     - Conventional ground release — SHAB ascends under solar buoyancy alone.
       Full ascent / float / descent metrics are scored.
   * - ``"helium_augmented"``
     - SHAB is partially filled with helium so that buoyancy carries it up
       faster than solar heating alone could.  Float and descent are still
       physically comparable to the model, but the helium-driven ascent rate
       is not — **ascent metrics are reported as N/A** for these flights.
   * - ``"grand_slam"``
     - SHAB is carried aloft by a separate weather balloon and released
       *above* its natural float altitude.  After release, the SHAB
       *descends* through the air column until it reaches its float, then
       floats normally before landing.

       For Grand Slam two evaluator behaviours change:

       * **Ascent metrics are reported as N/A** (the carry-up is not the
         model's ascent).
       * The float-detection bracket is widened from "near peak altitude"
         to the **entire post-apex region of the trajectory**.  Without this
         the detector would clip to the brief weather-balloon release
         plateau and miss the actual SHAB float that follows the descent.

If the field is omitted, the evaluator behaves as if it were ``"standard"``.
The ``compare_batches.py`` and ``summary.html`` reports include a ``Type``
column so you can see at a glance which model assumption was applied to
each row.

.. note::
   ``launch_type`` does **not** change the forward simulation itself —
   EarthSHAB still simulates a self-ascending solar balloon either way.
   The flag only changes which phase metrics are scored against ground
   truth, so non-physical comparisons don't pollute the per-campaign
   averages.

**Example entry:**

.. code-block:: json

   {
     "shab_name": "SHAB14V",
     "organization": "UA",
     "callsign": "SHAB14V",
     "campaign": "Schuler-ABQ",
     "launch_time": "2022-08-22 14:36:00",
     "landing_time": null,
     "sim_time_hr": 14,
     "aprs_file": "SHAB14V-APRS.csv",
     "gfs_file": "gfs_0p25_20220822_12.nc",
     "era5_file": "SHAB14V_ERA5_20220822_20220823.nc",
     "launch_lat": 34.60,
     "launch_lon": -106.80,
     "launch_alt_m": 1000.0,
     "payload_weight_kg": 0.9,
     "envelope_weight_kg": 2.1,
     "balloon_shape": "sphere",
     "balloon_size": 5.8
   }

.. note::

   If a launch has both a GFS and an ERA5 forecast file, the batch runner will
   produce **two separate evaluations** — one per forecast type — in the same
   launch output folder.  Both rows appear independently in ``summary.csv``
   and ``summary.html``.


Step 2 — Validate Before Running
----------------------------------

Before running a batch, check that all referenced files exist and are correctly
formatted:

.. code-block:: bash

   pytest tests/test_validate_launches.py --spec

The tests are **auto-parametrized** — adding a new launch to ``launches.json``
instantly adds a full set of validation tests for it.


Step 3 — Run a Batch
----------------------

Run all launches against the **current codebase state**:

.. code-block:: bash

   python -m evaluation.run_batch --note "baseline evaluation"

The ``--note`` flag is **required**.  Use it to describe what changed since the
last batch (e.g. ``"tuned Upsilon coefficient"``).  The note is stored with the
results so you can remember why each batch was run.

The runner will:

1. Detect the current git hash, branch, commit message, and dirty flag
2. Snapshot the original ``config_earth`` state (so per-launch overrides cannot
   bleed into each other)
3. Create a timestamped output folder: ``evaluation/batches/2026-04-28T1423_a3f9c12/``
4. For each launch in ``launches.json``:

   * validate required fields and file paths,
   * build a complete config-override dict for each forecast type the launch
     supports,
   * run a GFS evaluation and/or an ERA5 evaluation,
   * write per-launch CSV / PNG / interactive HTML map outputs,
   * append one summary row per forecast type
5. Skip failed launches with a full traceback printed to the console — the
   batch always continues
6. Write ``summary.csv``, ``summary.html``, and ``batch_info.json`` at the batch
   level

**Console output example:**

.. code-block:: text

   ============================================================
     EarthSHAB Batch Evaluation
     Batch ID : 2026-04-28T1423_a3f9c12
     Note     : baseline evaluation
     Launches : 2
   ============================================================

   ── UA_SHAB14V_2022-08-22 ──
     Running GFS...
     [GFS] done

     Running ERA5...
     [ERA5] done

   ── UA_SHAB1_2020-10-01 ──
     Running GFS...
     [GFS] done

   Summary → evaluation/batches/2026-04-28T1423_a3f9c12/summary.csv
   Report  → evaluation/batches/2026-04-28T1423_a3f9c12/summary.html
   Batch info → evaluation/batches/2026-04-28T1423_a3f9c12/batch_info.json

   ============================================================
     Batch complete: 2/2 launches succeeded
     Total runtime: 142.3s
   ============================================================


Output Structure
-----------------

Each batch produces a self-contained folder:

.. code-block:: text

   evaluation/batches/
   └── 2026-04-28T1423_a3f9c12/
       ├── batch_info.json           ← git hash, note, runtime, launch status, launches.json snapshot
       ├── summary.csv               ← all metrics, one row per launch × forecast type
       ├── summary.html              ← interactive sortable, color-coded report
       ├── UA_SHAB14V_2022-08-22/
       │   ├── SHAB14V-APRS_GFS_2022_8_22.csv
       │   ├── SHAB14V-APRS_GFS_2022_8_22.png
       │   ├── EVALUATION_SHAB14V-APRS_GFS_2022_8_22.html
       │   ├── SHAB14V-APRS_ERA5_2022_8_22.csv
       │   ├── SHAB14V-APRS_ERA5_2022_8_22.png
       │   └── EVALUATION_SHAB14V-APRS_ERA5_2022_8_22.html
       └── UA_SHAB1_2020-10-01/
           ├── SHAB1-APRS_GFS_2020_10_1.csv
           ├── SHAB1-APRS_GFS_2020_10_1.png
           └── EVALUATION_SHAB1-APRS_GFS_2020_10_1.html

**``batch_info.json``** records everything needed to reproduce or understand the batch:

.. code-block:: text

   {
     "batch_id": "2026-04-28T1423_a3f9c12",
     "note": "baseline evaluation",
     "git_hash": "a3f9c12",
     "git_branch": "devel",
     "git_commit_message": "Added batch evaluator",
     "git_dirty": false,
     "earthshab_version": "1.3",
     "total_runtime_s": 142.3,
     "per_launch_avg_runtime_s": 71.15,
     "launches_attempted": ["UA_SHAB14V_2022-08-22", "UA_SHAB1_2020-10-01"],
     "launches_succeeded": ["UA_SHAB14V_2022-08-22", "UA_SHAB1_2020-10-01"],
     "launches_failed": {},
     "launches_json_snapshot": { ... full launches.json ... }
   }

.. note::
   ``git_dirty: true`` means there were uncommitted changes when the batch ran.
   Results from dirty batches should be treated as exploratory — commit your
   changes before a batch you intend to keep.


.. _batch-html-summary:

Batch Summary Table
--------------------------------------


|eval_html|

.. |eval_html| image:: ../../../img/evaluation_batch_summary_html.png
   :width: 100%
   :alt: Batch summary HTML report


**Tables**

The page always contains an **Overall Summary** table covering every launch ×
forecast pair in the batch.  In addition, every distinct value of the optional
``campaign`` field becomes its own **Campaign sub-table** below — but only if
the campaign contains at least 2 distinct balloons (single-flight campaigns
are suppressed to avoid noise).  Each campaign sub-table has its own
*Campaign Average* footer row.

**Columns**

Each row represents one launch × one forecast type.  Columns are grouped:

.. list-table::
   :widths: 22 78
   :header-rows: 1

   * - Group
     - Columns
   * - Identity
     - ``Launch``, ``Fcst`` (GFS / ERA5), ``Type`` (see :ref:`launch-types`), ``Payload (kg)``, ``Bal Ø (m)``
   * - Float Alt (m)
     - ``Sim``, ``Truth``, ``%Δ`` (signed percentage difference Sim vs Truth)
   * - Ascent (m/s)
     - ``Sim``, ``Truth``, ``%Δ`` — **N/A for ``helium_augmented`` and ``grand_slam`` rows** (the model's solar-ascent physics does not apply when buoyancy is augmented or the SHAB is carried up by a weather balloon)
   * - Descent (m/s)
     - ``Sim``, ``Truth``, ``%Δ``
   * - Time to Float (min)
     - ``Sim``, ``Truth``, ``Δ(min)`` (signed minutes, sim − truth)
   * - Time to Ground (min)
     - ``Sim``, ``Truth``, ``Δ(min)``
   * - End-to-end errors
     - ``Land Dist (km)`` great-circle landing miss, ``|Time Δ|`` (min, absolute landing-time miss), ``Temp MAE (K)``, ``Press MAE (Pa)``

The averages footer row shows the column-wise mean of all *successful* rows.
Failed launches are shown as a single full-width red row with the failure
message and are pinned to the bottom of any sort.

**Color coding**

The color rules differ between *percent-difference* cells and
*absolute-error* cells:

.. list-table::
   :widths: 30 15 25 30
   :header-rows: 1

   * - Cell type
     - Green (≤)
     - Yellow (≤)
     - Red (>)
   * - ``%Δ`` columns (Float / Ascent / Descent)
     - 10 %
     - 25 %
     - 25 %
   * - Time-to-float diff (min, abs)
     - 15
     - 45
     - 45
   * - Time-to-ground diff (min, abs)
     - 30
     - 90
     - 90
   * - Land Dist (km)
     - 20
     - 50
     - 50
   * - \|Time Δ\| (min, abs)
     - 30
     - 90
     - 90
   * - Temp MAE (K)
     - 5
     - 15
     - 15
   * - Press MAE (Pa)
     - 500
     - 2000
     - 2000

Cells with no value (NaN, missing, etc.) display as ``—`` and are uncolored.


.. tip::

    **Sorting**: Click any column header to sort that column's data. Failed-row entries 
    always sort to the bottom regardless of direction.