# CHIMERA Streaming Pipeline (S3-first)

This folder provides a minimal baseline for streaming CHIMERA data directly from AWS S3 without storing the full dataset locally. It supports:
- Clinical (from manifest columns or clinical JSON on S3),
- Histopathology (prefer precomputed patch features on S3; fallback to raw WSI thumbnails),
- MRI (NIfTI or MHA on S3),
with bounded temporary files only when a decoder needs a real file path.

## How it works
- Uses `s3fs/fsspec` to stream objects from `s3://...` URIs.
- For formats that require a file path (e.g., OpenSlide, nibabel in some cases), it downloads to a temporary file via a small LRU cache (`TempFileCache`) and deletes when evicted.
- For feature files (`.pt`, `.npy`), loads directly from in-memory buffers, avoiding disk.

## Requirements
- Python packages listed in `requirements.txt`.
- For macOS (for WSI):
  ```zsh
  brew install openslide
  ```
  Then `pip install -r requirements.txt` to get `openslide-python` bindings.
  For `.mha` MRI support, `SimpleITK` is included in requirements.

## Configure
Edit `config.yaml`:
- Set `s3.bucket` and relevant `prefixes` or point to your manifest CSVs.
- Toggle which modalities to use under `modalities`.
- Control cache size and temp dir under `runtime`.

## Manifest Format
A CSV with at least:
- `patient_id` (string),
- `label` (0/1),
- Optional modality columns:
  - `path_wsi_feat` (s3 URI to `.pt` or `.npy` patch features),
  - `path_wsi` (s3 URI to raw WSI, e.g., `.svs`/`.tif`),
  - `path_mri_feat` (s3 URI to small `.pt` MRI feature file),
  - `path_mri` (s3 URI to MRI, e.g., `.nii`/`.nii.gz` or `.mha`),
  - Clinical columns as numeric/categorical features, or `clinical_json_s3` pointing to a JSON object per patient.

Minimal example rows:
```
patient_id,label,path_wsi_feat,path_mri,age,gleason
1003,1,s3://bucket/pathology/features/1003_1.pt,s3://bucket/mri/1003.nii.gz,68,7
```

## Run Cross-Validation
```zsh
# Activate your environment, then:
python -m Code.train_eval \
  --manifest s3://your-chimera-bucket/manifests/train_manifest.csv \
  --config Code/config.yaml \
  --folds 5 --seed 42
```
This will:
- stream WSI features or raw WSIs (thumbnail-based fallback),
- stream MRI NIfTI and compute compact features (histogram or stats),
- combine with clinical features (impute+scale/one-hot),
- train a Logistic Regression baseline with 5-fold CV,
- print AUROC, AUPRC, balanced accuracy, and F1 per fold and mean±std.

## Real-time Streaming vs Local Storage
- Precomputed feature files (`.pt`, `.npy`) are loaded directly from S3 (memory/streamed file-like).
- Raw WSIs/MRI typically need a temporary local file for decoders. The cache keeps only a few files (configurable) and reuses them within an epoch; they are evicted and deleted automatically—no full dataset is stored locally.

## Tips
- If your bucket uses requester-pays or a specific profile, export env vars or configure AWS profiles accordingly (e.g., `AWS_PROFILE`, `AWS_REGION`).
- For large WSIs, consider using precomputed patch features to keep the pipeline fully streaming and faster. Note: CHIMERA WSI `.pt` files can be large; use selectively or pre-aggregate.
- To switch off a modality, set it to `false` in `config.yaml` under `modalities`.

## Build a Manifest from S3
If no manifest is provided, generate one from the public bucket:
```zsh
python -m Code.build_manifest \
  --clinical-prefix s3://chimera-challenge/v2/task1/clinical_data/ \
  --wsi-feat-prefix s3://chimera-challenge/v2/task1/pathology/features/features/ \
  --mri-feat-prefix s3://chimera-challenge/v2/task1/radiology/features/ \
  --output manifest_task1.csv
```
Then run CV using `--manifest manifest_task1.csv` and set `clinical.label_key` in `Code/config.yaml` to the correct JSON key.
