Case study on ENSO effect

🧠 I built a Reinforcement Learning system using PSPO + LoRA, trained on ENSO climate dynamics — here's what I learned.

Most RL tutorials give you CartPole. I wanted something real.

So I built a full pipeline: a custom Proximal Softmax Policy Optimization (PSPO) trainer, a LoRA-adapted policy network, and an environment simulating ENSO (El Niño-Southern Oscillation) — one of the most complex coupled ocean-atmosphere systems on Earth.

Here's the breakdown.

🌊 The Problem Domain — ENSO

ENSO drives global weather patterns. El Niño and La Niña phases cause droughts, floods, and economic disruption across continents. Predicting and responding to them requires long-horizon reasoning — exactly the kind of task where standard RL struggles.

The RL agent had to learn optimal climate monitoring interventions (from passive observation to full atmospheric reanalysis injection) across three phases: Neutral, El Niño, and La Niña, with a 5-dimensional state space encoding sea surface temperature, the Southern Oscillation Index, thermocline depth, trade wind index, and the Oceanic Niño Index.

⚙️ Why PSPO instead of vanilla PPO?

Standard PPO uses a fixed clipping ratio. PSPO adds a softmax temperature parameter τ that scales the probability ratio before clipping:

r = exp((log π_new − log π_old) / τ)

With τ = 1.5, the policy updates are smoother — preventing overconfident steps in stochastic environments like ENSO where a single El Niño episode can cause catastrophic reward collapse.

PSPO also incorporates adaptive KL scheduling: if KL divergence exceeds 1.5× the target, the learning rate drops 10%. If it falls below 0.5× target, the rate increases 5%. This mirrors techniques used in RLHF for large language models — which is exactly the point. PSPO was designed with LLM fine-tuning in mind, and this project stress-tests those ideas in a climate control domain.

🔧 Why LoRA for a Policy Network?

LoRA (Low-Rank Adaptation) is typically used to fine-tune LLMs efficiently. The insight here: the same principle applies to RL policy networks.

Instead of updating all weights W (5→64→5 MLP), only two small matrices are trained per layer:

  • A ∈ ℝ^{r×d_in}
  • B ∈ ℝ^{d_out×r}

The effective weight delta is ΔW = (α/r) · B · A — with rank r=4 and α=16, the scale factor is 4×, and only ~0.4% of parameters are trainable. Base weights are frozen, preserving any pre-trained knowledge while adapting to the new ENSO task.

B is initialized to zero (so the adapter starts as an identity transform), and A with small Gaussian noise. This is critical — it means the fine-tuned policy starts identical to the base policy and diverges gradually, which pairs perfectly with PSPO's conservative update philosophy.

📈 Higher Context via GAE

One of the core design goals was longer temporal reasoning. ENSO events unfold over months, not steps. The agent needs to connect a warming SST anomaly today to a reward collapse 20 steps later.

I implemented Generalized Advantage Estimation (GAE) with λ=0.95 and a context window of k=10 steps. This blends multi-step TD errors:

A_t = δ_t + γλ · A_{t+1}

The longer rollout means the agent learns to anticipate phase transitions, not just react to immediate rewards. Neutral-phase episodes yield rewards of ~+15 to +20. El Niño episodes can collapse to −120. The policy learns to recognize the early SST warning signs and shift toward higher intervention actions before the crash.

📊 What the Results Show

After 200 training episodes:

  • Best reward: +20.70 (neutral phase, full stability)
  • El Niño average reward: ~−100 (high variance, hard exploration problem)
  • Adaptive LR ranged from 3e-4 → 9e-4 as KL stabilized
  • LoRA delta norm converged near zero — base policy largely preserved

The reward variance is high, and that's the honest result. ENSO is hard. The El Niño phase represents a genuinely adversarial environment that the policy struggles to stabilize without more episodes or a wider network. But the PSPO + LoRA framework holds — updates remain stable, KL stays bounded, and the adapter learns without catastrophic interference.

🛠️ The Stack

Everything is pure Python + NumPy — no PyTorch, no gym, no shortcuts:

  • Custom LoRALayer with finite-difference gradient estimation
  • Bjerknes coupled ocean-atmosphere dynamics for the environment
  • GAE, PSPO surrogate, and adaptive LR scheduler from scratch
  • Results exported to JSON, with a full interactive HTML visualization showing the architecture diagram, training curves, phase timeline, and a live episode simulator

The visualization lets you tune τ, ε, λ, and LoRA rank in real time and run a simulated ENSO episode with terminal output — useful for teaching the algorithm to anyone unfamiliar with RL.

🔑 Key Takeaways

  1. PSPO's temperature scaling genuinely helps in high-variance environments. The softer ratio prevents the policy from overcorrecting during El Niño shocks.
  2. LoRA isn't just for LLMs. Parameter-efficient adapters on small policy networks give you a clean separation between "what the model already knows" and "what it's learning for this task."
  3. Context matters in RL. A longer GAE window meaningfully changes what the agent learns to value. For slow-moving, high-stakes domains (climate, finance, healthcare), k=10 or more is worth the compute.
  4. Reward shaping is where domain knowledge lives. The ENSO reward function encodes climatological intuition: penalize SST extremes quadratically, reward stability exponentially, and add a small cost to high-intervention actions to prevent trivial solutions.

If you're working on RL, climate AI, or parameter-efficient fine-tuning and want to dig into the code or the visualization — happy to share. Drop a comment or connect.

https://github.com/Mathin26/ENSO-Project

#ReinforcementLearning #MachineLearning #ClimateAI #RLHF #LoRA #Python #DeepLearning #ENSO #AIResearch

To view or add a comment, sign in

Explore content categories