Humanitarians.AI -- Madison Framework

Madison Intelligence Agent
RL System

Reinforcement learning for adaptive source selection. UCB Contextual Bandits and REINFORCE Policy Gradient on real live APIs.

StudentMadhumitha Nandhikatti

ID002304994

FrameworkHumanitarians.AI -- Madison

EnvironmentGoogle Colab CPU

+0.283UCB reward gain

0.873UCB late avg

+0.088REINFORCE gain

0.778baseline avg

5live APIs

Technical Report

Full PDF Report

System architecture, mathematical formulation, experimental results, analysis, ethical considerations, and references.

Source Code

Colab Notebook

Full implementation including UCB Bandit, REINFORCE, training loop, reward function, real API fetchers, statistical analysis, and visualizations.

Experimental Results

Learning Curves

Per-episode reward, cumulative reward, Q-value heatmap, REINFORCE policy distribution, policy entropy over training, and source selection frequency.

Custom Tool

RewardSignalEngine

A standalone, reusable multi-component reward scoring tool built specifically for Madison's source selection problem. Evaluates three independent quality dimensions — not just binary success/failure.

Reward components — Cell 5 of notebook

Component	Condition	Value
`r_success`	Fetch succeeded	+1.0
`r_success`	Fetch failed	−1.0
`r_length`	Word count > 50	+0.5
`r_length`	Word count < 10	−0.2
`r_relevance`	Keyword overlap ratio	0 – 0.3

Final reward clipped to [−1.0, 2.0] | score() returns scalar | score_breakdown() returns per-component dict | self-test runs on Cell 5 load

Performance Summary

Results Summary

Comparison across all three agents. Welch t-test: t=0.987, p=0.329 (not statistically significant at alpha=0.05 — consistent with high API variance over 50 episodes).

Agent performance -- 50 training episodes across 7 topic contexts and 5 information sources

Metric	Random Baseline	UCB Bandit	REINFORCE	Winner
Avg Reward (Random baseline)	0.778	--	--	Baseline
Avg Reward (Late training)	0.778	0.873	0.553	UCB
Improvement over random	--	+0.096	−0.225	UCB

Real run results: Random baseline = 0.778 | UCB early = 0.590, late = 0.873, delta = +0.283 | REINFORCE early = 0.465, late = 0.553, delta = +0.088 | Welch t-test: t=0.987, p=0.329 (not significant at alpha=0.05)

Demonstration

Project Demo Video

10-minute walkthrough covering notebook structure, live training, learning curves analysis, and before/after performance comparison.

Madison Intelligence AgentRL System