From the basics to deep reinforcement learning, this repo provides easy-to-read code examples. One file for each algorithm. Please feel free to create a Pull Request, or open an issue!
Grid World (1-grid-world/)
- Policy Iteration —
1-policy_iteration.py - Value Iteration —
2-value_iteration.py - SARSA —
3-sarsa.py - Q-Learning —
4-q_learning.py - Deep SARSA —
5-deep_sarsa.py - REINFORCE —
6-reinforce.py
CartPole (2-cartpole/)
Atari (3-atari/)
Trained on a MacBook Pro 14" (Apple M3, 8 GB unified memory), macOS 26.2, Python 3.11, PyTorch 2.11 with the MPS backend. CPU / GPU figures are read from Activity Monitor on the python3.11 process after the run has stabilized (~5 min in); peak RAM is the process's real memory at its high-water mark. Final score is the mean per-game return over the last 20 episodes of training.
| Algorithm | Params | Train time | Final mean (per-game) | Peak RAM | CPU% | GPU% | W&B |
|---|---|---|---|---|---|---|---|
| DQN | 1.69M | ~9h | 93.5 ± 9.6 | 5.27 GB | ~60 | ~55 | report |
| PPO | 1.69M | ~3.8h | 261.9 ± 6.4 | 1.98 GB | ~62 | ~55 | report |
Single seed per row, mean ± std over the final 20 logged episodes.
Paramscounts only trainable network weights.CPU%is the single-process value reported by Activity Monitor (sum across cores, so >100% means multi-core use);GPU%is the same column for the Apple GPU. Sticky actions (repeat_action_probability=0.25) make absolute scores lower than the deterministic*-v4environments often cited in older papers.
Trained on a Mac Studio (Apple M4 Max, 64 GB) — different hardware from the Breakout rows above. ALE/MontezumaRevenge-v5 with sticky actions, 512 parallel environments (envpool), single seed. Score = mean per-game return over the last 100 training episodes.
| Algorithm | Params | Train time | Final mean (per-game) | Frames | W&B |
|---|---|---|---|---|---|
| PPO + RND | 3.90M | ~3.4h | ~3120 (single seed) | 65M agent steps | report |
Random Network Distillation (Burda et al. 2018) for hard exploration. With 512 envs the first key is found reliably (~327k steps) and the extrinsic value bootstraps around 10M steps; with 128 envs the same code never scored in 50M steps — parallel breadth is what cracks the first-key bottleneck. Stopped at ~65M agent steps after the score plateaued above the paper's PPO baseline (2497); not run to a fixed budget. Still far below RND's headline 8152, which used 128–1024 envs × 1.97B frames (~30× more experience).
Params= trainable weights (actor-critic 1.69M + RND predictor 2.20M; the frozen RND target adds 1.68M). Single seed, so no ± std — a 3-seed run is the next step for a defensible number.
Requires Python 3.11 and uv.
git clone <this repo>
cd reinforcement-learning
uv sync# Grid World
cd 1-grid-world && uv run python 3-sarsa.py
# CartPole — train
cd 2-cartpole && uv run python 1-dqn.py
# CartPole — watch training (slower)
cd 2-cartpole && uv run python 1-dqn.py --render
# CartPole — replay a trained checkpoint
cd 2-cartpole && uv run python 1-dqn.py --testBoth Atari scripts (1-dqn.py, 2-ppo.py) can stream training metrics to your own Weights & Biases account. One-time login, then pass --wandb:
uv run wandb login # paste the API key from https://wandb.ai/authorize
cd 3-atari && uv run python 2-ppo.py --env breakout --wandb
cd 3-atari && uv run python 1-dqn.py --env breakout --wandbRuns land in your rl-atari-ppo / rl-atari-dqn project — nothing is shared by default. Omit --wandb and the script runs without ever touching the network.
Modernized from the 2017 original:
- Framework: Keras + TensorFlow 1.0 → PyTorch 2.11
- Env: gym 0.8 → gymnasium 1.2
- Rendering: tkinter → pygame (cross-platform with no system Tk)
- Tooling:
requirements.txt→pyproject.toml+uv - Scope: pruned to 9 core algorithms; dropped Monte Carlo / DDQN / A3C / Atari / mountaincar; added PPO
- Layout: flat
1-grid-world/3-sarsa.pyinstead of nested1-grid-world/4-sarsa/sarsa_agent.py - Docs: each algorithm file now opens with a paper citation and the core update equation
