Name | Type | Week | Tags | Image | Published | Date | Featured |
---|---|---|---|---|---|---|---|

1 - Motivation, States, Actions and Rewards | Post | 1 | |||||

2 - Return, Value Functions & Bellman Equations | Post | 1 | |||||

3 - Learning from Experience, TD-Learning, epsilon-Greedy | Post | 1 | |||||

4 - Generalised Policy Iteration | Post | 2 | |||||

5 - Curse of Dimensionality, Function Approximation & Parameters | Post | 2 | |||||

6 - Feature Vectors | Post | 2 | |||||

7 - Intro to Neural Networks | Post | 3 | |||||

8 - PyTorch | Post | 3 | |||||

9 - Training in PyTorch | Post | 3 | |||||

10 - Model-free control & Q-Learning | Post | 4 | |||||

11 - Deep Q-Networks | Post | 4 | |||||

12 - Double DQN | Post | 4 | |||||

What is PyTorch? | Post | ||||||

Mean Squared Error in PyTorch | Post | ||||||

Stochastic Gradient Descent | Post | ||||||

nn.Sequential modules | Post | ||||||

nn.Relu layers | Post | ||||||

nn.Linear layers | Post | ||||||

Activation functions | Post | ||||||

Tensors | Post | ||||||

Neurons to layers | Post | ||||||

Polynomial function approximation | Post | ||||||

Greedy Actions | Post | ||||||

1 step lookahead | Post | ||||||

Reward Functions | Post | ||||||

State Transition Functions | Post | ||||||

Episodic MDPs | Post | ||||||

The Env class | Post | ||||||

Reinforcement Learning in Python | Post | ||||||

Epsilon Greedy Actions | Post | ||||||

Policy \pi(s) | Post | ||||||

Discounting the Future | Post | ||||||

Supervised Learning the Value Function | Post | ||||||

Generalised Policy Iteration | Post | ||||||

Rewards | Post | ||||||

Neural Networks | Post | ||||||

Policy Evaluation | Post | ||||||

Loss Functions in PyTorch | Post | ||||||

Activation Functions | Post | ||||||

Training a Model in PyTorch | Post | ||||||

Backpropagation | Post | ||||||

Q-learning | Post | ||||||

Temporal Difference Learning | Post | ||||||

Normalising Neural Network Inputs | Post | ||||||

Problems with DQN and Value-based Approaches | Post | ||||||

Value-Function Approximation | Post | ||||||

Tensors and NumPy | Post | ||||||

Autograd | Post | ||||||

Value Functions | Post | ||||||

Deep Q-Networks | Post | ||||||

Training Machine Learning Models | Post | ||||||

Linear Combination of Features | Post | ||||||

Neural Network in PyTorch | Post | ||||||

Exploration-Exploitation Trade-off | Post | ||||||

Function Approximation for Policies | Post | ||||||

Model-based and model-free approaches | Post | ||||||

Optimizers | Post | ||||||

Learning from Experience | Post | ||||||

Policy Gradients | |||||||

What is PyTorch? | Post | ||||||

Gradient Descent | Post | ||||||

Curse of Dimensionality | Post | ||||||

Double DQN | Post | ||||||

Bellman Optimal Action-Value Equation | Post | ||||||

Mean Squared Error | Post | ||||||

Feature Table Lookup | Post | ||||||

Limitations of model-based approaches | Post | ||||||

Difficulties with DQN | Post | ||||||

Neural Network Dimensions | Post | ||||||

Choosing a Learning Rate | Post | ||||||

Terminal State Value | |||||||

Loss Functions | Post | ||||||

Neural Network Architectures | Post | ||||||

What is a Model? | Post | ||||||

Action-Value (Q) Functions | Post | ||||||

Neurons | Post | ||||||

Experience Replay | Post | ||||||

Bellman Equation | Post | ||||||

Approximation and Convergence Guarantees | Post | ||||||

Limitations of Polynomials | Post | ||||||

Generalization | Post | ||||||

Markov Decision Processes | Post | ||||||

Temporal Difference Update Equation | Post | ||||||

Return | Post | ||||||

Function Approximation methods | Post | ||||||

Policy Improvement | Post | ||||||

Actions | Post | ||||||

What is Reinforcement Learning? | Post | ||||||

States | Post | ||||||

Why is Reinforcement Learning useful? | Post | ||||||

Designing Features | Post | ||||||

Stochastic State Transition Functions | Post | ||||||

MCTS as Policy-Improvement Operator (1) | |||||||

Residual Learning (1) | |||||||

AlphaGo Zero: Architecture (1) | |||||||

Batch Normalisation (1) | |||||||

AlphaGo Zero: Backup (1) | |||||||

AlphaGo Zero: Loss Functions (1) | |||||||

AlphaGo Zero: Selection (1) | |||||||

AlphaGo Zero: Self-play (1) | |||||||

AlphaGo Zero: Expansion and Evaluation (1) | |||||||

AlphaGo Zero: Training (1) | |||||||

AlphaGo Zero: Action Selection (1) | |||||||

AlphaGo: Value Network Training (1) | |||||||

AlphaGo Zero Algorithm (1) | |||||||

AlphaGo: Why two Policy networks? (1) | |||||||

AlphaGo: RL Policy Network Training (1) | |||||||

AlphaGo: SL Policy Network Training (1) | |||||||

AlphaGo Zero: Neural Network (1) | |||||||

AlphaGo versus AlphaGo Zero (1) | |||||||

AlphaGo: Neural Networks Used (1) | |||||||

AlphaGo: Action Selection (1) | |||||||

AlphaGo: Simulation (1) | |||||||

AlphaGo: Selection (1) | |||||||

AlphaGo: Expansion (1) | |||||||

AlphaGo: Backup (1) | |||||||

AlphaGo: Rollout Policy (1) | |||||||

Why Go is hard to solve (1) | |||||||

Convolutional Neural Networks: Padding (1) | |||||||

Convolutional Neural Networks: Pooling (1) | |||||||

Convolutional Neural Networks in Reinforcement Learning (1) | |||||||

AlphaGo Algorithm (1) | |||||||

Convolutional Neural Networks: Stride (1) | |||||||

Rules of Go (1) | |||||||

Convolutional Neural Networks: Kernal Size (1) | |||||||

Convolutional Layers (1) | |||||||

Convolutional Neural Networks (1) | |||||||

Handcrafted features for Computer Vision (1) | |||||||

Convolutional Neural Networks: Number of Filters (1) | |||||||

Convolution (1) | |||||||

Filters for Feature Detection (1) | |||||||

Monte Carlo Tree Search: Selection (1) | |||||||

Monte Carlo Tree Search: Simulation (1) | |||||||

Tree Pruning (1) | |||||||

Computer Vision (1) | |||||||

Monte Carlo Tree Search: Expansion (1) | |||||||

Monte Carlo Tree Search: Backup (1) | |||||||

Parallelisation (1) | |||||||

Tree Policies (1) | |||||||

Half-States (1) | |||||||

Rollout Policies (1) | |||||||

Sharing Rollout Updates (1) | |||||||

Monte Carlo Tree Search (1) | |||||||

Two-Player Monte Carlo Tree Search (1) | |||||||

Environment and Acting Loop (1) | |||||||

Unifying Learning and Planning (1) | |||||||

Rollout Algorithms (1) | |||||||

Types of Updates (1) | |||||||

Decision-Time Planning (1) | |||||||

Limitations of Dynamic Programming (1) | |||||||

Trajectory Sampling (1) | |||||||

What is Planning? (1) | |||||||

Policy Iteration (1) | |||||||

Value Iteration (1) | |||||||

Distribution and Sample Models (1) | |||||||

Model-Based Reinforcement Learning (1) | |||||||

Dynamic Programming (1) | |||||||

Legal Actions Masking (1) | |||||||

Proximal Policy Optimization (PPO) (1) | |||||||

Clipped Proximal Policy Optimization (PPO) (1) | |||||||

Ideal Property of Policy Gradient Updates (1) | |||||||

Trust-Region Policy Optimization (TRPO) (1) | |||||||

Limitations of TRPO (1) | |||||||

Clipped PPO with GAE (1) | |||||||

Policy Gradients Instability (1) | |||||||

Online Eligbility Traces (1) | |||||||

Offline Forward-View GAE(⁍) (1) | |||||||

Algorithms for Generalised Advantage Estimation (GAE) (1) | |||||||

Batch GAE(⁍) (1) | |||||||

TD-Lambda (1) | |||||||

Advantage Function (1) | |||||||

Advantage Estimation (1) | |||||||

Generalised Advantage Estimation (1) | |||||||

Lambda-return (1) | |||||||

Actor Critic (1) | |||||||

Monte Carlo Learning (1) | |||||||

Bias-Variance Trade-Off (1) | |||||||

n-Step Bootstrapping (1) | |||||||

Averaging n-Step Returns (1) | |||||||

Bootstrapping (1) | |||||||

Learning a Baseline (1) | |||||||

High Variance of Parameter Updates (1) | |||||||

Model-Free Prediction (1) | |||||||

Losses in Torch (1) | |||||||

Vanilla Policy Gradients Update Equation (1) | |||||||

Vanilla Policy Gradients (1) | |||||||

Entropy Regularization (1) | |||||||

Continuous Action Spaces (1) | |||||||

Policy Gradient Theorem (1) | |||||||

Softmax (1) | |||||||

Benefits of Approximating Policies (1) | |||||||

Stochastic Policies (1) | |||||||

Supervised Learning | |||||||

Overfitting |