Project Overview
This project implements a console-based Tic-Tac-Toe game where a human player competes against an AI that learns through Q-learning, a model-free reinforcement learning algorithm. Unlike hard-coded game AI, this agent discovers winning strategies by playing thousands of games and learning from outcomes.
What is Q-Learning?
Q-learning is a reinforcement learning technique that learns the value (quality) of actions in different states. The agent builds a "Q-table" mapping state-action pairs to expected future rewards. Over time, it learns which moves lead to wins and avoids those that lead to losses — all without being explicitly programmed with game strategy.
The Mathematics Behind Q-Learning
The algorithm updates Q-values using the Bellman equation, which balances immediate rewards with expected future returns:
Q-Value Update Rule (Bellman Equation)
\[ Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \Big] \]| Parameter | Symbol | Value | Role |
|---|---|---|---|
| Learning Rate | \(\alpha\) | 0.5 | How quickly new information overrides old knowledge |
| Discount Factor | \(\gamma\) | 0.9 | Importance of future rewards vs. immediate rewards |
| Exploration Rate | \(\epsilon\) | 0.1 | Probability of trying a random move (exploration) |
Algorithm Steps
- Initialize Q-table — Start with all Q-values set to zero
- Observe current state — Convert the board configuration to a unique key
- Select action — Use ε-greedy: explore randomly (10%) or exploit best known move (90%)
- Execute action — Place the mark and observe the new state
- Receive reward — Win: +1, Loss: −1, Draw/Continue: 0
- Update Q-value — Apply the Bellman equation to improve future decisions
- Repeat — Continue until the game ends, then start a new episode
Core Implementation
Below are the key functions that power the Q-learning agent:
State Representation & Action Selection
import numpy as np
import random
# Hyperparameters
alpha = 0.5 # Learning rate
gamma = 0.9 # Discount factor
epsilon = 0.1 # Exploration rate
q_table = {} # State-action value dictionary
def board_to_key(board):
"""Convert board array to hashable string key."""
return str(board.flatten())
def available_actions(board):
"""Return indices of empty cells."""
return [i for i in range(9) if board.flatten()[i] == 0]
def choose_action(board, player):
"""Epsilon-greedy action selection."""
state = board_to_key(board)
actions = available_actions(board)
if random.random() < epsilon:
return random.choice(actions) # Explore
# Exploit: choose best known action
q_values = [q_table.get((state, a), 0) for a in actions]
max_q = max(q_values)
best = [a for a, q in zip(actions, q_values) if q == max_q]
return random.choice(best)
Q-Table Update
def update_q_table(state, action, reward, next_state, done):
"""Apply Q-learning update rule."""
old_q = q_table.get((state, action), 0)
if done:
# Terminal state: no future reward
new_q = old_q + alpha * (reward - old_q)
else:
# Estimate future reward from next state
next_actions = available_actions(...)
max_future = max([q_table.get((next_state, a), 0)
for a in next_actions], default=0)
new_q = old_q + alpha * (reward + gamma * max_future - old_q)
q_table[(state, action)] = new_q
Try It Yourself
Play against a minimax AI opponent (you are X, AI is O)
Exploration vs. Exploitation
The ε-greedy strategy is crucial: with probability ε (10%), the agent tries a random move to discover new strategies. The remaining 90% of the time, it exploits its learned knowledge by choosing the move with the highest Q-value. This balance prevents the agent from getting stuck in suboptimal patterns.
Run the Full Implementation
The browser demo uses a simplified opponent. To train and play against a true Q-learning agent that improves over time, run the notebook: