Play Tic-Tac-Toe Against a Q-Learning Agent

A reinforcement learning implementation where an AI learns to play optimally through trial and error, balancing exploration and exploitation.

View Source Run on Colab

Project Overview

This project implements a console-based Tic-Tac-Toe game where a human player competes against an AI that learns through Q-learning, a model-free reinforcement learning algorithm. Unlike hard-coded game AI, this agent discovers winning strategies by playing thousands of games and learning from outcomes.

What is Q-Learning?

Q-learning is a reinforcement learning technique that learns the value (quality) of actions in different states. The agent builds a "Q-table" mapping state-action pairs to expected future rewards. Over time, it learns which moves lead to wins and avoids those that lead to losses — all without being explicitly programmed with game strategy.

The Mathematics Behind Q-Learning

The algorithm updates Q-values using the Bellman equation, which balances immediate rewards with expected future returns:

Q-Value Update Rule (Bellman Equation)

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \Big] \]
Parameter Symbol Value Role
Learning Rate \(\alpha\) 0.5 How quickly new information overrides old knowledge
Discount Factor \(\gamma\) 0.9 Importance of future rewards vs. immediate rewards
Exploration Rate \(\epsilon\) 0.1 Probability of trying a random move (exploration)

Algorithm Steps

  1. Initialize Q-table — Start with all Q-values set to zero
  2. Observe current state — Convert the board configuration to a unique key
  3. Select action — Use ε-greedy: explore randomly (10%) or exploit best known move (90%)
  4. Execute action — Place the mark and observe the new state
  5. Receive reward — Win: +1, Loss: −1, Draw/Continue: 0
  6. Update Q-value — Apply the Bellman equation to improve future decisions
  7. Repeat — Continue until the game ends, then start a new episode

Core Implementation

Below are the key functions that power the Q-learning agent:

State Representation & Action Selection

import numpy as np
import random

# Hyperparameters
alpha = 0.5    # Learning rate
gamma = 0.9    # Discount factor
epsilon = 0.1  # Exploration rate
q_table = {}   # State-action value dictionary

def board_to_key(board):
    """Convert board array to hashable string key."""
    return str(board.flatten())

def available_actions(board):
    """Return indices of empty cells."""
    return [i for i in range(9) if board.flatten()[i] == 0]

def choose_action(board, player):
    """Epsilon-greedy action selection."""
    state = board_to_key(board)
    actions = available_actions(board)

    if random.random() < epsilon:
        return random.choice(actions)  # Explore

    # Exploit: choose best known action
    q_values = [q_table.get((state, a), 0) for a in actions]
    max_q = max(q_values)
    best = [a for a, q in zip(actions, q_values) if q == max_q]
    return random.choice(best)

Q-Table Update

def update_q_table(state, action, reward, next_state, done):
    """Apply Q-learning update rule."""
    old_q = q_table.get((state, action), 0)

    if done:
        # Terminal state: no future reward
        new_q = old_q + alpha * (reward - old_q)
    else:
        # Estimate future reward from next state
        next_actions = available_actions(...)
        max_future = max([q_table.get((next_state, a), 0)
                         for a in next_actions], default=0)
        new_q = old_q + alpha * (reward + gamma * max_future - old_q)

    q_table[(state, action)] = new_q

Try It Yourself

Play against a minimax AI opponent (you are X, AI is O)

Your turn — click any cell to play

Exploration vs. Exploitation

The ε-greedy strategy is crucial: with probability ε (10%), the agent tries a random move to discover new strategies. The remaining 90% of the time, it exploits its learned knowledge by choosing the move with the highest Q-value. This balance prevents the agent from getting stuck in suboptimal patterns.

Run the Full Implementation

The browser demo uses a simplified opponent. To train and play against a true Q-learning agent that improves over time, run the notebook: