Play Tic-Tac-Toe Against a Q-Learning Agent

A reinforcement learning implementation where an AI learns to play optimally through trial and error, balancing exploration and exploitation.

View Source Run on Colab

Project Overview

This project implements a console-based Tic-Tac-Toe game where a human player competes against an AI that learns through Q-learning, a model-free reinforcement learning algorithm. Unlike hard-coded game AI, this agent discovers winning strategies by playing thousands of games and learning from outcomes.

What is Q-Learning?

Q-learning is a reinforcement learning technique that learns the value (quality) of actions in different states. The agent builds a "Q-table" mapping state-action pairs to expected future rewards. Over time, it learns which moves lead to wins and avoids those that lead to losses — all without being explicitly programmed with game strategy.

The Mathematics Behind Q-Learning

The algorithm updates Q-values using the Bellman equation, which balances immediate rewards with expected future returns:

Q-Value Update Rule (Bellman Equation)

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \Big] \]
Parameter Symbol Value Role
Learning Rate \(\alpha\) 0.5 How quickly new information overrides old knowledge
Discount Factor \(\gamma\) 0.9 Importance of future rewards vs. immediate rewards
Exploration Rate \(\epsilon\) 0.1 Probability of trying a random move (exploration)

Algorithm Steps

  1. Initialize Q-table — Start with all Q-values set to zero
  2. Observe current state — Convert the board configuration to a unique key
  3. Select action — Use ε-greedy: explore randomly (10%) or exploit best known move (90%)
  4. Execute action — Place the mark and observe the new state
  5. Receive reward — Win: +1, Loss: −1, Draw/Continue: 0
  6. Update Q-value — Apply the Bellman equation to improve future decisions
  7. Repeat — Continue until the game ends, then start a new episode

Core Implementation

Below are the key functions that power the Q-learning agent:

State Representation & Action Selection

import numpy as np
import random

# Hyperparameters
alpha = 0.5    # Learning rate
gamma = 0.9    # Discount factor
epsilon = 0.1  # Exploration rate
q_table = {}   # State-action value dictionary

def board_to_key(board):
    """Convert board array to hashable string key."""
    return str(board.flatten())

def available_actions(board):
    """Return indices of empty cells."""
    return [i for i in range(9) if board.flatten()[i] == 0]

def choose_action(board, player):
    """Epsilon-greedy action selection."""
    state = board_to_key(board)
    actions = available_actions(board)

    if random.random() < epsilon:
        return random.choice(actions)  # Explore

    # Exploit: choose best known action
    q_values = [q_table.get((state, a), 0) for a in actions]
    max_q = max(q_values)
    best = [a for a, q in zip(actions, q_values) if q == max_q]
    return random.choice(best)

Q-Table Update

def update_q_table(state, action, reward, next_state, done):
    """Apply Q-learning update rule."""
    old_q = q_table.get((state, action), 0)

    if done:
        # Terminal state: no future reward
        new_q = old_q + alpha * (reward - old_q)
    else:
        # Estimate future reward from next state
        next_actions = available_actions(...)
        max_future = max([q_table.get((next_state, a), 0)
                         for a in next_actions], default=0)
        new_q = old_q + alpha * (reward + gamma * max_future - old_q)

    q_table[(state, action)] = new_q

Try It Yourself

Play against a minimax AI opponent (you are X, AI is O)

Your turn — click any cell to play

Exploration vs. Exploitation

The ε-greedy strategy is crucial: with probability ε (10%), the agent tries a random move to discover new strategies. The remaining 90% of the time, it exploits its learned knowledge by choosing the move with the highest Q-value. This balance prevents the agent from getting stuck in suboptimal patterns.

Training Process and Convergence

The Q-learning agent is trained through self-play, where two instances of the agent compete against each other over thousands of episodes. During training, the exploration rate ε starts high (around 1.0) and gradually decays toward 0.1, allowing the agent to explore broadly at first and then exploit learned strategies as training progresses. This technique is known as epsilon decay and is fundamental to ensuring the agent converges toward an optimal policy.

Tic-Tac-Toe has approximately 5,478 valid board states, making the Q-table relatively compact compared to more complex games like chess. After around 50,000 training episodes, the agent typically converges to a near-optimal policy that can consistently draw against perfect play and capitalize on opponent mistakes. The training progress can be monitored by tracking the win rate against a random opponent, which increases rapidly in the first few thousand episodes before plateauing near 95 percent.

Why Tic-Tac-Toe Is Perfect for Learning RL

Tic-Tac-Toe provides an ideal environment for understanding reinforcement learning concepts because the game has a manageable state space, clear terminal conditions (win, lose, or draw), and well-defined rewards. These properties allow the Q-learning algorithm to converge quickly, making it easy to observe learning dynamics and experiment with hyperparameter tuning. Unlike continuous control tasks, every state and action is discrete, which makes the Q-table approach directly applicable without the need for function approximation.

From Q-Tables to Deep Q-Networks

While the tabular Q-learning approach works perfectly for Tic-Tac-Toe, it becomes impractical for games with larger state spaces like Connect Four, Checkers, or Chess. Deep Q-Networks (DQN), introduced by DeepMind in 2015, replace the Q-table with a neural network that approximates Q-values for any given state. This breakthrough enabled reinforcement learning agents to master Atari games directly from pixel inputs and laid the foundation for more advanced techniques like AlphaGo and AlphaZero. Understanding tabular Q-learning first provides the essential conceptual foundation needed to grasp these more sophisticated deep reinforcement learning methods.

Run the Full Implementation

The browser demo uses a simplified opponent. To train and play against a true Q-learning agent that improves over time, run the notebook: