Software Programming

Kunuk Nykjaer

Hybrid AI example with Java, TicTacToe Reinforcement-Learning and NN

with 4 comments

Training an AI player for the Tic Tac Toe game.
A simple example combining Reinforcement learning with Neural network.
I want to train the AI player to be able to beat an opponent which plays randomly.

The training method is simple.
The AI also plays randomly and after the game finishes A positive reward is given if it wins or a negative reward it if it looses.

The input size for the NN is 18 for the pieces on the board, which represent the state. Hidden layer is also 18.
9 for the AI pieces and 9 for opponents pieces. A 0 value is for no piece and 1 is a piece at the index. The output size is 9. The max value of the values is the preferred action. The back propagation is only for one output neuron, the chosen action.

Because I don’t know the next state for the Q-learning, the states are ‘delayed’ in the Q-learning algorithm.
It uses prev state and current state instead.

Winning results after 70.000 training games:
1000 games played.

Random starts
Hero (ai) = 616
Villain = 304
Draw = 80

Villain starts
Hero (ai) = 475
Villain = 449
Draw = 76

Hero starts
Hero (ai) = 807
Villain = 109
Draw = 84

Eclipse project download url options
Remote download:
Local download, but rename the doc file to a zip file


Written by kunuk Nykjaer

October 24, 2010 at 2:28 pm

4 Responses

Subscribe to comments with RSS.

  1. Hello kunuk. Brilliant article, but i can’t seem to download any of the associated files. I just wanted to ask if you could elaborate on the number of inputs and outputs. I understand the 9 outputs because we need a q-value for each of the squares in the next move but why are there 18 inputs? Is it 9 inputs with a max value of 2 being “0=empty” “1=player” “2=computer” for the current state and 9 inputs for the next state?


    November 6, 2011 at 10:11 pm

    • It should be possible to download. I just tried and it worked for me. There are two links. Try the other one if one doesn’t work.

      I made this long time ago. As I recall the input is either 0 or 1.
      0 for not empty and 1 for occupied.
      This project was based on a school assignment.

      There is a specific reason for choosing 18 inputs and not 9.
      Some of it has to do with not mixing hero and villain pieces using same input.
      Can’t remember the detail reasoning for this.

      The 18 inputs are only pieces information. There are no state-input. The prev state and current state is managed internally.

      kunuk Nykjaer

      November 7, 2011 at 7:05 pm

  2. Reblogged this on Howl.


    August 4, 2012 at 8:27 am

    • Hiya, I want to implement a similar learning approach in chess using TD-learning, where the algorithm learns to tune the weight of an evaluation function. How do I implement it since chess have an infinite number of states compared to tic tac toe.


      August 24, 2014 at 9:00 am

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: