Q-learning example with Java

Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.

This example shows the Q-learning used for path finding. A robot learns where it should go from any state.

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.

q-learn1

Result Q*(s,a)
q-learn2

Policy Π*(s)
q-learn3

Click show source below for the Java code
Qlearning.java

import java.text.DecimalFormat;
import java.util.Random;

/**
 * @author Kunuk Nykjaer
 */
public class Qlearning {
	final DecimalFormat df = new DecimalFormat("#.##");

	// path finding
	final double alpha = 0.1;
	final double gamma = 0.9;


// states A,B,C,D,E,F
// e.g. from A we can go to B or D
// from C we can only go to C 
// C is goal state, reward 100 when B->C or F->C
// 
// _______
// |A|B|C|
// |_____|
// |D|E|F|
// |_____|
//

	final int stateA = 0;
	final int stateB = 1;
	final int stateC = 2;
	final int stateD = 3;
	final int stateE = 4;
	final int stateF = 5;

	final int statesCount = 6;
	final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF};

	// http://en.wikipedia.org/wiki/Q-learning
	// http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm

	// Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))

	int[][] R = new int[statesCount][statesCount]; // reward lookup
	double[][] Q = new double[statesCount][statesCount]; // Q learning

	int[] actionsFromA = new int[] { stateB, stateD };
	int[] actionsFromB = new int[] { stateA, stateC, stateE };
	int[] actionsFromC = new int[] { stateC };
	int[] actionsFromD = new int[] { stateA, stateE };
	int[] actionsFromE = new int[] { stateB, stateD, stateF };
	int[] actionsFromF = new int[] { stateC, stateE };
	int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC,
			actionsFromD, actionsFromE, actionsFromF };

	String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" };

	public Qlearning() {
		init();
	}

	public void init() {		
		R[stateB][stateC] = 100; // from b to c
		R[stateF][stateC] = 100; // from f to c		
	}

	public static void main(String[] args) {
		long BEGIN = System.currentTimeMillis();

		Qlearning obj = new Qlearning();

		obj.run();
		obj.printResult();
		obj.showPolicy();

		long END = System.currentTimeMillis();
		System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec.");
	}

	void run() {
		/*
		 1. Set parameter , and environment reward matrix R 
		 2. Initialize matrix Q as zero matrix 
		 3. For each episode: Select random initial state 
			Do while not reach goal state o 
				Select one among all possible actions for the current state o 
				Using this possible action, consider to go to the next state o 
				Get maximum Q value of this next state based on all possible actions o 
				Compute o Set the next state as the current state
		 */

		// For each episode
		Random rand = new Random();
		for (int i = 0; i < 1000; i++) { // train episodes
			// Select random initial state
			int state = rand.nextInt(statesCount);
			while (state != stateC) // goal state
			{
				// Select one among all possible actions for the current state
				int[] actionsFromState = actions[state];
				
				// Selection strategy is random in this example
				int index = rand.nextInt(actionsFromState.length);
				int action = actionsFromState[index];

				// Action outcome is set to deterministic in this example
				// Transition probability is 1
				int nextState = action; // data structure

				// Using this possible action, consider to go to the next state
				double q = Q(state, action);
				double maxQ = maxQ(nextState);
				int r = R(state, action);

				double value = q + alpha * (r + gamma * maxQ - q);
				setQ(state, action, value);

				// Set the next state as the current state
				state = nextState;
			}
		}
	}

	double maxQ(int s) {
		int[] actionsFromState = actions[s];
		double maxValue = Double.MIN_VALUE;
		for (int i = 0; i < actionsFromState.length; i++) {
			int nextState = actionsFromState[i];
			double value = Q[s][nextState];

			if (value > maxValue)
				maxValue = value;
		}
		return maxValue;
	}

	// get policy from state
	int policy(int state) {
		int[] actionsFromState = actions[state];
		double maxValue = Double.MIN_VALUE;
		int policyGotoState = state; // default goto self if not found
		for (int i = 0; i < actionsFromState.length; i++) {
			int nextState = actionsFromState[i];
			double value = Q[state][nextState];

			if (value > maxValue) {
				maxValue = value;
				policyGotoState = nextState;
			}
		}
		return policyGotoState;
	}

	double Q(int s, int a) {
		return Q[s][a];
	}

	void setQ(int s, int a, double value) {
		Q[s][a] = value;
	}

	int R(int s, int a) {
		return R[s][a];
	}

	void printResult() {
		System.out.println("Print result");
		for (int i = 0; i < Q.length; i++) {
			System.out.print("out from " + stateNames[i] + ":  ");
			for (int j = 0; j < Q[i].length; j++) {
				System.out.print(df.format(Q[i][j]) + " ");
			}
			System.out.println();
		}
	}

	// policy is maxQ(states)
	void showPolicy() {
		System.out.println("\nshowPolicy");
		for (int i = 0; i < states.length; i++) {
			int from = states[i];
			int to =  policy(from);
			System.out.println("from "+stateNames[from]+" goto "+stateNames[to]);
		}			
	}
}

Print result
out from A: 0 90 0 72,9 0 0
out from B: 81 0 100 0 81 0
out from C: 0 0 0 0 0 0
out from D: 81 0 0 0 81 0
out from E: 0 90 0 72,9 0 90
out from F: 0 0 100 0 81 0

showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.

ConvergenceCon

The next post about q-learning can be read here
https://kunuk.wordpress.com/2012/01/14/q-learning-framework-example-with-c/

Written by kunuk Nykjaer

September 24, 2010 at 12:42 pm

Posted in Artificial Intelligence, Java

Tagged with artificial intelligence, example, java, path finding, Q Learning, temporal difference

21 Responses

Subscribe to comments with RSS.

Heh, nice!

Angielski Online

September 25, 2010 at 2:09 am

Reply
i have a question about the number of states and RL.
if we could not create a countable state space for robot what we should do?
imaging that robot is in a floor with several corridors.

mohammad

July 23, 2011 at 12:10 am

Reply
- That’s a good question. I think you should ask this in http://stackoverflow.com/
  I am sure lots of smart people can answer this better than me.
  
  kunuk Nykjaer
  
  January 15, 2012 at 6:39 pm
  
  Reply
excellent blog I’m a huge Big Brother viewer from Sydney

this link

July 23, 2012 at 10:37 am

Reply
whoah this blog is magnificent i really like reading your
posts. Stay up the good work! You know, a lot of people are hunting
round for this info, you could help them greatly.

ost to pst

July 25, 2012 at 4:58 am

Reply
Can you please explain the results? I m bit confused..

cjb

October 20, 2012 at 4:53 pm

Reply
- Thank you for this, it has helped me a lot in understanding Q-learning. There is nothing like a real example to clarify a concept 🙂
  
  lalala
  
  January 31, 2013 at 12:58 am
  
  Reply
Hi!

Congratulations for your blog! =)
But you could explain me about training phase?
I want implement the Q-Learning algorithm in mobile robot, but, for example,
I don’t know what to do when the robot choose a invalidated action as to exit of workspace. Imaging that robot in the ‘position 1′, it can just go to forward, but and it resolves to go backward and so it exits of workspace. In this case, I must put in programming one action for it go to the ‘position 1’ again or what?

Mah

February 16, 2013 at 2:01 am

Reply
- You could give it negative reward and do a new random start. Just like goal state but with negative points.
  
  kunuk Nykjaer
  
  February 20, 2013 at 12:39 am
  
  Reply
this is a very useful blog, helped me enormously in understanding Q-learning and how to actually implement it….Very good…

raj

January 28, 2014 at 10:40 am

Reply
Hey, Kunuk!
May I please use your provided code as a blueprint for my project in C. Science?
Thanks. 🙂

Valters

April 9, 2014 at 12:22 am

Reply
- Sure, your welcome
  
  kunuk Nykjaer
  
  April 9, 2014 at 9:58 am
  
  Reply
  - Kunuk, hope u are doing good. Please can you walk me through some sample codes on how to learn the weights of an evaluation function in chess using temporal difference.
    
    mishaelharry2013
    
    August 24, 2014 at 8:47 am
This is very clearly explained but I didn’t see where convergence was implemented in the code?

Mike

August 30, 2014 at 9:04 pm

Reply
thank you brother for this work, i want to appliyed an RL in game 2048 but the number of states is very huge and with Q learning we cant programing this, So can any one hear suggest another algorithme more adapted to this problem

Seifou

December 3, 2014 at 4:50 pm

Reply
Gerda aus Stuttgart (21.11.2014): Hallo ich möchte
meinen Blutdruck erst mal versuchen natürlich zu senken.

Abnehmen

March 7, 2016 at 8:02 pm

Reply
Katrin Rost, die eine Praxis für manuelle und osteopathische Medizin im Plumbohms hat.

https://twitter.com

March 15, 2016 at 10:09 am

Reply
Hier nimmt man je eine Kurzhantel und legt sich mit dem Rücken auf die Hantelbank.

Earnest

March 24, 2016 at 4:49 pm

Reply
this is awesome and very nicely explained! thanks a lot

ardi

June 10, 2016 at 2:09 pm

Reply
I think there shouldn’t be any notion of “goal”. There is a set of states and a set of actions possible to invoke (and get reward/penalty) while being on a certain state. Looking at the author’s schema, one might think that the ultimate goal of algorithm is to find the cheapest path to C, while in reality the goal is to find the action with highest long-term reward for every state.

mangusta1

July 4, 2016 at 7:07 am

Reply
Von Rückrufen und Krankheitsausbrüchen über Fitness und Ernährung
bis hin zu Studien, wir versorgen Sie täglich
mit aktuellen Gesundheitsnachrichten.

Dianne

July 5, 2016 at 8:43 pm

Reply