Software Programming

Kunuk Nykjaer

Q-learning example with Java

with 21 comments


Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.

This example shows the Q-learning used for path finding. A robot learns where it should go from any state.

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.

q-learn1

Result Q*(s,a)
q-learn2

Policy Π*(s)
q-learn3

Click show source below for the Java code
Qlearning.java

import java.text.DecimalFormat;
import java.util.Random;

/**
 * @author Kunuk Nykjaer
 */
public class Qlearning {
	final DecimalFormat df = new DecimalFormat("#.##");

	// path finding
	final double alpha = 0.1;
	final double gamma = 0.9;


// states A,B,C,D,E,F
// e.g. from A we can go to B or D
// from C we can only go to C 
// C is goal state, reward 100 when B->C or F->C
// 
// _______
// |A|B|C|
// |_____|
// |D|E|F|
// |_____|
//

	final int stateA = 0;
	final int stateB = 1;
	final int stateC = 2;
	final int stateD = 3;
	final int stateE = 4;
	final int stateF = 5;

	final int statesCount = 6;
	final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF};

	// http://en.wikipedia.org/wiki/Q-learning
	// http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm

	// Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))

	int[][] R = new int[statesCount][statesCount]; // reward lookup
	double[][] Q = new double[statesCount][statesCount]; // Q learning

	int[] actionsFromA = new int[] { stateB, stateD };
	int[] actionsFromB = new int[] { stateA, stateC, stateE };
	int[] actionsFromC = new int[] { stateC };
	int[] actionsFromD = new int[] { stateA, stateE };
	int[] actionsFromE = new int[] { stateB, stateD, stateF };
	int[] actionsFromF = new int[] { stateC, stateE };
	int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC,
			actionsFromD, actionsFromE, actionsFromF };

	String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" };

	public Qlearning() {
		init();
	}

	public void init() {		
		R[stateB][stateC] = 100; // from b to c
		R[stateF][stateC] = 100; // from f to c		
	}

	public static void main(String[] args) {
		long BEGIN = System.currentTimeMillis();

		Qlearning obj = new Qlearning();

		obj.run();
		obj.printResult();
		obj.showPolicy();

		long END = System.currentTimeMillis();
		System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec.");
	}

	void run() {
		/*
		 1. Set parameter , and environment reward matrix R 
		 2. Initialize matrix Q as zero matrix 
		 3. For each episode: Select random initial state 
			Do while not reach goal state o 
				Select one among all possible actions for the current state o 
				Using this possible action, consider to go to the next state o 
				Get maximum Q value of this next state based on all possible actions o 
				Compute o Set the next state as the current state
		 */

		// For each episode
		Random rand = new Random();
		for (int i = 0; i < 1000; i++) { // train episodes
			// Select random initial state
			int state = rand.nextInt(statesCount);
			while (state != stateC) // goal state
			{
				// Select one among all possible actions for the current state
				int[] actionsFromState = actions[state];
				
				// Selection strategy is random in this example
				int index = rand.nextInt(actionsFromState.length);
				int action = actionsFromState[index];

				// Action outcome is set to deterministic in this example
				// Transition probability is 1
				int nextState = action; // data structure

				// Using this possible action, consider to go to the next state
				double q = Q(state, action);
				double maxQ = maxQ(nextState);
				int r = R(state, action);

				double value = q + alpha * (r + gamma * maxQ - q);
				setQ(state, action, value);

				// Set the next state as the current state
				state = nextState;
			}
		}
	}

	double maxQ(int s) {
		int[] actionsFromState = actions[s];
		double maxValue = Double.MIN_VALUE;
		for (int i = 0; i < actionsFromState.length; i++) {
			int nextState = actionsFromState[i];
			double value = Q[s][nextState];

			if (value > maxValue)
				maxValue = value;
		}
		return maxValue;
	}

	// get policy from state
	int policy(int state) {
		int[] actionsFromState = actions[state];
		double maxValue = Double.MIN_VALUE;
		int policyGotoState = state; // default goto self if not found
		for (int i = 0; i < actionsFromState.length; i++) {
			int nextState = actionsFromState[i];
			double value = Q[state][nextState];

			if (value > maxValue) {
				maxValue = value;
				policyGotoState = nextState;
			}
		}
		return policyGotoState;
	}

	double Q(int s, int a) {
		return Q[s][a];
	}

	void setQ(int s, int a, double value) {
		Q[s][a] = value;
	}

	int R(int s, int a) {
		return R[s][a];
	}

	void printResult() {
		System.out.println("Print result");
		for (int i = 0; i < Q.length; i++) {
			System.out.print("out from " + stateNames[i] + ":  ");
			for (int j = 0; j < Q[i].length; j++) {
				System.out.print(df.format(Q[i][j]) + " ");
			}
			System.out.println();
		}
	}

	// policy is maxQ(states)
	void showPolicy() {
		System.out.println("\nshowPolicy");
		for (int i = 0; i < states.length; i++) {
			int from = states[i];
			int to =  policy(from);
			System.out.println("from "+stateNames[from]+" goto "+stateNames[to]);
		}			
	}
}

Print result
out from A:  0 90 0 72,9 0 0
out from B:  81 0 100 0 81 0
out from C:  0 0 0 0 0 0
out from D:  81 0 0 0 81 0
out from E:  0 90 0 72,9 0 90
out from F:  0 0 100 0 81 0

showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.

ConvergenceCon

The next post about q-learning can be read here
https://kunuk.wordpress.com/2012/01/14/q-learning-framework-example-with-c/

Written by kunuk Nykjaer

September 24, 2010 at 12:42 pm

21 Responses

Subscribe to comments with RSS.

  1. Heh, nice!

    Angielski Online

    September 25, 2010 at 2:09 am

  2. i have a question about the number of states and RL.
    if we could not create a countable state space for robot what we should do?
    imaging that robot is in a floor with several corridors.

    mohammad

    July 23, 2011 at 12:10 am

  3. excellent blog I’m a huge Big Brother viewer from Sydney

    this link

    July 23, 2012 at 10:37 am

  4. whoah this blog is magnificent i really like reading your
    posts. Stay up the good work! You know, a lot of people are hunting
    round for this info, you could help them greatly.

    ost to pst

    July 25, 2012 at 4:58 am

  5. Can you please explain the results? I m bit confused..

    cjb

    October 20, 2012 at 4:53 pm

    • Thank you for this, it has helped me a lot in understanding Q-learning. There is nothing like a real example to clarify a concept 🙂

      lalala

      January 31, 2013 at 12:58 am

  6. Hi!

    Congratulations for your blog! =)
    But you could explain me about training phase?
    I want implement the Q-Learning algorithm in mobile robot, but, for example,
    I don’t know what to do when the robot choose a invalidated action as to exit of workspace. Imaging that robot in the ‘position 1′, it can just go to forward, but and it resolves to go backward and so it exits of workspace. In this case, I must put in programming one action for it go to the ‘position 1’ again or what?

    Mah

    February 16, 2013 at 2:01 am

    • You could give it negative reward and do a new random start. Just like goal state but with negative points.

      kunuk Nykjaer

      February 20, 2013 at 12:39 am

  7. this is a very useful blog, helped me enormously in understanding Q-learning and how to actually implement it….Very good…

    raj

    January 28, 2014 at 10:40 am

  8. Hey, Kunuk!
    May I please use your provided code as a blueprint for my project in C. Science?
    Thanks. 🙂

    Valters

    April 9, 2014 at 12:22 am

    • Sure, your welcome

      kunuk Nykjaer

      April 9, 2014 at 9:58 am

      • Kunuk, hope u are doing good. Please can you walk me through some sample codes on how to learn the weights of an evaluation function in chess using temporal difference.

        mishaelharry2013

        August 24, 2014 at 8:47 am

  9. This is very clearly explained but I didn’t see where convergence was implemented in the code?

    Mike

    August 30, 2014 at 9:04 pm

  10. thank you brother for this work, i want to appliyed an RL in game 2048 but the number of states is very huge and with Q learning we cant programing this, So can any one hear suggest another algorithme more adapted to this problem

    Seifou

    December 3, 2014 at 4:50 pm

  11. Gerda aus Stuttgart (21.11.2014): Hallo ich möchte
    meinen Blutdruck erst mal versuchen natürlich zu senken.

    Abnehmen

    March 7, 2016 at 8:02 pm

  12. Katrin Rost, die eine Praxis für manuelle und osteopathische Medizin im Plumbohms hat.

    https://twitter.com

    March 15, 2016 at 10:09 am

  13. Hier nimmt man je eine Kurzhantel und legt sich mit dem Rücken auf die Hantelbank.

    Earnest

    March 24, 2016 at 4:49 pm

  14. this is awesome and very nicely explained! thanks a lot

    ardi

    June 10, 2016 at 2:09 pm

  15. I think there shouldn’t be any notion of “goal”. There is a set of states and a set of actions possible to invoke (and get reward/penalty) while being on a certain state. Looking at the author’s schema, one might think that the ultimate goal of algorithm is to find the cheapest path to C, while in reality the goal is to find the action with highest long-term reward for every state.

    mangusta1

    July 4, 2016 at 7:07 am

  16. Von Rückrufen und Krankheitsausbrüchen über Fitness und Ernährung
    bis hin zu Studien, wir versorgen Sie täglich
    mit aktuellen Gesundheitsnachrichten.

    Dianne

    July 5, 2016 at 8:43 pm


Leave a comment