Q-learning example with Java
Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.
This example shows the Q-learning used for path finding. A robot learns where it should go from any state.
The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).
In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.
Result Q*(s,a)
Policy Π*(s)
Click show source below for the Java code
Qlearning.java
import java.text.DecimalFormat; import java.util.Random; /** * @author Kunuk Nykjaer */ public class Qlearning { final DecimalFormat df = new DecimalFormat("#.##"); // path finding final double alpha = 0.1; final double gamma = 0.9; // states A,B,C,D,E,F // e.g. from A we can go to B or D // from C we can only go to C // C is goal state, reward 100 when B->C or F->C // // _______ // |A|B|C| // |_____| // |D|E|F| // |_____| // final int stateA = 0; final int stateB = 1; final int stateC = 2; final int stateD = 3; final int stateE = 4; final int stateF = 5; final int statesCount = 6; final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF}; // http://en.wikipedia.org/wiki/Q-learning // http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a)) int[][] R = new int[statesCount][statesCount]; // reward lookup double[][] Q = new double[statesCount][statesCount]; // Q learning int[] actionsFromA = new int[] { stateB, stateD }; int[] actionsFromB = new int[] { stateA, stateC, stateE }; int[] actionsFromC = new int[] { stateC }; int[] actionsFromD = new int[] { stateA, stateE }; int[] actionsFromE = new int[] { stateB, stateD, stateF }; int[] actionsFromF = new int[] { stateC, stateE }; int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC, actionsFromD, actionsFromE, actionsFromF }; String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" }; public Qlearning() { init(); } public void init() { R[stateB][stateC] = 100; // from b to c R[stateF][stateC] = 100; // from f to c } public static void main(String[] args) { long BEGIN = System.currentTimeMillis(); Qlearning obj = new Qlearning(); obj.run(); obj.printResult(); obj.showPolicy(); long END = System.currentTimeMillis(); System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec."); } void run() { /* 1. Set parameter , and environment reward matrix R 2. Initialize matrix Q as zero matrix 3. For each episode: Select random initial state Do while not reach goal state o Select one among all possible actions for the current state o Using this possible action, consider to go to the next state o Get maximum Q value of this next state based on all possible actions o Compute o Set the next state as the current state */ // For each episode Random rand = new Random(); for (int i = 0; i < 1000; i++) { // train episodes // Select random initial state int state = rand.nextInt(statesCount); while (state != stateC) // goal state { // Select one among all possible actions for the current state int[] actionsFromState = actions[state]; // Selection strategy is random in this example int index = rand.nextInt(actionsFromState.length); int action = actionsFromState[index]; // Action outcome is set to deterministic in this example // Transition probability is 1 int nextState = action; // data structure // Using this possible action, consider to go to the next state double q = Q(state, action); double maxQ = maxQ(nextState); int r = R(state, action); double value = q + alpha * (r + gamma * maxQ - q); setQ(state, action, value); // Set the next state as the current state state = nextState; } } } double maxQ(int s) { int[] actionsFromState = actions[s]; double maxValue = Double.MIN_VALUE; for (int i = 0; i < actionsFromState.length; i++) { int nextState = actionsFromState[i]; double value = Q[s][nextState]; if (value > maxValue) maxValue = value; } return maxValue; } // get policy from state int policy(int state) { int[] actionsFromState = actions[state]; double maxValue = Double.MIN_VALUE; int policyGotoState = state; // default goto self if not found for (int i = 0; i < actionsFromState.length; i++) { int nextState = actionsFromState[i]; double value = Q[state][nextState]; if (value > maxValue) { maxValue = value; policyGotoState = nextState; } } return policyGotoState; } double Q(int s, int a) { return Q[s][a]; } void setQ(int s, int a, double value) { Q[s][a] = value; } int R(int s, int a) { return R[s][a]; } void printResult() { System.out.println("Print result"); for (int i = 0; i < Q.length; i++) { System.out.print("out from " + stateNames[i] + ": "); for (int j = 0; j < Q[i].length; j++) { System.out.print(df.format(Q[i][j]) + " "); } System.out.println(); } } // policy is maxQ(states) void showPolicy() { System.out.println("\nshowPolicy"); for (int i = 0; i < states.length; i++) { int from = states[i]; int to = policy(from); System.out.println("from "+stateNames[from]+" goto "+stateNames[to]); } } }
Print result
out from A: 0 90 0 72,9 0 0
out from B: 81 0 100 0 81 0
out from C: 0 0 0 0 0 0
out from D: 81 0 0 0 81 0
out from E: 0 90 0 72,9 0 90
out from F: 0 0 100 0 81 0
showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.
ConvergenceCon
The next post about q-learning can be read here
https://kunuk.wordpress.com/2012/01/14/q-learning-framework-example-with-c/
Heh, nice!
Angielski Online
September 25, 2010 at 2:09 am
i have a question about the number of states and RL.
if we could not create a countable state space for robot what we should do?
imaging that robot is in a floor with several corridors.
mohammad
July 23, 2011 at 12:10 am
That’s a good question. I think you should ask this in http://stackoverflow.com/
I am sure lots of smart people can answer this better than me.
kunuk Nykjaer
January 15, 2012 at 6:39 pm
excellent blog I’m a huge Big Brother viewer from Sydney
this link
July 23, 2012 at 10:37 am
whoah this blog is magnificent i really like reading your
posts. Stay up the good work! You know, a lot of people are hunting
round for this info, you could help them greatly.
ost to pst
July 25, 2012 at 4:58 am
Can you please explain the results? I m bit confused..
cjb
October 20, 2012 at 4:53 pm
Thank you for this, it has helped me a lot in understanding Q-learning. There is nothing like a real example to clarify a concept 🙂
lalala
January 31, 2013 at 12:58 am
Hi!
Congratulations for your blog! =)
But you could explain me about training phase?
I want implement the Q-Learning algorithm in mobile robot, but, for example,
I don’t know what to do when the robot choose a invalidated action as to exit of workspace. Imaging that robot in the ‘position 1′, it can just go to forward, but and it resolves to go backward and so it exits of workspace. In this case, I must put in programming one action for it go to the ‘position 1’ again or what?
Mah
February 16, 2013 at 2:01 am
You could give it negative reward and do a new random start. Just like goal state but with negative points.
kunuk Nykjaer
February 20, 2013 at 12:39 am
this is a very useful blog, helped me enormously in understanding Q-learning and how to actually implement it….Very good…
raj
January 28, 2014 at 10:40 am
Hey, Kunuk!
May I please use your provided code as a blueprint for my project in C. Science?
Thanks. 🙂
Valters
April 9, 2014 at 12:22 am
Sure, your welcome
kunuk Nykjaer
April 9, 2014 at 9:58 am
Kunuk, hope u are doing good. Please can you walk me through some sample codes on how to learn the weights of an evaluation function in chess using temporal difference.
mishaelharry2013
August 24, 2014 at 8:47 am
This is very clearly explained but I didn’t see where convergence was implemented in the code?
Mike
August 30, 2014 at 9:04 pm
thank you brother for this work, i want to appliyed an RL in game 2048 but the number of states is very huge and with Q learning we cant programing this, So can any one hear suggest another algorithme more adapted to this problem
Seifou
December 3, 2014 at 4:50 pm
Gerda aus Stuttgart (21.11.2014): Hallo ich möchte
meinen Blutdruck erst mal versuchen natürlich zu senken.
Abnehmen
March 7, 2016 at 8:02 pm
Katrin Rost, die eine Praxis für manuelle und osteopathische Medizin im Plumbohms hat.
https://twitter.com
March 15, 2016 at 10:09 am
Hier nimmt man je eine Kurzhantel und legt sich mit dem Rücken auf die Hantelbank.
Earnest
March 24, 2016 at 4:49 pm
this is awesome and very nicely explained! thanks a lot
ardi
June 10, 2016 at 2:09 pm
I think there shouldn’t be any notion of “goal”. There is a set of states and a set of actions possible to invoke (and get reward/penalty) while being on a certain state. Looking at the author’s schema, one might think that the ultimate goal of algorithm is to find the cheapest path to C, while in reality the goal is to find the action with highest long-term reward for every state.
mangusta1
July 4, 2016 at 7:07 am
Von Rückrufen und Krankheitsausbrüchen über Fitness und Ernährung
bis hin zu Studien, wir versorgen Sie täglich
mit aktuellen Gesundheitsnachrichten.
Dianne
July 5, 2016 at 8:43 pm