# Software Programming

Kunuk Nykjaer

## Q-learning example with Java

Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.

This example shows the Q-learning used for path finding. A robot learns where it should go from any state.

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.

Result Q*(s,a)

Policy Π*(s)

Click show source below for the Java code
Qlearning.java

```import java.text.DecimalFormat;
import java.util.Random;

/**
* @author Kunuk Nykjaer
*/
public class Qlearning {
final DecimalFormat df = new DecimalFormat("#.##");

// path finding
final double alpha = 0.1;
final double gamma = 0.9;

// states A,B,C,D,E,F
// e.g. from A we can go to B or D
// from C we can only go to C
// C is goal state, reward 100 when B->C or F->C
//
// _______
// |A|B|C|
// |_____|
// |D|E|F|
// |_____|
//

final int stateA = 0;
final int stateB = 1;
final int stateC = 2;
final int stateD = 3;
final int stateE = 4;
final int stateF = 5;

final int statesCount = 6;
final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF};

// http://en.wikipedia.org/wiki/Q-learning
// http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm

// Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))

int[][] R = new int[statesCount][statesCount]; // reward lookup
double[][] Q = new double[statesCount][statesCount]; // Q learning

int[] actionsFromA = new int[] { stateB, stateD };
int[] actionsFromB = new int[] { stateA, stateC, stateE };
int[] actionsFromC = new int[] { stateC };
int[] actionsFromD = new int[] { stateA, stateE };
int[] actionsFromE = new int[] { stateB, stateD, stateF };
int[] actionsFromF = new int[] { stateC, stateE };
int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC,
actionsFromD, actionsFromE, actionsFromF };

String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" };

public Qlearning() {
init();
}

public void init() {
R[stateB][stateC] = 100; // from b to c
R[stateF][stateC] = 100; // from f to c
}

public static void main(String[] args) {
long BEGIN = System.currentTimeMillis();

Qlearning obj = new Qlearning();

obj.run();
obj.printResult();
obj.showPolicy();

long END = System.currentTimeMillis();
System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec.");
}

void run() {
/*
1. Set parameter , and environment reward matrix R
2. Initialize matrix Q as zero matrix
3. For each episode: Select random initial state
Do while not reach goal state o
Select one among all possible actions for the current state o
Using this possible action, consider to go to the next state o
Get maximum Q value of this next state based on all possible actions o
Compute o Set the next state as the current state
*/

// For each episode
Random rand = new Random();
for (int i = 0; i < 1000; i++) { // train episodes
// Select random initial state
int state = rand.nextInt(statesCount);
while (state != stateC) // goal state
{
// Select one among all possible actions for the current state
int[] actionsFromState = actions[state];

// Selection strategy is random in this example
int index = rand.nextInt(actionsFromState.length);
int action = actionsFromState[index];

// Action outcome is set to deterministic in this example
// Transition probability is 1
int nextState = action; // data structure

// Using this possible action, consider to go to the next state
double q = Q(state, action);
double maxQ = maxQ(nextState);
int r = R(state, action);

double value = q + alpha * (r + gamma * maxQ - q);
setQ(state, action, value);

// Set the next state as the current state
state = nextState;
}
}
}

double maxQ(int s) {
int[] actionsFromState = actions[s];
double maxValue = Double.MIN_VALUE;
for (int i = 0; i < actionsFromState.length; i++) {
int nextState = actionsFromState[i];
double value = Q[s][nextState];

if (value > maxValue)
maxValue = value;
}
return maxValue;
}

// get policy from state
int policy(int state) {
int[] actionsFromState = actions[state];
double maxValue = Double.MIN_VALUE;
for (int i = 0; i < actionsFromState.length; i++) {
int nextState = actionsFromState[i];
double value = Q[state][nextState];

if (value > maxValue) {
maxValue = value;
policyGotoState = nextState;
}
}
return policyGotoState;
}

double Q(int s, int a) {
return Q[s][a];
}

void setQ(int s, int a, double value) {
Q[s][a] = value;
}

int R(int s, int a) {
return R[s][a];
}

void printResult() {
System.out.println("Print result");
for (int i = 0; i < Q.length; i++) {
System.out.print("out from " + stateNames[i] + ":  ");
for (int j = 0; j < Q[i].length; j++) {
System.out.print(df.format(Q[i][j]) + " ");
}
System.out.println();
}
}

// policy is maxQ(states)
void showPolicy() {
System.out.println("\nshowPolicy");
for (int i = 0; i < states.length; i++) {
int from = states[i];
int to =  policy(from);
System.out.println("from "+stateNames[from]+" goto "+stateNames[to]);
}
}
}
```

Print result
out from A:  0 90 0 72,9 0 0
out from B:  81 0 100 0 81 0
out from C:  0 0 0 0 0 0
out from D:  81 0 0 0 81 0
out from E:  0 90 0 72,9 0 90
out from F:  0 0 100 0 81 0

showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.

# ConvergenceCon

https://kunuk.wordpress.com/2012/01/14/q-learning-framework-example-with-c/

Written by kunuk Nykjaer

September 24, 2010 at 12:42 pm

### 21 Responses

1. Heh, nice!

Angielski Online

September 25, 2010 at 2:09 am

2. i have a question about the number of states and RL.
if we could not create a countable state space for robot what we should do?
imaging that robot is in a floor with several corridors.

July 23, 2011 at 12:10 am

3. excellent blog I’m a huge Big Brother viewer from Sydney

July 23, 2012 at 10:37 am

4. whoah this blog is magnificent i really like reading your
posts. Stay up the good work! You know, a lot of people are hunting
round for this info, you could help them greatly.

ost to pst

July 25, 2012 at 4:58 am

5. Can you please explain the results? I m bit confused..

cjb

October 20, 2012 at 4:53 pm

• Thank you for this, it has helped me a lot in understanding Q-learning. There is nothing like a real example to clarify a concept 🙂

lalala

January 31, 2013 at 12:58 am

6. Hi!

But you could explain me about training phase?
I want implement the Q-Learning algorithm in mobile robot, but, for example,
I don’t know what to do when the robot choose a invalidated action as to exit of workspace. Imaging that robot in the ‘position 1′, it can just go to forward, but and it resolves to go backward and so it exits of workspace. In this case, I must put in programming one action for it go to the ‘position 1’ again or what?

Mah

February 16, 2013 at 2:01 am

• You could give it negative reward and do a new random start. Just like goal state but with negative points.

kunuk Nykjaer

February 20, 2013 at 12:39 am

7. this is a very useful blog, helped me enormously in understanding Q-learning and how to actually implement it….Very good…

raj

January 28, 2014 at 10:40 am

8. Hey, Kunuk!
May I please use your provided code as a blueprint for my project in C. Science?
Thanks. 🙂

Valters

April 9, 2014 at 12:22 am

kunuk Nykjaer

April 9, 2014 at 9:58 am

• Kunuk, hope u are doing good. Please can you walk me through some sample codes on how to learn the weights of an evaluation function in chess using temporal difference.

mishaelharry2013

August 24, 2014 at 8:47 am

9. This is very clearly explained but I didn’t see where convergence was implemented in the code?

Mike

August 30, 2014 at 9:04 pm

10. thank you brother for this work, i want to appliyed an RL in game 2048 but the number of states is very huge and with Q learning we cant programing this, So can any one hear suggest another algorithme more adapted to this problem

Seifou

December 3, 2014 at 4:50 pm

11. Gerda aus Stuttgart (21.11.2014): Hallo ich möchte
meinen Blutdruck erst mal versuchen natürlich zu senken.

Abnehmen

March 7, 2016 at 8:02 pm

12. Katrin Rost, die eine Praxis für manuelle und osteopathische Medizin im Plumbohms hat.

March 15, 2016 at 10:09 am

13. Hier nimmt man je eine Kurzhantel und legt sich mit dem Rücken auf die Hantelbank.

Earnest

March 24, 2016 at 4:49 pm

14. this is awesome and very nicely explained! thanks a lot

ardi

June 10, 2016 at 2:09 pm

15. I think there shouldn’t be any notion of “goal”. There is a set of states and a set of actions possible to invoke (and get reward/penalty) while being on a certain state. Looking at the author’s schema, one might think that the ultimate goal of algorithm is to find the cheapest path to C, while in reality the goal is to find the action with highest long-term reward for every state.

mangusta1

July 4, 2016 at 7:07 am

16. Von Rückrufen und Krankheitsausbrüchen über Fitness und Ernährung
bis hin zu Studien, wir versorgen Sie täglich
mit aktuellen Gesundheitsnachrichten.

Dianne

July 5, 2016 at 8:43 pm