Software Programming

Kunuk Nykjaer

Q-learning Library example with C#

with 10 comments


This is a follow up of my previous post – Q-learning example with Java
https://kunuk.wordpress.com/2010/09/24/q-learning/

q-learn-bender
Imagine you had a robot, just like Bender here and you want it to learn to navigate correctly to the destination (the beer). Your possible moves are move up, right, down or left from any room. And you want some kind of automatic system that gives you the solution. This is what this library can do for you. You configure the code with possible states, possible actions and the action outcomes. E.g. when you move right from the bedroom the action result is Bender has moved to the Hallway and if you move left from the bedroom then Bender hits the wall and stays in the bedroom. You will need to set up a reward value when Bender reaches the beer e.g. 100 points. That’s it!
No logic programming from your part, just some configuration.

DISCLAIMER:
There might be a bug in the implementation. Read at the comments section from Jens for more info.

In my previous post I made a quick example in Java and this time I will use C#.
This version will support non-deterministic action outcomes and I will include examples of how to use this library. The code examples are included at the bottom. You need QLearning.cs to run these examples.

  • Path finding Bender
  • Path finding demo (same example as in my previous post)
  • Rock paper scissors demo
  • Path finding advanced demo

Path finding Bender example
Path finding Bender result:

** Q-Learning structure **
State Bedroom
  Action up
     ActionResult State Bedroom, Prob. 1, Reward 0, PrevState Bedroom, QE 72.9
  Action right
     ActionResult State Hallway, Prob. 1, Reward 0, PrevState Bedroom, QE 81
  Action down
     ActionResult State Bedroom, Prob. 1, Reward 0, PrevState Bedroom, QE 72.9
  Action left
     ActionResult State Bedroom, Prob. 1, Reward 0, PrevState Bedroom, QE 72.9
State Hallway
  Action up
     ActionResult State Hallway, Prob. 1, Reward 0, PrevState Hallway, QE 81
  Action right
     ActionResult State Hallway, Prob. 1, Reward 0, PrevState Hallway, QE 81
  Action down
     ActionResult State Stairwell, Prob. 1, Reward 0, PrevState Hallway, QE 90
  Action left
     ActionResult State Bedroom, Prob. 1, Reward 0, PrevState Hallway, QE 72.9
State Stairwell
  Action up
     ActionResult State Hallway, Prob. 1, Reward 0, PrevState Stairwell, QE 81
  Action right
     ActionResult State Stairwell, Prob. 1, Reward 0, PrevState Stairwell, QE 90
  Action down
     ActionResult State Stairwell, Prob. 1, Reward 0, PrevState Stairwell, QE 90
  Action left
     ActionResult State Kitchen, Prob. 1, Reward 100, PrevState Stairwell, QE 100

** Show Policy **
From state Bedroom do action right, max QEstimated is 81
From state Hallway do action down, max QEstimated is 90
From state Stairwell do action left, max QEstimated is 100

Bender has learned the fastest way to get some beer. He knows which direction to take from any room.

Path finding example
q-learn1

Path finding result:

** Q-Learning structure **
State A
  Action from_A_to_B
     ActionResult State B, Prob. 1, Reward 0, PrevState A, QE 90
  Action from_A_to_D
     ActionResult State D, Prob. 1, Reward 0, PrevState A, QE 72.9
State B
  Action from_B_to_A
     ActionResult State A, Prob. 1, Reward 0, PrevState B, QE 81
  Action from_B_to_C
     ActionResult State C, Prob. 1, Reward 100, PrevState B, QE 100
  Action from_B_to_E
     ActionResult State E, Prob. 1, Reward 0, PrevState B, QE 81
State C
  Action from_C_to_C
     ActionResult State C, Prob. 1, Reward 0, PrevState C, QE 0
State D
  Action from_D_to_A
     ActionResult State A, Prob. 1, Reward 0, PrevState D, QE 81
  Action from_D_to_E
     ActionResult State E, Prob. 1, Reward 0, PrevState D, QE 81
State E
  Action from_E_to_B
     ActionResult State B, Prob. 1, Reward 0, PrevState E, QE 90
  Action from_E_to_D
     ActionResult State D, Prob. 1, Reward 0, PrevState E, QE 72.9
  Action from_E_to_F
     ActionResult State F, Prob. 1, Reward 0, PrevState E, QE 90
State F
  Action from_F_to_C
     ActionResult State C, Prob. 1, Reward 100, PrevState F, QE 100
  Action from_F_to_E
     ActionResult State E, Prob. 1, Reward 0, PrevState F, QE 81

** Show Policy **
From state A do action from_A_to_B, max QEstimated is 90
From state B do action from_B_to_C, max QEstimated is 100
From state C do action from_C_to_C, max QEstimated is 0
From state D do action from_D_to_A, max QEstimated is 81
From state E do action from_E_to_B, max QEstimated is 90
From state F do action from_F_to_C, max QEstimated is 100

The results are exactly the same as in my previous post.

rock-paper-scissors

Rock paper scissors result:

** Q-Learning structure **
State Begin
  Action from_Begin_to_Rock
     ActionResult State Rock, Prob. 0.2, Reward 0, PrevState Begin, QE 0
     ActionResult State Paper, Prob. 0.5, Reward -10, PrevState Begin, QE -0.91
     ActionResult State Scissor, Prob. 0.3, Reward 100, PrevState Begin, QE 4.11

  Action from_Begin_to_Paper
     ActionResult State Rock, Prob. 0.2, Reward 100, PrevState Begin, QE 2.44
     ActionResult State Paper, Prob. 0.5, Reward 0, PrevState Begin, QE 0
     ActionResult State Scissor, Prob. 0.3, Reward -10, PrevState Begin, QE -0.41
  Action from_Begin_to_Scissor
     ActionResult State Rock, Prob. 0.2, Reward -10, PrevState Begin, QE -0.24
     ActionResult State Paper, Prob. 0.5, Reward 100, PrevState Begin, QE 9.09
     ActionResult State Scissor, Prob. 0.3, Reward 0, PrevState Begin, QE 0

** Show Policy **
From state Begin do action from_Begin_to_Scissor, max QEstimated is 9.09

** Opponent style **
style is rock 0.2 paper 0.5 scissor 0.3

Here the opponents style is to pick paper 50% of the time. The robot learns correctly to pick scissors as the action policy for maximizing the reward outcome.

Path finding advanced example
q-learn-adv

In the advanced version of path finding, there is a hill from state B to C.
whenever the robot wants to go climb the hill from state B to C, there is a 10% of success and 90% of chance it will slide back to state A.

Path finding advanced example result:

** Q-Learning structure **
State A
  Action from_A_to_B
     ActionResult State B, Prob. 1, Reward 0, PrevState A, QE 72.9
  Action from_A_to_D
     ActionResult State D, Prob. 1, Reward 0, PrevState A, QE 72.9
State B
  Action from_B_to_A
     ActionResult State A, Prob. 1, Reward 0, PrevState B, QE 65.61
  Action from_B_to_C
     ActionResult State C, Prob. 0.1, Reward 100, PrevState B, QE 1.1
     ActionResult State A, Prob. 0.9, Reward 0, PrevState B, QE 31.08
  Action from_B_to_E
     ActionResult State E, Prob. 1, Reward 0, PrevState B, QE 81
State C
  Action from_C_to_C
     ActionResult State C, Prob. 1, Reward 0, PrevState C, QE 0
State D
  Action from_D_to_A
     ActionResult State A, Prob. 1, Reward 0, PrevState D, QE 65.61
  Action from_D_to_E
     ActionResult State E, Prob. 1, Reward 0, PrevState D, QE 81
State E
  Action from_E_to_B
     ActionResult State B, Prob. 1, Reward 0, PrevState E, QE 72.9
  Action from_E_to_D
     ActionResult State D, Prob. 1, Reward 0, PrevState E, QE 72.9
  Action from_E_to_F
     ActionResult State F, Prob. 1, Reward 0, PrevState E, QE 90
State F
  Action from_F_to_C
     ActionResult State C, Prob. 1, Reward 100, PrevState F, QE 100
  Action from_F_to_E
     ActionResult State E, Prob. 1, Reward 0, PrevState F, QE 81

** Show Policy **
From state A do action from_A_to_B, max QEstimated is 72.9
From state B do action from_B_to_E, max QEstimated is 81
From state C do action from_C_to_C, max QEstimated is 0
From state D do action from_D_to_E, max QEstimated is 81
From state E do action from_E_to_F, max QEstimated is 90
From state F do action from_F_to_C, max QEstimated is 100

The robot has learned to take the action ‘go to state E’ instead of ‘go to state C’ when it is in state B.
10% of success is not good enough and the robot wisely chooses a longer but more rewarding path.

The data structure is as follows:
The QLearning algorithm has multiple states, for every state there are multiple possible actions and for each action there are multiple possible action outcomes.
q-learn

PathFindingBenderDemo.cs

using System;
using QLearningFramework;

namespace ConsoleQLearning
{
    class PathFindingBenderDemo
    {
        // ----------- Insert the state names here -----------
        internal enum StateNameEnum
        {
            Bedroom, Hallway, Stairwell, Kitchen
        }
        // ----------- End Insert the state names here -------

        static void Main(string[] args)
        {
            DateTime starttime = DateTime.Now;

            PathFinding();            

            double timespend = DateTime.Now.Subtract(starttime).TotalSeconds;
            Console.WriteLine("\n{0} sec. press a key ...", timespend.Pretty()); Console.ReadKey();
        }         

        static void PathFinding()
        {
            QLearning q = new QLearning();
            QAction fromTo;
            QState state;
            string stateName;
            string stateNameNext;

            // ----------- Begin Insert the path setup here -----------
            // insert the end states here, e.g. goal state
            q.EndStates.Add(StateNameEnum.Kitchen.EnumToString()); 

            // State Bedroom           
            stateName = StateNameEnum.Bedroom.EnumToString();
            q.AddState(state = new QState(stateName, q));
            // action up
            stateNameNext = StateNameEnum.Bedroom.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("up")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action right
            stateNameNext = StateNameEnum.Hallway.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("right")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action down
            stateNameNext = StateNameEnum.Bedroom.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("down")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action left
            stateNameNext = StateNameEnum.Bedroom.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("left")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State Hallway           
            stateName = StateNameEnum.Hallway.EnumToString();
            q.AddState(state = new QState(stateName, q));
            // action up
            stateNameNext = StateNameEnum.Hallway.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("up")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action right
            stateNameNext = StateNameEnum.Hallway.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("right")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action down
            stateNameNext = StateNameEnum.Stairwell.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("down")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action left
            stateNameNext = StateNameEnum.Bedroom.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("left")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State Stairwell           
            stateName = StateNameEnum.Stairwell.EnumToString();
            q.AddState(state = new QState(stateName, q));
            // action up
            stateNameNext = StateNameEnum.Hallway.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("up")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action right
            stateNameNext = StateNameEnum.Stairwell.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("right")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action down
            stateNameNext = StateNameEnum.Stairwell.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("down")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action left
            stateNameNext = StateNameEnum.Kitchen.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName("left")));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0){Reward = 100});            
            // ----------- End Insert the path setup here -----------

            q.RunTraining();
            q.PrintQLearningStructure();
            q.ShowPolicy();
        }
    }
}

PathFindingDemo.cs

using System;
using QLearningFramework;

namespace ConsoleQLearning
{
    class PathFindingDemo
    {
        // ----------- Insert the state names here -----------
        internal enum StateNameEnum
        {
            A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, 
            V, W, X, Y, Z
        }
        // ----------- End Insert the state names here -------

        static void Main(string[] args)
        {
            DateTime starttime = DateTime.Now;

            PathFinding();            

            double timespend = DateTime.Now.Subtract(starttime).TotalSeconds;
            Console.WriteLine("\n{0} sec. press a key ...", timespend.Pretty()); Console.ReadKey();
        }         

        static void PathFinding()
        {
            QLearning q = new QLearning { Episodes = 1000, Alpha = 0.1, Gamma = 0.9, 
                MaxExploreStepsWithinOneEpisode = 1000 };

            QAction fromTo;
            QState state;
            string stateName;
            string stateNameNext;

            // ----------- Begin Insert the path setup here -----------
            // insert the end states here, e.g. goal state
            q.EndStates.Add(StateNameEnum.C.EnumToString()); 


            // State A           
            stateName = StateNameEnum.A.EnumToString();
            q.AddState(state = new QState(stateName, q));
            // action A -> B
            stateNameNext = StateNameEnum.B.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext) ));
            // action outcome probability
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action A -> D
            stateNameNext = StateNameEnum.D.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            // action outcome probability
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State B
            stateName = StateNameEnum.B.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // B -> A
            stateNameNext = StateNameEnum.A.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // B -> C
            stateNameNext = StateNameEnum.C.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            // action outcome probability
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0) { Reward = 100 });
            // B -> E
            stateNameNext = StateNameEnum.E.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State C
            stateName = StateNameEnum.C.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // C -> C
            stateNameNext = stateName;
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State D
            stateName = StateNameEnum.D.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // D -> A
            stateNameNext = StateNameEnum.A.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // D -> E
            stateNameNext = StateNameEnum.E.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State E
            stateName = StateNameEnum.E.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // E -> B
            stateNameNext = StateNameEnum.B.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // E -> D
            stateNameNext = StateNameEnum.D.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // E -> F
            stateNameNext = StateNameEnum.F.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State F
            stateName = StateNameEnum.F.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // F -> C
            stateNameNext = StateNameEnum.C.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0) { Reward = 100 });
            // F -> E
            stateNameNext = StateNameEnum.E.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // ----------- End Insert the path setup here -----------

            q.RunTraining();
            q.PrintQLearningStructure();
            q.ShowPolicy();
        }
    }
}

PathFindingAdvDemo.cs

using System;
using QLearningFramework;

namespace ConsoleQLearning
{
    class PathFindingAdvDemo
    {
        // ----------- Insert the state names here -----------
        internal enum StateNameEnum
        {
            A, B, C, D, E, F
        }
        // ----------- End Insert the state names here -------

        static void Main(string[] args)
        {
            DateTime starttime = DateTime.Now;

            PathFinding();            

            double timespend = DateTime.Now.Subtract(starttime).TotalSeconds;
            Console.WriteLine("\n{0} sec. press a key ...", timespend.Pretty()); Console.ReadKey();
        }         

        static void PathFinding()
        {
            QLearning q = new QLearning { Episodes = 1000, Alpha = 0.1, Gamma = 0.9, 
                MaxExploreStepsWithinOneEpisode = 1000 };

            QAction fromTo;
            QState state;
            string stateName;
            string stateNameNext;

            // ----------- Begin Insert the path setup here -----------
            // insert the end states here, e.g. goal state
            q.EndStates.Add(StateNameEnum.C.EnumToString()); 


            // State A           
            stateName = StateNameEnum.A.EnumToString();
            q.AddState(state = new QState(stateName, q));
            // action A -> B
            stateNameNext = StateNameEnum.B.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext) ));
            // action outcome probability
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // action A -> D
            stateNameNext = StateNameEnum.D.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            // action outcome probability
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State B
            stateName = StateNameEnum.B.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // B -> A
            stateNameNext = StateNameEnum.A.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // B -> C
            stateNameNext = StateNameEnum.C.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            // action outcome probability
            fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.C.EnumToString(), 0.1) { Reward = 100 });
            fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.A.EnumToString(), 0.9));
            // B -> E
            stateNameNext = StateNameEnum.E.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State C
            stateName = StateNameEnum.C.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // C -> C
            stateNameNext = stateName;
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State D
            stateName = StateNameEnum.D.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // D -> A
            stateNameNext = StateNameEnum.A.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // D -> E
            stateNameNext = StateNameEnum.E.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State E
            stateName = StateNameEnum.E.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // E -> B
            stateNameNext = StateNameEnum.B.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // E -> D
            stateNameNext = StateNameEnum.D.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));
            // E -> F
            stateNameNext = StateNameEnum.F.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // State F
            stateName = StateNameEnum.F.EnumToString();
            q.States.Add(state = new QState(stateName, q));
            // F -> C
            stateNameNext = StateNameEnum.C.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0) { Reward = 100 });
            // F -> E
            stateNameNext = StateNameEnum.E.EnumToString();
            state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
            fromTo.AddActionResult(new QActionResult(fromTo, stateNameNext, 1.0));

            // ----------- End Insert the path setup here -----------

            q.RunTraining();
            q.PrintQLearningStructure();
            q.ShowPolicy();
        }
    }
}

RockPaperScissorsDemo.cs

using System;
using System.Collections.Generic;
using QLearningFramework;

namespace ConsoleQLearning
{
    class RockPaperScissorsDemo
    {
        // ----------- Insert the state names here -----------
        internal enum StateNameEnum
        {
            Begin, Rock, Paper, Scissor
        }
        // ----------- End Insert the state names here -------

        static void Main(string[] args)
        {
            DateTime starttime = DateTime.Now;
            
            RockPaperScissors();

            double timespend = DateTime.Now.Subtract(starttime).TotalSeconds;
            Console.WriteLine("\n{0} sec. press a key ...", timespend.Pretty()); Console.ReadKey();
        }

         static void RockPaperScissors()
         {
             QLearning q = new QLearning { Episodes = 1000, Alpha = 0.1, Gamma = 0.9, 
                 MaxExploreStepsWithinOneEpisode = 1000 };            

             var opponentStyles = new List<double[]>();
             // rock paper scissor probability styles
             opponentStyles.Add(new []{ 0.33,0.33,0.33} );
             opponentStyles.Add(new [] { 0.5, 0.3, 0.2 });
             opponentStyles.Add(new [] { 0.2, 0.5, 0.3 });
             opponentStyles.Add(new[] { 0.1, 0.1, 0.8 });
             int index = new Random().Next(opponentStyles.Count);
             var opponent = opponentStyles[index];

             // opponent probability pick 
             double rockOpponent = opponent[0];
             double paperOpponent = opponent[1];
             double scissorOpponent = opponent[2];

             QAction fromTo;
             QState state;
             string stateName;
             string stateNameNext;

             // ----------- Begin Insert the path setup here -----------

             // insert the end states here
             q.EndStates.Add(StateNameEnum.Rock.EnumToString());
             q.EndStates.Add(StateNameEnum.Paper.EnumToString());
             q.EndStates.Add(StateNameEnum.Scissor.EnumToString());

             // State Begin
             stateName = StateNameEnum.Begin.EnumToString();
             q.AddState(state = new QState(stateName, q));             
             
             // action Rock
             stateNameNext = StateNameEnum.Rock.EnumToString();
             state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));             
             // action outcome probability
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Rock.EnumToString(), 
                 rockOpponent) { Reward = 0 });
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Paper.EnumToString(), 
                 paperOpponent) { Reward = -10 });
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Scissor.EnumToString(), 
                 scissorOpponent) { Reward = 100 });             
             
             // action paper
             stateNameNext = StateNameEnum.Paper.EnumToString();
             state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
             // action outcome probability
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Rock.EnumToString(), 
                 rockOpponent) { Reward = 100 });
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Paper.EnumToString(), 
                 paperOpponent) { Reward = 0 });
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Scissor.EnumToString(), 
                 scissorOpponent) { Reward = -10 });

             // action scissor
             stateNameNext = StateNameEnum.Scissor.EnumToString();
             state.AddAction(fromTo = new QAction(stateName, new QActionName(stateName, stateNameNext)));
             // action outcome probability
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Rock.EnumToString(), 
                 rockOpponent) { Reward = -10 });
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Paper.EnumToString(), 
                 paperOpponent) { Reward = 100 });
             fromTo.AddActionResult(new QActionResult(fromTo, StateNameEnum.Scissor.EnumToString(), 
                 scissorOpponent) { Reward = 0 });
             // ----------- End Insert the path setup here -----------

             q.RunTraining();
             q.PrintQLearningStructure();
             q.ShowPolicy();

             Console.WriteLine("\n** Opponent style **");
             Console.WriteLine(string.Format("style is rock {0} paper {1} scissor {2}", 
                 opponent[0].Pretty(), opponent[1].Pretty(), opponent[2].Pretty()));
         }        
    }
}

QLearning.cs

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Text;

namespace QLearningFramework
{  
    /// <summary>
    /// Version 0.1
    /// Author: Kunuk Nykjaer
    /// one to many relationships data structure
    /// QLearning [1 -- *] State [1 -- *] Action [1 -- *] ActionResult
    /// </summary>
    class QLearning
    {
        public List<QState> States { get; private set; }
        public Dictionary<string, QState> StateLookup { get; private set; }

        public double Alpha { get; internal set; }
        public double Gamma { get; internal set; }

        public HashSet<string> EndStates { get; private set; }
        public int MaxExploreStepsWithinOneEpisode { get; internal set; } //avoid infinite loop
        public bool ShowWarning { get; internal set; } // show runtime warnings regarding q-learning
        public int Episodes { get; internal set; }

        public QLearning()
        {
            States = new List<QState>();
            StateLookup = new Dictionary<string, QState>();
            EndStates = new HashSet<string>();

            // Default when not set
            MaxExploreStepsWithinOneEpisode = 1000;
            Episodes = 1000;
            Alpha = 0.1;
            Gamma = 0.9;
            ShowWarning = true;
        }

        public void AddState(QState state)
        {
            States.Add(state);
        }

        public void RunTraining()
        {
            QMethod.Validate(this);

            /*		 
		    For each episode: Select random initial state 
			Do while not reach goal state
				Select one among all possible actions for the current state 
				Using this possible action, consider to go to the next state 
				Get maximum Q value of this next state based on all possible actions 				
                Set the next state as the current state
		    */

            // For each episode
            var rand = new Random();
            long maxloopEventCount = 0;

            // Train episodes
            for (long i = 0; i < Episodes; i++)
            {
                long maxloop = 0;
                // Select random initial state			
                int stateIndex = rand.Next(States.Count);
                QState state = States[stateIndex];
                QAction action = null;
                do
                {
                    if (++maxloop > MaxExploreStepsWithinOneEpisode)
                    {
                        if(ShowWarning)
                        {
                            string msg = string.Format(
                            "{0} !! MAXLOOP state: {1} action: {2}, {3} endstate is to difficult to reach?",
                            ++maxloopEventCount, state, action, "maybe your path setup is wrong or the ");
                            QMethod.Log(msg);
                        }
                        
                        break;
                    }

                    // no actions, skip this state
                    if(state.Actions.Count==0)
                        break;

                    // Selection strategy is random based on probability
                    int index = rand.Next(state.Actions.Count);
                    action = state.Actions[index];

                    // Using this possible action, consider to go to the next state
                    // Pick random Action outcome
                    QActionResult nextStateResult = action.PickActionByProbability();
                    string nextStateName = nextStateResult.StateName;

                    double q = nextStateResult.QEstimated;
                    double r = nextStateResult.Reward;
                    double maxQ = MaxQ(nextStateName);

                    // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))
                    double value = q + Alpha * (r + Gamma * maxQ - q); // q-learning                  
                    nextStateResult.QValue = value; // update

                    // is end state go to next episode
                    if (EndStates.Contains(nextStateResult.StateName))
                        break;

                    // Set the next state as the current state                    
                    state = StateLookup[nextStateResult.StateName];

                } while (true);
            }
        }


        double MaxQ(string stateName)
        {
            const double defaultValue = 0;

            if(!StateLookup.ContainsKey(stateName))            
                return defaultValue;                            

            QState state = StateLookup[stateName];
            var actionsFromState = state.Actions;
            double? maxValue = null;
            foreach (var nextState in actionsFromState)
            {
                foreach (var actionResult in nextState.ActionsResult)
                {
                    double value = actionResult.QEstimated;
                    if (value > maxValue || !maxValue.HasValue)
                        maxValue = value;
                }
            }

            // no update
            if (!maxValue.HasValue && ShowWarning) 
                QMethod.Log(string.Format("Warning: No MaxQ value for stateName {0}",
                    stateName) );

            return maxValue.HasValue ? maxValue.Value : defaultValue;
        }

        public void PrintQLearningStructure()
        {
            Console.WriteLine("** Q-Learning structure **");
            foreach (QState state in States)
            {
                Console.WriteLine("State {0}", state.StateName);
                foreach (QAction action in state.Actions)
                {
                    Console.WriteLine("  Action " + action.ActionName);
                    Console.Write(action.GetActionResults());
                }
            }
            Console.WriteLine();
        }

        public void ShowPolicy()
        {
            Console.WriteLine("** Show Policy **");
            foreach (QState state in States)
            {
                double max = Double.MinValue;
                string actionName = "nothing";
                foreach (QAction action in state.Actions)
                {
                    foreach (QActionResult actionResult in action.ActionsResult)
                    {
                        if (actionResult.QEstimated > max)
                        {
                            max = actionResult.QEstimated;
                            actionName = action.ActionName.ToString();
                        }
                    }
                }

                Console.WriteLine(string.Format("From state {0} do action {1}, max QEstimated is {2}",
                    state.StateName, actionName, max.Pretty()));
            }
        }
    }

    class QState
    {        
        public string StateName { get; private set; }
        public List<QAction> Actions { get; private set; }

        public void AddAction(QAction action)
        {
            Actions.Add(action);
        }

        public QState(string stateName, QLearning q)
        {            
            q.StateLookup.Add(stateName, this);
            StateName = stateName;
            Actions = new List<QAction>();
        }

        public override string ToString()
        {
            return string.Format("StateName {0}", StateName);
        }
    }

    class QAction
    {
        private static readonly Random Rand = new Random();
        public QActionName ActionName { get; internal set; }
        public string CurrentState { get; private set; }
        public List<QActionResult> ActionsResult { get; private set; }

        public void AddActionResult(QActionResult actionResult)
        {
            ActionsResult.Add(actionResult);
        }

        public string GetActionResults()
        {
            var sb = new StringBuilder();
            foreach (QActionResult actionResult in ActionsResult)
                sb.AppendLine("     ActionResult " + actionResult);

            return sb.ToString();
        }

        public QAction(string currentState, QActionName actionName = null)
        {
            CurrentState = currentState;
            ActionsResult = new List<QActionResult>();
            ActionName = actionName;
        }

        // The sum of action outcomes must be close to 1
        public void ValidateActionsResultProbability()
        {
            const double epsilon = 0.1;

            if (ActionsResult.Count == 0)
                throw new ApplicationException(string.Format(
                    "ValidateActionsResultProbability is invalid, no action results:\n {0}", 
                    this));

            double sum = ActionsResult.Sum(a => a.Probability);
            if (Math.Abs(1 - sum) > epsilon)
                throw new ApplicationException(string.Format(
                    "ValidateActionsResultProbability is invalid:\n {0}", this));
        }

        public QActionResult PickActionByProbability()
        {
            double d = Rand.NextDouble();
            double sum = 0;
            foreach (QActionResult actionResult in ActionsResult)
            {
                sum += actionResult.Probability;
                if (d <= sum)
                    return actionResult;
            }
            
            // we might get here if sum probability is below 1.0 e.g. 0.99 
            // and the d random value is 0.999
            if (ActionsResult.Count > 0) 
                return ActionsResult.Last();

            throw new ApplicationException(string.Format("No PickAction result: {0}", this));
        }

        public override string ToString()
        {
            double sum = ActionsResult.Sum(a => a.Probability);
            return string.Format("ActionName {0} probability sum: {1} actionResultCount {2}",
                ActionName, sum, ActionsResult.Count);
        }
    }

    class QActionResult
    {
        public string StateName { get; internal set; }
        public string PrevStateName { get; internal set; }
        public double QValue { get; internal set; } // Q value is stored here        
        public double Probability { get; internal set; }
        public double Reward { get; internal set; }

        public double QEstimated
        {
            get { return QValue * Probability; }
        }

        public QActionResult(QAction action, string stateNameNext = null, 
            double probability = 1, double reward = 0)
        {
            PrevStateName = action.CurrentState;
            StateName = stateNameNext;
            Probability = probability;
            Reward = reward;
        }

        public override string ToString()
        {
            return string.Format("State {0}, Prob. {1}, Reward {2}, PrevState {3}, QE {4}",
                StateName, Probability.Pretty(), Reward, PrevStateName, QEstimated.Pretty());
        }
    }

    public class QActionName
    {
        public string From { get; private set; }
        public string To { get; private set; }

        public QActionName(string from, string to = null)
        {
            From = from;
            To = to;
        }

        public override string ToString()
        {
            return GetActionName();
        }

        public string GetActionName()
        {
            if (To == null) 
                return From;
            return QMethod.ActionNameFromTo(From,To);
        }
    }

    static class QMethod
    {
        public static void Log(string s)
        {
            Console.WriteLine(s);
        }

        public static readonly CultureInfo CultureEnUs = new CultureInfo("en-US");

        public static string ToStringEnUs(this double d)
        {
            return d.ToString("G", CultureEnUs);
        }
        public static string Pretty(this double d)
        {
            return ToStringEnUs(Math.Round(d, 2));
        }

        public static string ActionNameFromTo(string a, string b)
        {
            return string.Format("from_{0}_to_{1}", a, b);
        }

        public static string EnumToString<T>(this T type)
        {
            return Enum.GetName(typeof(T), type);
        }

        public static void ValidateRange(double d, string origin = null)
        {
            if (d < 0 || d > 1)
            {
                string s = origin ?? string.Empty;
                throw new ApplicationException(string.Format("ValidateRange error: {0} {1}", d, s));
            }
        }

        public static void Validate(QLearning q)
        {
            foreach (var state in q.States)
            {
                foreach (var action in state.Actions)
                {
                    action.ValidateActionsResultProbability();
                }
            }
        }
    }
}

Currently you have to know all the probability values in advance for the action outcomes, which are used for the Q-learning structure setup. The source code can easily be adapted if you need runtime knowledge of probability outcomes. Just add a probability function lookup and look at the Q-learning formula calculation area. Maybe I will look more into that at a later time.

Advertisements

Written by kunuk Nykjaer

January 14, 2012 at 7:53 pm

10 Responses

Subscribe to comments with RSS.

  1. […] next post about q-learning can be read here https://kunuk.wordpress.com/2012/01/14/q-learning-framework-example-with-c/ Advertisement LD_AddCustomAttr("AdOpt", "1"); LD_AddCustomAttr("Origin", "other"); […]

  2. It’s actually a great and helpful piece of info. I am satisfied that you shared this helpful information with us. Please keep us up to date like this. Thank you for sharing.

    experienced cougar women

    December 16, 2012 at 6:32 am

  3. Thank you for sharing… Can we use this QLearning framework with Prey Hunter problems?

    master student

    March 19, 2013 at 9:15 am

    • This is mostly a demo of Q-Learning general usage for path-finding. I have not tried it on prey hunter problems.

      kunuk Nykjaer

      April 2, 2013 at 9:03 pm

  4. hi, thanks for share your codes , i study your codes and try to implements Hanoi and maze with your framework but can’t get correct answer , can you help me ?

    ali

    April 13, 2013 at 6:41 pm

  5. Hi Nunuk,

    first i’d like to thank you for these very nice code samples. I think many students interested in Q-learning are happy to find working code with examples. But your code has a little conceptual but which results in wrong q-values. The main problem ist, that you store youre q-values within the transitions. This works, if your transition probabililties are always 1.0, as in the most examples above. But if you use the last example that action “right” from state “B” ends with a probability of 90% in state “A” you have to use one q-value for both transitions.

    You can calculate yourself the optimal q-values (gamma = 0.9): with a probability of 0.1 you’ll get an immediate reward 100.0 for state “C” and an estimation of further rewards of 0.9 * 0.0, because “C” is a final state. So you get an estimation of 0.1*(100.0 + 0.9*0.0) = 10.0 for this transition (B -> right -> C). With a probability of 0.9 you’ll get an immediate reward 0.0 for state “A” and an estimation of further rewards of 0.9 * 72.9 (Q* from “A”). So you get an estimation of 0.9*(0 + 0.9*72.9) = 59.05 for the other transtition (B -> right -> A). So your q-value will converge to 69.05.

    In your case it makes no difference, because action “down” from “B” with a q-value of 81.0 is the best policy. But if you only have the choice between “left” and “right” in “B”, your policy would choose left, what would be wrong.

    Jens

    Jens

    October 14, 2013 at 2:30 pm

    • Hey Jens, thanks for the feedback, I haven’t looked more into it. I will update with a disclaimer that there might be a bug.

      kunuk Nykjaer

      March 13, 2014 at 10:07 pm

  6. Hi thank for this post it is great. What kind of exploration policy are you using here. greedy, softmax?

    Jess

    March 13, 2014 at 5:04 am

    • No fancy stuff, I remember it as random. This is implementation of the basic version of q-learning. There should be comments in the code which explains the process. Look at QLearning.cs

      kunuk Nykjaer

      March 13, 2014 at 10:24 pm

  7. This design is steller! You most certainly know how
    to keep a reader amused. Between your wit and your
    videos, I was almost moved to start my own blog (well, almost…HaHa!) Excellent job.
    I really loved what you had to say, and more than that, how you presented
    it. Too cool!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: