马尔可夫决策过程的值迭代是如何工作的？

Question

马尔可夫决策过程的值迭代是如何工作的？

algorithmlanguage-agnosticartificial-intelligencemarkov

40

马尔可夫决策过程（使用值迭代）我无法理解。资源使用的数学公式对我的能力来说太复杂了。

我想在一个填满了墙壁（不可到达）、硬币（可取得）和移动的敌人（必须尽量避免）的二维网格上应用它。目标是收集所有硬币而不触碰敌人。我想为主角创建一个使用马尔可夫决策过程的人工智能。它看起来像这样（与游戏相关的方面不是问题，我想要理解马尔可夫决策过程的一般概念）：

enter image description here

简化的马尔可夫决策过程是一个网格，用于确定我们需要向哪个方向前进（从网格上的某个位置开始），以收集硬币并避免敌人。使用马尔可夫决策过程的术语，它创建了一个状态集合（即网格），其中包含每个状态（网格上的位置）应采取的策略（即行动：上、下、左、右）。这些策略由每个状态的“效用”值确定，而这些值本身是通过评估在短期和长期内达到该状态对于收益有多大来计算的。

这个方程中的变量在我的情况下代表什么？我想知道。

s将是来自网格的一个方块列表，a表示特定的动作（上、下、左、右），但其他方面呢？奖励和效用函数应该如何实现？如果有人能展示与我的情况相似的伪代码，那就太好了。

- Jesse Emond

我可以问一下为什么要踩我吗？我想知道这个问题有什么问题。谢谢。 - Jesse Emond

4个回答

5

这不是一个完整的答案，而是一个澄清说明。

状态并不是单个单元格。状态同时包含有关所有相关单元格中每个单元格中存在的信息。这意味着一个状态元素包含哪些单元格是实心的，哪些是空的；哪些包含怪物；哪里有硬币；玩家在哪里。

也许您可以使用从每个单元格到其内容的映射作为状态。但这忽略了怪物和玩家的移动，这可能也非常重要。

细节取决于您想要如何建模您的问题（决定什么属于状态以及以哪种形式）。

然后，策略将每个状态映射到左、右、跳跃等操作。

首先，在考虑像价值迭代这样的算法如何工作之前，您必须理解由MDP表达的问题。

- ziggystar

5

我建议您使用Q-learning进行实现。

也许您可以将我写的文章作为灵感。这是一个带有Java源代码的 Q-learning演示。这个演示是一个有6个领域的地图，人工智能学习从每个状态到达奖励的最佳路径。

Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.

This example shows the Q-learning used for path finding. A robot learns where it should go from any state.

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.

Result Q*(s,a)

Policy Π*(s)

Qlearning.java

import java.text.DecimalFormat;
import java.util.Random;

/**
 * @author Kunuk Nykjaer
 */
public class Qlearning {
    final DecimalFormat df = new DecimalFormat("#.##");

    // path finding
    final double alpha = 0.1;
    final double gamma = 0.9;


// states A,B,C,D,E,F
// e.g. from A we can go to B or D
// from C we can only go to C
// C is goal state, reward 100 when B->C or F->C
//
// _______
// |A|B|C|
// |_____|
// |D|E|F|
// |_____|
//

    final int stateA = 0;
    final int stateB = 1;
    final int stateC = 2;
    final int stateD = 3;
    final int stateE = 4;
    final int stateF = 5;

    final int statesCount = 6;
    final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF};

    // http://en.wikipedia.org/wiki/Q-learning
    // http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm

    // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))

    int[][] R = new int[statesCount][statesCount]; // reward lookup
    double[][] Q = new double[statesCount][statesCount]; // Q learning

    int[] actionsFromA = new int[] { stateB, stateD };
    int[] actionsFromB = new int[] { stateA, stateC, stateE };
    int[] actionsFromC = new int[] { stateC };
    int[] actionsFromD = new int[] { stateA, stateE };
    int[] actionsFromE = new int[] { stateB, stateD, stateF };
    int[] actionsFromF = new int[] { stateC, stateE };
    int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC,
            actionsFromD, actionsFromE, actionsFromF };

    String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" };

    public Qlearning() {
        init();
    }

    public void init() {       
        R[stateB][stateC] = 100; // from b to c
        R[stateF][stateC] = 100; // from f to c    
    }

    public static void main(String[] args) {
        long BEGIN = System.currentTimeMillis();

        Qlearning obj = new Qlearning();

        obj.run();
        obj.printResult();
        obj.showPolicy();

        long END = System.currentTimeMillis();
        System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec.");
    }

    void run() {
        /*
         1. Set parameter , and environment reward matrix R
         2. Initialize matrix Q as zero matrix
         3. For each episode: Select random initial state
            Do while not reach goal state o
                Select one among all possible actions for the current state o
                Using this possible action, consider to go to the next state o
                Get maximum Q value of this next state based on all possible actions o
                Compute o Set the next state as the current state
         */

        // For each episode
        Random rand = new Random();
        for (int i = 0; i < 1000; i++) { // train episodes
            // Select random initial state
            int state = rand.nextInt(statesCount);
            while (state != stateC) // goal state
            {
                // Select one among all possible actions for the current state
                int[] actionsFromState = actions[state];

                // Selection strategy is random in this example
                int index = rand.nextInt(actionsFromState.length);
                int action = actionsFromState[index];

                // Action outcome is set to deterministic in this example
                // Transition probability is 1
                int nextState = action; // data structure

                // Using this possible action, consider to go to the next state
                double q = Q(state, action);
                double maxQ = maxQ(nextState);
                int r = R(state, action);

                double value = q + alpha * (r + gamma * maxQ - q);
                setQ(state, action, value);

                // Set the next state as the current state
                state = nextState;
            }
        }
    }

    double maxQ(int s) {
        int[] actionsFromState = actions[s];
        double maxValue = Double.MIN_VALUE;
        for (int i = 0; i < actionsFromState.length; i++) {
            int nextState = actionsFromState[i];
            double value = Q[s][nextState];

            if (value > maxValue)
                maxValue = value;
        }
        return maxValue;
    }

    // get policy from state
    int policy(int state) {
        int[] actionsFromState = actions[state];
        double maxValue = Double.MIN_VALUE;
        int policyGotoState = state; // default goto self if not found
        for (int i = 0; i < actionsFromState.length; i++) {
            int nextState = actionsFromState[i];
            double value = Q[state][nextState];

            if (value > maxValue) {
                maxValue = value;
                policyGotoState = nextState;
            }
        }
        return policyGotoState;
    }

    double Q(int s, int a) {
        return Q[s][a];
    }

    void setQ(int s, int a, double value) {
        Q[s][a] = value;
    }

    int R(int s, int a) {
        return R[s][a];
    }

    void printResult() {
        System.out.println("Print result");
        for (int i = 0; i < Q.length; i++) {
            System.out.print("out from " + stateNames[i] + ":  ");
            for (int j = 0; j < Q[i].length; j++) {
                System.out.print(df.format(Q[i][j]) + " ");
            }
            System.out.println();
        }
    }

    // policy is maxQ(states)
    void showPolicy() {
        System.out.println("\nshowPolicy");
        for (int i = 0; i < states.length; i++) {
            int from = states[i];
            int to =  policy(from);
            System.out.println("from "+stateNames[from]+" goto "+stateNames[to]);
        }          
    }
}

Print result

out from A:  0 90 0 72,9 0 0
out from B:  81 0 100 0 81 0
out from C:  0 0 0 0 0 0
out from D:  81 0 0 0 81 0
out from E:  0 90 0 72,9 0 90
out from F:  0 0 100 0 81 0

showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.

- Kunukn

2

我知道这是一个相当老的帖子，但在寻找MDP相关问题时，我遇到了它。我想指出（对于来到这里的人们），你在陈述“s”和“a”时还有一些评论。

我认为对于a，您绝对正确，它是您的[上，下，左，右]列表。

然而，对于s，它实际上是网格中的位置，s'是您可以去的位置。这意味着您选择一个状态，然后选择一个特定的s'并通过所有可以将您带到该sprime的操作来计算出这些值。（从中选择最大值）。最后，您继续下一个s'并做同样的事情，当您耗尽所有s'值时，然后找到您刚完成搜索的最大值。

假设您选择了角落中的网格单元格，您只能移动到2个状态（假设在左下角），具体取决于您如何选择“命名”您的状态，在这种情况下，我们可以假设状态是x，y坐标，因此您当前的状态s是1,1，您的s'（或s prime）列表是x+1,y和x,y+1（在此示例中没有对角线）（涵盖所有s'的总和部分）

此外，您没有在方程式中列出它，但最大值是a或给您最大值的操作，因此首先选择给您最大值的s'，然后在其中选择操作（至少这是我对该算法的理解）。

所以如果您有

x,y+1 left = 10 
x,y+1 right = 5 

x+1,y left = 3
x+1,y right 2

您将选择x，y+1作为您的s'，但随后您需要选择一个在这种情况下最大化的动作，即向左移动到x，y+1。我不确定仅查找最大数字和查找状态然后最大数字之间是否存在微妙差别，因此也许有一天某个人可以为我澄清这一点。

如果您的移动是确定性的（意味着如果您说前进，您就会百分之百地前进），那么很容易您只有一个动作，但是如果它们是非确定性的，则您可能会有80%的把握考虑其他可能带您到达目的地的动作。这是Jose上面提到的滑轮的背景。

我不想削弱其他人所说的话，但只是提供一些额外的信息。

- onaclov2000

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jose M Vidal · Accepted Answer

是的，数学符号可能使它看起来比实际复杂得多。实际上，这是一个非常简单的想法。我已经实现了一个价值迭代演示应用程序，您可以使用它来更好地理解。（点击此处）基本上，假设您有一个带有机器人的二维网格。机器人可以尝试向北、南、东、西移动（这些都是动作），但由于其左轮滑动，当它尝试向北移动时，只有 .9 的概率会停在它北面的方块上，而有 .1 的概率会停在它西边的方块上（其他 3 个动作类似）。这些概率由 T() 函数捕获。即，T(s,A,s') 将如下所示：

s    A      s'     T    //x=0,y=0 is at the top-left of the screen
x,y  North  x,y+1  .9   //we do move north
x,y  North  x-1,y  .1   //wheels slipped, so we move West
x,y  East   x+1,y  .9
x,y  East   x,y-1  .1
x,y  South  x,y+1  .9
x,y  South  x-1,y  .1 
x,y  West   x-1,y  .9
x,y  West   x,y+1  .1

您需要将奖励设置为所有状态的0，但对于目标状态（即您希望机器人到达的位置）为100。价值迭代的作用是从目标状态开始，将其效用设为100，将所有其他状态的效用设为0。然后，在第一次迭代中，这100个效用会向后分配1步，因此可以在1步内到达目标状态的所有状态（紧挨着它的4个正方形）都将获得一些效用。换句话说，它们将获得与从该状态到达目标状态的概率相等的效用。然后我们继续迭代，每次将效用向离目标更远的方向移动1步。

例如，假设您从R(5,5)= 100和R(.) = 0的所有其他状态开始。因此，目标是到达5,5。

在第一次迭代中，我们设置

R(5,6) = gamma * (.9 * 100) + gamma * (.1 * 100)

因为在5,6上，如果向北走，有.9的概率最终到达5,5，而如果向西走，则有.1的概率最终到达5,5。

对于(5,4), (4,5), (6,5)同样如此。

第一次价值迭代后，所有其他状态仍保持U = 0。