发布于  更新于 

强化学习作业 1

强化学习

Problem 1

有问题。从人的经验来看,仅仅依靠位置信息不能做出是否加减速、转弯的决策,关于是否有障碍、障碍的位置、移动速度、自身的方向和速度等信息都应该纳入状态之中作为决策依据。

Problem 2

考虑为乘除法的回报设置更大的权重,和 / 或衰减已经大量练习的题目的回报

Problem 3

井字棋比较简单,总状态数不多,翻出旧代码来按照题意做一个简单的概率 DP 即可。仅在 “必胜” 招后回报为 1 可以理解为只有赢了回报为 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <string>

using int64 = long long;
using uint64 = unsigned long long;
using uint32 = unsigned int;

struct state {
uint32 u;

inline constexpr state(uint32 u) : u(u) {}
inline constexpr operator uint32() const { return u; }
inline constexpr state(const std::string& str) : u(0) {
for (int i = 0; i < 9; i++) {
uint32 val = 0;
if (str[i] == 'x') {
val = 1;
} else if (str[i] == 'o') {
val = 2;
} else {
val = 0;
}
u |= val << (i * 2);
}
}

int check_win();
inline constexpr int get(int idx) const { return (u >> (idx * 2)) & 0x3; }
inline constexpr void set(int idx, int val) {
u &= ~(0x3 << (idx * 2));
u |= uint32(val) << (idx * 2);
}
};

inline state final_state[] = {
state("xxx "), state(" xxx "), state(" xxx"),
state("x x x "), state(" x x x "), state(" x x x"),
state("x x x"), state(" x x x "),
};

inline int state::check_win() {
for (int i = 0; i < 8; i++) {
if ((final_state[i].u & u) == final_state[i].u) return 1;
if (((final_state[i].u << 1) & u) == (final_state[i].u << 1)) return -1;
}
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <iostream>
#include <map>
#include <utility>

void print_state(state st) {
for (int i = 0; i < 9; i++) {
if (st.get(i) == 0)
std::cout << i << " ";
else if (st.get(i) == 1)
std::cout << "X ";
else
std::cout << "O ";
if (i % 3 == 2)
std::cout << std::endl;
}
std::cout << std::endl;
}

std::map<state, double> memo;

double dfs(state st, bool current_x) {
if (memo.find(st) != memo.end())
return memo[st];
int res = st.check_win();
if (res != 0) {
if (res == 1)
return memo[st] = 1;
else
return memo[st] = 0;
}
if (current_x) {
// X's turn
// Choose the best move
double best = 0;
for (int i = 0; i < 9; i++) {
if (st.get(i) == 0) {
state new_st = st;
new_st.set(i, 1);
best = std::max(best, dfs(new_st, false));
}
}
return memo[st] = best;
} else {
// O's turn
// Prevent X from instantly winning, otherwise move in equal probability
double total = 0;
int cnt = 0;
for (int i = 0; i < 9; i++) {
if (st.get(i) == 0) {
state new_st = st;
new_st.set(i, 1);
if (new_st.check_win() == 1) {
new_st.set(i, 2);
return memo[st] = dfs(new_st, true);
}
new_st.set(i, 2);
total += dfs(new_st, true);
cnt++;
}
}
return memo[st] = total / cnt;
}
}

int main() {
state st("x o ");
dfs(st, true);
for (int i = 0; i < 9; i++) {
if (st.get(i) == 0) {
state new_st = st;
new_st.set(i, 1);
print_state(new_st);
std::cout << "Winning probability: " << memo[new_st] << std::endl;
}
}
}

程序输出给定策略下各动作的期望回报

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
X X 2 
3 O 5
6 7 8

Winning probability: 0
X 1 X
3 O 5
6 7 8

Winning probability: 0
X 1 2
X O 5
6 7 8

Winning probability: 0
X 1 2
3 O X
6 7 8

Winning probability: 0.541667
X 1 2
3 O 5
X 7 8

Winning probability: 0.25
X 1 2
3 O 5
6 X 8

Winning probability: 0.541667
X 1 2
3 O 5
6 7 X

Winning probability: 0.666667

所以下一步应该走对角

Problem 4

很显然的,游戏 AI。电子游戏很明显符合序贯决策模型,但是设计应用于许多游戏的强化学习算法可能很困难,因为游戏中的信息非常丰富,设计合适的状态表示是艰巨的任务;同时动作集也很庞大,可以做出多种类型的动作,动作甚至可以是连续的。收益可以定义为单局游戏的最终胜利,但也有许多细化定义的空间。

我最感兴趣的是强化学习在 LLM 上的应用。现在的 LLM 多使用强化学习使其能产生符合人类期望的对话。应当是把与用户对话的过程作为一个序贯决策的过程?在这个模型中,状态和动作似乎都涉及到深度网络的表示,回报是人类对对话过程喜好的主观打分。