Sistemas Lineares

•

UFRN

Francisval Guedes

06/09/2023

E aí, curtiu este material?

Ajude a incentivar outros estudantes a melhorar o conteúdo

Gostou desse material? Compartilhe! 🧡

Sistemas Lineares

420 Materiais compartilhados

Baixe o app para aproveitar ainda mais

Leia os materiais offline, sem usar a internet. Além de vários outros recursos!

Prévia do material em texto

Industrial Robot: An International Journal
Model-based deep reinforcement learning with heuristic search for satellite attitude control
Ke Xu, Fengge Wu, Junsuo Zhao,
Article information:
To cite this document:
Ke Xu, Fengge Wu, Junsuo Zhao, (2018) "Model-based deep reinforcement learning with heuristic search for satellite attitude
control", Industrial Robot: An International Journal, https://doi.org/10.1108/IR-05-2018-0086
Permanent link to this document:
https://doi.org/10.1108/IR-05-2018-0086
Downloaded on: 23 December 2018, At: 10:55 (PT)
References: this document contains references to 24 other documents.
To copy this document: permissions@emeraldinsight.com
The fulltext of this document has been downloaded 26 times since 2018*
Users who downloaded this article also downloaded:
(2018),"Sensorless force estimation and control of Delta robot with limited access interface", Industrial Robot: An International
Journal, Vol. 45 Iss 5 pp. 611-622 <a href="https://doi.org/10.1108/IR-03-2018-0048">https://doi.org/10.1108/IR-03-2018-0048</
a>
,"Improving stability in physical human–robot interaction by estimating human hand stiffness and a vibration index", Industrial
Robot: An International Journal, Vol. 0 Iss 0 pp. - <a href="https://doi.org/10.1108/IR-05-2018-0111">https://doi.org/10.1108/
IR-05-2018-0111</a>
Access to this document was granted through an Emerald subscription provided by emerald-srm:277515 []
For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service
information about how to choose which publication to write for and submission guidelines are available for all. Please visit
www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
Emerald is a global publisher linking research and practice to the benefit of society. The company manages a portfolio of
more than 290 journals and over 2,350 books and book series volumes, as well as providing an extensive range of online
products and additional customer resources and services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee on Publication Ethics
(COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation.
*Related content and download information correct at time of download.
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
https://doi.org/10.1108/IR-05-2018-0086
https://doi.org/10.1108/IR-05-2018-0086
Model-based deep reinforcement learning with
heuristic search for satellite attitude control
Ke Xu, Fengge Wu and Junsuo Zhao
Institute of Software, Chinese Academy of Sciences, Beijing, China
Abstract
Purpose – Recently, deep reinforcement learning is developing rapidly and shows its power to solve difficult problems such as robotics and game of
GO. Meanwhile, satellite attitude control systems are still using classical control technics such as proportional – integral – derivative and slide mode
control as major solutions, facing problems with adaptability and automation.
Design/methodology/approach – In this paper, an approach based on deep reinforcement learning is proposed to increase adaptability and
autonomy of satellite control system. It is a model-based algorithm which could find solutions with fewer episodes of learning than model-free
algorithms.
Findings – Simulation experiment shows that when classical control crashed, this approach could find solution and reach the target with hundreds
times of explorations and learning.
Originality/value – This approach is a non-gradient method using heuristic search to optimize policy to avoid local optima. Compared with classical
control technics, this approach does not need prior knowledge of satellite or its orbit, has the ability to adapt different kinds of situations with data
learning and has the ability to adapt different kinds of satellite and different tasks through transfer learning.
Keywords Control, Artificial Intelligence, Deep reinforcement learning, Satellite attitude
Paper type Research paper
1. Introduction
With the rapid development of small satellite, satellite attitude
control is facing problems with autonomous and adaptivity.
Owing to the lower cost and faster design of small satellite,
hundreds of satellites are launched each year, and the number is
increasing year by year. It is harder to design control system for
each small satellite with their different space tasks, different
physical parameters, actors and sensors, different orbits and so
on. The limited ground measuring and control system cannot
meet the needs of them at the same time. The unstable space
environment (Williams et al., 2016) and satellite component
errors (Hu et al., 2015; Mazmanyan and Ayoubi, 2014) also
influence control precision and stability. Therefore, an
intelligence control system that can adapt different kinds of
situations and work autonomously is urgently needed to be
developed.
Meanwhile, deep reinforcement learning has recently been
used to solve challenge problems successfully, from
autonomous racing (Williams et al., 2017) to robotics (Levine
et al., 2016). Alpha-zero (Silver et al., 2017) is an impressive
application of deep reinforcement learning, which could solve
different kinds of board games without offline data and human
knowledge. Guided policy search (GPS) (Levine et al., 2016) is
a kind of successful model-based solution for different kinds of
robot tasks which could be repeated learnt with the help of
guided policy. Deep deterministic policy gradient (DDPG)
(Lillicrap et al., 2015) gives a model-free greedy solution for
general control problems that described in Open-AI gym
(Brockman et al., 2016). These results show that deep
reinforcement learning is a promising solution for intelligence
satellite attitude control system.
Unfortunately, deep reinforcement learning needs too much
computing and storage resource for an on-board computer,
which leads to most satellite attitude control systems are still
using traditional control algorithm such as proportional –
integral – derivative (PID) (Gross et al., 2015; Akella et al.,
2015), slide mode (Xiao et al., 2014) or fuzzy control (Walker
et al., 2015; Ghadiri et al., 2015). PID is still the most
commonly used method in satellite attitude control, for its
implementation simplicity and stability. The problem of PID is
adaptability and robustness, where many adaptive approaches
are used to improve that. Slide mode considers non-linear
systems, and similar to PID, the main problem is adaptability.
Current intelligent approaches such as fuzzy control bring
adaptability to control system. The main problem of fuzzy
control is that it needs expert knowledge to build control
system. Besides these approaches, few of them use machine
learning as a support method (Wu et al., 2015). To solve this
problem, a new kind of satellite called phonesat (Shimmin
et al., 2014) is proposed, which uses smartphone as a new kind
of on-board computer, to upgrade computing and storage
ability for satellites. Many deep learning tools such as
The current issue and full text archive of this journal is available on
Emerald Insight at: www.emeraldinsight.com/0143-991X.htm
Industrial Robot: An International Journal
© Emerald Publishing Limited [ISSN 0143-991X]
[DOI 10.1108/IR-05-2018-0086]
This work is supported by National Natural Science Foundation of China
(Grant Number 61202218).
Received 7 May 2018
Revised 31 May 2018
Accepted 5 July 2018
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
http://dx.doi.org/10.1108/IR-05-2018-0086
tensorflow (Abadi et al., 2016) and caffe (Jia et al., 2014) also
support Android OS and make it more convenient to develop
deep reinforcement learning applications on smartphones.
In this paper, a model-based deep reinforcement learning
algorithm is proposed to learn from data that generated by
satellite attitudecontrol system, optimizes control system
online and makes it more adaptive to space environment and
hardware changes of satellite itself. There are three main parts
of the algorithm: a model network that predicts the next state
according to current state and action, a policy network that
outputs control policy and a heuristic search part that optimizes
control policy. The model network could be trained initially
offline by ground simulation, and then trained online by actual
attitude data. The main error of the whole system is generated
by model predict error, so a feedback compensation is given to
make up the error. Neural network gives non-linear dynamics
in general; however, two predict state neighbours in time
dimension may have errors in detail, and they always close in
space dimensions, which make the predict error more linear,
that is to say a linear feedback compensation could work fine.
The policy network could also be trained initially offline by
existing control policy such as PID or slide mode control, and
then trained online by optimized policy generated by heuristic
search. Paper (Sugimura et al., 2012) shows the ground test of a
satellite attitude simulation system which could be used as pre-
trained environment. The heuristic search algorithm searches
better policy around initial actions given by policy network
output; if the optimized policy works better in model network,
then the systemwill use it as real output to control attitude, and
if the policy also works better in the real environment, the
policy network use it as training data. Simulation result shows
this algorithm has the ability to find a better policy to reach the
control target.
2. Model-based deep reinforcement learning with
heuristic search
In the game of GO, state transition is explicit, so there is no
need to propose a model network, but theQ-value is need to be
built. In model-based deep reinforcement learning with
heuristic (MDRLH) search, once a model network is built, the
Q-value could be decided by model output, so there is no need
to propose a Q-value network. In DDPG, there are four
networks, two of them output Q-value, and the other two
output policy, whereas in MDRLH, there are two networks,
one outputs state prediction, and the other outputs policy. In
addition, DDPG uses gradient descent to optimize policy,
whereas MDRLH uses heuristic search to optimize policy.
Figure 1 showsMDRLHworkflow.
The algorithmworks as follows:
� Random initialize networks or initialize networks from
ground simulation.
� Calculate policy predict output in state through model
network as well as itsQ-value to initialize heuristic search.
� Use heuristic search to find better policy model to get
lowerQ-value than initial policy.
� Execute optimized policy, get real output and its Q-value,
calculate model predict error.
� If the result is better than initial policy, use it as policy
training data.
� If state is reach beyond the safe zone, use guided policy,
for example, PID policy to return training zone.
� Use every state–action–state pair as model network’s
training data.
� End if target state is reached, otherwise return Step 2.
2.1Model network
Some works have been used machine learning technics to build
satellite attitude dynamics model (Straub et al., 2013). In this
paper, a neural network-based dynamic model is used as a part
of MDRLH algorithm. The model network st11 ¼ fu s st; atð Þ
uses previous state–action pair as input, and predicted state as
output. u s are parameters of network. They could be random
initialized or pre-trained with ground simulation or data from
other satellites, and then apply it to model network using fine-
tune or transfer learning (need references). Pre-trained process
will accelerate network’s convergence and improve online
performance of the whole algorithm.
Network uses leaky-ReLU (Xu et al., 2015) as activate
function, instead of ReLU. There are some results that show
the limitation of ReLU networks (Punjani and Abbeel, 2015),
and they use initial parameter settings to avoid gradient missing
problem, whereas a leaky-ReLU network has no need to worry
about this kind of problem.
Loss function of model network is L ¼ jjs0 � sjj2, which is
mean square error (MSE) between network output and next
state from data. This loss function minimizes error between
predicted state and real state after actions. Although model
network is able to fetch the majority of system’s non-linearity,
minor error still remains because of leak of data and limitations
of network itself, and this error causes main error of the whole
algorithm. So, an estimation of this error is necessary while using
model network to predict state.Model network could be seen as a
feedforward part of algorithm and error estimation as a feedback
part of algorithm: et11 = (s0t� st) · wp 1 etwi 1 (et� et�1) · wd, the
initial error e0 could be set as 0, and this function is actually PID
control algorithm,where s0t� st is proportional part, et is integral
part and et � et�1 is differential part. When algorithm uses
model network to predict state, error estimation is added:
Figure 1 MDRLH search workflow
Model-based deep reinforcement learning
Ke Xu, Fengge Wu and Junsuo Zhao
Industrial Robot: An International Journal
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-000.jpg&w=227&h=185
st11 ¼ fu s st; atð Þ1 e, whereas update u s does not need this
error.
A model-based algorithm makes control system perform
better every episode, but a model-free algorithm will wander
longer and can hardly perform better until it reaches the target
state, and the model network can learn the dynamics online as
long as the satellite attitude control system is working, even if
the system uses other control algorithms, which makes
algorithmmore adaptable.
2.2 Policy network
The policy network at ¼ gu a stð Þ uses current state as input and
current policy as output. u a is the parameter of network. They
could be random initialized or pre-trained with guided
policies such as PID control. Pre-trained process will reduce
the probability of reaching beyond the safe zone and improve
the performance of the whole algorithm.
Network uses similar structure as model network for the
same reasonmentioned before.
Loss function of policy network is L u að Þ ¼ jjat0 � atjj2 ¼
jjat0 � gu a stð Þjj2, which is MSE between network output,
at ¼ gu a stð Þ, and policy from data at0 in same state. This loss
function minimizes error between output policy and guided
policy from data. Unlike the error in model network, there is no
“correct policy” for control system. There is difference between
output of policy network and guided policy from data, but there
is also no idea of whether output of policy network is better.
Also during learning episodes, output of policy network is an
initial input of heuristic search and not directly control output.
So, there is no need tomodel the error between them.
This network is able to find a global solution of control
system, which is different from the GPS, and the solution
follows a certain routine. If necessary, policy network can
imitate other policies to adapt new tasks quickly.
2.3 Heuristic search
There are several algorithms that could be used as a heuristic
search, such as simulated annealing (Kirkpatrick et al., 1987),
particle swarm optimization (Kennedy and Eberhart, 2002)
and normal evolution strategy (OpenAI, 2017). It uses output
of policy network as its input and outputs an optimized policy
whereQ-value is higher.Q-value is defined as follows:
Q st; atð Þ ¼ jjst11 � sT jj2 � ws: (1)
In equation (1), st is the result state after action taken at time t,
and it could be model prediction or real output of environment.
sT is a target state which is decided by space missions such as
target tracking and taking photos. ws is a weight parameter,
which is also decided by missions.Q-value measures MSE
between next state and target state, which is different from
temporal difference. As it known in reinforcement learning,
different Q-value definitions lead to different agent behaviours
when algorithm converges. In satellite attitude control, for
example, a state contains quaternion of satellite and angular
velocity, and if weight of quaternion is bigger, then the satellite
prefers to move fast, and the attitude will be less stable because
big angular velocity affects little toQ-value. On the other hand,
if the weight of velocity is bigger, then the satellite prefers to
stay still, which will take a longer time to get target. So, a proper
ws should be set to get an acceptable performance; for instance,
weight of quaternion should be set bigger when satellite learns a
fast attitude manoeuvre task, and weight of angular velocity
should be set bigger when satellite needs to take a high
definition photo.
Simulated annealing is used to optimize policy. This
algorithm is a kind of Markov chain Monte Carlo algorithm,
where initial input is deterministic and makes a random change
to it and decides whether it is better than the original one. It is
rather general, so it needsmodification to fit certain situation:
ai0 ¼ ai � 11SRð Þ (2)
D ¼ Q s; ai0ð Þ �Q s; aið Þ (3)
p ¼ exp
�D
k � T0 �Q st�1; at�1ð Þ
!
if D > 0
1 if D � 0
:
8><
>: (4)
Equation (2) shows the random search change of input in each
episode, where SR [ (�d , d ) is a uniform random variable with
same dimension of ai, and 0 < d << 1, which means policy will
move slowly to improve stability of learning, similar in DDPG.
ai also has constrains and depends on satellite actor output
abilities, reaction wheels and magnetic torque, for example.
The initial input from policy network is a0, and 1 will be added
to iwhen a new action is accepted by algorithm.
Equation (3) shows the difference of Q-values between input
action ai and its random change ai0, and s is a current state.Q(s, a)
is calculated by the model network and equation (1). If D < 0, it
means new action is better than initial action; otherwise, it means
it is worse than initial action.
Equation (4) shows the accept probability of ai0, and if D is
bigger than 0, it means ai0 is better than original ai, so it will be
accepted. If D is smaller than 0, although ai0 is worse than ai, it
still has chance to be accepted by the probability of p, which
ensures that algorithm will not stuck in local optima. 0 << k < 1
is a decay rate of acceptance function and makes the function
harder to accept policy where D < 0, here k = 0.99. T0 is initial
temperature of simulated annealing, As D is calculated by Q-
value difference, T0 is multiplied by Q(st�1, at�1) to make initial
temperaturemore reasonable.
In Algorithm 1, first, while loop is less than 4,096 episodes,
algorithm will stop if it still cannot find the solution. Second,
while loop is less than 1,024 episodes, heuristic search will
output current optimize action after searching:
Algorithm 1. Model-based Deep Reinforcement Learning
withHeuristic Search
1: Randomly initialize Qa and Qs or initialized
by transfer learning
2: while Not find solution yet do
3: Generate initial policy: ain gu a stð Þ,
optimum action aopt
4: while In heuristic search do
5: Generateexplore policy: aout/ ain ·
(11SR)
6: Predictstate: st11 fu s st ; aoutð Þ
Model-based deep reinforcement learning
Ke Xu, Fengge Wu and Junsuo Zhao
Industrial Robot: An International Journal
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
7: Q st ; aoutð Þ jjst11 � sT jj2 � ws
8: if Q(st, aout) is accepted by heuristic
search then
9: ain = aout
10: ifQ(st, aout)> Q(st, aopt) then
11: aopt = aout
12: end if
13: end if
14: end while
15: Execute aopt andget data. Update u
s by data.
16: ifdata aopt actually better than ain then
17: Update u a by data.
18: end if
19: endwhile
3. Experimental results
The experiment uses satellite attitude simulation system to
show when classical control algorithm perform worse and
cannot reach the control target, andMDRLH could learn from
data and find its way to reach control target without prior
knowledge of satellites that classical control algorithm
demands: inertia of satellite, orbit of satellite and functions of
quaternion. It only needs to know the dimension of quaternion,
angular velocity and actors.
Figure 2 shows PID performance with random initial angular
velocity and quaternion. It converges around 10�5 of MSE less
than 200 episodes. At first 20 episodes, angular velocity
increases to reach the target quaternion faster, so total MSE
does not decrease.
Figure 3 shows that for some reason classical control cannot
reach the target but not diverge. Assume that a guided policy
performs like this and satellite attitude determination system
works fine and then apply MDRLH to see if it could help
satellite reach the target. The guided policy is set to give
randomoutput whenv2< 0.5, otherwise use PID control.
Figure 4 showsMDRLHperformance with guided policy. In
this experiment, MDRLH is initialled with random parameters
and control output may diverge after some episodes, so guided
policy is used to prevent MDRLH diverging. In the outer
space, control system cannot just be set to initial state, so a
guided policy is needed when learning algorithm explores too
much. Figure 4(a) shows best case of learning process. In this
case, algorithm could find solutions with small episodes of
guidance. Figure 4(b) shows the normal case of learning
process, in this case, algorithm could find solutions after 1,000
episodes with several trials. Figure 4(c) shows the worst case, in
this case, algorithm could find solutions after tens of trials, and
sometimes it is close to convergence, but loses it again. The
experiment is limited to 4,096 episodes as mentioned in
Algorithm 1, and sometimes it cannot find solutions after 4,096
episodes; however, these figures indicate that it still has a
chance to converge after these episodes. The differences
between these cases may because of the random initialization of
model network and policy network parameters. So, if the
network could be pre-trained with ground simulation, the
episodes of learning processmay be reduced significantly.
An interesting property of quaternion is jjqjj ¼ 1. However,
the model network does not know this property. In Figure 5,
y-axis is the quaternion prediction of model network. During
the learning process, the result shows that the model network is
able to learn this property from training data. This is an
important feature to measure the quality of satellite attitude
dynamics model.
4. Conclusion
MDRLH uses deep learning to learn dynamics model and
optimum policy from data generated by control system and
uses heuristic search to find better policy under dynamics
model. This is more like a framework, and many deep learning
technics and heuristic search technics could be implemented to
this framework. This work uses fully connected leaky-ReLU
network to build models and uses simulated annealing to
implement heuristic search. The experiment shows that the
algorithm could reach the target without prior knowledge of
satellite and orbit parameters when guided policy could not
work well, which makes the control system much more
adaptive to the environment.
MDRLH is a general algorithm that could be suitable for
many other control systems, not only for satellite attitude
control system. There are several advantages of the algorithm:
first, deep learning makes it do not need prior knowledge of
Figure 2 PID algorithm performance
Figure 3 Guided policy that works not so well and cannot reach the
target
Model-based deep reinforcement learning
Ke Xu, Fengge Wu and Junsuo Zhao
Industrial Robot: An International Journal
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-001.jpg&w=236&h=153https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-002.jpg&w=236&h=153
satellite and orbit of it, which make the algorithm easy to be
transformed to different types of satellites and wanders in
different orbits without design the control system for each of
them. Second, it is a model-based algorithm, which could learn
faster than model-free algorithm such as DDPG. Third,
heuristic search is a kind of non-gradient method, which can
avoid local optima.
However, MDRLH is a greedy algorithm, which is not
suitable for programming tasks. Online learning tasks also need
large amount of computation resources, which may influence
other satellite tasks. From view of control theory, this algorithm
cannot make sure stability of control system, which makes
expensive satellite in danger of lost control.
So in the future, the algorithm should be improved to use less
computing resources to adapt online tasks. The algorithm also
needs to add dynamic programming ability to fulfil harder
tasks. A plan of experiment on phonesat is also needed to test
stability and robustness of the algorithm.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
Citro, C., Corrado, G.S., Davis, A., Dean, J. and Devin, M.
(2016), “Tensorflow: large-scale machine learning on
heterogeneous distributed systems”
Akella, M.R., Thakur, D. and Mazenc, F. (2015), “Partial
lyapunov strictification: smooth angular velocity observers
for attitude tracking control”, AIAA/AAS Astrodynamics
Specialist Conference, pp. 442-451.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J. and Zaremba, W. (2016), “Openai
gym”.
Ghadiri, H., Sadeghi, M., Abaspour, A. and Esmaelzadeh, R.
(2015), “Optimized fuzzy-quaternion attitude control of
satellite in large maneuver”, International Conference on Space
Operations.
Gross, K., Swenson, E. and Agte, J.S. (2015), “Optimal
attitude control of a 6u cubesat with a four-wheel pyramid
reaction wheel array and magnetic torque coils”, AIAA
Modeling and Simulation Technologies Conference.
Hu, Q., Li, L. and Friswell, M.I. (2015), “Spacecraft anti-
unwinding attitude control with actuator nonlinearities and
Figure 4 MDRLH results
Figure 5 Learning the property of jjqjj ¼ 1
Model-based deep reinforcement learning
Ke Xu, Fengge Wu and Junsuo Zhao
Industrial Robot: An International Journal
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-003.jpg&w=359&h=287
https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-004.jpg&w=214&h=143
velocity limit”, Journal of Guidance Control & Dynamics,
Vol. 38No. 10, pp. 1-8.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
Girshick, R., Guadarrama, S. andDarrell, T. (2014), “Caffe:
convolutional architecture for fast feature embedding”,
pp. 675-678.
Kennedy, J. and Eberhart, R. (2002), “Particle swarm
optimization”, IEEE International Conference on Neural
Networks, 1995, Proceedings, Vol. 4, pp. 1942-1948.
Kirkpatrick, S., Cd, G., Jr and Vecchi, M.P. (1987),
“Optimization by simulated annealing”, Readings in
Computer Vision, Vol. 220No. 4598, pp. 606-615.
Levine, S., Finn, C., Darrell, T. and Abbeel, P. (2016), “End-
to-end training of deep visuomotor policies”, Journal of
Machine Learning Research, Vol. 17No. 1, pp. 1334-1373.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T.,
Tassa, Y., Silver, D. and Wierstra, D. (2015), “Continuous
control with deep reinforcement learning”, Computer Science,
Vol. 8 No. 6, p. A187.
Mazmanyan, L. and Ayoubi, M.A. (2014), “Takagi-sugeno
fuzzy model-based attitude control of spacecraft with
partially-filled fuel tank”,AIAA/AAS Astrodynamics Specialist
Conference.
OpenAI (2017), “Evolution strategies as a scalable alternative
to reinforcement learning”.
Punjani, A. and Abbeel, P. (2015), “Deep learning helicopter
dynamics models”, IEEE International Conference on Robotics
and Automation, pp. 3223-3230.
Shimmin, R., Salas, A.G., Oyadomari, K., Attai, W., Priscal,
C., Gazulla, O.T. and Wolfe, J. (2014), “Phonesat in-flight
experience results”, Small Satellites Systems and Services
Symposium.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,
M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D. and
Graepel, T. (2017), “Mastering chess and shogi by self-play
with a general reinforcement learning algorithm”.
Straub, J.,Wegerson,M. andMarsh, R. (2013), “An intelligent
attitude determination and control system for a Cubesat
class spacecraft”, AIAA SPACE 2014 Conference and
Exposition.
Sugimura, N., Fukuda, K., Tomioka, Y. and Fukuyama, M.
(2012), “Ground test of attitude control system for micro
satellite rising-2”, Ieee/sice International Symposium on System
Integration, pp. 301-306.
Walker, A.R., Putman, P.T. and Cohen, K. (2015), “Solely
magnetic genetic/fuzzy-attitude-control algorithm for a
cubesat”, Journal of Spacecraft & Rockets, Vol. 52 No. 6,
pp. 1627-1639.
Williams, T.W., Shulman, S., Sedlak, J., Ottenstein, N. and
Lounsbury, B. (2016), “Magnetospheric multiscale mission
attitude dynamics: observations from flight data”, AIAA/
AASAstrodynamics Specialist Conference.
Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J.
M., Boots, B. and Theodorou, E.A. (2017), “Information
theoretic MPC for model-based reinforcement learning”,
IEEE International Conference on Robotics and Automation,
pp. 1714-1721.
Wu, B., Wang, D. and Poh, E.K. (2015), “High precision
satellite attitude tracking control via iterative learning
control”, Journal of Guidance Control & Dynamics, Vol. 38
No. 3, pp. 528-534.
Xiao, B., Hu, Q., Zhang, Y. and Huo, X. (2014), “Fault-
tolerant tracking control of spacecraft with attitude-only
measurement under actuator failures”, Journal of Guidance
Control &Dynamics, Vol. 37No. 3, pp. 838-849.
Xu, B., Wang, N., Chen, T. and Li, M. (2015), “Empirical
evaluation of rectified activations in convolutional network”,
Computer Science.
Corresponding author
Ke Xu can be contacted at: xuke13@iscas.ac.cn
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com
Model-based deep reinforcement learning
Ke Xu, Fengge Wu and Junsuo Zhao
Industrial Robot: An International Journal
D
ow
nl
oa
de
d
by
Y
or
k
U
ni
ve
rs
ity
A
t 1
0:
55
2
3
D
ec
em
be
r
20
18
(
PT
)
mailto:xuke13@iscas.ac.cn
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FICRA.2017.7989202&citationId=p_21
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1145%2F2647868.2654889&citationId=p_7
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1145%2F2647868.2654889&citationId=p_7
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FSII.2012.6427322&citationId=p_18
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FSII.2012.6427322&citationId=p_18
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.61369&isi=000335397700011&citationId=p_23
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.61369&isi=000335397700011&citationId=p_23
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2016-5675&citationId=p_20
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.G000980&isi=000364235600019&citationId=p_6
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2016-5675&citationId=p_20
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.G000497&isi=000359311900015&citationId=p_22https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2014-4215&citationId=p_12
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2014-4215&citationId=p_12
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.A33294&isi=000367590800010&citationId=p_19
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FICRA.2015.7139643&citationId=p_14
https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FICRA.2015.7139643&citationId=p_14
Model-based deep reinforcement learning with heuristic search for satellite attitude control
1. Introduction
2. Model-based deep reinforcement learning with heuristic search
2.1 Model network
2.2 Policy network
2.3 Heuristic search
3. Experimental results
4. Conclusion
References