Baixe o app para aproveitar ainda mais
Prévia do material em texto
Industrial Robot: An International Journal Model-based deep reinforcement learning with heuristic search for satellite attitude control Ke Xu, Fengge Wu, Junsuo Zhao, Article information: To cite this document: Ke Xu, Fengge Wu, Junsuo Zhao, (2018) "Model-based deep reinforcement learning with heuristic search for satellite attitude control", Industrial Robot: An International Journal, https://doi.org/10.1108/IR-05-2018-0086 Permanent link to this document: https://doi.org/10.1108/IR-05-2018-0086 Downloaded on: 23 December 2018, At: 10:55 (PT) References: this document contains references to 24 other documents. To copy this document: permissions@emeraldinsight.com The fulltext of this document has been downloaded 26 times since 2018* Users who downloaded this article also downloaded: (2018),"Sensorless force estimation and control of Delta robot with limited access interface", Industrial Robot: An International Journal, Vol. 45 Iss 5 pp. 611-622 <a href="https://doi.org/10.1108/IR-03-2018-0048">https://doi.org/10.1108/IR-03-2018-0048</ a> ,"Improving stability in physical human–robot interaction by estimating human hand stiffness and a vibration index", Industrial Robot: An International Journal, Vol. 0 Iss 0 pp. - <a href="https://doi.org/10.1108/IR-05-2018-0111">https://doi.org/10.1108/ IR-05-2018-0111</a> Access to this document was granted through an Emerald subscription provided by emerald-srm:277515 [] For Authors If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service information about how to choose which publication to write for and submission guidelines are available for all. Please visit www.emeraldinsight.com/authors for more information. About Emerald www.emeraldinsight.com Emerald is a global publisher linking research and practice to the benefit of society. The company manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as well as providing an extensive range of online products and additional customer resources and services. Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation. *Related content and download information correct at time of download. D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) https://doi.org/10.1108/IR-05-2018-0086 https://doi.org/10.1108/IR-05-2018-0086 Model-based deep reinforcement learning with heuristic search for satellite attitude control Ke Xu, Fengge Wu and Junsuo Zhao Institute of Software, Chinese Academy of Sciences, Beijing, China Abstract Purpose – Recently, deep reinforcement learning is developing rapidly and shows its power to solve difficult problems such as robotics and game of GO. Meanwhile, satellite attitude control systems are still using classical control technics such as proportional – integral – derivative and slide mode control as major solutions, facing problems with adaptability and automation. Design/methodology/approach – In this paper, an approach based on deep reinforcement learning is proposed to increase adaptability and autonomy of satellite control system. It is a model-based algorithm which could find solutions with fewer episodes of learning than model-free algorithms. Findings – Simulation experiment shows that when classical control crashed, this approach could find solution and reach the target with hundreds times of explorations and learning. Originality/value – This approach is a non-gradient method using heuristic search to optimize policy to avoid local optima. Compared with classical control technics, this approach does not need prior knowledge of satellite or its orbit, has the ability to adapt different kinds of situations with data learning and has the ability to adapt different kinds of satellite and different tasks through transfer learning. Keywords Control, Artificial Intelligence, Deep reinforcement learning, Satellite attitude Paper type Research paper 1. Introduction With the rapid development of small satellite, satellite attitude control is facing problems with autonomous and adaptivity. Owing to the lower cost and faster design of small satellite, hundreds of satellites are launched each year, and the number is increasing year by year. It is harder to design control system for each small satellite with their different space tasks, different physical parameters, actors and sensors, different orbits and so on. The limited ground measuring and control system cannot meet the needs of them at the same time. The unstable space environment (Williams et al., 2016) and satellite component errors (Hu et al., 2015; Mazmanyan and Ayoubi, 2014) also influence control precision and stability. Therefore, an intelligence control system that can adapt different kinds of situations and work autonomously is urgently needed to be developed. Meanwhile, deep reinforcement learning has recently been used to solve challenge problems successfully, from autonomous racing (Williams et al., 2017) to robotics (Levine et al., 2016). Alpha-zero (Silver et al., 2017) is an impressive application of deep reinforcement learning, which could solve different kinds of board games without offline data and human knowledge. Guided policy search (GPS) (Levine et al., 2016) is a kind of successful model-based solution for different kinds of robot tasks which could be repeated learnt with the help of guided policy. Deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) gives a model-free greedy solution for general control problems that described in Open-AI gym (Brockman et al., 2016). These results show that deep reinforcement learning is a promising solution for intelligence satellite attitude control system. Unfortunately, deep reinforcement learning needs too much computing and storage resource for an on-board computer, which leads to most satellite attitude control systems are still using traditional control algorithm such as proportional – integral – derivative (PID) (Gross et al., 2015; Akella et al., 2015), slide mode (Xiao et al., 2014) or fuzzy control (Walker et al., 2015; Ghadiri et al., 2015). PID is still the most commonly used method in satellite attitude control, for its implementation simplicity and stability. The problem of PID is adaptability and robustness, where many adaptive approaches are used to improve that. Slide mode considers non-linear systems, and similar to PID, the main problem is adaptability. Current intelligent approaches such as fuzzy control bring adaptability to control system. The main problem of fuzzy control is that it needs expert knowledge to build control system. Besides these approaches, few of them use machine learning as a support method (Wu et al., 2015). To solve this problem, a new kind of satellite called phonesat (Shimmin et al., 2014) is proposed, which uses smartphone as a new kind of on-board computer, to upgrade computing and storage ability for satellites. Many deep learning tools such as The current issue and full text archive of this journal is available on Emerald Insight at: www.emeraldinsight.com/0143-991X.htm Industrial Robot: An International Journal © Emerald Publishing Limited [ISSN 0143-991X] [DOI 10.1108/IR-05-2018-0086] This work is supported by National Natural Science Foundation of China (Grant Number 61202218). Received 7 May 2018 Revised 31 May 2018 Accepted 5 July 2018 D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) http://dx.doi.org/10.1108/IR-05-2018-0086 tensorflow (Abadi et al., 2016) and caffe (Jia et al., 2014) also support Android OS and make it more convenient to develop deep reinforcement learning applications on smartphones. In this paper, a model-based deep reinforcement learning algorithm is proposed to learn from data that generated by satellite attitudecontrol system, optimizes control system online and makes it more adaptive to space environment and hardware changes of satellite itself. There are three main parts of the algorithm: a model network that predicts the next state according to current state and action, a policy network that outputs control policy and a heuristic search part that optimizes control policy. The model network could be trained initially offline by ground simulation, and then trained online by actual attitude data. The main error of the whole system is generated by model predict error, so a feedback compensation is given to make up the error. Neural network gives non-linear dynamics in general; however, two predict state neighbours in time dimension may have errors in detail, and they always close in space dimensions, which make the predict error more linear, that is to say a linear feedback compensation could work fine. The policy network could also be trained initially offline by existing control policy such as PID or slide mode control, and then trained online by optimized policy generated by heuristic search. Paper (Sugimura et al., 2012) shows the ground test of a satellite attitude simulation system which could be used as pre- trained environment. The heuristic search algorithm searches better policy around initial actions given by policy network output; if the optimized policy works better in model network, then the systemwill use it as real output to control attitude, and if the policy also works better in the real environment, the policy network use it as training data. Simulation result shows this algorithm has the ability to find a better policy to reach the control target. 2. Model-based deep reinforcement learning with heuristic search In the game of GO, state transition is explicit, so there is no need to propose a model network, but theQ-value is need to be built. In model-based deep reinforcement learning with heuristic (MDRLH) search, once a model network is built, the Q-value could be decided by model output, so there is no need to propose a Q-value network. In DDPG, there are four networks, two of them output Q-value, and the other two output policy, whereas in MDRLH, there are two networks, one outputs state prediction, and the other outputs policy. In addition, DDPG uses gradient descent to optimize policy, whereas MDRLH uses heuristic search to optimize policy. Figure 1 showsMDRLHworkflow. The algorithmworks as follows: � Random initialize networks or initialize networks from ground simulation. � Calculate policy predict output in state through model network as well as itsQ-value to initialize heuristic search. � Use heuristic search to find better policy model to get lowerQ-value than initial policy. � Execute optimized policy, get real output and its Q-value, calculate model predict error. � If the result is better than initial policy, use it as policy training data. � If state is reach beyond the safe zone, use guided policy, for example, PID policy to return training zone. � Use every state–action–state pair as model network’s training data. � End if target state is reached, otherwise return Step 2. 2.1Model network Some works have been used machine learning technics to build satellite attitude dynamics model (Straub et al., 2013). In this paper, a neural network-based dynamic model is used as a part of MDRLH algorithm. The model network st11 ¼ fu s st; atð Þ uses previous state–action pair as input, and predicted state as output. u s are parameters of network. They could be random initialized or pre-trained with ground simulation or data from other satellites, and then apply it to model network using fine- tune or transfer learning (need references). Pre-trained process will accelerate network’s convergence and improve online performance of the whole algorithm. Network uses leaky-ReLU (Xu et al., 2015) as activate function, instead of ReLU. There are some results that show the limitation of ReLU networks (Punjani and Abbeel, 2015), and they use initial parameter settings to avoid gradient missing problem, whereas a leaky-ReLU network has no need to worry about this kind of problem. Loss function of model network is L ¼ jjs0 � sjj2, which is mean square error (MSE) between network output and next state from data. This loss function minimizes error between predicted state and real state after actions. Although model network is able to fetch the majority of system’s non-linearity, minor error still remains because of leak of data and limitations of network itself, and this error causes main error of the whole algorithm. So, an estimation of this error is necessary while using model network to predict state.Model network could be seen as a feedforward part of algorithm and error estimation as a feedback part of algorithm: et11 = (s0t� st) · wp 1 etwi 1 (et� et�1) · wd, the initial error e0 could be set as 0, and this function is actually PID control algorithm,where s0t� st is proportional part, et is integral part and et � et�1 is differential part. When algorithm uses model network to predict state, error estimation is added: Figure 1 MDRLH search workflow Model-based deep reinforcement learning Ke Xu, Fengge Wu and Junsuo Zhao Industrial Robot: An International Journal D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-000.jpg&w=227&h=185 st11 ¼ fu s st; atð Þ1 e, whereas update u s does not need this error. A model-based algorithm makes control system perform better every episode, but a model-free algorithm will wander longer and can hardly perform better until it reaches the target state, and the model network can learn the dynamics online as long as the satellite attitude control system is working, even if the system uses other control algorithms, which makes algorithmmore adaptable. 2.2 Policy network The policy network at ¼ gu a stð Þ uses current state as input and current policy as output. u a is the parameter of network. They could be random initialized or pre-trained with guided policies such as PID control. Pre-trained process will reduce the probability of reaching beyond the safe zone and improve the performance of the whole algorithm. Network uses similar structure as model network for the same reasonmentioned before. Loss function of policy network is L u að Þ ¼ jjat0 � atjj2 ¼ jjat0 � gu a stð Þjj2, which is MSE between network output, at ¼ gu a stð Þ, and policy from data at0 in same state. This loss function minimizes error between output policy and guided policy from data. Unlike the error in model network, there is no “correct policy” for control system. There is difference between output of policy network and guided policy from data, but there is also no idea of whether output of policy network is better. Also during learning episodes, output of policy network is an initial input of heuristic search and not directly control output. So, there is no need tomodel the error between them. This network is able to find a global solution of control system, which is different from the GPS, and the solution follows a certain routine. If necessary, policy network can imitate other policies to adapt new tasks quickly. 2.3 Heuristic search There are several algorithms that could be used as a heuristic search, such as simulated annealing (Kirkpatrick et al., 1987), particle swarm optimization (Kennedy and Eberhart, 2002) and normal evolution strategy (OpenAI, 2017). It uses output of policy network as its input and outputs an optimized policy whereQ-value is higher.Q-value is defined as follows: Q st; atð Þ ¼ jjst11 � sT jj2 � ws: (1) In equation (1), st is the result state after action taken at time t, and it could be model prediction or real output of environment. sT is a target state which is decided by space missions such as target tracking and taking photos. ws is a weight parameter, which is also decided by missions.Q-value measures MSE between next state and target state, which is different from temporal difference. As it known in reinforcement learning, different Q-value definitions lead to different agent behaviours when algorithm converges. In satellite attitude control, for example, a state contains quaternion of satellite and angular velocity, and if weight of quaternion is bigger, then the satellite prefers to move fast, and the attitude will be less stable because big angular velocity affects little toQ-value. On the other hand, if the weight of velocity is bigger, then the satellite prefers to stay still, which will take a longer time to get target. So, a proper ws should be set to get an acceptable performance; for instance, weight of quaternion should be set bigger when satellite learns a fast attitude manoeuvre task, and weight of angular velocity should be set bigger when satellite needs to take a high definition photo. Simulated annealing is used to optimize policy. This algorithm is a kind of Markov chain Monte Carlo algorithm, where initial input is deterministic and makes a random change to it and decides whether it is better than the original one. It is rather general, so it needsmodification to fit certain situation: ai0 ¼ ai � 11SRð Þ (2) D ¼ Q s; ai0ð Þ �Q s; aið Þ (3) p ¼ exp �D k � T0 �Q st�1; at�1ð Þ ! if D > 0 1 if D � 0 : 8>< >: (4) Equation (2) shows the random search change of input in each episode, where SR [ (�d , d ) is a uniform random variable with same dimension of ai, and 0 < d << 1, which means policy will move slowly to improve stability of learning, similar in DDPG. ai also has constrains and depends on satellite actor output abilities, reaction wheels and magnetic torque, for example. The initial input from policy network is a0, and 1 will be added to iwhen a new action is accepted by algorithm. Equation (3) shows the difference of Q-values between input action ai and its random change ai0, and s is a current state.Q(s, a) is calculated by the model network and equation (1). If D < 0, it means new action is better than initial action; otherwise, it means it is worse than initial action. Equation (4) shows the accept probability of ai0, and if D is bigger than 0, it means ai0 is better than original ai, so it will be accepted. If D is smaller than 0, although ai0 is worse than ai, it still has chance to be accepted by the probability of p, which ensures that algorithm will not stuck in local optima. 0 << k < 1 is a decay rate of acceptance function and makes the function harder to accept policy where D < 0, here k = 0.99. T0 is initial temperature of simulated annealing, As D is calculated by Q- value difference, T0 is multiplied by Q(st�1, at�1) to make initial temperaturemore reasonable. In Algorithm 1, first, while loop is less than 4,096 episodes, algorithm will stop if it still cannot find the solution. Second, while loop is less than 1,024 episodes, heuristic search will output current optimize action after searching: Algorithm 1. Model-based Deep Reinforcement Learning withHeuristic Search 1: Randomly initialize Qa and Qs or initialized by transfer learning 2: while Not find solution yet do 3: Generate initial policy: ain gu a stð Þ, optimum action aopt 4: while In heuristic search do 5: Generateexplore policy: aout/ ain · (11SR) 6: Predictstate: st11 fu s st ; aoutð Þ Model-based deep reinforcement learning Ke Xu, Fengge Wu and Junsuo Zhao Industrial Robot: An International Journal D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) 7: Q st ; aoutð Þ jjst11 � sT jj2 � ws 8: if Q(st, aout) is accepted by heuristic search then 9: ain = aout 10: ifQ(st, aout)> Q(st, aopt) then 11: aopt = aout 12: end if 13: end if 14: end while 15: Execute aopt andget data. Update u s by data. 16: ifdata aopt actually better than ain then 17: Update u a by data. 18: end if 19: endwhile 3. Experimental results The experiment uses satellite attitude simulation system to show when classical control algorithm perform worse and cannot reach the control target, andMDRLH could learn from data and find its way to reach control target without prior knowledge of satellites that classical control algorithm demands: inertia of satellite, orbit of satellite and functions of quaternion. It only needs to know the dimension of quaternion, angular velocity and actors. Figure 2 shows PID performance with random initial angular velocity and quaternion. It converges around 10�5 of MSE less than 200 episodes. At first 20 episodes, angular velocity increases to reach the target quaternion faster, so total MSE does not decrease. Figure 3 shows that for some reason classical control cannot reach the target but not diverge. Assume that a guided policy performs like this and satellite attitude determination system works fine and then apply MDRLH to see if it could help satellite reach the target. The guided policy is set to give randomoutput whenv2< 0.5, otherwise use PID control. Figure 4 showsMDRLHperformance with guided policy. In this experiment, MDRLH is initialled with random parameters and control output may diverge after some episodes, so guided policy is used to prevent MDRLH diverging. In the outer space, control system cannot just be set to initial state, so a guided policy is needed when learning algorithm explores too much. Figure 4(a) shows best case of learning process. In this case, algorithm could find solutions with small episodes of guidance. Figure 4(b) shows the normal case of learning process, in this case, algorithm could find solutions after 1,000 episodes with several trials. Figure 4(c) shows the worst case, in this case, algorithm could find solutions after tens of trials, and sometimes it is close to convergence, but loses it again. The experiment is limited to 4,096 episodes as mentioned in Algorithm 1, and sometimes it cannot find solutions after 4,096 episodes; however, these figures indicate that it still has a chance to converge after these episodes. The differences between these cases may because of the random initialization of model network and policy network parameters. So, if the network could be pre-trained with ground simulation, the episodes of learning processmay be reduced significantly. An interesting property of quaternion is jjqjj ¼ 1. However, the model network does not know this property. In Figure 5, y-axis is the quaternion prediction of model network. During the learning process, the result shows that the model network is able to learn this property from training data. This is an important feature to measure the quality of satellite attitude dynamics model. 4. Conclusion MDRLH uses deep learning to learn dynamics model and optimum policy from data generated by control system and uses heuristic search to find better policy under dynamics model. This is more like a framework, and many deep learning technics and heuristic search technics could be implemented to this framework. This work uses fully connected leaky-ReLU network to build models and uses simulated annealing to implement heuristic search. The experiment shows that the algorithm could reach the target without prior knowledge of satellite and orbit parameters when guided policy could not work well, which makes the control system much more adaptive to the environment. MDRLH is a general algorithm that could be suitable for many other control systems, not only for satellite attitude control system. There are several advantages of the algorithm: first, deep learning makes it do not need prior knowledge of Figure 2 PID algorithm performance Figure 3 Guided policy that works not so well and cannot reach the target Model-based deep reinforcement learning Ke Xu, Fengge Wu and Junsuo Zhao Industrial Robot: An International Journal D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-001.jpg&w=236&h=153https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-002.jpg&w=236&h=153 satellite and orbit of it, which make the algorithm easy to be transformed to different types of satellites and wanders in different orbits without design the control system for each of them. Second, it is a model-based algorithm, which could learn faster than model-free algorithm such as DDPG. Third, heuristic search is a kind of non-gradient method, which can avoid local optima. However, MDRLH is a greedy algorithm, which is not suitable for programming tasks. Online learning tasks also need large amount of computation resources, which may influence other satellite tasks. From view of control theory, this algorithm cannot make sure stability of control system, which makes expensive satellite in danger of lost control. So in the future, the algorithm should be improved to use less computing resources to adapt online tasks. The algorithm also needs to add dynamic programming ability to fulfil harder tasks. A plan of experiment on phonesat is also needed to test stability and robustness of the algorithm. References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J. and Devin, M. (2016), “Tensorflow: large-scale machine learning on heterogeneous distributed systems” Akella, M.R., Thakur, D. and Mazenc, F. (2015), “Partial lyapunov strictification: smooth angular velocity observers for attitude tracking control”, AIAA/AAS Astrodynamics Specialist Conference, pp. 442-451. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W. (2016), “Openai gym”. Ghadiri, H., Sadeghi, M., Abaspour, A. and Esmaelzadeh, R. (2015), “Optimized fuzzy-quaternion attitude control of satellite in large maneuver”, International Conference on Space Operations. Gross, K., Swenson, E. and Agte, J.S. (2015), “Optimal attitude control of a 6u cubesat with a four-wheel pyramid reaction wheel array and magnetic torque coils”, AIAA Modeling and Simulation Technologies Conference. Hu, Q., Li, L. and Friswell, M.I. (2015), “Spacecraft anti- unwinding attitude control with actuator nonlinearities and Figure 4 MDRLH results Figure 5 Learning the property of jjqjj ¼ 1 Model-based deep reinforcement learning Ke Xu, Fengge Wu and Junsuo Zhao Industrial Robot: An International Journal D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-003.jpg&w=359&h=287 https://www.emeraldinsight.com/action/showImage?doi=10.1108/IR-05-2018-0086&iName=master.img-004.jpg&w=214&h=143 velocity limit”, Journal of Guidance Control & Dynamics, Vol. 38No. 10, pp. 1-8. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. andDarrell, T. (2014), “Caffe: convolutional architecture for fast feature embedding”, pp. 675-678. Kennedy, J. and Eberhart, R. (2002), “Particle swarm optimization”, IEEE International Conference on Neural Networks, 1995, Proceedings, Vol. 4, pp. 1942-1948. Kirkpatrick, S., Cd, G., Jr and Vecchi, M.P. (1987), “Optimization by simulated annealing”, Readings in Computer Vision, Vol. 220No. 4598, pp. 606-615. Levine, S., Finn, C., Darrell, T. and Abbeel, P. (2016), “End- to-end training of deep visuomotor policies”, Journal of Machine Learning Research, Vol. 17No. 1, pp. 1334-1373. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015), “Continuous control with deep reinforcement learning”, Computer Science, Vol. 8 No. 6, p. A187. Mazmanyan, L. and Ayoubi, M.A. (2014), “Takagi-sugeno fuzzy model-based attitude control of spacecraft with partially-filled fuel tank”,AIAA/AAS Astrodynamics Specialist Conference. OpenAI (2017), “Evolution strategies as a scalable alternative to reinforcement learning”. Punjani, A. and Abbeel, P. (2015), “Deep learning helicopter dynamics models”, IEEE International Conference on Robotics and Automation, pp. 3223-3230. Shimmin, R., Salas, A.G., Oyadomari, K., Attai, W., Priscal, C., Gazulla, O.T. and Wolfe, J. (2014), “Phonesat in-flight experience results”, Small Satellites Systems and Services Symposium. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D. and Graepel, T. (2017), “Mastering chess and shogi by self-play with a general reinforcement learning algorithm”. Straub, J.,Wegerson,M. andMarsh, R. (2013), “An intelligent attitude determination and control system for a Cubesat class spacecraft”, AIAA SPACE 2014 Conference and Exposition. Sugimura, N., Fukuda, K., Tomioka, Y. and Fukuyama, M. (2012), “Ground test of attitude control system for micro satellite rising-2”, Ieee/sice International Symposium on System Integration, pp. 301-306. Walker, A.R., Putman, P.T. and Cohen, K. (2015), “Solely magnetic genetic/fuzzy-attitude-control algorithm for a cubesat”, Journal of Spacecraft & Rockets, Vol. 52 No. 6, pp. 1627-1639. Williams, T.W., Shulman, S., Sedlak, J., Ottenstein, N. and Lounsbury, B. (2016), “Magnetospheric multiscale mission attitude dynamics: observations from flight data”, AIAA/ AASAstrodynamics Specialist Conference. Williams, G., Wagener, N., Goldfain, B., Drews, P., Rehg, J. M., Boots, B. and Theodorou, E.A. (2017), “Information theoretic MPC for model-based reinforcement learning”, IEEE International Conference on Robotics and Automation, pp. 1714-1721. Wu, B., Wang, D. and Poh, E.K. (2015), “High precision satellite attitude tracking control via iterative learning control”, Journal of Guidance Control & Dynamics, Vol. 38 No. 3, pp. 528-534. Xiao, B., Hu, Q., Zhang, Y. and Huo, X. (2014), “Fault- tolerant tracking control of spacecraft with attitude-only measurement under actuator failures”, Journal of Guidance Control &Dynamics, Vol. 37No. 3, pp. 838-849. Xu, B., Wang, N., Chen, T. and Li, M. (2015), “Empirical evaluation of rectified activations in convolutional network”, Computer Science. Corresponding author Ke Xu can be contacted at: xuke13@iscas.ac.cn For instructions on how to order reprints of this article, please visit our website: www.emeraldgrouppublishing.com/licensing/reprints.htm Or contact us for further details: permissions@emeraldinsight.com Model-based deep reinforcement learning Ke Xu, Fengge Wu and Junsuo Zhao Industrial Robot: An International Journal D ow nl oa de d by Y or k U ni ve rs ity A t 1 0: 55 2 3 D ec em be r 20 18 ( PT ) mailto:xuke13@iscas.ac.cn https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FICRA.2017.7989202&citationId=p_21 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1145%2F2647868.2654889&citationId=p_7 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1145%2F2647868.2654889&citationId=p_7 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FSII.2012.6427322&citationId=p_18 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FSII.2012.6427322&citationId=p_18 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.61369&isi=000335397700011&citationId=p_23 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.61369&isi=000335397700011&citationId=p_23 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2016-5675&citationId=p_20 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.G000980&isi=000364235600019&citationId=p_6 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2016-5675&citationId=p_20 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.G000497&isi=000359311900015&citationId=p_22https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2014-4215&citationId=p_12 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F6.2014-4215&citationId=p_12 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.2514%2F1.A33294&isi=000367590800010&citationId=p_19 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FICRA.2015.7139643&citationId=p_14 https://www.emeraldinsight.com/action/showLinks?doi=10.1108%2FIR-05-2018-0086&crossref=10.1109%2FICRA.2015.7139643&citationId=p_14 Model-based deep reinforcement learning with heuristic search for satellite attitude control 1. Introduction 2. Model-based deep reinforcement learning with heuristic search 2.1 Model network 2.2 Policy network 2.3 Heuristic search 3. Experimental results 4. Conclusion References
Compartilhar