Optimized Path Planning in Reinforcement Learning by Backtracking

,

model. Furthermore, it consists of an agent, the environment to interact with and the goal to reach. After constructing the components and presenting the solution, we briefly discuss the exploit vs. explore dilemma in the RL. The explore and exploit do not exist in other machine learning methods. At the end of this paper, we have presented the results of our tests and optimization.
We also compare our results after optimization. The solution is driven by a series of hyper parameters that we have explained and tweaked them before analyzing the results.
The structure of this paper is as follows. The next section details the related work in path planning. Section III introduces the problem of pathfinding in the smart lab and the pre-established assumptions. In Section IV we elaborate the solution components.
After constructing the components and presenting the solution, we briefly discuss the exploit vs. explore dilemma in the section V. In the hyperparameters section VI, we have shown the parameters of the model that impact the performance and convergence time. In the optimization section VII of this paper, we explain the way we can improve the convergence time to find the optimal path. In the discussion section VIII, we have briefly compared our model with other models and finally, in the conclusion section, we will discuss the findings.

Related Work
The robot navigation in an environment is not a new topic and as a result, there exists great progress in this field. Instances are Kalman Filtering (EKF) to implement Simultaneous Localization and Mapping (SLAM) in a robot navigation task [3]. IEEE has Depth First Search (DFS) and its special version called Flooding are utilized to search the shortest path in a graph. Among many existing results, [5] has proved that the maze structure can change the performance of BFS and DFS. For example, DFS tends to explore all the cells in the graph that may be a big waste. DFS also needs more memory as this is the case for Flooding, in which it is a concern in mazes larger than 8x8.
In addition to the above-mentioned graph-based methods, other algorithms like A* and D* [6] [7] are suggested for the robot path planning and it is proved that A* overperform D* algorithm.
The study in [8] in regard to the greedy DFS, has been proposed in path selection. [9] suggests a neural network to navigate a robot in an obstacle avoidance model based on multiple neural networks cooperation. In this case, the robot will be trained to make decisions based on the existing patterns. For example [10] has suggested a solution to navigate the robot to avoid collision and deciding based on 256 patterns. In this case, the 256 patterns are generated from 8 sonar reading that model the entire possible scenarios and off-line trained and learned by the artificial neural network.
Moreover, [11] proves that the Dijkstra algorithm can solve the maze robot path planning. The study in [12] has proposed a model to map the whole maze as a graph in standard "Adjacency-list representation" and finding the shortest path in the shortest time path by Dijkstra algorithm.
Recent advances in image processing and artificial intelligence has been utilized to build smarter robots in the maze. For example, authors in [13] have used an intelligent maze solving robot that can determine its shortest path on a line maze based on image processing and artificial intelligence algorithms. The image of the line maze is captured by a camera and sent to the computer to be analyzed and processed by a program and based on graph theory algorithms. The developed program solves the captured lines by examining all possible paths (hence supervised learning) in the maze that could convey the robot to the required destination point.
The best shortest path is instructed to the robot to reach its desired destination point through Bluetooth. The authors in [13] have claimed that the solution works faster than the traditional methods which push the robot to move through the maze cell by cell in order to find its destination point. The research in [14] has solved the maze by upgrading the line maze solving algorithm (an algorithm used to solve a maze made of lines to be traced by a mobile robot) by using the curved and zigzag turns.
Others have solved the same problem by a combination of Zigbee wireless robots with markers in an indoor environment with a camera, image process and BFS and DFS algorithms for trajectory calculation, path planning and trajectory execution within an indoor maze environment [15]. The Lee algorithm [16] is used to find the shortest path in a maze that is implemented with a breadth-first search. The Maze is effectively "flooded with water", and each point in the Maze keeps track of which direction it was flooded from.

The Smart Lab Problem
In general, the RL problem is about finding a series of actions that result in some best output. The mapping between all those actions to all the states that give the best result is captured in the form of a function called policy (ℼ) function. Hence, we are looking for the best ℼ, or optimal ℼ (ℼ * ) in this context. The mapping function ℼ maps a state to an action, i.e. ℼ: SℼA and the job of RL is to find ℼ *. We can find ℼ * in two general ways: by "searching" the solution space and trying all possible scenario before choosing the best one or by "estimating" a function (function approximation) that generates the values. In this paper, we focus on the former approach since our assumptions around solving the smart lab problem are constructed around the following assumptions: The reward signal of doing an action in the smart lab environment is static (we always give -1 for each move that is not resulted in reaching the target state). This assumption rolls out the idea of having stochastic rewards.
II. We do not allow our agent to move to an obstacle since the entire state space is observable and modelled. Having a model allows the agent to "plan" before moving.
III. The result of our action is not stochastic, and we do not need to use statistical techniques in solving the problem.
IV. The state space is not too big and is not continuous. This assumption allows us to explore all the states in a space and make use of that in the planning.
V. The state space does not have any other moving object than the learner agent. That means the agent is the only moving object in the state space (i.e. we do not have a multi-agent environment).
VI. The state values that we would like to learn is the criteria to find the best path.
VII. We simulate the state environment in the form of a maze and try to find the best path in an offline manner (i.e. the robot would not move until the program finish its planning). In the context of robot path planning, this approach is called "global path planning". In this paper, we did not assume that the robot has any sensory data and as a result, there is no real-time path planning (local path planning) VIII. Since there is a model of the environment in the form of a maze, we can safely assume that the robot has the map In the smart lab, there are obstacles like tables or chairs and surrounding walls. A robot has to move from a base station (start point) toward a target location in the lab, such as a table. The decisions that the agent makes depends on the topology and the environment model [17]. The topology of the environment is known ahead of time shown in [ Figure 1]. The robot knows where the start location is and where the robot is heading, the target location.
There are different ways to model and solve the movement of a robot in the lab such as using the graph theory techniques or computational geometry. In this paper, the solution method is based on modeling the indoor laboratory in the form of a grid or a maze shown in [ Figure 2]. As a result, to solve this problem we abstract the problem statement to solve the robot movement from a cell in a grid from one cell to another cell. This scenario is very similar to one of the well research topics in the RL literature to move an agent in a maze (/grid). The agent is a software component that is given intelligence to make decisions sequentially based on the observations and the feedbacks that are received from the environment. The environment gives feedbacks to the agent about its last decision, in the form of immediate rewards (the primary reward). In addition to rewards, it presents the changes in the space state after performing the last action. In RL literature, this is called the state signal [1]. A maze is a simulation of a controlled environment where a robot (in which it carries the agent software) exists in it.

The Solution Components
In this section, we give details of the architecture and on how we found an optimal path before the search function is converged.
In other words, instead of finding the best value in the states, we focus on finding the optimal route that appeared much sooner than the final value states.

A. Agent
An agent is a piece of software that is able to learn and take actions accordingly. It has the following features: i. An agent has to be able to choose an action in a state ii.
An agent should know what is the best action to reach the goal. The agent does not know at the beginning and will learn about it gradually.
"The best action" is formulated in the context of a dilemma in RL called "exploit vs. explore". The agent has to explore new actions that are not identified necessarily as the best-learned action at the moment with this hope that it learns a new better way to achieve its goal. For example, 20% of the time it explores and 80% of the time it picks (exploits) the best action that has already learned.
If the agent always exploits, then it uses what is called a "greedy" approach. The problem of the greedy approach is that it may never find the true best action. The agent has to learn from its actions.
Although there are more advanced methods such as Neural Networks (NN) to learn the best path to reach a destination from the existing data (for instance, existing robot navigation data), we have used a simple model in this paper. That is based on the reward system and the state value of the cell in the maze at each moment.
Since the agent must learn from its actions, we should store state values in the form of a  Since the agent can learn, it should be able to keep track of and update its state history. In addition, the agent has to memorize the overall states it paves during learning, and it has to keep track of the steps it takes in each episode. This is important specifically in a multi-agent environment that the configuration of the state constantly changing due to other agents' moves.
We should not confuse the state of the environment with the current state of the agent. For example at each point in time, the robot is in one specific state (position) in the entire state space. In our model, the only component that changes the state space is when the robot moves (the occupancy of the cells changes after each move). The agent must correlate its actions to the decisionmaking process. The glue between its actions and the decisionmaking process is the value of the states. We have covered this matter in the Future Reward section in this paper later. In the high level, the agent components have the functions we have described in [ Figure 3].

B. Environment:
The environment is the "perception" of the agent about the world. We model the world for the agent in the form of its environment. It has the following features: i. The environment at each point in time has a state. We let the agent observe that state. ii.
The agent's moves can change parts of the state space but not all. For example, the state space's free cells in the grid are changed when an agent moves from one state to another state The agent cannot pass through the obstacles.
As a result, we implemented a function to control the valid actions/moves in a cell based on the location of the agent and its immediately adjacent cells.
Our implementation includes a function to check if the robot has reached its destination (goal) after each move. Since we assumed to have only one agent in the environment (not being in a multi-agent environment) then nothing is going to change in the state space other than the state of the agent. In this case, it suffices to just track the state of the robot (the agent) in the state space. In the end, the environment would support the functions shown in [ Figure 5].

C. Immediate Reward:
The reward indeed is a component that we could include it in the Environment section but we intentionally would like to discuss it separately to distinguish it from long-term rewards that we discuss it in the next section. The immediate reward is given to the agent based on the action that the agent has taken. If we do not have a proper rewarding mechanism, the entire RL will not work properly. In this research, we model a non-stochastic rewarding system based on the assumptions that the environment is static and only moving part is the robot. This static rewarding model rolls out the discussions around multi-arm bandit algorithms in our solution. The objective of the agent in its decision-making process is to maximize the reward it gets to take action. In this research, we did not include any other criteria other than maximizing the reward function. In the real world, usually there are more objectives to achieve such as reaching the goal with the objective of choosing the safest path (if there are more than one path to reach the destination), it is the shortest path and there is no collision with a moving obstacle.
A reward is a scalar value. We could give positive (+1) reward for any action that we consider them as "desirable" and give negative rewards ( -1) for the non-desirable actions. In our model, we just give (-1) for the actions that do not land the agent to the goal cell.
The reward system should have the following features: a. The reward mechanism should incentivize the agent to finish its task sooner b. The above assumption enables us to enforce the agent to find the solution that is "optimized" than just finding a solution c. The above requirement means we need to give a negative reward for each step that does not lead to the goal state d. The reward mechanism is not a prescription to solve the problem. It is simply a quantities measurement on the reward rate of action is in each step. It is the agent's responsibility to find a solution.

D. Future Reward:
If there are three allowable moves for an agent to choose from in a state, for example, which move is the best action for the agent to pick to achieve the goal faster. The word "best" in this context is more or less a "time" factor in achieving the goal. It means that the agent should not only get to the destination but also reach the goal as soon as possible.
One way to solve this problem is to let the agent knows about the best action to pick in each state. In this case, there is no learning. The designer of the system programs the agent to pick the best action based on a logic that is programmed based on human judgment. Moreover, this approach is not scalable since the designer cannot predict all possible scenarios that may happen for a robot. Figure 6 shows how we can program an agent to always go down in being in a cell. In this case, the agent picks an action that gives a lesser penalty, i.e. a greater reward. Our model was set to let the agent learns these values "by itself".   and 0. We also set the target state value to 0 since there is no more gain to offer after reaching to the target cell.
In formula 1, the (target -existing G value of the state ) is a way to calculate the error. The G value is a cell dependent value, i.e. each cell has its own value. So, in this case, our policy function (ℼ) selects the best move (up, down, right or left actions) based on the value of the states, if the idea is to exploit (not to explore).
The α is a parameter that we call it "learning rate". It plays like a regulator in the formula. We have to find an appropriate value for this parameter and it is one of our hyperparameters to tweak in our experiment.

Exploit VS. Explore
When we know what actions are available in each cell (i.e. in each state), we must pick one of them that it gives the best gain in the longer term. What is the criterion to select the best action in each state? In our model, the criterion is the state value of the candidate cells to move to. We have to assign a value to each cell.
The value shows the long term reward we get if we land on the cell. If we have 3 viable actions in a cell, each of those actions has a long term reward, associated with the cell values. We can simply select 80% of the time the best action and 20% of the time another random action. The 20%, in this case, is called "epsilon" and this method is called "epsilon-greedy".
In the code, we initialize the ἑ -greedy value to 0.25. That means 25% of the time we choose a random action and 75% of the time we select the action in a cell that lands the robot to a cell with the best value. We anneal the epsilon in the course of learning to discount the exploration and increase the exploitation to calculate agent familiarity of the environment as time passes. The Epsilon is another hyperparameter that we tweaked in our optimization experiment.

Hyper-Parameters and Results
In the developed model, there are a few parameters that we have explored their impacts on the learning process.
We have summarized our findings in the following tables. The hyperparameters that we have explored are: a. epsilon value in the ἑ -greedy b. initial learning rate value (α) c. Changes in the learning rate during execution (Annealing factor) [ Figure 9] illustrates one sample of the simulations. [ Table   2] shows the best combination of epsilon (ἑ) and alpha () hyperparameters in our examination. It seems trial #5 (epsilon=0.25 and alpha=0.1) is an acceptable approximation for those hyperparameters. The configuration of the grid, the number of episodes, the values of the cells while learning was in progress and the elapsed times are shown in [ Figure 9]. Initially, for an 8x8 maze, we set the number of episodes to a fixed number (3000 episodes). We observed that the final state values did not converge to the extent we wish. After increasing the number of episodes to 5000 episodes, we observed much better convergence as it is shown in [ Figure 10]. However, waiting for almost half an hour to one hour to finish 5000 episodes for such a small state environment is not feasible in practice and it needs lots of computation. As a result, we have explored two optimization approaches inline with the hyperparameters in trial #5 mentioned in [ Table 2].

Optimization by Backtracking
In the application to find the optimal path, we do not need to wait such a long time to calculate those precise state values. We proposed an optimization approach to find the optimized path by backtracking recurrence.
We are convinced that the optimal path can be found much sooner by monitoring the cells in a few consecutive iterations and stop if we observe the cells in consecutive episodes have recurred. By adding this idea to the logic, we experience 10 times improvement in the learning time.
In this section, the optimization approaches are going to be described and the results compared with other non-optimized iterations will be presented. To implement the optimization approach, the optimized code monitors the consecutive episodes and the steps that are used in those episodes to reach to destination and if they are identical, the process stops. Initially, we monitored two consecutive episodes however, we experienced a very edge case that two episodes in the initial episodes selected the same cells while the paved path was not an optimized path. To avoid this edge case, we decided to monitor three consecutive episodes and decide based on identicality of those three episodes.
For implementing the optimization code, two changes have to be applied. The code must store the track of the finished episodes and compare the result of each episode with the results of the last three episodes. These two processes add two steps in the execution of our code 1) writing the steps to a file, 2) comparing the last three episodes before starting the next episodes. Obviously, there is a performance penalty to introduce this I/O to the process, but this matter can be avoided to store the results in the memory instead of the disk. Although we have not implemented that, it is a way to make the code to find the optimized path even faster.
In an 8x8 maze, the elapsed time after optimization was 10 times faster or even better. It proves that it is a better approach and brings tremendous value, specifically in larger state spaces. As you see in Figure 11, the elapsed time is 22.4 minutes vs. 60.41 minutes (please see table 2, trial #5 that was the best value we got there) before optimization. The number of episodes is reduced to 697 episodes than 5000 episodes.  Figure 12] shows one instance of our experiment and the optimized path based on state values. We ran the optimized codes three times to get an average of time difference. The results are shown in [ Table 3]. As it is shown, in our trials, on average the state-value based method saves around 5 minutes in an 8x8 maze while as it is true and expected that the state value optimization method outperforms, in situations, it falls behind the recurring path optimization methods (an example is the third instance of our experiment in table 3). This is more interesting when we compare the number of episodes in these two methods.  Figure 12: Performance Improvement In 8x8 Maze.
As it is illustrated in table 3, the number of episodes in the state value method has been decreased drastically (for example from 765 episodes to 73 episodes), almost 10 times less, while this is not true for the duration (i.e. we saved in times around 50%).

Discussion
The robot motion planning usually is decomposed to the path planning and trajectory planning. In path planning, we need to generate a collision-free path in an environment with obstacles and optimize it with respect to some given criteria [18]. If the environment is static, we can generate the path in advance (i.e. planning) and this is the approach we have chosen in this paper.
As it is mentioned in the related work, there are many ways to solve the maze problem. Traditionally, there exist simple solutions for connected mazes. An instance of such simple solution is wallfollower (that sometimes it is called "left-hand rule " or "the right-hand rule") by simply walking forward, keeping your left hand or right hand respectively on the wall at all times [19]. As it is shown in [ Figure 12], at the point we have the state value -306.6941 (this state is shown in a rectangle in [ Figure 9], the next Another scenario that we did not include in this paper is multiagent environments. The main question in the multi-agent environments is how all other robots get each others' updates. [20] has suggested each agent solve a part of the maze and update the shared memory so that other robots also benefit from each other's' proves that we can gain around 3 times better performance with the above-mentioned configurations. We could use SSD and GPU processing in our configurations to even get better results. Another interesting observation is that while the average # of episodes in the cloud servers for the recurring method compare to running the same method on the local laptop has not changed much (799 episodes vs. 718 episodes), the duration has been improved from 18.47 on average to 5.54 minutes). That speaks how much using fog computing in the proximity of the robots can improve the pathfinding process.
Another observation is that the state value method in the cloud server has taken more episodes (~ 86 episodes) than running it process. The result of our experiments proved that generally, the state value monitoring has higher performance than the recurring state monitoring approach. This is not always the case but in the majority of cases that is proved to be true. We presented how the optimized path showed itself before concluding the 5000 episodes and how we can make use of this information in pathfinding in a shorter time.
The two optimization approaches mentioned in this paper need more exploration and improvements at least from the following two aspects: A) Finding out the scenarios that state value optimization algorithm performs slower than the recurring value algorithm. We also showed how offloading the processing to more dedicated computing servers can improve the pathfinding process.
The next step in the research is to compare the findings in this paper with state-of-the-art algorithms such as A*. In addition, such comparison can be conducted in the fog computing settings. The comparison can be done not only from the algorithmic perspective, but also from the constraint and the environment settings.
Athabasca University. This research paper contributes to the first author's PhD study at the University of Oviedo based on the MOU between the two universities. This research has been funded by the Spanish Ministry of Science and Innovation, under project MINECO-TIN2017-84804-R