A New Distributed Reinforcement Learning Approach for Multi- agent Cooperation Using Team-mate Modeling and Joint Action Generalization

Article history: Received: 02 January, 2020 Accepted: 22 February, 2020 Online: 09 March, 2020


Introduction
Reinforcement Learning (RL) focuses on the question of an agent learning by interacting with its environment and analyzing the effects of these interactions and has been successfully applied in many single-agent systems [1,2]. Throughout RL, the learning takes place iteratively and is carried out through trials and errors, making it a safe tool to deal with complex and uncertain environments [3]. Given RL properties, a growing interest was developing these last years in order to extend reinforcement learning to multiagent systems (MASs). MASs are applied to a wide variety of domains including robotic teams [4,5], air traffic management [6] and product delivery [7]. We are specifically interested in distributed cooperative mobile robots, Where multiple cooperative robots can perform tasks faster and more efficiently than a single robot. The decentralized point of view provides many potential benefits such as speed-up, scalability and robustness [8,9]. Cooperative systems means that the agents (robots) share common interests (e.g., the same reward function), thus the increase in individual's benefit also leads to the increase of the benefits of the whole group [10,11]. In recent years, an increasing number of researches have shown interest in extending reinforcement learning (RL) to MASs in the powerful framework of Markov games (MG, also known as stochastic games, SG), and many promising multiagent reinforcement learning (MARL) algorithms have been proposed [10][12]- [14]. These methods can be divided into two big categories: the case of independent learners (ILs) where each agent only knows its own actions [14] and the case of joint actions learners (JALs) where each agent collects information about their own choice of action as well as the choices of other agents [15]. Many algorithms derived from these both IALs and JALs can learn the coordinated optimal joint behaviors and provide cer-tain convergence guarantees as well but in simple cooperative games (matrix games, few players, not fully stochastic environment, etc). However, they fail when the domain becomes more complex [16,17], especially when the number of agents or state spaces increases, resulting in large memory requirement, slowness in learning speed and then more challenging coordination. As JALs offer good coordination results but suffer from the curse of dimensionality, a new kind of distributed multi-agent reinforcement learning algorithm, called the ThMLA-JAG (Three Model Learning Architecture based on Joint Action Generalization) method, is proposed here. The main idea is to decompose the coordination of all JALs into several twoagent coordinations. Validation tests which are realized on a pursuit problem show that an overall coordination of the multi-agent system is ensured by the new method as well as a great reduction of the amount of information managed by each learner, which considerably accelerates the learning process.
To the best of our knowledge, this is the first distributed MARL system that decompose the multi-agents learning process into several two-agents learning systems. The system is decentralized in the sense that learned parameters are split among the agents: each agent learns by interacting with its environment and observing the results of all agents' displacements. Hence no direct communication is required into learning agents, a communication that can be expensive or insufficient because of bad synchronization and/or limited communication range [18]. As such, the proposed approach is different from methods that learn when such communication is necessary [9][19]- [20]. We particularly show that the system remains effective even when agents' information are distorted due to frequent environmental changes. The rest of the paper is arranged as follows. In section 2, we introduce some basic reinforcement learning concepts. In section 3, two existing RL methods on which our proposed approach is based are presented. Section 4 is dedicated to present the ThMLA-JAG method. Several experiences are conducted in section 5 showing the efficiency of our proposals. Some concluding remarks and future works are discussed in section 6.

Reinforcement learning
Markov Decision Processes (MDPs) are often used to model single-agent problems. MDPs [21] are suitable for studying a wide range of optimization problems that have been solved by dynamic programming and reinforcement learning. They are used to model scenarios where an agent has to determine how to behave based on the current state's observation. More precisely, a Markov model is defined as a 4-tuple (S, A, T, R), where S is a discrete set of environmental states, A is a discrete set of agent actions, R : S × A → is a reward function and T : S × A → Π(S) is a state transition function ( Π(S) is a probability distribution over S). We write T (s, a, s ) as the probability of making a transition from s to s taking action a. The action-value function Q * (s, a) is defined as the expected infinite discounted sum of rewards that the agent will gain if it chooses the action a in the state s and follows the optimal policy. Given Q * (s, a) for all state/action pairs, the optimal policy π * will be developed as the mapping from states to actions such that the future reward is maximized [1]. One of the most relevant advances in reinforcement learning was the development of the Qlearning algorithm, an offpolicy TD (temporal-difference) control algorithm [16,22]. Q-learning is especially used when the model (T and R) are unknown. It directly maps states to actions by using a Q function updated as in (1): where, • α ∈ [0, 1] is the learning rate and γ ∈ [0, 1] is the discount factor.
Despite its multiple successful results in single agent cases, Qlearning can't be directly applied to multi-agent systems. In fact, learning becomes a much more complex task when moving from a single agent to a multi-agent setting. One problem is the lack of the single agent framework's convergence hypotheses. For instance, the environment is no longer static from a single-agent perspective due to many agents working on it: this is a commonly cited source of multi-agent learning systems difficulties [23]. Another issue is the communication between agents: How to ensure a relevent information exchange for effective learning [24]?, and finally the problem of multi-agent coordination in order to ensure a coherent joint behavior.
In what follows, several MAL methods derived from Qlearning are presented and classified according to the state/action definition.

Individual state / individual action
Many researchers are interested on extending Qlearning to distributed MAS using cooperative ILs. They aim to design multiple agents capable of performing tasks faster and more reliably than a single agent. Examples include PA (Policy Averaging) [25], EC (Experience Counting) [26], D-DCM-Multi-Q (Distributed Dynamic Correlation Matrix based MultiQ) [18], CMRL-MRMT (Cooperative Multi-agent learning approach based on the Most Recently Modified Transitions) [27] and CBG-LRVS [28]. Here, each entry of the Qfunction is relative to the individual state/action pair of the agent itself like in the case of single-agent learning and a direct communication is used to share information between different learners. These methods use also the assumption that all learners can www.astesj.com 2 be in the same state at the same time which eliminates the coordination problem. So, these methods succeed to establish a cooperative learning but become inapplicable when considering the collision between learners, like the case in reality of mobile and autonomous robotic systems.

Joint state / individual action
To better address the coordination problem, many other learning algorithms have been proposed in literature while keeping collisions between learners, like, decentralized Qlearning [24], distributed Qlearning [16], WoLF PHC [23], Implicit coordination [13], Recursive FMQ [17] and Hysteritic learners [29]. In such methods, agents are fully cooperative which means that they all receive the same reward and are noncommunicative that are unable to observe the actions of other agents. These methods are based on Markov games, i.e, each learner updates its Qvalues using joint states ( relatives to all agents and even other elements presenting in the environment) and individual actions ( specific to the agent itself). Their performances can greatly vary between successful and unsuccessful coordination of agents. As explained in [14], no one of those algorithms fully resolves the coordination problem. Complications arise from the need of balance between exploration and exploitation to ensure efficiency like in single RL algorithms. But the exploration of an agent induces noise in received rewards of the group and can destabilize the learned policies of other ILs.
To sort out this problem, called the alter-exploration problem, many parameters, which the convergence relies on, are introduced. So, the main drawbacks of these methods are that parameter tuning is difficult and that when the number of agent increases, the difficulty of coordination increases and the alter-exploration problem outweighs. The learning velocity is also highly affected when altering the number of agents and/or the environment size due to the use of joint states.

Joint state / joint action
Given ILs' limitations, methods using JALs are much more used for stochastic games [15]. They can provide distributed learning while avoiding collisions between agents. They solve ILs' problems since the whole system's state is considered by each learner, i.e, the Qvalues are updated using joint states and joint actions. However, JALs suffer from combinatorial explosion of the size of the state-action space with the number of agents, as the value of joint actions is learnt by each agent, contrary to ILs that ensure a state space size independent of the number of agents.
As an example, TM-LM-ASM (team-mate model-learning model-Action Selection Model) [30] is a JAL method that combines traditional Q-learning with a team-mate modeling mechanism. To do that, each learner has to memorize a table Q storing all possible joint states/actions pairs and a table P storing all possible joint states/actions pairs except its own action. The Q table is used by the learning model as simi-lar as the Qlearning method and the P table is used by the teammate model to estimate the behaviors strategies of other agents. Then, the strategy used by the learner to select an action is done by the action selection module using both the P and Q tables. Experiments done on a pursuit game [30] using two predators and one moving prey show the effectiveness of the TM-LM-ASM method and that it ensures a global coordination without the need of a direct communication between different members. However, obtained results are limited to two-agent systems. A problem of state space explosion can be obviously detected when increasing the number of agents because of the use of joint states and joint actions. As an example, let's consider the case of N agents agents able to execute N actions actions in an L × L grid world. If the state of every agent is relative to its position in the grid, every learner has to store a table Q containing (L 2 · N actions ) Nagents entries and a table P having (L 2 ·Nactions) N agents Nactions entries, without forgetting the significant number of possible team-mates actions' combinations to consider when updating the table P and when choosing the next action to execute, that's: N Nagents−1 actions possible actions' combinations. If we examine the case of 4 agents which can make 5 actions in a 10 × 10 grid world, we obtain: • 625 · 10 8 entries of the table Q, • 125 · 10 8 entries of the table P, • 125 possible combinations of team-mates' actions in every state.
As it is mentioned in the theoretical convergence of the Qlearning algorithm [16], an optimal policy is only achieved if every state action pair has often been performed infinitely. For the above described example, we can see that a first visit of each state/action pair isn't obvious and requires a long time. The amount of memory needed for storing the tables P and Q is also significant. A huge computing power is thus required so that all possible combinations of team-mates' actions can be taken into account and this is at every learning step (during the updates of P and Q and when choosing actions).

Learning by pair of agents
To solve the state space explosion problem along with ensuring a satisfactory multi-agent coordination, Lawson and Mairesse [19] propose a new method inspired by both ILs and JALs approaches. The main idea is to learn a joint Q-function of two agents and to generalize it to any number of agents. Considering a system of N agents agents, at any learning step, each agent has to communicate with its (N agents − 1) teammate agents, update its (N agents − 1) Q tables of two agents and then identify the next joint action to follow. Note that the same joint action must be chosen by all the learners. Additional computations are then required. Indeed, the most promising joint action is not directly determined from a joint www.astesj.com table Q, as in the case of joint actions learners, but it needs the evaluation of the sum of (N agents − 1) values of Q. Also, (N agents − 1) updates of Q are done instead of one update of the joint table. However, this increase in required computing power is widely compensated by the advantages that present this method compared with the approaches using ILs or JALs, namely: • By learning through agents' pairs as the JAG method, a huge reduction in the number of state/action pairs is ensured which leads to the acceleration of the system convergence. As illustrated in Table 1, these memory savings widen with the increase of the environment size and/or the number of agents.
• Contrary to independent agents, a global coordination is provided by the JAG method because this method evaluates all possible joint actions of every pair of agents and chooses the global joint action (of the whole multi-agent system) after maximizing the Qvalue of each of these pairs. This is based on the assumption that a global coordination is ensured by coordinating each couple of agents, provided that the policy obtained for two agents is optimal.
Several experiences are made by Lawson and Mairesse [19] on the problem of the navigation of a group of agents in a discrete and dynamic environment so that each of them reaches its destination as quickly as possible while avoiding obstacles and other agents. The results of these experiences confirm the aforesaid advantages, namely an acceleration of learning with a global coordination as in the case of JALs. However, the distribution of learning does not facilitate the task but results in an identical treatment of the same information by all the agents. More precisely, in every iteration, agents communicate their new states and rewards to all their team-mates. As a consequence, each of them updates the Qentries of all agents' couples and executes the JAG procedure to choose the next joint action. Another communication is then necessary to make sure that the same joint action is chosen by all learners. Thus, the JAG method seems to be less expensive by adopting a centralized architecture: only one agent is responsible for updating information and for choosing actions and informs other agents of their corresponding actions. Conversely, every agent executes its own action and informs the central entity of its new state and the resulted reward.

Proposed reinforcement learning algorithm
As explained earlier, the JAG method [19] ensures a global coordination while using independent learners but needs a centralized process to make sure that all agents choose the same joint action at each learning step, whereas the TM-LM-ASM method [30] is a full distributed learning approach that also provides a global coordination but employs joint action learners which make it unsuitable for systems considering many agents and/or large state spaces. Our objective is to develop a new intermediate approach between TM-LM-ASM and JAG. The main idea is to generalize the TM-LM-ASM architecture which is learned by two agents to a larger number of agents while using the JAG decomposition instead of joint state-action pairs. The new proposed approach is called ThMLA-JAG (a Three-Model Learning Architecture using Joint Action Generalization).
According to the ThMLA-JAG method, the multi-agent coordination process is divided into several two-agents learning tasks. Assuming that the system contains N agents agents, each agent must memorize (N agents − 1) P-Tables of two agents for the team-mates' model and (N agents − 1) Q-Tables of two agents for the learning model.

The team-mates' model
Consider a system of N learning agents in a joint state s = (s 1 , ..., s Nagents ). At a learning step t, a learning agent i under consideration executes an action a * i and observes its partners' actions (a * 1 , ..., a * i−1 , a * i+1 , ..., a * N ), the new obtained state s = (s 1 , ..., s Nagents ) and the resulted reward r. Then, it updates (N agents − 1) P-tables concerning each partner j for each action a j that can be tried by this agent j in the experimented state s = (s i , s j ). The table P ij concerning the pair of agents (i, j) can be updated using (2), similarly to the TM-LM-ASM method when is applied to two-agent systems.
where, β ∈ [0, 1] is the learning rate that determines the effect of previous action distribution, and T is the number of iterations needed for the task completion. s and s are joint states that present the instantaneous positions of agents (i, j) in the environment.

The learning model
Similar to the JAG method, (N agents − 1) updates are made by each learner at each learning step. Each update concerns an entry of one of the (N agents − 1) Q-tables that are saved by this learner and is relative to one of its partner. If we consider that the transition (s i , a i ) → s i had been executed by a learner i while another learner j executed the transition (s j , a j ) → s j , the table Q ij corresponding to the pair (i, j) with the joint state s = (s i , s j ) and the joint action a = (a i , a j ) can be updated using (3), as when applying the TM-LM-ASM method for two agents.
www.astesj.com 4  where, Thus, the team-mate model is exploited by the learning model. More precisely, the best move in the next state s relies on the next action a i of the agent i under consideration and the next action a j of its partner j. Given that the Q function of this later isn't known by agent i, a prediction of its action is done using the memorized team-mate model. For the examined agent, the action that should be preferably executed in the next state is determined by the exploitation of its Q function.

The action selection model
For a better selection of the next action to execute, a generalization of the action selection model proposed by Zhou and Shen [30] for 2 agents to N agents (N >= 2) is possible by exploiting all stored P-tables and Q-tables of the learner. To this end, for each agent i having (N agents −1) team-mates and staying at a joint state s = (s1, ..., s i , ..., s Nagents ), the action a * i to be executed in that state s can be selected by using (4): where, (a * 1 , ..., a * i−1 , a * i+1 , ..., a * Nagents ) are predicted by agent i as the next actions to be executed by its partners in the joint state s, A i is the set of possible actions of agent i and V (a i |a * 1 , ..., a * i−1 , a * i+1 , ..., a * Nagents ) is the conditional expectation value of action a * i of agent i when it considers the team-mates' model: For all a i ∈ A i , V (a i |a * 1 , ..., a * i−1 , a * i+1 , ..., a * Nagents ) is calculated using (5): where, (s i , s j ) is the actual state of agent i and its team-mate j and A j is the set of all possible actions of agent j.
Here, all possible actions of team-mates are equally considered and there's no prediction of any particular joint action. This can be explained by the fact that, for each pair of agents (i, j), P ij and Q ij have initially the same values for all state/action pairs. As the learning progresses or new circumstances take place (such as collisions or environmental changes), a particular set of state/action pairs will become favoured as a result of the increase of their corresponding P ij and Q ij values. Therefore, different values V will be attributed to each action a i of agent i. Then, a greedy choice following (4), allows the agent i to select the most promising action a * i in the current state s.

Improving multi-agent coordination
When updating a table Q ij related to the team-mate j, the learner i has to predict the most promising joint action (a i , a j ) in the next state. As described in (3), its own action a i is that maximizing Q ij in that state. However, our proposition assumes that each agent has (N agents − 1) Q-tables corresponding to its (N agents − 1) partners and all these Q-tables will be exploited during the choice of the action to be really executed. Thus, by referring to the unique table Q ij when predicting a possible next action, the predicted action will not necessary reflect the one which will be really experimented in the next state. As a result, the learner can continuously oscillate between two consecutive states if those states are updated according to a predicted state different from that really visited. For that, it would be better to predict the action maximizing the next state using the maximum amount of information, explicitly, all the Q-tables and P-tables which have been stored by the learner. Thus, the updated version of (3) is shown by (6): where, a * j is identified as before in (4) and a * i is calculated using the predefined action selection model of (4): it is the same strategy used for real actions' selection. This modification in the update of the Q-function will be further justified in the experimental section.

The proposed learning algorithm
A possible extension of the algorithm proposed by Lawson and Mairesse [19] for a system having more than two agents is presented by algorithm 1 and illustrated in Figure 1. Note that all agents are learning synchronously, specifically: • Update Q-table (the step 23 of algorithm 1). Two alternatives of updating a Q-table will be compared in the experimental section: (3) and (6), • Test the end of learning or the release of a new episode (the steps 25 and 26 of algorithm 1), • Go to a new iteration in the same episode (the steps 27 and 28 of algorithm 1).
It should also be noted that the reward is defined by pair of agents in accordance with the learning architecture, i.e, every two agents receive a specific reward that describes the result of their own displacement. In what follows, we are going to examine two test cases: • In the first case (S1: a temporary dynamic environment), the prey is temporary moving and each predator should be able to build an optimal path from its starting position to that target while avoiding obstacles, other predators and the prey as well as to correct this path after each environmental change. An episode ends if the prey is captured or that it exceeds 1000 iterations. Initially, the environment is in the form of Figure 2. After 1200 episodes, the prey moves to a new position as shown in Figure 2-b. Figure 2: The testing environment of S1 Figure 3 describes the environment used to test this scenario. Once an episode is ended, agents are relocated on the right extremity of the labyrinth and a new trial begins. Because collisions between agents aren't permitted and in order to test a system containing more than four agents, we suppose that the prey is captured if each hunter is in one of its eight neighboring cells including corners. Figure 3-a describes a possible capture position. for Choose an action a i,t following equation (4) Execute a i,t and observe the actions being simultaneously executed by other team-mates for team − mate i : j = 1...N, j = i do Determine the new state s t+1 = (s i,t+1 , s j,t+1 ) and the corresponding reward r t+1 = r ij Update P ij (s i,t , s j,t , a j ) following equation (2),for all a j ∈ A j Update Q ij (s i,t , s j,t , a i,t , a j,t ) following equation (3) or equation (6) , where a j,t is the simultaneous action of agent j end for if the new state s t+1 is the goal state then The episode is ended, go to 11 else {s t+1 = goal state} t ← t + 1, go to 15 end if end for end for end while devoid of obstacles. At each stage of learning, the prey has a probability equal to 0.2 to remain motionless or to move towards vertically or horizontally neighboring cells. As to hunters, they can share the same position except the one containing the prey. These agents are initially in random positions. They move according to 5 above-mentioned actions in a synchronized way until they catch the prey or their current episode exceeds 1000 iterations. Once an episode is ended, these agents are relocated at new random positions and a new episode begins. The prey is captured when all hunters are positioned in the vertically and/or horizontally neighboring cells of that prey.
In the rest of the paper, the term agent refers to only a predator (a learner). Some experiences are conducted while vary-ing the number of agents. The results reported below are obtained by average results of 30 experiments where each one contains 2200 episodes.

Communication between agents
Communication between the different agents must be ensured by an autonomous multi-agent network system which aims to exchange data between the different nodes while meeting and maintaining certain communication performance requirements(coordination, synchronization of messages and cooperation). As wireless technology has led to consider a new era for robotics, where robots are networked and work in cooperation with sensors and actuators, we have defined a wireless communication allowing the different agents to work in cooperation, and to exchange data locally in a multi-agent system. www.astesj.com 7 A wireless access point is used to operate in the command link of a mobile agent. The link will carry data signals for the other agents.

Parameter setting
As most RL methods, the major problems met during the implementation of the ThMLA-JAG algorithm (algorithm 1) are essentially the initialization of its parameters. To do it, several values had been experimented and the best configuration was then adopted to test the above-mentioned test cases. As a result, the same distribution of rewards is used in both scenarios, namely: • a penalty of 0.9 is attributed to every agent striking an obstacle and/or to every couple of colliding agents, • a reward of 3 is received by every couple of agents having captured the prey, • a reward of -0.05 is given to every agent moving to a new state without neither colliding nor capturing the prey.
Concerning the other parameters, they are initialized as follows: • the learning rates α = 0.8 and β = 0.8, • the update factor γ = 0.9, • the Qvalues (values of Q-table) are set to 0.04, • the Pvalues (values of P-table) are set to 0.2, Finally, the ThMLA-JAG method is tested in scenarios S1 and S2 while comparing both possibilities of update of the table Q, that are: • according to (3) where the Qtable concerning one teammate is only updated according to the information saved by the current learner about this latter. In this case, the learning method to be tested is indicated by ThMLA-JAG1, • according to (6) where the Qtable concerning one teammate is updated according to all information saved by the current learner, including those concerning other team-mates. Here, the variant to be tested is denoted by ThMLA-JAG2.

Memory savings
According to the ThMLA-JAG method and regardless of the manner in which the Q-tables are updated, a lot of saving in memory is ensured comparing to the TM-LM-ASM method (a JAL method) as well as the Hysteritic Qlearning method (an IL method) and this is all the more important as the number of agents increases. Table 2 illustrates this result.

Computation savings
In addition to memory savings, important computation savings are provided by the ThMLA-JAG1 method and are described in Table 3. Note that a P entry (respt. a Q entry ) designates an entry in the table P (respt. Q).

Comparing the two variants of ThMLA-JAG
In this section, we will compare ThMLA-JAG1 and ThMLA-JAG2 using 4-agent systems in both test cases S1 and S2.
3.10.1 Testing S1 As noted by Figure 4 and Table 4, both ThMLA-JAG1 and ThMLA-JAG2 lead to the system convergence before and after the environmental change. Figure 4-a shows the number of iterations needed for each episode over time while Figure  4-b describes the number of collisions in each learning episode. We can see that in both environments, the agents succeed to catch the prey without colliding with each other or with obstacles and this is after some iterations (about 850 www.astesj.com episodes in the first environmental form and 450 episodes after moving the target). On another hand, when updating a Q-table by only using the information of the corresponding team-mate (ThMLA-JAG1), the number of collisions is less than by exploiting the information of all other team-mates (ThMLA-JAG2). This is because with ThMLA-JAG2, the action resulting previous collisions is more likely to be chosen again, especially if these collisions concerned only some members of the multi-agent system.
On the contrary, the learning following ThMLA-JAG1 is much slower than that using ThMLA-JAG2. Figure 4-a and Table 4 show this result. Using ThMLA-JAG1, the first episodes are longer and the adaptation to the new environmental form requires more iterations than ThMLA-JAG2. Moreover, after displacing the target, the agents following ThMLA-JAG1 risk to not find a path leading to the new position of the target. Some experiences considering the ThMLA-JAG1 method are ended without a successful adaptation to the new environmental form. This explains the reason why the average length of the constructed path is equal to 80 steps in case of ThMLA-JAG1 and only 12 steps with ThMLA-JAG2. Besides, by observing the curves of collisions of ThMLA-JAG1 (Figure 4-b), we notice the existence of few collisions even after the system convergence. These collisions are related to failed experiences. The weakness of ThMLA-JAG1 is caused by the difference that can happen between the predicted action when updating the Qvalue and that will be really chosen using the adopted policy. This difference can block the learner between consecutive states. As the updates of Q no longer depend on recent movements, the Qvalues will remain invariants after some updates and the agent can't modify or correct it.

Testing S2
The same remains true with the scenario S2. Figure 5 describes the number of captures done each 1000 episodes by considering two systems of 4 learning agents following ThMLA-JAG1 and ThMLA-JAG2, respectively. Results show that the number of captures after the system convergence are much more important and stable by considering ThMLA-JAG2: After 16 · 10 4 iterations, the number of captures is slightly changing in the case of ThMLA-JAG2 due to the movement of the prey but is significantly degrading in the case of ThMLA-JAG1. This difference in learning performance lies in the fact that, with ThMLA-JAG2, the Q-update is more in line with the adopted PEE, contrary to ThMLA-JAG1 where the system fails to converge to a final solution because the saved information can be considerably distorted by the movement of the prey which can occurs at every stage of learning.

Effect of increasing the number of agents on the learning performance
In this section, we aim to evaluate the ThMLA-JAG2 method in both cases S1 and S2 while varying the number of agents from 2 to 6.
www.astesj.com 9 Calculate V (a i ) for every possible action a i of the current learner i by performing k sum of (P entry · Q entry ) where k is equal to From Figure 6, we can see that the multi-agent system converges to a near optimal and collision free path and this is regardless of the number of agents. Likewise, agents succeed to adapt to the new environmental form once the prey is moved and the learning time isn't considerably delayed by the addition of new agents as well. As shown by Table 5, the 4-agent system needs more episodes to converge than the 2-agent system but a smaller number of iterations and collisions. As for the 6-agent system, the learning is slightly slower than the other cases.

Case of S2
Promising results are also obtained in case of S2. Figure 7 shows the number of captures done each 1000 episodes by considering three systems containing 2, 4 and 6 agents and using the ThMLA-JAG2 method. The results prove that the learning is accelerated by adding new agents. • From the beginning of learning to the 15 · 10 4th iteration, the number of captures done each 1000 episodes increases over time and is all the more important as agents are more numerous.
• From the 15·10 4th to the 42·10 4th iteration, the number of captures further increases in case of 2-agent systems and converges to approximately 140 captures every 1000 episodes in the case of 4-agents and 6-agents systems with low oscillations due to the movements of the prey.

Conclusion
In this paper, we have studied the problem of cooperative learning with avoiding collisions between agents. For that, a new learning method, called ThMLA-JAG, has been proposed. Using this method, a global coordination is ensured between agents as well as a great reduction in the amount of stored information and the learning computations with regard to classic joint RL methods. This is effectively due to the decomposition of learning into pairs of agents. By this decomposition, the learning process is considerably accelerated and the problem of the states' space explosion is partially resolved. This research is still in its early stages. Experimental results so far point to the fact that the proposed method is a good alternative to RL algorithms when dealing with distributed decision making in cooperative multi-agent systems. However, all conducted tests on the ThMLA-JAG method are restricted to small environments with a limited number of agents. We expect to further improve our work by expanding it with the ability to solve more complex scenarios. Possible test cases include more agents, several targets, as well as www.astesj.com continuous state spaces. Such complex elements typically describe real-world applications.
Furthermore, all agents in our current Markov game model can observe the global state space, while in reality this may not be feasible. More research into how agents can still efficiently collaborate when only partial state information is available would be worthwhile. In addition, the study of fully competitive MARL methods is also a good focus for next researches.