lib.agents package¶

Submodules¶

lib.agents.Agent_DQN module¶

class lib.agents.Agent_DQN.Agent_DQN(env, memory, qNetworkSlow, qNetworkFast, numActions, gamma, device='cpu')[source]¶

Bases: object

A class allowing the training of the DQN

This class is intended to be used by functions within the lib.agents.trainAgents module.

The DQN algorithm was first proposed over some years ago and was slated to be used for improving the state of affairs of traditional reinforcement learning and extending it to deep reinforcement learning. This class allows you to easily set up a DQN learning framework. This class does not care about the type of environment. Just that the action an agent is able to take is one of a finite number of actions, each action at a particular state has an associated Q-value. This algorithm attempts to find theright Q value for each action.

The class itself does not care about the specifics of the state, and the Qnetworks that calculate the results. It is up to the user to specify the right environment and the associated networks that will allow the algorithm to solve the Bellman equation.

[link to paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)

Parameters:

env (instance of an Env class) – The environment that will be used for generating the result of a particulat action in the current state
memory (instance of the Memory class) – The environment that will allow one to store and retrieve previously held states that can be used to train upon.
qNetworkSlow (neural network instance) – This is a neural network instance that can be used for converting a state into a set of Q-values. This is the slower version, used for making a prediction, and is never trained. Its parameters are slowly updated over time to slowly allow it to converge to the right value
qNetworkFast (neural network instance) – This is the instance of the faster network that can be used for training Q-learning algorithm. This is the main network that implements the Bellman equation.
numActions (int) – The number of discrete actions that the current environment can accept.
gamma (float) – The discount factor. currently not used
device (str, optional) – the device where you want to run your algorithm, by default ‘cpu’. If you want to run the optimization of a particular GPU, you may specify that. For example with ‘cuda:0’

Raises:

type – [description]

checkTrainingMode()[source]¶

prints whether the networks are in training or eval mode

This function allows us to determine whether the function is in training or evaluation mode. This is important for several things - specifically to make sure that the networks are not going to be randomly evaluated, as well as making sure the things like batch normalization and dropout are properly evaluated.

epsGreedyAction(state, eps=0.999)[source]¶

epsilon greedy action

This is the epsilon greedy action. In general, this is going to select the maximum action eps percentage of the times, while selecting the random action the rest of the time. It is assumed that this will receive a value of epsilon between 0 and 1.

Parameters:	state ({ndarray}) – [description] eps (float, optional) – Determines the fraction of times the max action will be selected in comparison to a random action. (the default is 0.999)
Returns:	The 1d tensor that has an action for each state provided.
Return type:	tensor

eval()[source]¶

put both the networks in eval mode

This will allow us to make sure that the networks do not randomly get evaluated for some reason.

fastUpdate(tau=1)[source]¶

update the fast network slightly

This is going to update the fast network slightly. The amount is dictated by tau. This should be a number between 0 and 1. It will update the tau fraction of the slow network weights with the new weights. This is done for providing stability to the network.

Parameters:	tau ({number}, optional) – This parameter determines how much of the slow Networks weights will be updated to the fast parameters weights (the default is 1)

load(folder, name, map_location=None)[source]¶

load the model

An agent saved with the save command can be safely loaded with this command. This will load both the qNetworks, as well as the memory buffer. There is a possibility that one may not want to load the model into the same device. In that case, the user should insert the device that the user wants to load the model into.

Parameters:

folder ({str}) – folder into which the model should be saved.
name ({str}) – A name to associate the model to load. It is absolutelty possible to save a number of models within the same folder, and hence the name can retrieve that model that is important.
map_location ({str}, optional) – The device in which to load the file. This is a string like ‘cpu’, ‘cuad:0’ etc. (the default is None, which results in the model being loaded to the originam device)

maxAction(state)[source]¶

returns the action that maximizes the Q function

Given an set of statees, this function is going to return a set of actions which will maximize the value of the Q network for each of the supplied states.

Parameters:	state ({nd_array or tensor}) – numpy array or tensor containing the state. The columns represent the different parts of the state.
Returns:	The return values of actions that maximize the states
Return type:	uarray

memoryUpdateEpisode(policy, maxSteps=1000, minScoreToAdd=None)[source]¶

update the memory

Given a particular policy, this memory is going to take the policy and generate a series of memories and update thememory buffer. Generating memories is easier to do using this function than an external function …

Parameters:	policy ({function}) – This is a function that takes a state and returns an action. This defines how the agent will explore the environment by changing the exploration/exploitation scale. maxSteps ({number}, optional) – The maximum number of steps that one shoule have within an episode. (the default is 1000)

randomAction(state)[source]¶

returns a set of random actions for the given states

given the size of the number of actions, this function is going to return a set of actions that has the same number of actions as the number of inputs in the shape. For example, if state.shape == (10, ?) then the result will be a vector of size 10. This is in accordance with the redduction in the dimensionality of the maxAction space.

Parameters:	state ({nd_array or tensor}) – numpy array or tensor containing the state. The columns represent the different parts of the state.
Returns:	The return value is set of random actions
Return type:	uarray

save(folder, name)[source]¶

save the model

This function allows one to save the model, in a folder that is specified, with the fast and the slow qNetworks, as well as the memory buffer. Sometimes there may be more than a single agent, and under those circumstances, the name will come in handy. If the supplied folder does not exist, it will be generated.

Parameters:	folder ({str}) – folder into which the model should be saved. name ({str}) – A name to associate the current model with. It is absolutelty possible to save a number of models within the same folder.

sigmaMaxAction(state, sigma=0)[source]¶

returns the action that maximizes the noisy Q function

Given an set of statees, this function is going to return a set of actions which will maximize the value of the Q network for each of the supplied states, after adding Gaussian noise to the layers. This is alternative to using an $epsilon$-greedy policy, and has shown to provide better results under most circumstances.

Parameters:	state ({nd_array or tensor}) – numpy array or tensor containing the state. The columns represent the different parts of the state.
Returns:	The return values of actions that maximize the states
Return type:	uarray

softUpdate(tau=0.1)[source]¶

update the slow network slightly

This is going to update the slow network slightly. The amount is dictated by tau. This should be a number between 0 and 1. It will update the tau fraction of the slow network weights with the new weights. This is done for providing stability to the network.

Parameters:	tau ({number}, optional) – This parameter determines how much of the fast Networks weights will be updated to the ne parameters weights (the default is 0.1)

step(nSamples=100, sigma=0)[source]¶

optimize the fast Q network via the bellman equations

This function is going to obtain a number of samples form the replay memory, and train the fast network over this dataset. This will optimize based upon the idea that

> Given: > Qf = fast network > Qs = slow network > s = current state > a = action that maximizes the current state > r = reward > s’ = next state > > Qf(s, a) = r + max(Qs(s’))

Parameters:	nSamples (int, optional) – The number of samples too retrieve from the replay memory for training, by default 100 sigma (float, optional) – The amount by which the fast network should be jittered so that the network introduces some Gaussian noise in the learning process, by default 0, which does not introduce noise in the learning algorithm.
Raises:	`type` – [description]

lib.agents.Agent_DoubleDQN module¶

class lib.agents.Agent_DoubleDQN.Agent_DoubleDQN(env, memory, qNetworkSlow, qNetworkFast, numActions, gamma, device='cpu')[source]¶

Bases: object

A class allowing the training of the DQN

This class is intended to be used by functions within the lib.agents.trainAgents module.

This is supposed to be an imporvement over DQN. Details of the idea behind Double DQN is present in the original paper:

Deep Reinforcement Learning with Double Q-learning https://arxiv.org/pdf/1509.06461.pdf

See the step() function for the details of its implementation.

Parameters:

env (instance of an Env class) – The environment that will be used for generating the result of a particulat action in the current state
memory (instance of the Memory class) – The environment that will allow one to store and retrieve previously held states that can be used to train upon.
qNetworkSlow (neural network instance) – This is a neural network instance that can be used for converting a state into a set of Q-values. This is the slower version, used for making a prediction, and is never trained. Its parameters are slowly updated over time to slowly allow it to converge to the right value
qNetworkFast (neural network instance) – This is the instance of the faster network that can be used for training Q-learning algorithm. This is the main network that implements the Bellman equation.
numActions (int) – The number of discrete actions that the current environment can accept.
gamma (float) – The discount factor. currently not used
device (str, optional) – the device where you want to run your algorithm, by default ‘cpu’. If you want to run the optimization of a particular GPU, you may specify that. For example with ‘cuda:0’

Raises:

type – [description]

checkTrainingMode()[source]¶

prints whether the networks are in training or eval mode

This function allows us to determine whether the function is in training or evaluation mode. This is important for several things - specifically to make sure that the networks are not going to be randomly evaluated, as well as making sure the things like batch normalization and dropout are properly evaluated.

epsGreedyAction(state, eps=0.999)[source]¶

epsilon greedy action

This is the epsilon greedy action. In general, this is going to select the maximum action eps percentage of the times, while selecting the random action the rest of the time. It is assumed that this will receive a value of epsilon between 0 and 1.

Parameters:	state ({ndarray}) – [description] eps (float, optional) – Determines the fraction of times the max action will be selected in comparison to a random action. (the default is 0.999)
Returns:	The 1d tensor that has an action for each state provided.
Return type:	tensor

eval()[source]¶

put both the networks in eval mode

This will allow us to make sure that the networks do not randomly get evaluated for some reason.

fastUpdate(tau=1)[source]¶

update the fast network slightly

This is going to update the fast network slightly. The amount is dictated by tau. This should be a number between 0 and 1. It will update the tau fraction of the slow network weights with the new weights. This is done for providing stability to the network.

Parameters:	tau ({number}, optional) – This parameter determines how much of the slow Networks weights will be updated to the fast parameters weights (the default is 1)

load(folder, name, map_location=None)[source]¶

load the model

An agent saved with the save command can be safely loaded with this command. This will load both the qNetworks, as well as the memory buffer. There is a possibility that one may not want to load the model into the same device. In that case, the user should insert the device that the user wants to load the model into.

Parameters:

folder ({str}) – folder into which the model should be saved.
name ({str}) – A name to associate the model to load. It is absolutelty possible to save a number of models within the same folder, and hence the name can retrieve that model that is important.
map_location ({str}, optional) – The device in which to load the file. This is a string like ‘cpu’, ‘cuad:0’ etc. (the default is None, which results in the model being loaded to the originam device)

maxAction(state)[source]¶

returns the action that maximizes the Q function

Given an set of statees, this function is going to return a set of actions which will maximize the value of the Q network for each of the supplied states.

Parameters:	state ({nd_array or tensor}) – numpy array or tensor containing the state. The columns represent the different parts of the state.
Returns:	The return values of actions that maximize the states
Return type:	uarray

memoryUpdateEpisode(policy, maxSteps=1000, minScoreToAdd=None)[source]¶

update the memory

Given a particular policy, this memory is going to take the policy and generate a series of memories and update thememory buffer. Generating memories is easier to do using this function than an external function …

Parameters:	policy ({function}) – This is a function that takes a state and returns an action. This defines how the agent will explore the environment by changing the exploration/exploitation scale. maxSteps ({number}, optional) – The maximum number of steps that one shoule have within an episode. (the default is 1000)

randomAction(state)[source]¶

returns a set of random actions for the given states

given the size of the number of actions, this function is going to return a set of actions that has the same number of actions as the number of inputs in the shape. For example, if state.shape == (10, ?) then the result will be a vector of size 10. This is in accordance with the redduction in the dimensionality of the maxAction space.

Parameters:	state ({nd_array or tensor}) – numpy array or tensor containing the state. The columns represent the different parts of the state.
Returns:	The return value is set of random actions
Return type:	uarray

save(folder, name)[source]¶

save the model

This function allows one to save the model, in a folder that is specified, with the fast and the slow qNetworks, as well as the memory buffer. Sometimes there may be more than a single agent, and under those circumstances, the name will come in handy. If the supplied folder does not exist, it will be generated.

Parameters:	folder ({str}) – folder into which the model should be saved. name ({str}) – A name to associate the current model with. It is absolutelty possible to save a number of models within the same folder.

sigmaMaxAction(state, sigma=0)[source]¶

returns the action that maximizes the noisy Q function

Given an set of statees, this function is going to return a set of actions which will maximize the value of the Q network for each of the supplied states, after adding Gaussian noise to the layers. This is alternative to using an $epsilon$-greedy policy, and has shown to provide better results under most circumstances.

Parameters:	state ({nd_array or tensor}) – numpy array or tensor containing the state. The columns represent the different parts of the state.
Returns:	The return values of actions that maximize the states
Return type:	uarray

softUpdate(tau=0.1)[source]¶

update the slow network slightly

This is going to update the slow network slightly. The amount is dictated by tau. This should be a number between 0 and 1. It will update the tau fraction of the slow network weights with the new weights. This is done for providing stability to the network.

Parameters:	tau ({number}, optional) – This parameter determines how much of the fast Networks weights will be updated to the ne parameters weights (the default is 0.1)

step(nSamples=100, sigma=0)[source]¶

optimize the fast Q network via the bellman equations

This function is going to obtain a number of samples form the replay memory, and train the fast network over this dataset. The idea behind this is that the Q network for the next step will not automatically pick the next best value. It will possibly pick the best value that the Fast network thinks it will pick, and so the original DQN will overestimate the possible Q value. This should reduce that estimation.

This will optimize based upon the idea that

> Given: > Qf = fast network > Qs = slow network > s = current state > a = action that maximizes the current state > r = reward > s’ = next state > > a’ = argmax Qf(s’) > Qf(s, a) = r + Qs(s’, a’)

Parameters:	nSamples (int, optional) – The number of samples too retrieve from the replay memory for training, by default 100 sigma (float, optional) – The amount by which the fast network should be jittered so that the network introduces some Gaussian noise in the learning process, by default 0, which does not introduce noise in the learning algorithm.
Raises:	`type` – [description]

lib.agents.policy module¶

class lib.agents.policy.epsGreedyPolicy(agent, randomAgent)[source]¶

Bases: object

[summary]

[description]

Parameters:	{[type]} -- [description] (randomAgent) – {[type]} -- [description] –

act(states, eps)[source]¶

[summary]

[description]

Parameters:	{[type]} -- [description] (eps) – {[type]} -- [description] –
Returns:	[type] – [description]

lib.agents.qNetwork module¶

class lib.agents.qNetwork.qNetworkDiscrete(stateSize, actionSize, layers=[10, 5], activations=[<function tanh>, <function tanh>], batchNormalization=False, lr=0.01)[source]¶

Bases: torch.nn.modules.module.Module

This is a Q network with discrete actions

This takes a state and returns a Q function for each action. Hence, the input is a state and the output is a set of Q values, one for each action in the action space. The action is assumed to be discrete. i.e. a 1 when the particular action is to be desired. The input state is assumed to be 1D in nature. A different network will have to be chosen if 2D and 3D inputs are to be desired.

Parameters:

stateSize ({int}) – Size of the state. Since this is a 1D network, this represents the number of values will be used to represent the current state.
actionSize ({int}) – The number of discrete actions that will be used.
layers ({list of int}, optional) – The number of nodes associated with each layer (the default is [10, 5] , which will create two hidden layers with and and 5 nodes each)
activations ({list of activations}, optional) – The activation functions to be used for each layer (the default is [F.tanh, F.tanh], which will generate tanh activations for each of the hidden layers)
batchNormalization ({bool}, optional) – Whether batchnormalization is to be used (the default is False, for which batch normalization will be neglected)

forward(x, sigma=0)[source]¶

forward function that is called during the forward pass

This is the forward function that will be called during a forward pass. It takes thee states and gives the Q value correspondidng to each of the applied actions that are associated with that state.

Parameters:	x (Tensor) – This is a 2D tensor.
Returns:	This represents the Q value of the function
Return type:	tensor

step(v1, v2)[source]¶

Uses the optimizer to update the weights

This calculates the MSE loss given two inputs, one of which must be calculated with this current nn.Module, and the other one that is expected.

Note that this allows arbitrary functions to be used for calculating the loss.

Parameters:	v1 ({Tensor}) – Tensor for calculating the loss function v2 ({Tensor}) – Tensor for calculating the loss function
Raises:	`type` – [description]

lib.agents.randomActor module¶

class lib.agents.randomActor.randomDiscreteActor(stateShape, numActions)[source]¶

Bases: object

a random discrete actor

A discrete action space is one where the actor will return an integer which will represent one of n actions. This actor will return a random action independent of the state the environment is in. This will be roughly uniformly distributed.

Parameters:	{tuple} -- Tuple of integers that will describe (stateShape) – the dimensions of the state space {integer} -- An action that the agent will do. (numActions) –

act(state)[source]¶

return an action based on the state

[description]

Parameters:	{nd-array} -- nd-array as described by the state (state) – shape described in the `__init__` function.
Returns:	integer – integer between 0 and the number of actions available.

lib.agents.sequentialActor module¶

class lib.agents.sequentialActor.SequentialDiscreteActor(stateSize, numActions, layers=[10, 5], activations=[<function tanh>, <function tanh>], batchNormalization=True)[source]¶

Bases: torch.nn.modules.module.Module

[summary]

[description]

Keyword Arguments:
Parameters:	{[type]} -- [description] (numActions) – {[type]} -- [description] –
	{list} -- [description] (default (activations) – {[10, 5]}) {list} -- [description] (default – {[F.tanh, F.tanh]}) {bool} -- [description] (default (batchNormalization) – {True})

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

lib.agents.sequentialCritic module¶

class lib.agents.sequentialCritic.SequentialCritic(stateSize, actionSize, layers=[10, 5], activations=[<function tanh>, <function tanh>], mergeLayer=0, batchNormalization=True)[source]¶

Bases: torch.nn.modules.module.Module

[summary]

[description]

Parameters:

stateSize ({[type]}) – [description]
actionSize ({[type]}) – [description]
layers ({list}, optional) – [description] (the default is [10, 5], which [default_description])
activations ({list}, optional) – [description] (the default is [F.tanh, F.tanh], which [default_description])
batchNormalization ({bool}, optional) – [description] (the default is True, which [default_description])

forward(x, action)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

lib.agents.trainAgents module¶

lib.agents.trainAgents.trainAgentGymEpsGreedy(configAgent)[source]¶

Module contents¶

module that contains a plethora of agents and policies

This module will contain a number of agents and policies that can be specified by easily in a uniform manner. This way different agents can be easily swapped out between each other.

Agents¶

Agents take a state and return an action. Unfortunately, both states and actions come in a variety of shapes and sizes.

There are essentially two different types of states. Either a vector, or an nd-array. Typically, nd-arrays are dealt with coonvolution operators whiile vectors are dealt with simple sequential networks. While nd-arrays can be flattened, the reverse operator us generally not practicable. In any case, it is assumed that the user has enough intuition to be able to distinguish between the two.

Actions are typically vectors. However, sometimes actions can be discrete and sometimes contnuous and any combination of the two. Furthermore, actions typically have bounds. In more general cases (like chess) valid actions are associated with the current state. There is no generic way of solving this problem, so we chall create different types of agents that will return different types of actions.

All agets will definitely have the following methods:

act : action to take given a state

save : save the current state of the agent

restore : restore the agent from a state saved earlier

Currently the following agents are available:

Policies¶

Policies determine what action to take given the result of an agent. In one way, they are similar in that they also take a state, and return an action. However, a policy determines how much exploration vs. exploitation should be done over the period that the agent is learning.