Introduction to AI within Environments¶

Due to deep-learning's desire for large datasets, anything that can be modeled or simulated can be easily learned by AI.

With Python, we can easily create our own environments, but there are also quite a few libraries out there that do this for you. The most popular that I know of is OpenAI's gym environments.

There are also many concepts like mathematics, even concepts like encryption, where we can generate hundreds of thousands, or millions, of samples easily.

For this tutorial, we're going to use the "CartPole" environment.

To follow along, the following requirements will be necessary:

Requirements¶

TensorFlow - pip install tensorflow OR pip install tensorflow-gpu

Installing the GPU version of TensorFlow in Ubuntu

Installing the GPU version of TensorFlow on a Windows machine

Using TensorFlow and concept tutorials:

Introduction to deep learning with neural networks

Introduction to TensorFlow

TFLearn - pip install tflearn

Intro to TFLearn

OpenAI's gym - pip install gym

Solving the CartPole balancing environment¶

The idea of CartPole is that there is a pole standing up on top of a cart. The goal is to balance this pole by wiggling/moving the cart from side to side to keep the pole balanced upright.

The environment is deemed successful if we can balance for 200 frames, and failure is deemed when the pole is more than 15 degrees from fully vertical.

Every frame that we go with the pole "balanced" (less than 15 degrees from vertical), our "score" gets +1, and our target is a score of 200.

Now, how do we do this? There are endless ways, some very complex, and some very specific. I'd like to solve this very generally, and in a way that we could easily apply this same solution to a wide variety of problems.

This will also give me the ability to illustrate a very interesting property of neural networks. If you've ever taken a statistics course, you might be familiar with the scenario where you can have various signals, which have some degree of predictive power, and combine them for something with more predictive power than the sum of the parts.

Neural networks are fully capable of doing this on their own entirely.

To illustrate this, we're going to start by creating an agent that, when in this cartpole environment, it just randomly chooses actions (left and right). Recall that our goal is to get a score of 200, but we'll go ahead and use any scenario where we've scored above 50 to learn from.

From here, the input layer is the obervation from the environment, which includes pole position and such. The output layer is just one of two actions: Left or Right.

Alright, let's get started:

import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
from statistics import median, mean
from collections import Counter

LR = 1e-3
env = gym.make("CartPole-v0")
env.reset()
goal_steps = 500
score_requirement = 50
initial_games = 10000

[2017-03-02 18:38:06,633] Making new env: CartPole-v0

Now, let's just get a quick impression of what a random agent looks like.

def some_random_games_first():
    # Each of these is its own game.
    for episode in range(5):
        env.reset()
        # this is each frame, up to 200...but we wont make it that far.
        for t in range(200):
            # This will display the environment
            # Only display if you really want to see it.
            # Takes much longer to display it.
            env.render()
            
            # This will just create a sample action in any environment.
            # In this environment, the action can be 0 or 1, which is left or right
            action = env.action_space.sample()
            
            # this executes the environment with an action, 
            # and returns the observation of the environment, 
            # the reward, if the env is over, and other info.
            observation, reward, done, info = env.step(action)
            if done:
                break
                
some_random_games_first()

Each time you see the scene start over, that's because the environment was "done." In our case, we kept losing.

Now that you've seen what random is, can we learn from it? Absolutely.

def initial_population():
    # [OBS, MOVES]
    training_data = []
    # all scores:
    scores = []
    # just the scores that met our threshold:
    accepted_scores = []
    # iterate through however many games we want:
    for _ in range(initial_games):
        score = 0
        # moves specifically from this environment:
        game_memory = []
        # previous observation that we saw
        prev_observation = []
        # for each frame in 200
        for _ in range(goal_steps):
            # choose random action (0 or 1)
            action = random.randrange(0,2)
            # do it!
            observation, reward, done, info = env.step(action)
            
            # notice that the observation is returned FROM the action
            # so we'll store the previous observation here, pairing
            # the prev observation to the action we'll take.
            if len(prev_observation) > 0 :
                game_memory.append([prev_observation, action])
            prev_observation = observation
            score+=reward
            if done: break

        # IF our score is higher than our threshold, we'd like to save
        # every move we made
        # NOTE the reinforcement methodology here. 
        # all we're doing is reinforcing the score, we're not trying 
        # to influence the machine in any way as to HOW that score is 
        # reached.
        if score >= score_requirement:
            accepted_scores.append(score)
            for data in game_memory:
                # convert to one-hot (this is the output layer for our neural network)
                if data[1] == 1:
                    output = [0,1]
                elif data[1] == 0:
                    output = [1,0]
                    
                # saving our training data
                training_data.append([data[0], output])

        # reset env to play again
        env.reset()
        # save overall scores
        scores.append(score)
    
    # just in case you wanted to reference later
    training_data_save = np.array(training_data)
    np.save('saved.npy',training_data_save)
    
    # some stats here, to further illustrate the neural network magic!
    print('Average accepted score:',mean(accepted_scores))
    print('Median score for accepted scores:',median(accepted_scores))
    print(Counter(accepted_scores))
    
    return training_data

Now we will make our neural network. We're just going to use a simple multilayer perceptron model.

def neural_network_model(input_size):

    network = input_data(shape=[None, input_size, 1], name='input')

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')

    return model


def train_model(training_data, model=False):

    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    y = [i[1] for i in training_data]

    if not model:
        model = neural_network_model(input_size = len(X[0]))
    
    model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_learning')
    return model

IF you do not understand the neural network code, see the linked tutorials at the beginning of this notebook. I've already covered neural networks extensively, no sense in repeating myself!

Let's produce the training data:

training_data = initial_population()

[2017-03-02 18:38:09,220] You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.

Average accepted score: 61.45325779036827
Median score for accepted scores: 58.0
Counter({51.0: 30, 53.0: 29, 50.0: 25, 58.0: 23, 55.0: 19, 59.0: 17, 57.0: 16, 52.0: 14, 54.0: 14, 56.0: 14, 62.0: 13, 60.0: 12, 61.0: 12, 66.0: 11, 72.0: 9, 63.0: 8, 68.0: 8, 65.0: 7, 67.0: 6, 69.0: 6, 75.0: 5, 81.0: 5, 64.0: 4, 73.0: 4, 74.0: 4, 78.0: 4, 79.0: 4, 77.0: 3, 85.0: 3, 87.0: 3, 76.0: 2, 89.0: 2, 92.0: 2, 101.0: 2, 70.0: 1, 71.0: 1, 80.0: 1, 82.0: 1, 83.0: 1, 84.0: 1, 86.0: 1, 88.0: 1, 90.0: 1, 97.0: 1, 100.0: 1, 107.0: 1, 110.0: 1})

Take note here that the average score is 60, the median is 57, and the HIGHEST example here is 111, and that's the only one above 100. Now, let's train our neural network on this data that gave us these scores...

model = train_model(training_data)

Training Step: 1669  | total loss: 0.65939 | time: 1.893s
| Adam | epoch: 005 | loss: 0.65939 - acc: 0.6086 -- iter: 21312/21340
Training Step: 1670  | total loss: 0.65900 | time: 1.898s
| Adam | epoch: 005 | loss: 0.65900 - acc: 0.6118 -- iter: 21340/21340
--

Now, we're going to use code very similar to the initial_population function, the only major difference is that, rather than using a random action, our we'll generate an action FROM our neural network instead. We're going to go ahead and visualize these as well, and then save some stats:

scores = []
choices = []
for each_game in range(10):
    score = 0
    game_memory = []
    prev_obs = []
    env.reset()
    for _ in range(goal_steps):
        env.render()

        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])

        choices.append(action)
                
        new_observation, reward, done, info = env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score+=reward
        if done: break

    scores.append(score)

print('Average Score:',sum(scores)/len(scores))
print('choice 1:{}  choice 0:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices)))
print(score_requirement)

Average Score: 195.9
choice 1:0.5074017355793773  choice 0:0.49259826442062277
50