When OpenAI's Universe came out, and various articles suggested games even like Grand Theft Auto 5 were ready to go, I was very excited in checking it out. Then, however, somewhat mysteriously, GTA V was completely removed from Universe with no explanation whatsoever.

I gave up and forgot about it for a while, but the idea still seemed exciting. Finally, I decided to put some more mental energy into it, and questioned whether or not I even needed Open AI at all for a task like this. Sure, it's nice for simpler games that can be run en masse, so you can train thousands of iterations in moments, but, with something like GTA V, this is really not going to be much of an option anyway.

Just in case it's not totally obvious, why GTA V? At least for me, Grand Theft Auto 5 is a great environment to practice in for a variety of reasons. It's an open world with endless things you can do, but let's consider even just a simple one: Self-driving cars. With GTA V, we can use mods to control the time of day, weather, traffic, speeds, what happens when we crash...all kinds of things (mainly using mods, but this isn't absolutely required). It's just a completely customize-able environment.

Some of my tutorials are planned fully, others sort of, and some not at all. This is not planned at all, and is going to be me working through this problem. I realize not everyone has Grand Theft Auto 5, but it is my expectation that you have SOME similar games to do the tasks we're going to be working on, and that this method can be done on a variety of games. Because you may have to translate some things and tweak to get things working on your end, this is probably not going to be a beginner-friendly series.

My initial goal is to just create a sort of self-driving car. Any game with lanes and cars should be just fine for you to follow along. The method I will use to access the game should be do-able on almost any game. A simpler game will likely be much more simple of a task too. Things like sun glare in GTA V will make computer vision only much more challenging, but also more realistic.

I may also try other games with this method, since I also think we can teach an AI to play games by simply showing it how to play for a bit, using a Convolutional Neural Network on that information, and then letting the AI poke around.

Here are my initial thoughts:

Despite not having a pre-packaged solution already with Python:

We can surely access frames from the screen.
We can mimic key-presses (sendkeys, pyautogui...and probably many other options).

This is already enough for more rudimentary tasks, but what about for something like deep learning? Really the only extra thing we might want is something that can also log various events from the game world. That said, since most games are played almost completely visually, we can handle for that already, and we can also track mouse position and key presses, allowing us to engage in deep learning.

I doubt this will be sunshine and rainbows, but I think it's at least possible, and will make for a great, or at least interesting, project. My main concern is processing everything fast enough, but I think we can do it, and it's at least worth a shot.

So this is quite a large project, if we don't break it down, and take some baby-steps, we're going to be overwhelmed. The way I see it, we need to try to do the bare minimum first. Thus, the initial goals are:

Access the game-screen at a somewhat decent FPS. Anything over 5ish should be workable for us. Unpleasant to watch, but work-able, and we can always just watch the actual live game, rather than the processing frames for our enjoyment.
Send keyboard input to game screen. I am assuming I can do this very simply, but we need to make sure.
Try some form of joystick input if possible (especially considering throttle and turning)
Use OpenCV on game frames and hope to not take a huge hit in processing FPS
Simple self-driving car that stays in some lanes under simple conditions (High sun, clear day, no rain, no traffic...etc).

Alright, so step 1, how should we actually access our screen? I am only certain it's been done, but I don't really know how. For this, I take to Google! I find quite a few examples, most of which don't actually loop, but this one does: http://stackoverflow.com/questions/24129253/screen-capture-with-opencv-and-python-2-7, it just appears to have a typo on the import, ImageGrab is part of PIL.

import numpy as np
import ImageGrab
import cv2

while(True):
    printscreen_pil =  ImageGrab.grab()
    printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype=uint8)\
    .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3)) 
    cv2.imshow('window',printscreen_numpy)
    if cv2.waitKey(25) & 0xFF == ord('q'):
        cv2.destroyAllWindows()
        break

----------------------------------------------------------------------
ImportError                          Traceback (most recent call last)
<ipython-input-3-00f897cb4216> in <module>()
      1 import numpy as np
----> 2 import ImageGrab
      3 import cv2
      4 
      5 while(True):

ImportError: No module named 'ImageGrab'

Odd, okay, ImageGrab is part of PIL from what I understand, so we fix that import:

import numpy as np
from PIL import ImageGrab
import cv2

while(True):
    printscreen_pil =  ImageGrab.grab()
    printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype=uint8)\
    .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3)) 
    cv2.imshow('window',printscreen_numpy)
    if cv2.waitKey(25) & 0xFF == ord('q'):
        cv2.destroyAllWindows()
        break

----------------------------------------------------------------------
NameError                            Traceback (most recent call last)
<ipython-input-4-545ecbe36422> in <module>()
      5 while(True):
      6     printscreen_pil =  ImageGrab.grab()
----> 7     printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype=uint8)    .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3))
      8     cv2.imshow('window',printscreen_numpy)
      9     if cv2.waitKey(25) & 0xFF == ord('q'):

NameError: name 'uint8' is not defined

More fighting. The dtype should be string, not what appears to be a variable name that's obviously not defined. Did this person run the code?

import numpy as np
from PIL import ImageGrab
import cv2

def screen_record(): 
    while True:
        printscreen_pil =  ImageGrab.grab()
        printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype='uint8')\
        .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3)) 
        cv2.imshow('window',printscreen_numpy)
        if cv2.waitKey(25) & 0xFF == ord('q'):
            cv2.destroyAllWindows()
            break

Great, this one actually works to some degree. It's a bit large though. And slow. Let's solve for the size.

import numpy as np
from PIL import ImageGrab
import cv2


def screen_record(): 
    while True:
        # 800x600 windowed mode
        printscreen_pil =  ImageGrab.grab(bbox=(0,40,800,640))
        printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype='uint8')\
        .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3))
        cv2.imshow('window',cv2.cvtColor(printscreen_numpy, cv2.COLOR_BGR2RGB))
        if cv2.waitKey(25) & 0xFF == ord('q'):
            cv2.destroyAllWindows()
            break

Okay great, this will work for size... but this is still very slow. I am currently getting about 2-3 Frames per second. Let's find out why.

import numpy as np
from PIL import ImageGrab
import cv2
import time

def screen_record(): 
    last_time = time.time()
    while True:
        # 800x600 windowed mode
        printscreen_pil =  ImageGrab.grab(bbox=(0,40,800,640))
        printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype='uint8')\
        .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3))
        print('loop took {} seconds'.format(time.time()-last_time))
        last_time = time.time()

    ##    cv2.imshow('window',cv2.cvtColor(printscreen_numpy, cv2.COLOR_BGR2RGB))
    ##    if cv2.waitKey(25) & 0xFF == ord('q'):
    ##        cv2.destroyAllWindows()
    ##        break

This is still ~2-3 FPS, so the imshow is not the culprit.

import numpy as np
from PIL import ImageGrab
import cv2
import time

def screen_record(): 
    last_time = time.time()
    while True:
        # 800x600 windowed mode
        printscreen_pil =  ImageGrab.grab(bbox=(0,40,800,640))
    ##    printscreen_numpy =   np.array(printscreen_pil.getdata(),dtype='uint8')\
    ##    .reshape((printscreen_pil.size[1],printscreen_pil.size[0],3))
        print('loop took {} seconds'.format(time.time()-last_time))
        last_time = time.time()
    ##    
    ##    cv2.imshow('window',cv2.cvtColor(printscreen_numpy, cv2.COLOR_BGR2RGB))
    ##    if cv2.waitKey(25) & 0xFF == ord('q'):
    ##        cv2.destroyAllWindows()
    ##        break

Oooh. We're on to something now: loop took 0.05849909782409668 seconds loop took 0.044053077697753906 seconds loop took 0.04760456085205078 seconds loop took 0.04805493354797363 seconds loop took 0.05989837646484375 seconds

Now, for OpenCV's imshow, we really need a numpy array. What if, rather than doing the whole .getdata and reshape.... let's just convert ImageGrab.grab(bbox=(0,40,800,640)) to a numpy array. Why the reshape? It's already the size we need, and maybe .getdata, despite being a method, wont be required.

import numpy as np
from PIL import ImageGrab
import cv2
import time

def screen_record(): 
    last_time = time.time()
    while(True):
        # 800x600 windowed mode
        printscreen =  np.array(ImageGrab.grab(bbox=(0,40,800,640)))
        print('loop took {} seconds'.format(time.time()-last_time))
        last_time = time.time()
        cv2.imshow('window',cv2.cvtColor(printscreen, cv2.COLOR_BGR2RGB))
        if cv2.waitKey(25) & 0xFF == ord('q'):
            cv2.destroyAllWindows()
            break

Great, this gives me ~12-13 FPS. That's certainly not amazing, but we can work with that.

I chose the bbox dimensions to match an 800x600 resolution of GTA V in windowed mode.

Reading game frames in Python with OpenCV - Python Plays GTA V