Training Dataset - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 6




Welcome to part 6 of the chatbot with Python and TensorFlow tutorial series. In this part, we're going to work on creating our training data. There are two different overall models and workflows that I am considering working with in this series: One I know works (shown in the beginning and running live on the Twitch stream), and another that can probably work better, but I am still poking around with it. Either way, our setup for training data is relatively similar. We need to create files that are basically "parent" and "reply" text files, where each line is the sample. So line 15 in the parent file is the parent comment, and then line 15 in the reply file is the response to line 15 in the parent file.

To create these files, we just need to grab pairs from the database, and append them to the respective training files. Let's begin with:

import sqlite3
import pandas as pd

timeframes = ['2015-05']


for timeframe in timeframes:

For the run through here, I am just running through a single month, having created only one database, but you might instead want to create a single database, with tables being called the month and year, or you can create a bunch of sqlite databases with tables like we have, then iterate them to create your files. Anyway, I just have the one, so I will leave timeframes as a single item list. Let's continue building this loop:

for timeframe in timeframes:
    connection = sqlite3.connect('{}.db'.format(timeframe))
    c = connection.cursor()
    limit = 5000
    last_unix = 0
    cur_length = limit
    counter = 0
    test_done = False

The first line just establishes our connection, then we define the cursor, then the limit. The limit is the size of chunk that we're going to pull at a time from the database. Again, we're working with data that is plausibly much larger than the RAM we have. We want to set limit to 5000 for now, so we can have some testing data. We can raise that later. We'll use the last_unix to help us make pulls from the database, cur_length will tell us when we're done, counter will allow us to show some debugging information, and test_done for when we're done building testing data.

    while cur_length == limit:

        df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent NOT NULL and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection)
        last_unix = df.tail(1)['unix'].values[0]
        cur_length = len(df)

So long as the cur_length is the same as our limit, we've still got more pulling to do. Then, we'll pull the data from the database and slap it into a dataframe. For now, we're not doing anything much with the dataframe, but, later, we could use this to set more restrictions on the types of data we want to consider. We store the last_unix so we know what time to draw from after. We also take note of the length of the return. Now, this this data, and build our training/testing files. We'll start with the testing:

        if not test_done:
            with open('test.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('test.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

            test_done = True

Now, if you want, you could also raise the limit at this point. After test_done = True, you could also re-define limit to be something like 100K or something. Now, let's do the training code:

        else:
            with open('train.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('train.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

We could make this code simpler and better by making it a function instead so we're not copying and pasting basically the same code. But... instead... let's keep this ball rolling:

        counter += 1
        if counter % 20 == 0:
            print(counter*limit,'rows completed so far')

Here, we'll see output for every 20 steps, so every 100K pairs if we keep the limit to 5,000.

Full code up to this point:

import sqlite3
import pandas as pd

timeframes = ['2015-05']

for timeframe in timeframes:
    connection = sqlite3.connect('{}.db'.format(timeframe))
    c = connection.cursor()
    limit = 5000
    last_unix = 0
    cur_length = limit
    counter = 0
    test_done = False

    while cur_length == limit:

        df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent NOT NULL and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection)
        last_unix = df.tail(1)['unix'].values[0]
        cur_length = len(df)

        if not test_done:
            with open('test.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('test.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

            test_done = True

        else:
            with open('train.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('train.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

        counter += 1
        if counter % 20 == 0:
            print(counter*limit,'rows completed so far')

Alright, run that, and I'll see you when you've got the data ready!

The next tutorial:





  • Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 1
  • Chat Data Structure - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 2
  • Buffering Data - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 3
  • Insert Logic - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 4
  • Building Database - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 5
  • Training Dataset - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 6
  • Training a Model - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 7
  • Exploring concepts and parameters of our NMT Model - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 8
  • Interacting with our Chatbot - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 9