Insert Logic - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 4




Welcome to part 4 of the chatbot with Python and TensorFlow tutorial series. Leading up to this, we've gotten our data and begun to iterate through it. Now we're ready to begin building the actual logic for inputting the data.

To begin, I would like to impose a restriction on *all* comments, regardless if there are any others, and that is that we only want to deal with non-pointless comments. For this reason, I am going to say we only want to consider comments with a vote of two or higher. The code leading up to now:

import sqlite3
import json
from datetime import datetime

timeframe = '2015-05'
sql_transaction = []

connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()

def create_table():
    c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)")

def format_data(data):
    data = data.replace('\n',' newlinechar ').replace('\r',' newlinechar ').replace('"',"'")
    return data

def find_parent(pid):
    try:
        sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False


if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('J:/chatdata/reddit_data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)

Now let's require the score to be two or higher, and then let's also see if there's already an existing reply to the parent, and what its score is:

if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('J:/chatdata/reddit_data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)
            # maybe check for a child, if child, is our new score superior? If so, replace. If not...

            if score >= 2:
                existing_comment_score = find_existing_score(parent_id)

Now, we need to create the find_existing_score function:

def find_existing_score(pid):
    try:
        sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False

If there is an existing comment, and if our score is higher than the existing comment's score, we'd like to replace it:

            if score >= 2:
                existing_comment_score = find_existing_score(parent_id)
                if existing_comment_score:
                    if score > existing_comment_score:

Next, many comments are either deleted or removed, but also some comments are very long, or very short. We want to make sure comments are of an acceptable length for training, and that the comment wasn't removed or deleted:

def acceptable(data):
    if len(data.split(' ')) > 50 or len(data) < 1:
        return False
    elif len(data) > 1000:
        return False
    elif data == '[deleted]':
        return False
    elif data == '[removed]':
        return False
    else:
        return True

Alright, at this point, we're ready to actually start inserting data, which is what we'll be doing in the next tutorial.

The next tutorial:





  • Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 1
  • Chat Data Structure - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 2
  • Buffering Data - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 3
  • Insert Logic - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 4
  • Building Database - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 5
  • Training Dataset - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 6
  • Training a Model - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 7
  • Exploring concepts and parameters of our NMT Model - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 8
  • Interacting with our Chatbot - Creating a Chatbot with Deep Learning, Python, and TensorFlow Part 9