Welcome to part 4 of the chatbot with Python and TensorFlow tutorial series. Leading up to this, we've gotten our data and begun to iterate through it. Now we're ready to begin building the actual logic for inputting the data.
To begin, I would like to impose a restriction on *all* comments, regardless if there are any others, and that is that we only want to deal with non-pointless comments. For this reason, I am going to say we only want to consider comments with a vote of two or higher. The code leading up to now:
import sqlite3 import json from datetime import datetime timeframe = '2015-05' sql_transaction = [] connection = sqlite3.connect('{}.db'.format(timeframe)) c = connection.cursor() def create_table(): c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)") def format_data(data): data = data.replace('\n',' newlinechar ').replace('\r',' newlinechar ').replace('"',"'") return data def find_parent(pid): try: sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid) c.execute(sql) result = c.fetchone() if result != None: return result[0] else: return False except Exception as e: #print(str(e)) return False if __name__ == '__main__': create_table() row_counter = 0 paired_rows = 0 with open('J:/chatdata/reddit_data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f: for row in f: row_counter += 1 row = json.loads(row) parent_id = row['parent_id'] body = format_data(row['body']) created_utc = row['created_utc'] score = row['score'] comment_id = row['name'] subreddit = row['subreddit'] parent_data = find_parent(parent_id)
Now let's require the score to be two or higher, and then let's also see if there's already an existing reply to the parent, and what its score is:
if __name__ == '__main__': create_table() row_counter = 0 paired_rows = 0 with open('J:/chatdata/reddit_data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f: for row in f: row_counter += 1 row = json.loads(row) parent_id = row['parent_id'] body = format_data(row['body']) created_utc = row['created_utc'] score = row['score'] comment_id = row['name'] subreddit = row['subreddit'] parent_data = find_parent(parent_id) # maybe check for a child, if child, is our new score superior? If so, replace. If not... if score >= 2: existing_comment_score = find_existing_score(parent_id)
Now, we need to create the find_existing_score
function:
def find_existing_score(pid): try: sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid) c.execute(sql) result = c.fetchone() if result != None: return result[0] else: return False except Exception as e: #print(str(e)) return False
If there is an existing comment, and if our score is higher than the existing comment's score, we'd like to replace it:
if score >= 2: existing_comment_score = find_existing_score(parent_id) if existing_comment_score: if score > existing_comment_score:
Next, many comments are either deleted or removed, but also some comments are very long, or very short. We want to make sure comments are of an acceptable length for training, and that the comment wasn't removed or deleted:
def acceptable(data): if len(data.split(' ')) > 50 or len(data) < 1: return False elif len(data) > 1000: return False elif data == '[deleted]': return False elif data == '[removed]': return False else: return True
Alright, at this point, we're ready to actually start inserting data, which is what we'll be doing in the next tutorial.