In this part of our PRAW (Python Reddit API Wrapper) Tutorial, we're going to be familiarizing ourselves more with the PRAW and Reddit API by attempting to parse comments and actually structure them.
To do this, let's dive into a subreddit submission:
import time
hot_python = subreddit.hot(limit=3)
for submission in hot_python:
if not submission.stickied:
print('Title: {}, ups: {}, downs: {}, Have we visited?: {}'.format(submission.title,
submission.ups,
submission.downs,
submission.visited))
comments = submission.comments
for comment in comments:
print(20*'-')
print(comment.body)
if len(comment.replies) > 0:
for reply in comment.replies:
print('REPLY:')
print("\t"+reply.body)
So this is one option, but then we've got a recursion problem. We don't really know how deep the comments go. We have several options to handle for this, but there's already a built-in solution via PRAW, using a .list() modifier to the comments with: submission.comments.list().
That said, Reddit also has a "load more comments" on longer comment trees, which we also need to handle for. Once again PRAW comes in with the save, using the "replace_more" function. This will replace MoreComment objects for you, with a limit of 32. Each MoreComments object replacement requires another API call, which counts against your quota (30 API requests per minute).
The PRAW automatically handles your request limit, so you shouldn't need to worry about breaching the rules. The only thing you might want to note is that the PRAW is not thread-safe.
Alright, so let's see an example for both the comments.list() and replace_more() functionalities:
hot_python = subreddit.hot(limit=3)
for submission in hot_python:
if not submission.stickied:
print('Title: {}, ups: {}, downs: {}, Have we visited?: {}, subid: {}'.format(submission.title,
submission.ups,
submission.downs,
submission.visited,
submission.id))
submission.comments.replace_more(limit=0)
# limiting to 15 results to save output
for comment in submission.comments.list()[:15]:
print(20*'#')
print('Parent ID:',comment.parent())
print('Comment ID:',comment.id)
# limiting output for space-saving-sake, feel free to not do this
print(comment.body[:200])
Alright, so what do we have here? If you compare the output to the actual thread in your browser, you should find that all comments are here, but not necessarily in the order you were expecting. So, the order you get things when you use the comments.list() is all of the top level comments, followed by 2nd level comments, followed by 3rd level, so these still aren't necessarily sorted how you want them, but you have all comments, and can access every comment's id and parent's id.
My reason for wanting to use the PRAW is to get context-specific conversational data. Thus, for me, I am interested in comment-response pairs. How might we build something to specifically get comments and their responses?
One option could be to just build a dictionary, since many comments might have multiple responses.
conversedict = {}
hot_python = subreddit.hot(limit=3)
for submission in hot_python:
if not submission.stickied:
print('Title: {}, ups: {}, downs: {}, Have we visited?: {}, subid: {}'.format(submission.title,
submission.ups,
submission.downs,
submission.visited,
submission.id))
submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
if comment.id not in conversedict:
conversedict[comment.id] = [comment.body,{}]
if comment.parent() != submission.id:
parent = str(comment.parent())
conversedict[parent][1][comment.id] = [comment.ups, comment.body]
Alright, so that dictionary might be pretty dense and confusing, here's a pseudocode-ish breakdown:
conversedict = {post_id: [parent_content, {reply_id:[votes, reply_content],
reply_id:[votes, reply_content],
reply_id:[votes, reply_content]}],
post_id: [parent_content, {reply_id:[votes, reply_content],
reply_id:[votes, reply_content],
reply_id:[votes, reply_content]}],
post_id: [parent_content, {reply_id:[votes, reply_content],
reply_id:[votes, reply_content],
reply_id:[votes, reply_content]}],
}
In this case, we have every message, that message's contents, and then every reply to to it, along with that reply's votes, just incase we want some metric to sort or filter by. Now, for example, we can iterate through this like so:
for post_id in conversedict:
message = conversedict[post_id][0]
replies = conversedict[post_id][1]
if len(replies) > 1:
print(35*'_')
print('Original Message: {}'.format(message))
print('Replies:')
for reply in replies:
print('--')
print(replies[reply][1][:200]) # again, limiting to 200 characters for space-saving, not necessary
Instead of doing a comment-reply styled dictionary, you could also re-create the comment tree in dictionary/json form, or whatever you want.
Full code up to this point:
import praw
reddit = praw.Reddit(client_id='clientid',
client_secret='secret', password='password',
user_agent='PrawTut', username='username')
subreddit = reddit.subreddit('python')
conversedict = {}
hot_python = subreddit.hot(limit=3)
for submission in hot_python:
if not submission.stickied:
print('Title: {}, ups: {}, downs: {}, Have we visited?: {}, subid: {}'.format(submission.title,
submission.ups,
submission.downs,
submission.visited,
submission.id))
submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
if comment.id not in conversedict:
conversedict[comment.id] = [comment.body,{}]
if comment.parent() != submission.id:
parent = str(comment.parent())
conversedict[parent][1][comment.id] = [comment.ups, comment.body]
# Dictionary Format#
'''
conversedict = {post_id: [parent_content, {reply_id:[votes, reply_content],
reply_id:[votes, reply_content],
reply_id:[votes, reply_content]}],
post_id: [parent_content, {reply_id:[votes, reply_content],
reply_id:[votes, reply_content],
reply_id:[votes, reply_content]}],
post_id: [parent_content, {reply_id:[votes, reply_content],
reply_id:[votes, reply_content],
reply_id:[votes, reply_content]}],
}
'''
for post_id in conversedict:
message = conversedict[post_id][0]
replies = conversedict[post_id][1]
if len(replies) > 1:
print('Original Message: {}'.format(message))
print(35*'_')
print('Replies:')
for reply in replies:
print(replies[reply])
In the next tutorial, we're going to cover streaming comments and submissions live from Reddit.