Parsing Reddit Comments - Python Reddit API Wrapper (PRAW) tutorial p.2




p2. Parsing Comments

In this part of our PRAW (Python Reddit API Wrapper) Tutorial, we're going to be familiarizing ourselves more with the PRAW and Reddit API by attempting to parse comments and actually structure them.

To do this, let's dive into a subreddit submission:

import time

hot_python = subreddit.hot(limit=3)
for submission in hot_python:
    if not submission.stickied:
        print('Title: {}, ups: {}, downs: {}, Have we visited?: {}'.format(submission.title,
                                                                           submission.ups,
                                                                           submission.downs,
                                                                           submission.visited))
        comments = submission.comments
        for comment in comments:
            print(20*'-')
            print(comment.body)
            if len(comment.replies) > 0:
                for reply in comment.replies:
                    print('REPLY:')
                    print("\t"+reply.body)
Title: Why is Python 50% faster under Windows Subsystem for Linux?, ups: 152, downs: 0, Have we visited?: False
--------------------
Try again with Python 3.6, that uses a much more recent version of MSVC. See https://wiki.python.org/moin/WindowsCompilers for a table matching Visual Studio versions to Python versions.
--------------------
I don't think psf builds from the website have all the optimisations on in case of edge case bugs.

I can't read your post due to formatting so I can't tell what your running, so probably an optimised build.

I noticed that Ubuntu's build by Ubuntu is faster in an Ubuntu VM on a Mac, then native Mac build from psf. So the compiler optimisation gain is more than the overhead for running a VM.
REPLY:
	Added some formatting, but the important part is:

 - Native python.exe: 200 410 pystones/second
 - WSL python binary: 299 065 pystones/second

The WSL python was compiled by me as there is no Python 2.7.13 in APT yet.
--------------------
One thing that's worth noting is that WSL is a _subsystem_, just like Win32 is a subsystem: you're actually bypassing parts of normal Windows, and using just the underlying NT kernel + the Linux interface on top of it, instead of the usual Win32 interface. This includes what UNIX people would call the C library, which includes things like memory allocation. It's entirely possible that the glibc memory allocator is algorithmically faster or just more efficient (talks to the OS less often and maintains a larger pool of memory) than the Windows one.

I think you'd need to find some sort of profiling tool to get a real answer. [Windows Performance Toolkit](https://docs.microsoft.com/en-us/windows-hardware/test/wpt/) sounds like the right place to start, and _probably_ it's capable of tracing WSL processes. See if you can generate flamegraphs of the two and see where things are slower in the Windows version.
--------------------
Just a guess: it is compiled with gcc, which is very good at optimizing code.  A bit surprising if a Microsoft compiler is that much slower, I would have expected only a marginal difference.  But this is my best guess. 

REPLY:
	It might be that Microsoft compiler don't support computed goto which can make  a huge difference in an interpreter. I believe other specialties of gcc are also used which also could make difference. I believe the main reason is that the main part developers has primary target unixes and so target to gcc or clang. So they use gcc and clang idioms when possible. It could exists some idioms which makes Visual C code faster but there is not enough windows coder amoung core python developer to use it.
	It has nothing to do with compilers. The author said he BUILT python for WSL from source. It's the difference between generic binary build and custom optimized build.
--------------------
Ubuntu's python package use PGO (some overview: https://www.activestate.com/blog/2014/06/python-performance-boost-using-profile-guided-optimization) - I suspect Windows builds don't (2.7 is ancient).

Also, pystones don't necessarily correlate to real world performance.
--------------------
Part of me was hoping this was a rhetorical question and some explaining would be presented when i clicked the link. I was wrong.
--------------------
Out of curiosity, can you run the same test on your machine under native GNU/Linux?
--------------------
I'm just going to put this out there for you. Your statement is invalid. There are a number of issues with your test. The first of which is you have not run your tests for long enough. The second big one is that you are testing how fast `test.pystone` runs and `sys.version` can print the version number out. They both seem like things that don't matter in a production app. I can't think of a time I needed to run `test.pystone` or print the version in a production app.

So this is one option, but then we've got a recursion problem. We don't really know how deep the comments go. We have several options to handle for this, but there's already a built-in solution via PRAW, using a .list() modifier to the comments with: submission.comments.list().

That said, Reddit also has a "load more comments" on longer comment trees, which we also need to handle for. Once again PRAW comes in with the save, using the "replace_more" function. This will replace MoreComment objects for you, with a limit of 32. Each MoreComments object replacement requires another API call, which counts against your quota (30 API requests per minute).

The PRAW automatically handles your request limit, so you shouldn't need to worry about breaching the rules. The only thing you might want to note is that the PRAW is not thread-safe.

Alright, so let's see an example for both the comments.list() and replace_more() functionalities:

hot_python = subreddit.hot(limit=3)
for submission in hot_python:
    if not submission.stickied:
        print('Title: {}, ups: {}, downs: {}, Have we visited?: {}, subid: {}'.format(submission.title,
                                                                                                   submission.ups,
                                                                                                   submission.downs,
                                                                                                   submission.visited,
                                                                                                   submission.id))
        submission.comments.replace_more(limit=0)
        # limiting to 15 results to save output
        for comment in submission.comments.list()[:15]:
            print(20*'#')
            print('Parent ID:',comment.parent())
            print('Comment ID:',comment.id)
            # limiting output for space-saving-sake, feel free to not do this
            print(comment.body[:200])
Title: My code for Tic-Tac-Toe (beginner), compared to my mate that works at google's code., ups: 243, downs: 0, Have we visited?: False, subid: 6qvu38
####################
Parent ID: 6qvu38
Comment ID: dl0hgpl
Your version is slightly better in the sense that pos 1-2-3 is considered a win :)
####################
Parent ID: 6qvu38
Comment ID: dl0f8oo
Thanks for sharing. I hope your learn from his code. It's about removing a lot of the dublication and redundancies. Code is shorter, more concise (therefore easier to read), and if you want to change 
####################
Parent ID: 6qvu38
Comment ID: dl0mmxm
Not a true comparison as he wrote his after he saw yours. Would be better to compare it before he saw yours. 
####################
Parent ID: 6qvu38
Comment ID: dl0l6ll
Why are his functions capitalized?
####################
Parent ID: 6qvu38
Comment ID: dl0rgxt
Avoid the `global` statement. Use `return` in your functions instead.
####################
Parent ID: 6qvu38
Comment ID: dl0iieq
There's still room for improvements. If you define the win combinations as a constant, the `win` function could be a one liner.

    WIN_COMBINATIONS = (
        (1, 2, 3), (4, 5, 6), (7, 8, 9), (1, 4
####################
Parent ID: 6qvu38
Comment ID: dl0hm8p
His `PlayerWin` is wrong. Take it as a win ;).
####################
Parent ID: 6qvu38
Comment ID: dl0knk3
2 space indentation

Yep, he works at Google alright.
####################
Parent ID: 6qvu38
Comment ID: dl0okyr
This habit of starting sentences with an unnecessary 'so' is out of control. The googler does it inside a print()
####################
Parent ID: 6qvu38
Comment ID: dl0mty7
Here is my solution. Can anyone come up with a better way to validate vertical/horizontal/diagnal without specific cases?

    import os
    from collections import OrderedDict
    
    turn = 1
    

####################
Parent ID: 6qvu38
Comment ID: dl0na2y
A few less than conventional choices in my version, but I'm prepared to defend them all:

    from itertools import cycle
    
    
    def legal_moves(board):
        return set(board) - set('XO')
  
####################
Parent ID: 6qvu38
Comment ID: dl0usyh
[This](https://github.com/Ema0/PythonStuff/blob/master/tris.py) is what I came up with a week ago. If someone has suggestions/improvements on my code feel free to comment.
####################
Parent ID: 6qvu38
Comment ID: dl0hqud
Thanks for posting. My employer does tictactoe live coding exercise as part of interview.  It's really interesting seeing the variety if solutions. But sad how many supposed developers can't read, ask
####################
Parent ID: 6qvu38
Comment ID: dl0icq0
his code is pretty bad too though.
####################
Parent ID: 6qvu38
Comment ID: dl0hr1q
Hey this was super interesting, thanks for sharing!

Alright, so what do we have here? If you compare the output to the actual thread in your browser, you should find that all comments are here, but not necessarily in the order you were expecting. So, the order you get things when you use the comments.list() is all of the top level comments, followed by 2nd level comments, followed by 3rd level, so these still aren't necessarily sorted how you want them, but you have all comments, and can access every comment's id and parent's id.

My reason for wanting to use the PRAW is to get context-specific conversational data. Thus, for me, I am interested in comment-response pairs. How might we build something to specifically get comments and their responses?

One option could be to just build a dictionary, since many comments might have multiple responses.

conversedict = {}
hot_python = subreddit.hot(limit=3)

for submission in hot_python:
    if not submission.stickied:
        print('Title: {}, ups: {}, downs: {}, Have we visited?: {}, subid: {}'.format(submission.title,
                                                                                      submission.ups,
                                                                                      submission.downs,
                                                                                      submission.visited,
                                                                                      submission.id))

        submission.comments.replace_more(limit=0)
        for comment in submission.comments.list():
            if comment.id not in conversedict:
                conversedict[comment.id] = [comment.body,{}]
                if comment.parent() != submission.id:
                    parent = str(comment.parent())
                    conversedict[parent][1][comment.id] = [comment.ups, comment.body]
Title: My code for Tic-Tac-Toe (beginner), compared to my mate that works at google's code., ups: 237, downs: 0, Have we visited?: False, subid: 6qvu38

Alright, so that dictionary might be pretty dense and confusing, here's a pseudocode-ish breakdown:

conversedict = {post_id: [parent_content, {reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content]}],

                post_id: [parent_content, {reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content]}],
                                            
                post_id: [parent_content, {reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content]}],
                }

In this case, we have every message, that message's contents, and then every reply to to it, along with that reply's votes, just incase we want some metric to sort or filter by. Now, for example, we can iterate through this like so:

for post_id in conversedict:
    message = conversedict[post_id][0]
    replies = conversedict[post_id][1]
    if len(replies) > 1:
        print(35*'_')
        print('Original Message: {}'.format(message))
        
        print('Replies:')
        for reply in replies:
            print('--')
            print(replies[reply][1][:200]) # again, limiting to 200 characters for space-saving, not necessary
___________________________________
Original Message:     perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'
Replies:
--
Perl is a write only language. Everyone knows that so you can't just whip that out as the example. ;)



--
> !~ /^1?$|^(11+?)\1+$/

Perl always looks to me like the programmer had a stroke and landed face down on the keyboard.
___________________________________
Original Message: really? Tic Tac Toe was a high school programming exercise for me.
Replies:
--
You're pretty privileged to have gone to a HS that has that.  
--
Even today, there are vast numbers of high schools that don't offer any programming classes.
___________________________________
Original Message: Why are his functions capitalized?
Replies:
--
Input was capitalized as a Py2/3 input/raw_input hybrid, if you didn't notice.
--
Google style guide I think.
___________________________________
Original Message: Yeah I'm really not impressed by it. There's a lot of pep8 and other style violations which surprised me. I assumed one of the major differences between the novice and "pro" was going to be the better design patterns and adherence to style guide. 
Replies:
--
Indeed, don't just blindly take his code as an example of how you should've done it OP.
--
Just because the code doesn't conform to PEP8 doesn't mean it's bad. PEP8, by the way, [is originally intended as a style guide for the Python standard library](https://www.python.org/dev/peps/pep-000
___________________________________
Original Message: >...therefore easier to read...

Not always.
Replies:
--
Was referring only to more concise, not shorter
--
> > ...therefore easier to read...

> Not always.

actually, almost never. 

I think OP's code is actually better from a maintainability point of view. 
--
Care to elaborate? 
___________________________________
Original Message: Thanks for posting. My employer does tictactoe live coding exercise as part of interview.  It's really interesting seeing the variety if solutions. But sad how many supposed developers can't read, ask questions and formulate a plan/architecture to solve this fairly simple  problem.  Many struggle with looking to display board or how to detect winners or when game is over. They have 45min and most never get close to working code.

So, just being able to complete this you are already ahead if 75% of people I've interviewed. Good job!
Replies:
--
really? Tic Tac Toe was a high school programming exercise for me.
--
What sorts of questions are you looking for in situations like that? Or at least, what would be a good question to ask?
___________________________________
Original Message: Care to elaborate? 
Replies:
--
Some short code can be more difficult to understand than a longer version that does the same thing. For example, consider:
```
void xorSwap (int *x, int *y) {
    *x^=*y^(*y=*x);
}
```
Compared to
```
--
shorter and more concise is not necessarily easier to read. Code can be short and concise and difficult to read. See anything on [codegolf](https://codegolf.stackexchange.com/) for example.
--
    perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'
___________________________________
Original Message: Input was capitalized as a Py2/3 input/raw_input hybrid, if you didn't notice.
Replies:
--
...all functions are written in PascalCase. It's not the greatest of style.
--
Huh? Yeah I see that, how does that have anything to do with why the function name is capitalized?
___________________________________
Original Message: his code is pretty bad too though.
Replies:
--
Yeah I'm really not impressed by it. There's a lot of pep8 and other style violations which surprised me. I assumed one of the major differences between the novice and "pro" was going to be the better
--
Why?
___________________________________
Original Message: > !~ /^1?$|^(11+?)\1+$/

Perl always looks to me like the programmer had a stroke and landed face down on the keyboard.
Replies:
--
You should look up J or APL. Here's a implementation of quicksort in J:

    quicksort=: (($:@(<#[), (=#[), $:@(>#[)) ({~ ?@#)) ^: (1<#) 
--
There's just something wrong with a language where you can enter any random sequence of characters and probably get a working program. No idea what it'll do, but it'll do *something*.
--
> 1987 - Larry Wall falls asleep and hits Larry Wall's forehead on the keyboard. Upon waking Larry Wall decides that the string of characters on Larry Wall's monitor isn't random but an example progra
--
the user TheTerrasque gave an example of very short code which is cryptic but gets the job done -- rebuffing the silly assertion that shorter code is more readable.

You can write cryptic code in any 
--
Which I believe is exactly what happens when you try to debug anyone else's perl code.

Instead of doing a comment-reply styled dictionary, you could also re-create the comment tree in dictionary/json form, or whatever you want.

Full code up to this point:

import praw

reddit = praw.Reddit(client_id='clientid',
                     client_secret='secret', password='password',
                     user_agent='PrawTut', username='username')

subreddit = reddit.subreddit('python')

conversedict = {}
hot_python = subreddit.hot(limit=3)

for submission in hot_python:
    if not submission.stickied:
        print('Title: {}, ups: {}, downs: {}, Have we visited?: {}, subid: {}'.format(submission.title,
                                                                                                   submission.ups,
                                                                                                   submission.downs,
                                                                                                   submission.visited,
                                                                                                   submission.id))

        submission.comments.replace_more(limit=0)
        for comment in submission.comments.list():
            if comment.id not in conversedict:
                conversedict[comment.id] = [comment.body,{}]
                if comment.parent() != submission.id:
                    parent = str(comment.parent())
                    conversedict[parent][1][comment.id] = [comment.ups, comment.body]


# Dictionary Format#
'''
conversedict = {post_id: [parent_content, {reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content]}],

                post_id: [parent_content, {reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content]}],
                                            
                post_id: [parent_content, {reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content],
                                            reply_id:[votes, reply_content]}],
                }


'''

for post_id in conversedict:
    message = conversedict[post_id][0]
    replies = conversedict[post_id][1]
    if len(replies) > 1:
        print('Original Message: {}'.format(message))
        print(35*'_')
        print('Replies:')
        for reply in replies:
            print(replies[reply])

In the next tutorial, we're going to cover streaming comments and submissions live from Reddit.

The next tutorial:





  • Introduction and Basics - Python Reddit API Wrapper (PRAW) tutorial p.1
  • Parsing Reddit Comments - Python Reddit API Wrapper (PRAW) tutorial p.2
  • Streaming from Reddit - Python Reddit API Wrapper (PRAW) tutorial p.3
  • Building a Reddit Bot that Detects Trash - Python Reddit API Wrapper (PRAW) tutorial p.4