Analyzing reddit comments using Python
2021-01-18 by Adam Goth · 8 min read
In this post, we'll take a look at how to build a simple Python script for word analysis. We will then apply it to the comment section of any given reddit post.
Overview
Between my job and side projects, I typically spend most of my time building web applications using React and Node. That means writing almost exclusively JavaScript. To keep my perspective on programming fresh and not strictly confined to a single language, I wanted to take a little time to step out of the world of JavaScript and explore the world of another programming language. I decided to come up with a little project idea and to build it with Python. Python is a powerful yet friendly programming language that is popular with beginners and experienced programmers alike. It was created in 1991 by Guido van Rossum, but continues to rise in popularity almost 30 years later. In 2020, Python was at or near the top of the list for in-demand languages for programming jobs. It was also deemed by Wired magazine to be more popular than ever before. After spending just a short time writing code with Python, it's not hard to see why it's a popular choice. Let's jump in.
Setting up
This post will assume you have basic programming knowledge and that you have
Python 3 installed. For more detailed information on installing Python,
start here. The repo for
this project can be found
here. You will notice a
.py
file containing the full script, as well as a .ipynb
file containing a
Jupyter Notebook for the script. The Jupyter Notebook is
an open-source web application that allows you to create and share documents
that contain live code, equations, visualizations, and narrative text, which can
make it easier to follow along and learn how a python script works.
The script
The script in its entirety can be found here.
The first thing we need to do is import the requests library. This is what we
will use to make the HTTP request to reddit to get the comment data from the
reddit post. After that, we will initialize a few global variables. We will use
these global variables to keep track of data as we parse through comments.
comment_count
is an integer and will track the number of comments we parse,
comment_array
is an array and will hold the actual comment strings, and
more_comment_ids
is another array that will hold ID strings that we will need
in order to fetch additional comments that are not returned in the initial
payload (commonly found in posts with many comments).
1# imports2import requests # The requests library for HTTP requests in Python34# globals5comment_count = 06comment_array = []7more_comment_ids = []
Next, we need to fetch the data for the reddit post. To do that, we can append
.json
to the end of any reddit post URL.
An example would be:
https://www.reddit.com/r/redditdev/comments/krolrb/multicomments.json
.
What we get back is JSON that will have a basic format that looks like this:
1{2 "kind": "Listing",3 "data": {4 "children": [5 "kind": "t1",6 "data": {7 "body": "",8 "replies": ""9 }10 ]11 }12}
A reddit post is referred to as a
"Listing". Listings can contain many
kinds of children. A child with a kind
of t1
indicates that the child
represents a comment. Within the comments data
property, among many other
properties, the text of the comment can be found on the body
property, along
with any possible replies which are located on the replies
property. Replies
are structured the same way as comments. They contain children and the children
has kind
and data
properties. Within every reply to a comment, we may see
another reply to that reply comment. Each of these contains their own
identically formatted children. So in order to analyze all comments within a
thread, we'll have to recursively sift through all comments and replies.
If having to follow each individual comment tree recursively to its end wasn't
tricky enough, there's another issue we have to worry about. Since comment
threads can become quite long, not every comment is always displayed on the
initial thread load. When this happens, reddit shows "load more replies" buttons
within threads. So how do we get these as well? To handle these instances, the
API will deliver a child with a kind
property value of more
.
1{2 "kind": "more",3 "data": {4 "count": 2,5 "name": "t1_ghp1m6v",6 "id": "ghp1m6v",7 "parent_id": "t1_ghozojl",8 "depth": 2,9 "children": [10 "ghp1m6v"11 ]12 }13}
The array of children
within the more
object will contain a list of thread
IDs that can be used to fetch additional comments. In the code example above,
there is just one child ID, ghp1m6v
. So in addition to parsing all comment
trees recursively, we will also have to collect any additional comment thread
IDs and then do the same thing for those.
Hopefully, you are still with me at this point. Talking about all of this without writing any code can be confusing, so let's try to break it down with some functions that will help us achieve this goal.
The first function we'll write is parse_children_for_comments
.
1def parse_children_for_comments(children):2 global comment_count3 global comment_array4 for child in children:5 if child['kind'] == "more":6 children = child['data']['children']7 for id in children:8 more_comment_ids.append(id)9 if child['kind'] == "t1":10 comment_count += 111 comment_array.append(child['data']['body'])12 get_replies(child['data'])
It will take an array of children
objects that are sent back in the response
data and will pull out the comment text which is found in the body
property.
For each child in the array argument of children
, we will check its kind
. If
the kind
is more
, we will loop through and add each id to the global array
we created, more_comment_ids
. We will eventually come back to this array of
ids and parse through it.
Next, if the kind
is t1
, that means we have a comment and we want to read
its text. In order to do that, we simply get the text with
child['data']['body']
and append it to our global comment_array
variable.
After appending the comment to the comment_array
, we need to check if there
are any replies to that comment. Since we will be doing this check many times,
it's best that we write a helper function for it. We'll call it get_replies
:
1def get_replies(comment):2 global comment_count3 if comment['replies'] != "":4 children = comment['replies']['data']['children']5 parse_children_for_comments(children)
First, we check if there are any replies. When there are no replies, the
replies
property will be an empty string. If the string is not empty, we know
we have a reply. As I mentioned above, replies take the same format as the
original comment it is replying to. So in order to parse the reply text, we can
reuse the same parse_children_for_comments
function we already wrote. Since
parse_children_for_comments
will again call get_replies
, and get_replies
will again call parse_children_for_comments
until there are no comments left,
this will recursively continue until we reach a child comment with an empty
replies
property. Pretty neat.
With those helper functions defined, we're ready to fetch our data. In order to
do this, we will use a built-in Python function called
input
which will
allow the user to enter a URL to a reddit post.
1# get url from user2print('enter the reddit post url (e.g. https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/):')3thread_url = input()
We can expect the user to paste in a URL for a reddit post. For example, it may
look something like this:
https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/
To get the post data, we need to turn
https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/
into
https://www.reddit.com/r/redditdev/comments/krolrb/multicomments.json
.
To do that, we can write a small helper function.
1def sanitize_input(url):2 last_char = url[-1]3 if last_char == '/':4 url = url[:-1]5 url = f'{url}.json'6 return url
We pass the URL as an argument into the function. The function checks if the
last character of url
is a /
and removes it if it is. Then the function
appends .json
to the end of url
. After we pass the user's inputted URL to
this function, we're ready to fetch the post data.
1# pass user's url to sanitize helper2sanitized_thread_url = sanitize_input(thread_url)34# make network call5req_data = requests.get(sanitized_thread_url, headers={'User-agent': 'adamgoth.com'})67if req_data.status_code != 200:8 print('request failed')9 print(req_data.json())1011if req_data.status_code == 200:12 json_data = req_data.json()13 for item in json_data:14 children = item['data']['children']15 parse_children_for_comments(children)
We call requests.get()
, passing our URL as the first parameter, as well as a
headers value for a second parameter. The reason we need to specify a
User-agent
property in the header is so that we have a unique identity to
reddit. This will ensure we appear entirely anonymous and run into
rate-limiting issues.
Once we have our data back in our req_data
variable, the first thing we'll
check is if we did not get a 200
response for any reason. If the response is
not 200
, we will print out the error.
Assuming we get a 200
, we can then start parsing the data. We can use the
requests library built-in JSON decoder and called .json()
on the response. We
then write a simple for
statement that takes each child in the response data
and passes it to the parse_children_for_comments
we previously discussed.
After the for
loop from line 13 completes, we should have a number of comments
stored in our global comment_array
. Additionally, depending on the number of
comments from the post, we may have found some additional comment IDs and stored
them in our global more_comment_ids
array. As a reminder, these are IDs we can
use to fetch more comments that did not appear in the initial load. In the
reddit UI, these represent the links within comment threads that appear as "load
more replies", and in our data response, these IDs come from the children that
have a kind
property value of more
.
The URL for fetching the additional comment data looks similar to the URL we
used for fetching the initial post data. The only difference is the comment ID
is appended to the end. So
https://www.reddit.com/r/redditdev/comments/krolrb/multicomments.json
becomes
https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/{comment_id}.json
.
We can write a simple helper function to do this for us.
1def create_thread_url(comment_id):2 return sanitized_thread_url.replace('.json', f'/{comment_id}.json')
We simply pass the comment_id
as an argument and then do a string replace on
.json
with /{comment_id}.json
.
We're then ready to make the requests for the additional comments.
1# handle extra comment ids2for id in more_comment_ids:3 req_data = requests.get(create_thread_url(4 id), headers={'User-agent': 'adamgoth.com'})5 if req_data.status_code != 200:6 print('request failed')7 print(req_data.json())89 if req_data.status_code == 200:10 json_data = req_data.json()11 for item in json_data:12 children = item['data']['children']13 parse_children_for_comments(children)
To fetch the additional comments, we'll use another for
loop to loop through
each ID in the more_comment_ids
array. For each one, we again use
requests.get()
, passing the comment ID to the create_thread_url
function we
just wrote, along with the same User-agent
header as our previous request.
Once we have our response, we again check the status code, and if it's
successful, we'll parse the data the same way we did before, passing each child
in the data to parse_children_for_comments
. As a word of caution, for posts
with thousands of comment replies, this can result in a large number of
additional comment IDs. It's possible to have hundreds of IDs to fetch. Each one
of these will require a synchronous network call, so it can take quite a while
if this is the case.
Once all the additional comment IDs have been fetched, we have all the data we
need to run our word analysis. To do this, we will combine all of the comments
in our global comment_array
variable into a single string. We will then write
a function which will parse that string and keep track of how many times each
word appears. The function to do that looks like this:
1def analyze_words(words):2 analysis_string = words.split(' ')3 word_dict = {}4 for word in analysis_string:5 cleaned_word = word.replace('.', '').replace("'", '').replace(6 '\n', '').replace(',', '').replace("’", '').lower()7 if cleaned_word not in word_dict:8 word_dict[cleaned_word] = 19 else:10 word_dict[cleaned_word] += 11112 return word_dict
The function takes a single string as an argument called words
. It then breaks
the string into an array of words called analysis_string
by splitting the
string on each space character found in the string. We create an empty
dictionary
called word_dict
that we will use to keep track of each word's appearance.
Then we loop through each word in our analysis_string
array. For each word, we
use string replaces to strip out various common special characters (commas,
periods, etc.) and then call .lower()
on it to convert all uppercase
characters to lowercase characters. This ensures that The
and the
are not
tracked as two different words. As we go through each word in the array, if the
word does not exist in our word_dict
dictionary yet, we will add it and give
it a count value of 1
. If it already exists in word_dict
, then we will just
increment the count value up by 1. When we are finished looping through each
word, we will return the word_dict
we created.
The end of the script looks as follows:
1comment_string = ' '.join(comment_array)2results = analyze_words(comment_string)34sorted = sorted(results.items(), key=lambda x: x[1], reverse=True)56print(f'{comment_count} comments analyzed')78for key in sorted:9 print(key)
After combining all the comments into a single string and passing that string
through analyze_words
, we can sort all the results by the number of
appearances counted by calling
sorted = sorted(results.items(), key=lambda x: x[1], reverse=True)
. We can
then print the total number of comments we parsed and then each word and the
number of times it appeared.
Wrapping up
The script in its entirety can be found
here.
To run the script, simply run python reddit-comment-analysis.py
from the
directory containing the script file.
If you have Jupyter Notebooks installed, a more interactive version of this post can be found here.
This script serves as a basic starting point for fetching and analyzing data from the web. There is room for many improvements and enhancements to this script. Ideas for additional features include:
- Input validation
- Options for handling upper and lower casing
- Options for removing special characters
- Options for removing common words (the, and, I, etc.)
If you enjoyed this post or found it useful, please consider sharing it on Twitter.
If you want to stay updated on new posts, follow me on Twitter.
If you have any questions, comments, or just want to say hello, send me a message.
Thanks for reading!