Hello and welcome to part 3 of our sentiment analysis visualization application project with Dash. Leading up to this part, we learned how to calculate senitment on strings, how to stream data from Twitter, and now we're ready to tie it in to Dash. If you are not already familiar with Dash, check out the Dash tutorials.
In our Dash application, we need some way to interact with our database. Because we are fairly likely to do various calcualtions on this data, rather than displaying the raw sentiment, I am going to use pandas. If you do not have pandas: pip install pandas
. Pandas is a popular data analysis library for Python. You can think of it like Excel for Python. If you would like to learn more, check out the Data Analysis with Python and Pandas Tutorials.
Let's quickly cover how we can read data from a database to a dataframe, and then a simple calculation we might make:
import sqlite3 import pandas as pd conn = sqlite3.connect('twitter.db') c = conn.cursor() df = pd.read_sql("SELECT * FROM sentiment WHERE tweet LIKE '%olympic%' ORDER BY unix DESC LIMIT 1000", conn) print(df.tail())
To load a database into a dataframe, we first pass the SQL query, and then the connection object. In this case, our SQL query is hunting for any tweets that contain the substring of "olympic" in them. Since the olympics are going on at the time of my writing this, it's a fairly common subject at the moment being talked about. For now, we're also limiting the return to be 1000, sorting by the unix time, descending (so we get the latest tweets).
Example output:
unix tweet \ 995 1.519571e+12 RT @NekoDaice: I'm not an exo fan but damn tha... 996 1.519571e+12 RT @Workpoint_Ent: `yaakcchaekhaaaipfang I'm T... 997 1.519571e+12 RT @MomentsAU: EXO and CL didn't disappoint at... 998 1.519571e+12 RT @smolkjd: how much are we gonna bet some EX... 999 1.519571e+12 RT @MartinBelam: I picked ten highlights from ... sentiment 995 -0.8972 996 0.6369 997 0.3089 998 0.4939 999 0.5859
There's a quick snippet of our data, looking good. Now, the raw sentiment scores are going to be all over the place and noisy. We're going to want to smooth them. Let's do this with a moving average. We don't want just any moving average though, we want something that is suitable for the size of data we might be finding.
Also, since we've sorted by unix time in descending order, we probably want to flip our data back around to be in proper chronological order, with:
df.sort_values('unix', inplace=True)
Now to smooth things out:
df['sentiment_smoothed'] = df['sentiment'].rolling(int(len(df)/5)).mean() df.dropna(inplace=True) print(df.tail())
Full code up to this point:
import sqlite3 import pandas as pd conn = sqlite3.connect('twitter.db') c = conn.cursor() df = pd.read_sql("SELECT * FROM sentiment WHERE tweet LIKE '%olympic%' ORDER BY unix DESC LIMIT 1000", conn) df.sort_values('unix', inplace=True) df['sentiment_smoothed'] = df['sentiment'].rolling(int(len(df)/5)).mean() df.dropna(inplace=True) print(df.tail())
Output:
unix tweet sentiment \ 4 1.519575e+12 RT @MichaelG1: This is what happens when a spo... 0.0000 3 1.519575e+12 RT @sebaeksalot: If I had to choose between tw... 0.4767 2 1.519575e+12 I LOVE YOU ALL SO MUCH YOU DESERVE EVERYTHING... 0.9374 1 1.519575e+12 RT @morningpassages: And full capture of epic ... 0.0000 0 1.519575e+12 RT @MichaelG1: This is what happens when a spo... 0.0000 sentiment_smoothed 4 0.173937 3 0.174035 2 0.178722 1 0.178082 0 0.174713
Okay, at this point, we should have everything we need to create our first basic Dash app that will display live Twitter sentiment for any word we want. Let's do that in the next tutorial.