My friend and flatmate Aleesha watches a lot of Netflix. When I say a lot, I mean almost all the time when she is at home. She has it on when shes cooking, when shes on her phone in bed, and even while she sleeps. You might think this means she watches a lot of new things, but you would be wrong. From my own casual observations, it's almost always the same few shows she just watches over and over and over again. Now I'm not here to judge her (I do that plenty), but I was curious as to exactly how much she has watched, and how much of it is the same stuff. When I found out Netflix had an option to request your data, I knew I had to take a look.
Lets use the pandas library for Python to do some exploratory analysis of her Netflix Data, and if we think there is more to see we can do a deep dive with Power BI later.
The data Netflix provides is a number of folders with various .csv files. The file we are interested in is "ViewingActivity.csv". Lets import this data and have a look (along with some other libraries we will use later).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
fulldf = pd.read_csv("ViewingActivity.csv")
fulldf.shape
(38440, 10)
38,000 rows seems like a lot. Lets have a look at the first few to see if they are all useful to this analysis.
fulldf.head(10)
| Profile Name | Start Time | Duration | Attributes | Title | Supplemental Video Type | Device Type | Bookmark | Latest Bookmark | Country | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Leesh | 2021-01-14 02:12:43 | 00:02:50 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Heavy Competition... | NaN | Samsung 2014 MStar DTV | 00:03:18 | 00:03:18 | NZ (New Zealand) |
| 1 | Leesh | 2021-01-14 02:09:33 | 00:03:02 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Michael Scott Pap... | NaN | Samsung 2014 MStar DTV | 00:21:07 | 00:21:07 | NZ (New Zealand) |
| 2 | Leesh | 2021-01-13 22:31:13 | 00:17:35 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Michael Scott Pap... | NaN | Samsung 2014 MStar DTV | 00:18:07 | Not latest view | NZ (New Zealand) |
| 3 | Leesh | 2021-01-13 22:10:30 | 00:20:35 | NaN | The Office (U.S.): Season 5: Dream Team (Episo... | NaN | Samsung 2014 MStar DTV | 00:20:55 | 00:20:55 | NZ (New Zealand) |
| 4 | Leesh | 2021-01-13 21:45:15 | 00:20:47 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Two Weeks (Episod... | NaN | Samsung 2014 MStar DTV | 00:21:02 | 00:21:02 | NZ (New Zealand) |
| 5 | Leesh | 2021-01-13 21:33:17 | 00:11:51 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: New Boss (Episode... | NaN | Samsung 2014 MStar DTV | 00:21:08 | 00:21:08 | NZ (New Zealand) |
| 6 | D J | 2021-01-13 19:45:51 | 00:00:10 | Autoplayed: user action: None; | MINDHUNTER: Season 2 (Trailer) | TRAILER | Apple iPhone 11 | 00:00:10 | 00:00:10 | FR (France) |
| 7 | D J | 2021-01-13 19:44:26 | 00:00:08 | Autoplayed: user action: None; | Night Stalker: The Hunt for a Serial Killer: C... | HOOK | Apple iPhone 11 | 00:00:08 | 00:00:08 | FR (France) |
| 8 | Leesh | 2021-01-13 10:32:34 | 00:00:13 | Autoplayed: user action: None; | Derry Girls: Season 1: Episode 1 | NaN | Samsung 2016 Hawk-M DTV Smart TV | 00:00:13 | 00:00:13 | NZ (New Zealand) |
| 9 | Leesh | 2021-01-13 10:31:00 | 00:00:53 | Autoplayed: user action: None; | Season 1 Trailer: Snowpiercer | TRAILER | Samsung 2016 Hawk-M DTV Smart TV | 00:00:53 | 00:00:53 | NZ (New Zealand) |
fulldf["Profile Name"].unique()
array(['Leesh', 'D J', 'Britt', 'Gregor', 'Jacob'], dtype=object)
We can see that the dataset contains information about viewing activity for all profiles on this netflix account. We are only interested in data from Aleesha (Leesh), so lets keep those only.
leeshdf = fulldf.loc[fulldf['Profile Name'] == "Leesh"].copy()
leeshdf.shape
(16545, 10)
More realistic, but interesting that her usage makes up 45% of all interactions on that account!
leeshdf.head(10)
| Profile Name | Start Time | Duration | Attributes | Title | Supplemental Video Type | Device Type | Bookmark | Latest Bookmark | Country | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Leesh | 2021-01-14 02:12:43 | 00:02:50 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Heavy Competition... | NaN | Samsung 2014 MStar DTV | 00:03:18 | 00:03:18 | NZ (New Zealand) |
| 1 | Leesh | 2021-01-14 02:09:33 | 00:03:02 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Michael Scott Pap... | NaN | Samsung 2014 MStar DTV | 00:21:07 | 00:21:07 | NZ (New Zealand) |
| 2 | Leesh | 2021-01-13 22:31:13 | 00:17:35 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Michael Scott Pap... | NaN | Samsung 2014 MStar DTV | 00:18:07 | Not latest view | NZ (New Zealand) |
| 3 | Leesh | 2021-01-13 22:10:30 | 00:20:35 | NaN | The Office (U.S.): Season 5: Dream Team (Episo... | NaN | Samsung 2014 MStar DTV | 00:20:55 | 00:20:55 | NZ (New Zealand) |
| 4 | Leesh | 2021-01-13 21:45:15 | 00:20:47 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Two Weeks (Episod... | NaN | Samsung 2014 MStar DTV | 00:21:02 | 00:21:02 | NZ (New Zealand) |
| 5 | Leesh | 2021-01-13 21:33:17 | 00:11:51 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: New Boss (Episode... | NaN | Samsung 2014 MStar DTV | 00:21:08 | 00:21:08 | NZ (New Zealand) |
| 8 | Leesh | 2021-01-13 10:32:34 | 00:00:13 | Autoplayed: user action: None; | Derry Girls: Season 1: Episode 1 | NaN | Samsung 2016 Hawk-M DTV Smart TV | 00:00:13 | 00:00:13 | NZ (New Zealand) |
| 9 | Leesh | 2021-01-13 10:31:00 | 00:00:53 | Autoplayed: user action: None; | Season 1 Trailer: Snowpiercer | TRAILER | Samsung 2016 Hawk-M DTV Smart TV | 00:00:53 | 00:00:53 | NZ (New Zealand) |
| 10 | Leesh | 2021-01-13 10:00:40 | 00:09:01 | NaN | The Office (U.S.): Season 5: New Boss (Episode... | NaN | Samsung 2014 MStar DTV | 00:09:17 | Not latest view | NZ (New Zealand) |
| 11 | Leesh | 2021-01-13 09:39:44 | 00:20:47 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Golden Ticket (Ep... | NaN | Samsung 2014 MStar DTV | 00:21:05 | 00:21:05 | NZ (New Zealand) |
There are a few other rows that we want to drop. This netflix profile is logged into the lounge TV at the flat where 3 different people use it. Fortunately, the column 'Device Type' allows us to isolate the TV in question ('Samsung 2016 Hawk-M DTV Smart TV'). That being said, we dont want to simply drop all rows that use that TV, as there could still be valuable data there.
There is hope however, as I know from her usual schedule that she almost never watches from that TV later than 8pm NZST, which is when the vast majority of viewing by the other people happens. The 'Start Time' Column should help us here, but its not currently in a datetime format, and uses the UTC timezone, not our local one. Lets convert this column to a workable format.
leeshdf['Start Time'] = pd.to_datetime(leeshdf['Start Time'], utc = True)
leeshdf = leeshdf.set_index('Start Time')
leeshdf.index = leeshdf.index.tz_convert('Pacific/Auckland')
leeshdf = leeshdf.reset_index()
leeshdf.head(1)
| Start Time | Profile Name | Duration | Attributes | Title | Supplemental Video Type | Device Type | Bookmark | Latest Bookmark | Country | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-14 15:12:43+13:00 | Leesh | 00:02:50 | Autoplayed: user action: Unspecified; | The Office (U.S.): Season 5: Heavy Competition... | NaN | Samsung 2014 MStar DTV | 00:03:18 | 00:03:18 | NZ (New Zealand) |
Now that we have the time zone correct, we can remove all entries that begin later than 8pm on that television
leeshdf = leeshdf.drop(leeshdf[(leeshdf['Device Type'] == 'Samsung 2016 Hawk-M DTV Smart TV') & (leeshdf['Start Time'].dt.hour >= 20)].index).copy()
leeshdf.shape
(15899, 10)
While we are tidying columns, lets also convert the duration column to a more workable format (timedelta), and drop the columns we dont need at this point to make the table a bit easier to read.
leeshdf['Duration'] = pd.to_timedelta(leeshdf['Duration'])
df = leeshdf.drop(['Profile Name', 'Attributes', 'Supplemental Video Type', 'Bookmark', 'Latest Bookmark', 'Country'], axis = 1)
df.head(10)
| Start Time | Duration | Title | Device Type | |
|---|---|---|---|---|
| 0 | 2021-01-14 15:12:43+13:00 | 0 days 00:02:50 | The Office (U.S.): Season 5: Heavy Competition... | Samsung 2014 MStar DTV |
| 1 | 2021-01-14 15:09:33+13:00 | 0 days 00:03:02 | The Office (U.S.): Season 5: Michael Scott Pap... | Samsung 2014 MStar DTV |
| 2 | 2021-01-14 11:31:13+13:00 | 0 days 00:17:35 | The Office (U.S.): Season 5: Michael Scott Pap... | Samsung 2014 MStar DTV |
| 3 | 2021-01-14 11:10:30+13:00 | 0 days 00:20:35 | The Office (U.S.): Season 5: Dream Team (Episo... | Samsung 2014 MStar DTV |
| 4 | 2021-01-14 10:45:15+13:00 | 0 days 00:20:47 | The Office (U.S.): Season 5: Two Weeks (Episod... | Samsung 2014 MStar DTV |
| 5 | 2021-01-14 10:33:17+13:00 | 0 days 00:11:51 | The Office (U.S.): Season 5: New Boss (Episode... | Samsung 2014 MStar DTV |
| 8 | 2021-01-13 23:00:40+13:00 | 0 days 00:09:01 | The Office (U.S.): Season 5: New Boss (Episode... | Samsung 2014 MStar DTV |
| 9 | 2021-01-13 22:39:44+13:00 | 0 days 00:20:47 | The Office (U.S.): Season 5: Golden Ticket (Ep... | Samsung 2014 MStar DTV |
| 10 | 2021-01-13 22:18:34+13:00 | 0 days 00:20:53 | The Office (U.S.): Season 5: Blood Drive (Epis... | Samsung 2014 MStar DTV |
| 11 | 2021-01-13 22:17:48+13:00 | 0 days 00:00:35 | Snowpiercer_hook_primary_16x9 | Samsung 2014 MStar DTV |
Now that we have the columns we care about, lets filter out entries that are too short to be valid. Entries that are under a minute are likely just trailers, previews or autoplays on the home page. Lets drop all entries under 1 minute by using the 'Duration' column.
df = df[(df['Duration'] > '0 days 00:01:00')]
df.shape
(14068, 4)
From my observation in person, there are around 8 shows that she watches a considerable amount, with 3 main shows that she spends most of her time watching: Brooklyn Nine-Nine, The Big Bang Theory, and The Vampire Diaries. For the purposes of analyzing this data, we are going to add two new columns to the dataset, 'Show Name' and 'Big 3'.
'Show Name' will be a string that lists the title of the show for each episode watched. Only a handful of shows will have their name, the ones that I can remember seeing her watch often, the rest we will just label 'Other'.
'Big 3' will be a boolean, where shows that are identified as being part of the big 3 being assigned 'True', with the rest 'False'.
df['Show Name'] = np.where(df.Title.str.contains("The Big Bang Theory"), "The Big Bang Theory",
np.where(df.Title.str.contains("Brooklyn Nine-Nine"), "Brooklyn Nine-Nine",
np.where(df.Title.str.contains("The Vampire Diaries"), "The Vampire Diaries",
np.where(df.Title.str.contains("The Office"), "The Office",
np.where(df.Title.str.contains("Vikings: Season"), "Vikings",
np.where(df.Title.str.contains("American Horror Story"), "American Horror Story",
np.where(df.Title.str.contains("Friends: Season"), "Friends",
np.where(df.Title.str.contains("Prison Break: Season"), "Prison Break","Other"))))))))
df['Big 3'] = np.where(df.Title.str.contains("The Big Bang Theory"), True,
np.where(df.Title.str.contains("Brooklyn Nine-Nine"), True,
np.where(df.Title.str.contains("The Vampire Diaries"), True, False)))
df['Show Name'].value_counts()
Other 6163 The Big Bang Theory 3906 Brooklyn Nine-Nine 2432 The Vampire Diaries 628 Prison Break 314 Friends 282 Vikings 131 American Horror Story 117 The Office 95 Name: Show Name, dtype: int64
plot = df['Show Name'].value_counts().plot.pie(y='Most Viewed Shows',
figsize=(12, 12),
autopct="%.1f%%",
explode=[0.05]*9,
pctdistance=0.82)
plt.show()
We can see from these results that The Big Bang Theory and Brooklyn Nine-Nine dominate the results, and really are the 'Big 2'. 'Other' remains the most numerous category, which is a little surprising to me but makes sense. Now lets compare the big 3 with all other shows
df.to_csv('netflixpostformatting.csv')
df['Big 3'].value_counts(normalize = True)
False 0.504834 True 0.495166 Name: Big 3, dtype: float64
plot2 = df['Big 3'].value_counts(normalize = True).plot.pie(y='Big 3',
figsize=(14,14),
explode=[0.05]*2,
autopct="%.1f%%")
That being said, as the Big 3, they make up just under half of all viewing instances on that account. However, this is not necessarily the same as viewing hours, as there may be instances where she watched only briefly.
There is enough here to justify a deeper analysis. We will now take this data to Power BI and produce a full report. Thank you for following along with me!