Aleesha's Netflix Usage - Exploratory Analysis¶

My friend and flatmate Aleesha watches a lot of Netflix. When I say a lot, I mean almost all the time when she is at home. She has it on when shes cooking, when shes on her phone in bed, and even while she sleeps. You might think this means she watches a lot of new things, but you would be wrong. From my own casual observations, it's almost always the same few shows she just watches over and over and over again. Now I'm not here to judge her (I do that plenty), but I was curious as to exactly how much she has watched, and how much of it is the same stuff. When I found out Netflix had an option to request your data, I knew I had to take a look.

Lets use the pandas library for Python to do some exploratory analysis of her Netflix Data, and if we think there is more to see we can do a deep dive with Power BI later.

The data Netflix provides is a number of folders with various .csv files. The file we are interested in is "ViewingActivity.csv". Lets import this data and have a look (along with some other libraries we will use later).

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


fulldf = pd.read_csv("ViewingActivity.csv")

In [2]:

fulldf.shape

Out[2]:

(38440, 10)

38,000 rows seems like a lot. Lets have a look at the first few to see if they are all useful to this analysis.

In [3]:

fulldf.head(10)

Out[3]:

	Profile Name	Start Time	Duration	Attributes	Title	Supplemental Video Type	Device Type	Bookmark	Latest Bookmark	Country
0	Leesh	2021-01-14 02:12:43	00:02:50	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Heavy Competition...	NaN	Samsung 2014 MStar DTV	00:03:18	00:03:18	NZ (New Zealand)
1	Leesh	2021-01-14 02:09:33	00:03:02	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Michael Scott Pap...	NaN	Samsung 2014 MStar DTV	00:21:07	00:21:07	NZ (New Zealand)
2	Leesh	2021-01-13 22:31:13	00:17:35	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Michael Scott Pap...	NaN	Samsung 2014 MStar DTV	00:18:07	Not latest view	NZ (New Zealand)
3	Leesh	2021-01-13 22:10:30	00:20:35	NaN	The Office (U.S.): Season 5: Dream Team (Episo...	NaN	Samsung 2014 MStar DTV	00:20:55	00:20:55	NZ (New Zealand)
4	Leesh	2021-01-13 21:45:15	00:20:47	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Two Weeks (Episod...	NaN	Samsung 2014 MStar DTV	00:21:02	00:21:02	NZ (New Zealand)
5	Leesh	2021-01-13 21:33:17	00:11:51	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: New Boss (Episode...	NaN	Samsung 2014 MStar DTV	00:21:08	00:21:08	NZ (New Zealand)
6	D J	2021-01-13 19:45:51	00:00:10	Autoplayed: user action: None;	MINDHUNTER: Season 2 (Trailer)	TRAILER	Apple iPhone 11	00:00:10	00:00:10	FR (France)
7	D J	2021-01-13 19:44:26	00:00:08	Autoplayed: user action: None;	Night Stalker: The Hunt for a Serial Killer: C...	HOOK	Apple iPhone 11	00:00:08	00:00:08	FR (France)
8	Leesh	2021-01-13 10:32:34	00:00:13	Autoplayed: user action: None;	Derry Girls: Season 1: Episode 1	NaN	Samsung 2016 Hawk-M DTV Smart TV	00:00:13	00:00:13	NZ (New Zealand)
9	Leesh	2021-01-13 10:31:00	00:00:53	Autoplayed: user action: None;	Season 1 Trailer: Snowpiercer	TRAILER	Samsung 2016 Hawk-M DTV Smart TV	00:00:53	00:00:53	NZ (New Zealand)

In [4]:

fulldf["Profile Name"].unique()

Out[4]:

array(['Leesh', 'D J', 'Britt', 'Gregor', 'Jacob'], dtype=object)

We can see that the dataset contains information about viewing activity for all profiles on this netflix account. We are only interested in data from Aleesha (Leesh), so lets keep those only.

In [5]:

leeshdf = fulldf.loc[fulldf['Profile Name'] == "Leesh"].copy()

leeshdf.shape

Out[5]:

(16545, 10)

More realistic, but interesting that her usage makes up 45% of all interactions on that account!

In [6]:

leeshdf.head(10)

Out[6]:

	Profile Name	Start Time	Duration	Attributes	Title	Supplemental Video Type	Device Type	Bookmark	Latest Bookmark	Country
0	Leesh	2021-01-14 02:12:43	00:02:50	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Heavy Competition...	NaN	Samsung 2014 MStar DTV	00:03:18	00:03:18	NZ (New Zealand)
1	Leesh	2021-01-14 02:09:33	00:03:02	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Michael Scott Pap...	NaN	Samsung 2014 MStar DTV	00:21:07	00:21:07	NZ (New Zealand)
2	Leesh	2021-01-13 22:31:13	00:17:35	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Michael Scott Pap...	NaN	Samsung 2014 MStar DTV	00:18:07	Not latest view	NZ (New Zealand)
3	Leesh	2021-01-13 22:10:30	00:20:35	NaN	The Office (U.S.): Season 5: Dream Team (Episo...	NaN	Samsung 2014 MStar DTV	00:20:55	00:20:55	NZ (New Zealand)
4	Leesh	2021-01-13 21:45:15	00:20:47	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Two Weeks (Episod...	NaN	Samsung 2014 MStar DTV	00:21:02	00:21:02	NZ (New Zealand)
5	Leesh	2021-01-13 21:33:17	00:11:51	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: New Boss (Episode...	NaN	Samsung 2014 MStar DTV	00:21:08	00:21:08	NZ (New Zealand)
8	Leesh	2021-01-13 10:32:34	00:00:13	Autoplayed: user action: None;	Derry Girls: Season 1: Episode 1	NaN	Samsung 2016 Hawk-M DTV Smart TV	00:00:13	00:00:13	NZ (New Zealand)
9	Leesh	2021-01-13 10:31:00	00:00:53	Autoplayed: user action: None;	Season 1 Trailer: Snowpiercer	TRAILER	Samsung 2016 Hawk-M DTV Smart TV	00:00:53	00:00:53	NZ (New Zealand)
10	Leesh	2021-01-13 10:00:40	00:09:01	NaN	The Office (U.S.): Season 5: New Boss (Episode...	NaN	Samsung 2014 MStar DTV	00:09:17	Not latest view	NZ (New Zealand)
11	Leesh	2021-01-13 09:39:44	00:20:47	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Golden Ticket (Ep...	NaN	Samsung 2014 MStar DTV	00:21:05	00:21:05	NZ (New Zealand)

There are a few other rows that we want to drop. This netflix profile is logged into the lounge TV at the flat where 3 different people use it. Fortunately, the column 'Device Type' allows us to isolate the TV in question ('Samsung 2016 Hawk-M DTV Smart TV'). That being said, we dont want to simply drop all rows that use that TV, as there could still be valuable data there.

There is hope however, as I know from her usual schedule that she almost never watches from that TV later than 8pm NZST, which is when the vast majority of viewing by the other people happens. The 'Start Time' Column should help us here, but its not currently in a datetime format, and uses the UTC timezone, not our local one. Lets convert this column to a workable format.

In [7]:

leeshdf['Start Time'] = pd.to_datetime(leeshdf['Start Time'], utc = True)
leeshdf = leeshdf.set_index('Start Time')
leeshdf.index = leeshdf.index.tz_convert('Pacific/Auckland')
leeshdf = leeshdf.reset_index()
leeshdf.head(1)

Out[7]:

	Start Time	Profile Name	Duration	Attributes	Title	Supplemental Video Type	Device Type	Bookmark	Latest Bookmark	Country
0	2021-01-14 15:12:43+13:00	Leesh	00:02:50	Autoplayed: user action: Unspecified;	The Office (U.S.): Season 5: Heavy Competition...	NaN	Samsung 2014 MStar DTV	00:03:18	00:03:18	NZ (New Zealand)

Now that we have the time zone correct, we can remove all entries that begin later than 8pm on that television

In [8]:

leeshdf = leeshdf.drop(leeshdf[(leeshdf['Device Type'] == 'Samsung 2016 Hawk-M DTV Smart TV') & (leeshdf['Start Time'].dt.hour >= 20)].index).copy()
leeshdf.shape

Out[8]:

(15899, 10)

While we are tidying columns, lets also convert the duration column to a more workable format (timedelta), and drop the columns we dont need at this point to make the table a bit easier to read.

In [9]:

leeshdf['Duration'] = pd.to_timedelta(leeshdf['Duration'])

In [10]:

df = leeshdf.drop(['Profile Name', 'Attributes', 'Supplemental Video Type', 'Bookmark', 'Latest Bookmark', 'Country'], axis = 1)
df.head(10)

Out[10]:

	Start Time	Duration	Title	Device Type
0	2021-01-14 15:12:43+13:00	0 days 00:02:50	The Office (U.S.): Season 5: Heavy Competition...	Samsung 2014 MStar DTV
1	2021-01-14 15:09:33+13:00	0 days 00:03:02	The Office (U.S.): Season 5: Michael Scott Pap...	Samsung 2014 MStar DTV
2	2021-01-14 11:31:13+13:00	0 days 00:17:35	The Office (U.S.): Season 5: Michael Scott Pap...	Samsung 2014 MStar DTV
3	2021-01-14 11:10:30+13:00	0 days 00:20:35	The Office (U.S.): Season 5: Dream Team (Episo...	Samsung 2014 MStar DTV
4	2021-01-14 10:45:15+13:00	0 days 00:20:47	The Office (U.S.): Season 5: Two Weeks (Episod...	Samsung 2014 MStar DTV
5	2021-01-14 10:33:17+13:00	0 days 00:11:51	The Office (U.S.): Season 5: New Boss (Episode...	Samsung 2014 MStar DTV
8	2021-01-13 23:00:40+13:00	0 days 00:09:01	The Office (U.S.): Season 5: New Boss (Episode...	Samsung 2014 MStar DTV
9	2021-01-13 22:39:44+13:00	0 days 00:20:47	The Office (U.S.): Season 5: Golden Ticket (Ep...	Samsung 2014 MStar DTV
10	2021-01-13 22:18:34+13:00	0 days 00:20:53	The Office (U.S.): Season 5: Blood Drive (Epis...	Samsung 2014 MStar DTV
11	2021-01-13 22:17:48+13:00	0 days 00:00:35	Snowpiercer_hook_primary_16x9	Samsung 2014 MStar DTV

Now that we have the columns we care about, lets filter out entries that are too short to be valid. Entries that are under a minute are likely just trailers, previews or autoplays on the home page. Lets drop all entries under 1 minute by using the 'Duration' column.

In [11]:

df = df[(df['Duration'] > '0 days 00:01:00')]
df.shape

Out[11]:

(14068, 4)

From my observation in person, there are around 8 shows that she watches a considerable amount, with 3 main shows that she spends most of her time watching: Brooklyn Nine-Nine, The Big Bang Theory, and The Vampire Diaries. For the purposes of analyzing this data, we are going to add two new columns to the dataset, 'Show Name' and 'Big 3'.

'Show Name' will be a string that lists the title of the show for each episode watched. Only a handful of shows will have their name, the ones that I can remember seeing her watch often, the rest we will just label 'Other'.

'Big 3' will be a boolean, where shows that are identified as being part of the big 3 being assigned 'True', with the rest 'False'.

In [12]:

df['Show Name'] = np.where(df.Title.str.contains("The Big Bang Theory"), "The Big Bang Theory",
                  np.where(df.Title.str.contains("Brooklyn Nine-Nine"), "Brooklyn Nine-Nine",
                  np.where(df.Title.str.contains("The Vampire Diaries"), "The Vampire Diaries",
                  np.where(df.Title.str.contains("The Office"), "The Office",
                  np.where(df.Title.str.contains("Vikings: Season"), "Vikings",
                  np.where(df.Title.str.contains("American Horror Story"), "American Horror Story",
                  np.where(df.Title.str.contains("Friends: Season"), "Friends",
                  np.where(df.Title.str.contains("Prison Break: Season"), "Prison Break","Other"))))))))
df['Big 3'] = np.where(df.Title.str.contains("The Big Bang Theory"), True,
              np.where(df.Title.str.contains("Brooklyn Nine-Nine"), True,
              np.where(df.Title.str.contains("The Vampire Diaries"), True, False)))

In [13]:

df['Show Name'].value_counts()

Out[13]:

Other                    6163
The Big Bang Theory      3906
Brooklyn Nine-Nine       2432
The Vampire Diaries       628
Prison Break              314
Friends                   282
Vikings                   131
American Horror Story     117
The Office                 95
Name: Show Name, dtype: int64

In [14]:

plot = df['Show Name'].value_counts().plot.pie(y='Most Viewed Shows', 
                                               figsize=(12, 12),
                                               autopct="%.1f%%", 
                                               explode=[0.05]*9, 
                                               pctdistance=0.82)

plt.show()

We can see from these results that The Big Bang Theory and Brooklyn Nine-Nine dominate the results, and really are the 'Big 2'. 'Other' remains the most numerous category, which is a little surprising to me but makes sense. Now lets compare the big 3 with all other shows

In [15]:

df.to_csv('netflixpostformatting.csv')

In [16]:

df['Big 3'].value_counts(normalize = True)

Out[16]:

False    0.504834
True     0.495166
Name: Big 3, dtype: float64

In [17]:

plot2 = df['Big 3'].value_counts(normalize = True).plot.pie(y='Big 3',
                                                            figsize=(14,14),
                                                            explode=[0.05]*2,
                                                            autopct="%.1f%%")

That being said, as the Big 3, they make up just under half of all viewing instances on that account. However, this is not necessarily the same as viewing hours, as there may be instances where she watched only briefly.

There is enough here to justify a deeper analysis. We will now take this data to Power BI and produce a full report. Thank you for following along with me!