...

Collecting Youtube comments using the Youtube Data API - Part 1

Introduction

In this article, I'll show you how to use the Youtube API to collect videos and comments data.

Nowadays, it's common to see analyzes with Twitter data, since the Folha SP's ideological GPS to the Politoscope project in France. I also wrote about it for the EBICC, cognitive science event 2019 at Unicamp.

My experience in the field is more related the politics, but you can find many uses for analysis of social media. In politics, it's common to use them to measure votes' intent, political compass, and to see the narrative structure, in my case, I applied the LDA algorithm on Youtube comments to discover the topics more spoken during the election 2018.

An interesting point about this subject, which also is part of the goals of the Politoscope, is to take knowledge about how these techniques work for the general public, many people already know that your data are collected, but don't know what the interested doing with these data, to bring the knowledge of how these techniques work can help on the debate about this subject.

Before start, I need to highlight some limitations, in social media exists many paid profiles and robots producing fake news or artificial information that not correspond to most users' real opinions, they generally are used to improve the reputation of a politician and destroy the reputation of your competitor. So case you will measure the intention of votes, the paid profiles and robots will influence the results.

Another important thing is to be beware of the methodology used because may exist other implications and causes to explain your results, an example that I saw, is to use sentiment analysis to measure the impact of the fake news on public opinion, the problem is that sentiment not express agreement or disagreement.

For example, in a situation of tragedy is expected that most parts of the sentiment be negative, but it doesn't mean that they don't believe in the occurred fact. Another example is a little more polemic, occur when the public express negative sentiment with each other by thinks different about a specific subject, so not is possible to distinguish one opinion from the other.

In other words, the sentiment analysis doesn't consider the content of the message, so users with different opinions can be grouped into a unique sentiment.

If you want to read more about the challenges and difficulties of analyzing social media, you can read Renee Boucher Ferguson's article.

Generating credentials to access the API

Now, the first thing to do is access Google's developer console, and log in using your Gmail, if it's your first time you will need to accept the terms and conditions.

Create a project by clicking on the New Project button, in this tutorial, we'll use the default information, but you can edit if want, how name, organization, etc. Then click on credentials, and after it click on create credentials. Will appear some options, we'll to use the most simple, select the API Key option.

Now, we'll active the API to access the Youtube Data API v3 service, click on dashboard, and after it click on enable API and services. You'll be directed to the search page, search by Youtube Data API v3, and click on this service, and after it clicks on enable.

You'll need to download the API's library on your machine, to do it types:

pip install --upgrade google-api-python-client pip install --upgrade google-auth-oauthlib google-auth-httplib2

You're now ready to use the API, create a file, and import the library in your code, types:

import googleapiclient.discovery my_api_key = 'YOUR KEY HERE' # Remember to paste your key here api_service_name = ‘youtube’ api_version = ‘v3’ youtube = googleapiclient.discovery.build( api_service_name, api_version, developerKey = my_api_key)

Create a function to collect video data

Type and observe the following p:

def collect_video_data(q, y): """ This function collects data from YouTube videos :param q: words used to search for videos on Youtube :param y: year :return: Returns data from searched YouTube videos """ results = youtube.search().list( part='snippet', q=q, type='video', publishedAfter=y+'-01-01T00:00:00Z', publishedBefore=y+'-12-31T00:00:00Z', order='viewCount', maxResults=50 ).execute() print('50 videos returned') return results

This function receives two parameters, the first is the word used by API to search for videos on Youtube, the second is the year using by me to get videos in a specific year. The second parameter I used in my research to separate the samples per year.

This function returns many data about the video found, if you want only some specific information about the video you can set the parameter called 'part', access the documentation, and look at the properties name returned by API to choose a valid value for this parameter. In this case, the snippet property returns us the information that we need, like title and video id.

The publishedAfter and the publishedBefore parameters define the period, in this case, we'll look for videos published in the period for 1 Jan of 2018 to the 31 Dez of 2018. The order parameter defines if the function returns the videos more seen or more relevant, this parameter accepts other values, in the documentation show other valid values and what each one means. I chose the more seen (viewCount) because generally, it has more comments.

The maxResults parameter defines the maximum number of data that can be returned by the function, the maximum value allows is 50, but it's possible to collect more than 50 putting the function in a loop, we'll do it when we collect the video's comments because the videos have more comments than the maxResults' limit allow us to collect.

For each video that it returns, the data are following format:

{ ;kind;: ;youtube#searchResult;, ;etag;: ;;p4VTdlkQv3HQeTEaXgvLePAydmU/EJxKkgAmsuSsRl9cVoyWW8iBneY;;, ;id;: {;kind;: ;youtube#video;, ;videoId;: ;VIDEO ID;}, ;snippet;: {;publishedAt;: ;2018-10-05T11:00:09.000Z;, ;channelId;: ;UC- 6xqzMBF2CXTImn_a4aCVg;, ;title;: ;TITLE;, ;description;:DESCRIPTION.;, ;thumbnails;: {;default;: {;url;: ;IMAGE.jpg;, ;width;: 120, ;height;: 90}, ;medium;: {;url;: ;IMAGE.jpg;, ;width;: 320, ;height;: 180}, ;high;: {;url;: ;IMAGE.jpg;, ;width;: 480, ;height;: 360}}, ;channelTitle;: ;CHANNEL;, ;liveBroadcastContent;: ;none;} }

Now, we'll select the information that we want and put it in a JSON file.

def format_information(video_list, name): """ This function formats the information of the collected videos and writes them to a JSON file :param video_list: Youtube video data :param name: name of the JSON file :return: JSON file """ text = '[' for video in video_list['items']: info = video['id'] text += '{\n' text += '"' + 'ID": ' + '"' + str(info['videoId']) + '",\n' info = video['snippet'] text += '"' + 'Title": ' + '"' + str(info['title']) + '",\n' text += '"' + 'Channel": ' + '"' + str(info['channelTitle']) + '",\n' text += '"' + 'Date": ' + '"' + str(info['publishedAt']) + '",\n' text += '"' + 'Description": ' + '"' + str(info['description']) + '",\n},\n' text += ']' with open('database/videos/'+name+'.json', 'w', encoding="utf8") as file: file.write(text) print('Information was stored in json file')

This is a function very simple, it selects the field names that have the information we want and stores it in a string, after it we put the code in a loop to go through the list of 50 videos returned by the API.

After implementing the functions, we'll write a code to call them.

words = ['cs go', 'anime', 'basketball'] year = '2018' for word in words: videos = collect_video_data(word, year) format_information(videos, word)

For the purpose of demonstration, I chose three simple words about things that I like as keywords.

The code is available on my Github, so far we collect the videos data, in the next, we'll collect the video's comments.

Go to the page

My social medias 😊

My portfolio ☕

Contact me 🤗

tsukasa.renato@gmail.com

14 997468186