Der Einfluss der YouTube-Headline auf die Anzahl der zu erwartenden ‚Views‘. – Data Science and Artificial Intelligence @ HTL Spengergasse

In diesem Projekt soll der Einfluss der Sprache auf die Anzahl der Views als eine der wichtigsten Währungen in Sozialen Medien, gemessen werden.

Dazu werden mittels YouTube-API die Headlines von ausgewählten Bloggern geladen und mittels LLM (GPT) versucht diese zu klassifizieren. Ob reißerisch, nach Aufmerksamkeit heischend oder ’normal‘ oder nichtssagend.

In einem zweiten Schritt sollen ebenfalls mittels YouTube-API geladenenen Views predicted werden.

Wir möchten zeigen, dass reißerische Titel eine signifikant höhere Anzahl an Views erhält – eine Kritik die auch von den Autoren selbst kommt.

Zum Download der Titelisten eines Channels, ist zwingend eine Goolge-API3 Key für YouTube nötig. Dieser ist gratis. Zuvor muss ein Projekt bei Google aneglegt werden.

https://console.cloud.google.com/projectselector2/apis/dashboard?supportedpurview=project

Der eigentlichen Download erledigt ein Python Script unter Eingabe der channel_id. Letztere ist aus dem Source des jeweiligen YouTube-Kanals mittels Suche nach „?channel_id“ herauszufischen.

oder über…

how to copy channel_id from YouTube channel

mit folgendem Aufruf wird auf der Commandline (bash) mittels pip (dem package installer for Python) das benötigte Paket installiert.

# install google-api-python-client
pip install --upgrade google-api-python-client

Erstelle ein Python Script: youtube_get_channel.py und trage deinen API-Key und die channel_id ein.

from googleapiclient.discovery import build
import os

# Replace 'YOUR_API_KEY' with your actual YouTube Data API key
api_key = 'enter your api key here'
youtube = build('youtube', 'v3', developerKey=api_key)

# Replace 'CHANNEL_ID' with the actual ID of the YouTube channel
channel_id = 'UC3boa9w_mMa41DJ70sOtRMQ'
# Fetch the channel's uploads playlist ID
channel_response = youtube.channels().list(
    id=channel_id,
    part='contentDetails'
).execute()

uploads_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']

# Iterate through all videos in the uploads playlist
nextPageToken = None
while True:
    playlist_response = youtube.playlistItems().list(
        playlistId=uploads_playlist_id,
        part='snippet',
        maxResults=50,  # Adjust based on your needs
        pageToken=nextPageToken
    ).execute()

    for video in playlist_response['items']:
        video_id = video['snippet']['resourceId']['videoId']
        video_title = video['snippet']['title']

        # Fetch video statistics
        video_response = youtube.videos().list(
            id=video_id,
            part='statistics'
        ).execute()

        views = video_response['items'][0]['statistics']['viewCount']
        print(f'Title: {video_title}, Views: {views}')

    nextPageToken = playlist_response.get('nextPageToken')
    if not nextPageToken:
        break

# run the script
python3 ./youtube_get_channel.py > some_channel_name.out

NPL with R

2024 by Robert Kofler

NLP tasks: text preprocessing > feature extraction > sentiment analysis > and topic modeling.

Step 1: Text preprocessing

Theory:

Removal of punctuation, stop words, tokenization, and normalization of text (such as stemming or lemmatization). We use the tm package.

Common issues:

generate correct csv w/ quoting and escaping and line ending!
correct wrong encoding -> fix-encoding.py
don’t use teams: no csv preview
calculate views ratio (nb! div/0)
make sure headlines are w/o !

install.packages("tm")
install.packages("SnowballC")
library(tm)

# Load data
# for NLP tasks: stringsAsFactors = FALSE
data <- read.csv("headlines.csv", stringsAsFactors = FALSE)

# show what we got.
str(data)

# Create a corpus
# a so-called Corpus ... is a representation of a collection of text documents.
corpus <- VCorpus(VectorSource(data$headline))

# Text Preprocessing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
# remove stop words “the”, “is” and “and”
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# finde den Wortstamm (Porter's stemming algo) 
# https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus

Step 2: Feature Extraction

For text data to be used in machine learning models, it needs to be converted into a numeric format. The two common techniques are the Bag of Words and TF-IDF.

library(text2vec)

# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)

# Alternatively, create a TF-IDF matrix
tfidf <- weightTfIdf(dtm)

Step 3: Sentiment Analysis

Sentiment analysis can help determine the attitude or emotion of the text. We are using the syuzhet package.

NB! Latest developments in this area: Embedding with BERT model (Bidirectional Encoder Representations from Transformers (BERT)) -> this will be subject of 2025.

library(syuzhet)

# Get sentiment scores
sentiments <- get_nrc_sentiment(as.character(data$headline))
data <- cbind(data, sentiments)

Step 4: Themenerkennung / Topic Modeling

Die Themenerkennung wird verwendet, um abstrakte Themen innerhalb einer Sammlung von Dokumenten zu entdecken. Wir verwenden das topicmodels-Paket.

EXKURS:

topicmodels: An R Package for Fitting Topic Models

This article is a (slightly) modified and shortened version of Grün and Hornik (2011), published in the Journal of Statistical Software.

Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics.

The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.

library(topicmodels)

# Latent Dirichlet Allocation
# Estimate a LDA model using for example the VEM algorithm or Gibbs Sampling.
lda_model <- LDA(dtm, k = 3)  # k = number of topics

#Extract most likely terms or topics.
topics <- topics(lda_model)

#The function terms is a generic function which can be used to extract terms objects from various kinds of R data objects.
topic_terms <- terms(lda_model, 6)  # Get top 6 terms for each topic

Step 5: Machine Learning Models – Logistic Regression – glm()

Now we extracted features from text (like TF-IDF) to train a predictive models. Here we are going to use logistic regression.

# recoing of dependent variable in a binary form, threshold = MEDIAN
data$target <- ifelse(data$ratio_views_followers > median(data$ratio_views_followers), 1, 0)

# Fit a logistic regression model
model <- glm(data = train_data, target ~ abos + channelname + anger + anticipation + disgust + fear + joy + sadness + surprise + trust + negative + positive, family = "binomial")

summary(model)

Now interpret your model

What sentiment has a significat influence on view rate?

> summary(model)

Call:
glm(formula = target ~ anger + anticipation + disgust + fear + 
    joy + sadness + surprise + surprise + trust + negative + 
    channelname, family = "binomial", data = train_data)

Coefficients:
                                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                                       -3.45136    1.01911  -3.387 0.000708 ***
anger                                              0.10768    0.04222   2.550 0.010765 *  
anticipation                                      -0.09657    0.03973  -2.431 0.015062 *  
disgust                                            0.03100    0.05663   0.547 0.584070    
fear                                               0.07827    0.03346   2.339 0.019313 *  
joy                                               -0.04041    0.05090  -0.794 0.427281    
sadness                                           -0.13956    0.04409  -3.165 0.001551 ** 
surprise                                          -0.15013    0.04367  -3.438 0.000587 ***
trust                                             -0.02420    0.03242  -0.747 0.455330    
negative                                           0.11784    0.03454   3.411 0.000646 ***
channelnameATP Geopolitics                        18.81062  103.01116   0.183 0.855106    
channelnameBBC News                                2.84145    1.01924   2.788 0.005307 ** 
channelnameBrian Tyler Cohen                       7.90615    1.03987   7.603 2.89e-14 ***
channelnameClimateScience - Solve Climate Change   7.00347    1.43744   4.872 1.10e-06 ***
channelnameCNN                                     4.18150    1.02098   4.096 4.21e-05 ***
channelnameEnvironment and Climate Change Canada  19.02910  134.32759   0.142 0.887347    
channelnameeuronews                                2.04252    1.03436   1.975 0.048306 *  
channelnameMilitary Summary                        7.18931    1.04585   6.874 6.24e-12 ***
channelnameRight Side Broadcasting Network         2.29647    1.05488   2.177 0.029481 *  
channelnameSky News                                0.71018    1.18114   0.601 0.547663    
channelnameThe Guardian                            3.06333    1.02519   2.988 0.002807 ** 
channelnameThe Independent                         0.05235    1.44035   0.036 0.971004    
channelnameThe New York Times                      2.83837    1.01953   2.784 0.005369 ** 
channelnameThe Wall Street Journal                 2.73979    1.01951   2.687 0.007202 ** 
channelnameVICE News                               4.58010    1.02291   4.478 7.55e-06 ***
channelnameWar in Ukraine                          6.65964    1.05993   6.283 3.32e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27504  on 19839  degrees of freedom
Residual deviance: 21135  on 19814  degrees of freedom
AIC: 21187

Number of Fisher Scoring iterations: 14