In diesem Projekt soll der Einfluss der Sprache auf die Anzahl der Views als eine der wichtigsten Währungen in Sozialen Medien, gemessen werden.
Dazu werden mittels YouTube-API die Headlines von ausgewählten Bloggern geladen und mittels LLM (GPT) versucht diese zu klassifizieren. Ob reißerisch, nach Aufmerksamkeit heischend oder ’normal‘ oder nichtssagend.
In einem zweiten Schritt sollen ebenfalls mittels YouTube-API geladenenen Views predicted werden.
Wir möchten zeigen, dass reißerische Titel eine signifikant höhere Anzahl an Views erhält – eine Kritik die auch von den Autoren selbst kommt.
Zum Download der Titelisten eines Channels, ist zwingend eine Goolge-API3 Key für YouTube nötig. Dieser ist gratis. Zuvor muss ein Projekt bei Google aneglegt werden.
https://console.cloud.google.com/projectselector2/apis/dashboard?supportedpurview=project

Der eigentlichen Download erledigt ein Python Script unter Eingabe der channel_id. Letztere ist aus dem Source des jeweiligen YouTube-Kanals mittels Suche nach „?channel_id“ herauszufischen.

oder über…

mit folgendem Aufruf wird auf der Commandline (bash) mittels pip (dem package installer for Python) das benötigte Paket installiert.
# install google-api-python-client pip install --upgrade google-api-python-client
Erstelle ein Python Script: youtube_get_channel.py und trage deinen API-Key und die channel_id ein.
from googleapiclient.discovery import build
import os
# Replace 'YOUR_API_KEY' with your actual YouTube Data API key
api_key = 'enter your api key here'
youtube = build('youtube', 'v3', developerKey=api_key)
# Replace 'CHANNEL_ID' with the actual ID of the YouTube channel
channel_id = 'UC3boa9w_mMa41DJ70sOtRMQ'
# Fetch the channel's uploads playlist ID
channel_response = youtube.channels().list(
id=channel_id,
part='contentDetails'
).execute()
uploads_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']
# Iterate through all videos in the uploads playlist
nextPageToken = None
while True:
playlist_response = youtube.playlistItems().list(
playlistId=uploads_playlist_id,
part='snippet',
maxResults=50, # Adjust based on your needs
pageToken=nextPageToken
).execute()
for video in playlist_response['items']:
video_id = video['snippet']['resourceId']['videoId']
video_title = video['snippet']['title']
# Fetch video statistics
video_response = youtube.videos().list(
id=video_id,
part='statistics'
).execute()
views = video_response['items'][0]['statistics']['viewCount']
print(f'Title: {video_title}, Views: {views}')
nextPageToken = playlist_response.get('nextPageToken')
if not nextPageToken:
break
# run the script python3 ./youtube_get_channel.py > some_channel_name.out
NPL with R
2024 by Robert Kofler
NLP tasks: text preprocessing > feature extraction > sentiment analysis > and topic modeling.
Step 1: Text preprocessing
Theory:
Removal of punctuation, stop words, tokenization, and normalization of text (such as stemming or lemmatization). We use the tm package.
Common issues:
- generate correct csv w/ quoting and escaping and line ending!
- correct wrong encoding -> fix-encoding.py
- don’t use teams: no csv preview
- calculate views ratio (nb! div/0)
- make sure headlines are w/o !
install.packages("tm")
install.packages("SnowballC")
library(tm)
# Load data
# for NLP tasks: stringsAsFactors = FALSE
data <- read.csv("headlines.csv", stringsAsFactors = FALSE)
# show what we got.
str(data)
# Create a corpus
# a so-called Corpus ... is a representation of a collection of text documents.
corpus <- VCorpus(VectorSource(data$headline))
# Text Preprocessing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
# remove stop words “the”, “is” and “and”
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# finde den Wortstamm (Porter's stemming algo)
# https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus
Step 2: Feature Extraction
For text data to be used in machine learning models, it needs to be converted into a numeric format. The two common techniques are the Bag of Words and TF-IDF.
library(text2vec) # Create a document-term matrix dtm <- DocumentTermMatrix(corpus) # Alternatively, create a TF-IDF matrix tfidf <- weightTfIdf(dtm)
Step 3: Sentiment Analysis
Sentiment analysis can help determine the attitude or emotion of the text. We are using the syuzhet package.
NB! Latest developments in this area: Embedding with BERT model (Bidirectional Encoder Representations from Transformers (BERT)) -> this will be subject of 2025.
library(syuzhet) # Get sentiment scores sentiments <- get_nrc_sentiment(as.character(data$headline)) data <- cbind(data, sentiments)
Step 4: Themenerkennung / Topic Modeling
Die Themenerkennung wird verwendet, um abstrakte Themen innerhalb einer Sammlung von Dokumenten zu entdecken. Wir verwenden das topicmodels-Paket.
EXKURS:
topicmodels: An R Package for Fitting Topic Models
This article is a (slightly) modified and shortened version of Grün and Hornik (2011), published in the Journal of Statistical Software.
Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics.
The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.
library(topicmodels) # Latent Dirichlet Allocation # Estimate a LDA model using for example the VEM algorithm or Gibbs Sampling. lda_model <- LDA(dtm, k = 3) # k = number of topics #Extract most likely terms or topics. topics <- topics(lda_model) #The function terms is a generic function which can be used to extract terms objects from various kinds of R data objects. topic_terms <- terms(lda_model, 6) # Get top 6 terms for each topic
Step 5: Machine Learning Models – Logistic Regression – glm()
Now we extracted features from text (like TF-IDF) to train a predictive models. Here we are going to use logistic regression.
# recoing of dependent variable in a binary form, threshold = MEDIAN data$target <- ifelse(data$ratio_views_followers > median(data$ratio_views_followers), 1, 0) # Fit a logistic regression model model <- glm(data = train_data, target ~ abos + channelname + anger + anticipation + disgust + fear + joy + sadness + surprise + trust + negative + positive, family = "binomial") summary(model)
Now interpret your model
What sentiment has a significat influence on view rate?
> summary(model)
Call:
glm(formula = target ~ anger + anticipation + disgust + fear +
joy + sadness + surprise + surprise + trust + negative +
channelname, family = "binomial", data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45136 1.01911 -3.387 0.000708 ***
anger 0.10768 0.04222 2.550 0.010765 *
anticipation -0.09657 0.03973 -2.431 0.015062 *
disgust 0.03100 0.05663 0.547 0.584070
fear 0.07827 0.03346 2.339 0.019313 *
joy -0.04041 0.05090 -0.794 0.427281
sadness -0.13956 0.04409 -3.165 0.001551 **
surprise -0.15013 0.04367 -3.438 0.000587 ***
trust -0.02420 0.03242 -0.747 0.455330
negative 0.11784 0.03454 3.411 0.000646 ***
channelnameATP Geopolitics 18.81062 103.01116 0.183 0.855106
channelnameBBC News 2.84145 1.01924 2.788 0.005307 **
channelnameBrian Tyler Cohen 7.90615 1.03987 7.603 2.89e-14 ***
channelnameClimateScience - Solve Climate Change 7.00347 1.43744 4.872 1.10e-06 ***
channelnameCNN 4.18150 1.02098 4.096 4.21e-05 ***
channelnameEnvironment and Climate Change Canada 19.02910 134.32759 0.142 0.887347
channelnameeuronews 2.04252 1.03436 1.975 0.048306 *
channelnameMilitary Summary 7.18931 1.04585 6.874 6.24e-12 ***
channelnameRight Side Broadcasting Network 2.29647 1.05488 2.177 0.029481 *
channelnameSky News 0.71018 1.18114 0.601 0.547663
channelnameThe Guardian 3.06333 1.02519 2.988 0.002807 **
channelnameThe Independent 0.05235 1.44035 0.036 0.971004
channelnameThe New York Times 2.83837 1.01953 2.784 0.005369 **
channelnameThe Wall Street Journal 2.73979 1.01951 2.687 0.007202 **
channelnameVICE News 4.58010 1.02291 4.478 7.55e-06 ***
channelnameWar in Ukraine 6.65964 1.05993 6.283 3.32e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27504 on 19839 degrees of freedom
Residual deviance: 21135 on 19814 degrees of freedom
AIC: 21187
Number of Fisher Scoring iterations: 14

