In diesem Projekt soll der Einfluss der Sprache auf die Anzahl der Views als eine der wichtigsten Währungen in Sozialen Medien, gemessen werden.
Dazu werden mittels YouTube-API die Headlines von ausgewählten Bloggern geladen und mittels LLM (GPT) versucht diese zu klassifizieren. Ob reißerisch, nach Aufmerksamkeit heischend oder ’normal‘ oder nichtssagend.
In einem zweiten Schritt sollen ebenfalls mittels YouTube-API geladenenen Views predicted werden.
Wir möchten zeigen, dass reißerische Titel eine signifikant höhere Anzahl an Views erhält – eine Kritik die auch von den Autoren selbst kommt.
Zum Download der Titelisten eines Channels, ist zwingend eine Goolge-API3 Key für YouTube nötig. Dieser ist gratis. Zuvor muss ein Projekt bei Google aneglegt werden.
https://console.cloud.google.com/projectselector2/apis/dashboard?supportedpurview=project
Der eigentlichen Download erledigt ein Python Script unter Eingabe der channel_id. Letztere ist aus dem Source des jeweiligen YouTube-Kanals mittels Suche nach „?channel_id“ herauszufischen.
oder über…
mit folgendem Aufruf wird auf der Commandline (bash) mittels pip (dem package installer for Python) das benötigte Paket installiert.
# install google-api-python-client pip install --upgrade google-api-python-client
Erstelle ein Python Script: youtube_get_channel.py und trage deinen API-Key und die channel_id ein.
from googleapiclient.discovery import build import os # Replace 'YOUR_API_KEY' with your actual YouTube Data API key api_key = 'enter your api key here' youtube = build('youtube', 'v3', developerKey=api_key) # Replace 'CHANNEL_ID' with the actual ID of the YouTube channel channel_id = 'UC3boa9w_mMa41DJ70sOtRMQ' # Fetch the channel's uploads playlist ID channel_response = youtube.channels().list( id=channel_id, part='contentDetails' ).execute() uploads_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads'] # Iterate through all videos in the uploads playlist nextPageToken = None while True: playlist_response = youtube.playlistItems().list( playlistId=uploads_playlist_id, part='snippet', maxResults=50, # Adjust based on your needs pageToken=nextPageToken ).execute() for video in playlist_response['items']: video_id = video['snippet']['resourceId']['videoId'] video_title = video['snippet']['title'] # Fetch video statistics video_response = youtube.videos().list( id=video_id, part='statistics' ).execute() views = video_response['items'][0]['statistics']['viewCount'] print(f'Title: {video_title}, Views: {views}') nextPageToken = playlist_response.get('nextPageToken') if not nextPageToken: break
# run the script python3 ./youtube_get_channel.py > some_channel_name.out
NPL with R
2024 by Robert Kofler
NLP tasks: text preprocessing > feature extraction > sentiment analysis > and topic modeling.
Step 1: Text preprocessing
Theory:
Removal of punctuation, stop words, tokenization, and normalization of text (such as stemming or lemmatization). We use the tm
package.
Common issues:
- generate correct csv w/ quoting and escaping and line ending!
- correct wrong encoding -> fix-encoding.py
- don’t use teams: no csv preview
- calculate views ratio (nb! div/0)
- make sure headlines are w/o !
install.packages("tm") install.packages("SnowballC") library(tm) # Load data # for NLP tasks: stringsAsFactors = FALSE data <- read.csv("headlines.csv", stringsAsFactors = FALSE) # show what we got. str(data) # Create a corpus # a so-called Corpus ... is a representation of a collection of text documents. corpus <- VCorpus(VectorSource(data$headline)) # Text Preprocessing corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) # remove stop words “the”, “is” and “and” corpus <- tm_map(corpus, removeWords, stopwords("english")) # finde den Wortstamm (Porter's stemming algo) # https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus
Step 2: Feature Extraction
For text data to be used in machine learning models, it needs to be converted into a numeric format. The two common techniques are the Bag of Words and TF-IDF.
library(text2vec) # Create a document-term matrix dtm <- DocumentTermMatrix(corpus) # Alternatively, create a TF-IDF matrix tfidf <- weightTfIdf(dtm)
Step 3: Sentiment Analysis
Sentiment analysis can help determine the attitude or emotion of the text. We are using the syuzhet
package.
NB! Latest developments in this area: Embedding with BERT model (Bidirectional Encoder Representations from Transformers (BERT)) -> this will be subject of 2025.
library(syuzhet) # Get sentiment scores sentiments <- get_nrc_sentiment(as.character(data$headline)) data <- cbind(data, sentiments)
Step 4: Themenerkennung / Topic Modeling
Die Themenerkennung wird verwendet, um abstrakte Themen innerhalb einer Sammlung von Dokumenten zu entdecken. Wir verwenden das topicmodels
-Paket.
EXKURS:
topicmodels: An R Package for Fitting Topic Models
This article is a (slightly) modified and shortened version of Grün and Hornik (2011), published in the Journal of Statistical Software.
Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics.
The R package topicmodels
provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.
library(topicmodels) # Latent Dirichlet Allocation # Estimate a LDA model using for example the VEM algorithm or Gibbs Sampling. lda_model <- LDA(dtm, k = 3) # k = number of topics #Extract most likely terms or topics. topics <- topics(lda_model) #The function terms is a generic function which can be used to extract terms objects from various kinds of R data objects. topic_terms <- terms(lda_model, 6) # Get top 6 terms for each topic
Step 5: Machine Learning Models – Logistic Regression – glm()
Now we extracted features from text (like TF-IDF) to train a predictive models. Here we are going to use logistic regression.
# recoing of dependent variable in a binary form, threshold = MEDIAN data$target <- ifelse(data$ratio_views_followers > median(data$ratio_views_followers), 1, 0) # Fit a logistic regression model model <- glm(data = train_data, target ~ abos + channelname + anger + anticipation + disgust + fear + joy + sadness + surprise + trust + negative + positive, family = "binomial") summary(model)
Now interpret your model
What sentiment has a significat influence on view rate?
> summary(model) Call: glm(formula = target ~ anger + anticipation + disgust + fear + joy + sadness + surprise + surprise + trust + negative + channelname, family = "binomial", data = train_data) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.45136 1.01911 -3.387 0.000708 *** anger 0.10768 0.04222 2.550 0.010765 * anticipation -0.09657 0.03973 -2.431 0.015062 * disgust 0.03100 0.05663 0.547 0.584070 fear 0.07827 0.03346 2.339 0.019313 * joy -0.04041 0.05090 -0.794 0.427281 sadness -0.13956 0.04409 -3.165 0.001551 ** surprise -0.15013 0.04367 -3.438 0.000587 *** trust -0.02420 0.03242 -0.747 0.455330 negative 0.11784 0.03454 3.411 0.000646 *** channelnameATP Geopolitics 18.81062 103.01116 0.183 0.855106 channelnameBBC News 2.84145 1.01924 2.788 0.005307 ** channelnameBrian Tyler Cohen 7.90615 1.03987 7.603 2.89e-14 *** channelnameClimateScience - Solve Climate Change 7.00347 1.43744 4.872 1.10e-06 *** channelnameCNN 4.18150 1.02098 4.096 4.21e-05 *** channelnameEnvironment and Climate Change Canada 19.02910 134.32759 0.142 0.887347 channelnameeuronews 2.04252 1.03436 1.975 0.048306 * channelnameMilitary Summary 7.18931 1.04585 6.874 6.24e-12 *** channelnameRight Side Broadcasting Network 2.29647 1.05488 2.177 0.029481 * channelnameSky News 0.71018 1.18114 0.601 0.547663 channelnameThe Guardian 3.06333 1.02519 2.988 0.002807 ** channelnameThe Independent 0.05235 1.44035 0.036 0.971004 channelnameThe New York Times 2.83837 1.01953 2.784 0.005369 ** channelnameThe Wall Street Journal 2.73979 1.01951 2.687 0.007202 ** channelnameVICE News 4.58010 1.02291 4.478 7.55e-06 *** channelnameWar in Ukraine 6.65964 1.05993 6.283 3.32e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 27504 on 19839 degrees of freedom Residual deviance: 21135 on 19814 degrees of freedom AIC: 21187 Number of Fisher Scoring iterations: 14