Migration in online media

Methodology
Author-Topic Model

The Author-Topic Model (Author-Topic Model, ATM) is a generative probability model that provides Latent Dirichlet Allocation (LDA) with information about authority. In this way, ATM is the amalgamation of two models, namely that of the Author Model and LDA, hence it expels a probability model of authors and topics. The distribution of the topics of the documents is the mixture of the distribution of the topics related to the authors of the documents. While regarding LDA, a single author is related to each document, while in case of ATM a unique distribution of topic belongs to each author. Due to this feature of LDA, one can study the topics that each author writes about and the frequency of publication.

Words

Authors (news web pages) – topics – most frequent key words

The first visualization combines two figures and represents the topics related to the studied news websites and the words connected to the topics. It is important to note that the links here do not imply continuous and direct connection of the three columns. Accordingly, it cannot be revealed from the figure which site applies which expression mostly. Our analysis defines the news websites per authors firstly. The left side of the figure represents which news website is connected to which topic the most tightly. There are less but more dominant relations between the topics and the news websites. The right side of the figure intends to show the distribution of the words per topic, which typically characterize the topics. The figure reads the top 10 words that have the greatest weight with each topic. In other words, these are the phrases that are the most relevant regarding the topics.

Dates

Topics per date

Secondly, not only the websites, but also the dates were considered authors so as to be able to study the date when a certain topic was the most popular. The results suggest that the relation of the EU and Turkey were rarely mentioned in Hungarian online media. The topics of immigrants and migration; the rhetoric of the governing party; the opening and the closure of refugee camps were not the most popular ones either. However, the topic of the Turkish issue is outstanding and the topic of the international situation is dominant, which are not surprising phenomena, since Hungary is a transit country, consequently the news is motivated by the international status of the country every time.

Topics

It is not easy to find the optimal number of topics as there are no exceptionally good measures for it. Without a clear guideline we had to face a serious problem. It would have helped a lot, if we could have annotated the articles, hence we could have tested the performance of the different models on test and training data sets. Finally, we decided to adapt the classification model published on atlatszo.hu and the article of Bernáth Gábor and Messing Vera, who approached this topic from the perspective of content analysis. The two ideas lent us some topics that we searched in the topic words, derived from the model. We found that danger/epidemics and violence were identified in both articles. Based on these terms, the topics of border closure, border violation, political discourse, personal history and xenophobia were looked for among the words related to the topics.
Based on these terms, the topics of border closure, border violation, political discourse, personal history and xenophobia were looked for among the words related to the topics.

After optimizing the number of topics, the next challenge was to provide the topics with suitable names, which is usually achieved by studying the words of the topics. Our bubble chart, which was inspired by Termite , can give one a brief impression about this procedure. The figure presents the scale of probability that a certain word belongs to a certain topic, ergo the bigger a bubble is, the higher the probability is. Words can be sorted by saliency and frequency. Saliency filters the corpus and informs one about the presence of the words per topics, while frequency applies general stop words, hence it does not filter the corpus significantly and it shows the real frequency of words. Furthermore, saliency measures the extent that a given expression characterizes a given topic compared to a randomly selected word. The two indicators can be viewed either in bigrams – which contain the common word collocations, namely the words that frequently go together – or in word lists, which output texts that are processed more superficially. In order to make the figure comprehensible only the top 25 most frequent words were made available.

Further perspectives

This project is the outcome of the first application of the model. The analysis of the results suggests that the algorithm works most effectively if it is fed with numerous and diverse labels. It is considered that the model has great potentials, hence the Author-Topic Model will be tested on another corpus. It is assumed that the visual representation helps us get this work done more fruitful.

Technical information

Python was used for data collection, pre-processing and analysis. Gensim package was applied to make the author-topic model. The visualizations were prepared with d3 JavaScript library. The website uses a freelancer Bootstrap template.

Who talked about migration in Hungarian online media?
When?
What was the topic?

Virág Ilyés, Eszter Katona, Gergely Morvay, Orsolya Putz, Zoltán Varjú

Methodology
Author-Topic Model

Words

Authors (news web pages) – topics – most frequent key words

Dates

Topics per date

Topics

Further perspectives

Technical information

Methodology Author-Topic Model

Words

Authors (news web pages) – topics – most frequent key words

Dates

Topics per date

Topics

Further perspectives

Technical information

Methodology
Author-Topic Model