Virág Ilyés, Eszter Katona, Gergely Morvay, Orsolya Putz, Zoltán Varjú
What topics were discussed on Hungarian online media sites during the migration crisis? What was the news about? How has the topic of the news changed? This issue can sound familiar, since the results of our previous analyses on the topic have already been reported (see our dashboard or website).
The current study pays less attention to how the topics of the documents vary, rather it aims to reveal the most typical topics in line with certain news websites and dates. That can be best studied with an Author Topic Model (ATM) as in our corpus each document is assigned to two labels: the name of the news website and the date of publishing. The articles on migration were categorized by topics, consequently the most popular topics per websites can be explored and the evolution of popularity of each topic can be detected. The entire corpus comprises 50,000 articles, collected from the Internet between September 2014 and June 2016. The articles are from 25 news websites, in the selection of which we intended to represent the entire spectrum of the wide scope of Hungarian news websites.
The Author-Topic Model (Author-Topic Model, ATM) is a generative probability model that provides Latent Dirichlet Allocation (LDA) with information about authority. In this way, ATM is the amalgamation of two models, namely that of the Author Model and LDA, hence it expels a probability model of authors and topics. The distribution of the topics of the documents is the mixture of the distribution of the topics related to the authors of the documents. While regarding LDA, a single author is related to each document, while in case of ATM a unique distribution of topic belongs to each author. Due to this feature of LDA, one can study the topics that each author writes about and the frequency of publication.
The first visualization combines two figures and represents the topics related to the studied news websites and the words connected to the topics. It is important to note that the links here do not imply continuous and direct connection of the three columns. Accordingly, it cannot be revealed from the figure which site applies which expression mostly. Our analysis defines the news websites per authors firstly. The left side of the figure represents which news website is connected to which topic the most tightly. There are less but more dominant relations between the topics and the news websites. The right side of the figure intends to show the distribution of the words per topic, which typically characterize the topics. The figure reads the top 10 words that have the greatest weight with each topic. In other words, these are the phrases that are the most relevant regarding the topics.
Secondly, not only the websites, but also the dates were considered authors so as to be able to study the date when a certain topic was the most popular. The results suggest that the relation of the EU and Turkey were rarely mentioned in Hungarian online media. The topics of immigrants and migration; the rhetoric of the governing party; the opening and the closure of refugee camps were not the most popular ones either. However, the topic of the Turkish issue is outstanding and the topic of the international situation is dominant, which are not surprising phenomena, since Hungary is a transit country, consequently the news is motivated by the international status of the country every time.
It is not easy to find the optimal number of topics as there are no exceptionally good measures for it. Without a clear guideline we had to face a serious problem. It would have helped a lot, if we could have annotated the articles, hence we could have tested the performance of the different models on test and training data sets. Finally, we decided to adapt the classification model published on atlatszo.hu and the article of Bernáth Gábor and Messing Vera, who approached this topic from the perspective of content analysis. The two ideas lent us some topics that we searched in the topic words, derived from the model. We found that danger/epidemics and violence were identified in both articles. Based on these terms, the topics of border closure, border violation, political discourse, personal history and xenophobia were looked for among the words related to the topics.
Based on these terms, the topics of border closure, border violation, political discourse, personal history and xenophobia were looked for among the words related to the topics.
After optimizing the number of topics, the next challenge was to provide the topics with suitable names, which is usually achieved by studying the words of the topics. Our bubble chart, which was inspired by Termite , can give one a brief impression about this procedure. The figure presents the scale of probability that a certain word belongs to a certain topic, ergo the bigger a bubble is, the higher the probability is. Words can be sorted by saliency and frequency. Saliency filters the corpus and informs one about the presence of the words per topics, while frequency applies general stop words, hence it does not filter the corpus significantly and it shows the real frequency of words. Furthermore, saliency measures the extent that a given expression characterizes a given topic compared to a randomly selected word. The two indicators can be viewed either in bigrams – which contain the common word collocations, namely the words that frequently go together – or in word lists, which output texts that are processed more superficially. In order to make the figure comprehensible only the top 25 most frequent words were made available.
This project is the outcome of the first application of the model. The analysis of the results suggests that the algorithm works most effectively if it is fed with numerous and diverse labels. It is considered that the model has great potentials, hence the Author-Topic Model will be tested on another corpus. It is assumed that the visual representation helps us get this work done more fruitful.
Python was used for data collection, pre-processing and analysis. Gensim package was applied to make the author-topic model. The visualizations were prepared with d3 JavaScript library. The website uses a freelancer Bootstrap template.