Slides of "Identifying news clusters using Q-analysis and Modularity" presentation at ECCS13
Transcript of Slides of "Identifying news clusters using Q-analysis and Modularity" presentation at ECCS13
Iden%fying news clusters using Q-‐analysis and Modularity
David Rodrigues+ Centre for Complexity and Design
+The Open University, UK – [email protected]
1
Mo%va%on
• Find Structure in collec%ons of text documents • Create Computer Algorithms to automate this discovery with minimal human supervision.
• Use of hybrid methodologies to improve quality of results – Topology based approach describes data – Clustering technique to iden%fy modules
3
Problem Descrip%on
• Iden%fy the Structure of the news published online by The Guardian (among other newspapers) – Clustering? – Topology? – Topic Modelling? – Noise? – Novelty? – Change?
4
[Kohut, A. and Remez, M. (2008)]
Clustering Techniques in Topic Modelling
• Nearest neighbour classifica%on • Bayesian probabilis%c techniques • Decision trees • Regression Models • Neural Networks • Support Vector Machines
• Language dependent / Human interven%on in the defini%on of categories for training samples.
5
Clustering in Graphs is Community Detec%on
• Modularity based techniques [majority] • Spectral algorithms • Synchroniza%on based techniques • … • [Community detecBon in graphs -‐ Fortunato, 2010, for comprehensive review]
• Binary rela%ons between nodes don’t capture the mul%-‐level structure of exis%ng rela%ons. – Move to n-‐ary rela%ons and descrip%ons
6
Previously
• We used a sliding window over the %me series of the news stories
• Used Varia%on of Informa%on to measure changes in an evolving adap%ve network of news[Meilã 2007, Rodrigues 2010]
7
Our Proposal
• Use a high dimensional representa%on of the documents (Simplicial Complex)
• Use Q-‐analysis to describe the system constructed from the Documents x Tags Incidence Matrix
• Use Q-‐connected components to filter noise. • Use modularity opBmisaBon to find communi%es in the resul%ng induced graphs
8
Noise?
• In the news context, we define noise news as news that are loosely related to the main topics published.
• We can filter them by assuming that the Q-‐connectedness of this news is very low.
9
The Guardian
• Classifies news with useful metadata: – … – Sec%on – Tags – …
hkp://www.theguardian.com/open-‐plalorm Open Plalorm with API for applica%on development. 3 years of data: 2010, 2011 and 2012
10
Incidence Matrix
TAG 1 TAG 2 TAG 3 TAG 4 TAG 5 …
NEWS 1 1 1 0 0 0 …
NEWS 2 0 1 1 0 1 …
NEWS 3 0 1 0 0 1 …
NEWS 4 1 0 0 0 1 …
NEWS 5 0 0 0 1 1 …
… … … … … … …
13
Documents x Tags
Community detec%on on the 0-‐connected graph
15
1 Month of News – November 2011 Modularity = 0.48 9 communi%es
Developed Tools
• Theseus – A python applica%on for collec%ng, processing and visualisa%on of the textual dataset -‐ hkps://github.com/sixhat/theseus
• Visualisa%on tool
26
Conclusions
• Q-‐analysis gives an descrip%ve overview of the structure of the system, it terms of the local connec%vity of the news stories.
• Clustering (on top of the Q-‐analysis) gives a natural (highly modular) division of the resul%ng structures.
• This allows the iden%fica%on of coherent news cluster and the filtering of noise news.
28
Generalisa%on of applicability
• Instead of Human tagged documents, one can apply this to any kind of text based documents: – HTML Webpages: Use keywords tag from header
• or – Extract keywords with topic modelling (LDA, for example)
– Scien%fic Documents: Tag documents with topic modelling strategies like LDA and instead of noise, explore the possibility that low connected stories might be emerging scien%fic trends.
29
Take home message
• Real Complex Systems are mul%-‐dimensional. Community detec%on methods need to take into account those descrip%ons
• The construc%on of descrip%ons with all the rela%ons (hyper-‐simplicies) gives beker qualita%ve of the results
• In the newspapers case, this helps the filtering of ``noise’’ news (unrelated news).
30