Replication materials for Nicholls and Bright, "Understanding news story chains
using information retrieval and network clustering techniques", Communication
Methods and Measures
===============================================================================

This archive contains code and data underlying the paper. Running the scripts
from `02CategorisePairs.py` to `30TestLDA.py` in order should redo the
calculations.

* For copyright reasons, the crawled web page archives aren't included. The
  `ArticleTexts` directory contains the extracted texts of the news articles,
  and `00ExtractArticles.py` contains the code used for the extraction.

* `ClusterVerification` contains judgement data for the effectiveness of the
  clustering process.

* `Graphs` contains the paper's figures

* `HandCoded` contains the exhaustive pairwise codings of news articles for the
  evaluation of the first part of the matching process.

* `MisclassificationAnalyses` contains the validation set story pairs which
  were misclassified by the machine, for manual exploration.

* `Output` contains the primary outputs and will be regenerated by running the
  scripts.
  
* The folder 'Graphics and IRR' contains some of the post processing of the output, graphics, and results from the IRR exercise. 

We're very happy to discuss this paper with you, or provide whatever
replication assistance is necessary: contact tom.nicholls@bsg.ox.ac.uk.

Thanks,
Tom Nicholls and Jonathan Bright, October 2018.
