between the Clinton emails and the Podesta leak, it seems to me that many document sets include a ton of copy-pasted news articles. By themselves, these are really boring and can obscure more interesting stuff. It'd be neat to classify/rank documents by whether they're mostly boilerplate (signatures, disclaimers) and news articles and therefore boring.