Eight years ago, in 2003, we proposed and presented the use of linguistic analysis for email author identification. Our use case was started with the investigation of Advanced Fee Fraud (AFF), also known as 419 scams from Nigeria. We proved, albeit from a small data set, that language can identify a message author using several key indicators. We further proved that bias made victims far more susceptible to social engineering attacks.
About five years later, in 2008, an educational institution in Quebec picked up this theme of email author identification by applying pattern analysis to data sets. They released an online paper called A novel approach of mining write-prints for authorship attribution in e-mail forensics
In this paper, we introduce an innovative data mining method to capture the write-print of every suspect and model it as combinations of features that occurred frequently in the suspect’s e-mails. This notion is called frequent pattern, which has proven to be effective in many data mining applications, but it is the first time to be applied to the problem of authorship attribution.
Er, well, they are obviously wrong. The first time was not 2008. It probably was not even in 2006 (when we wrote our paper) or 2003. I would be far more impressed if they gave a little credit to the long history of language and data analysis, let alone our published and presented work. Our presentations on pattern frequency for authorship attribution predates not only their paper but, for at least two or three of the authors, their entire career.
At the start of 2010 we presented our findings at the RSA Conference in San Francisco and showed how anonymous authors could be distinguished using linguistic analysis. We pulled apart email messages, presented them based on their use of language (including stylometric features), and presented a taxonomy that predicts fraud based on key indicators.
The audience in our presentations always gets a quiz at the end; many always seem surprised they suddenly are able to see uniqueness in messages where none existed prior.
I just noticed that the Quebec crew have republished their paper under a more contemporary title with almost the same specific use case in mind: Mining writeprints from anonymous e-mails for forensic investigation
In this paper, we focus on the problem of mining the writing styles from a collection of e-mails written by multiple anonymous authors. The general idea is to first cluster the anonymous e-mail by the stylometric features and then extract the writeprint, i.e., the unique writing style, from each cluster. We emphasize that the presented problem together with our proposed solution is different from the traditional problem of authorship identification, which assumes training data is available for building a classifier.
Here is a major differentiation point. We did not assume a massive amount of training data was available or necessary to build a classifier. Our system can be taught to virtually anyone so that they then can start identifying authorship immediately. We have applied it and presented around the world, from Turkey to Brazil, with success.
Here is another major differentiation point. We were not trying to beg “first time” innovation recognition because we combined the extant body of knowledge in linguistics and security (social engineering). It was done in a novel way to help reduce fraud — stop people from falling victim to 419 scams — but we gave attribution.
We could have saved them a lot of time and hassle since we have been reporting it for eight years now. Perhaps there is a chance for collaboration in the future.
I could go on with differentiation points, but here’s one more. We don’t charge you to read our paper or presentation.