For several years I have tried to speak openly about why I find it disappointing that analysts rely heavily (sometimes exclusively) on language to determine who is a foreigner.
Back in 2011 I criticized McAfee for their rather awful analysis of language.
They are making some funny and highly improbable assumptions: … The attackers used Chinese language attack tools, therefore they must be Chinese. This is a reverse language bias that brings back memories of L0phtCrack. It only ran in English.
Here’s the sort of information I have presented most recently for people to consider:
You see above the analysts tell a reporter that presence of a Chinese language pack is the clue to Chinese design and operation of attacks on Russia. Then further investigation revealed the source actually was Korea. Major error, no? It seems to be reported as only an “oops” instead of a WTF.
At a recent digital forensics and incident response (DFIR) meeting I pointed out that the switch from Chinese to Korean origin of attacks on Russia of course was a huge shift in attribution, one with potential connections to the US.
This did not sit well with at least one researcher in the audience. “What proof do you have there are any connections from Korea to the US” they yelled out. I assumed they were facetiously trying to see if I had evidence of an English language pack to prove my point.
In retrospect they may actually have been seriously asking me to offer clues why Korean systems attacking Russia might be linked to America. I regret not taking the time to explain what clues more significant than a language pack tend to look like. Cue old history lesson slides…but I digress.
Here’s another slide from the same talk I gave about attribution and language. I point to census data with the number and location of Chinese speakers in America, and most popular languages used on the Internet.
Unlike McAfee, mentioned above, FireEye and Mandiant have continued to ignore the obvious and point to Chinese language as proof of someone being foreign.
Consider for a moment that the infamous APT1 report suggests that language proves nothing at all. Here is page 5:
Unit 61398 requires its personnel to be…proficient in the English language
Thus proving APT1 are English-speaking and therefore not foreigners? No, wait, I mean proving that APT1 are very dangerous because you can never trust anyone required to be proficient in English.
But seriously, Mandiant sets this out presumably to establish two things.
First, “requires to be proficient” is a subtle way to say Chinese never will do better than “proficient” (non-native) because, foreigners.
Second, the Chinese target English-speaking victims (“Only two victims appear to operate using a language other than English…we believe that the two non-English speaking victims are anomalies”). Why else would the Chinese learn English except to be extremely targeted in their attacks — narrowing their focus to basically everywhere people speak English. Extremely targeted.
And then on page 6 of APT1 we see supposed proof from Mandiant of something else very important. Use of a Chinese keyboard layout:
…the APT1 operator’s keyboard layout setting was “Chinese (Simplified) – US Keyboard”
On page 41 (suspense!) they explain why this matters so much:
…Simplified Chinese keyboard layout settings on APT1’s attack systems, betrays the true location and language of the operators
Mandiant gets so confident in where someone is from based on assessing language they even try to convince the reader that Americans do not make grammar errors. Errors in English (failed attempts at proficiency) prove they are dealing with a foreigner.
Their own digital weapons betray the fact that they were programmed by people whose first language is not English. Here are some examples of grammatically incorrect phrases that have made it into APT1’s tools
It is hard to believe this is not meant as a joke. There is a complete lack of linguistic analysis, for example, just a strange assertion about proficiency. In our 2010 RSAC presentation on the linguistics of threats we give analysis of phrases and show how syntax and spellings can be useful to understand origins. I can only imagine what people would have said if we tried to argue “Bad Grammar Means English Ain’t Your First Language”.
Of course I am not saying Mandiant or others are wrong to have suspicion of Chinese connections when they find some Chinese language. Despite analysts wearing clothes with Chinese language tags and using computers that probably have Chinese language print there may be some actual connections worth investigating further.
My point is that the analysis offered to support conclusions has been incredibly weak, almost to the point of being a huge distraction from the quality in the rest of the reports. It makes serious work look absurd when someone over-emphasizes language spoken as proof of geographic location.
Now, in some strange twist of “I told you so”, the Twittersphere has come alive with condemnation of an NSA analyst for relying to heavily on language.
Thank you to Chris and Halvar and everyone else for pointing out how awful it is when the NSA does this kind of thinking; please also notice how often it happens elsewhere.
More people need to speak out against this generally in the security community on a more regular basis. It really is far too common in far too many threat reports to be treated as unique or surprising when the NSA does it, no?
Part 2?