The Limitations of Machine Learning in Cybersecurity

How are Machine Learning (ML) techniques currently employed in cyber security?

As the amount of data created daily increases (already at 2.5 Quadrillion bytes a day allegedly [1]) ML techniques are allowing us to cluster, organise and appropriate this data into actionable information.  This is especially true in the realm of Cyber Security. 

Don’t be scared of the term Machine Learning, it really just means a computer that can learn to do something without being explicitly programmed for that task. The process typically involves training the machine to do a task (i.e. categorising some data points) on some known data and then setting it loose to do the same task on some unknown data. 

Let’s have a quick look at some of the ways we encounter ML every day in Cyber Security.

Spam Mail Filtering

Ever wondered how Google is able to accurately identify spam mail and filter it from your inbox?  I’m not going to pretend I know the ins and outs of how Google does this exactly but on a basic level it starts with training a program on a known dataset which is a mixture of spam and non-spam emails.  The probability that an email is spam given that it contains certain words or phrases such as ‘loan’ or ‘meet singles’ is calculated and the program can classify an email as spam if that probability is over a certain threshold.  Several other variables such as the quality of grammar or whether the sender is in your contacts can be considered too and added into this probability calculation.  In the odd occurrence that we do a get a spam email in our inbox, we might mark it as spam.  These marked emails can then become part of a crowd-sourced training set which the spam filter can continue training on.  Over time the machine builds up a pretty strong model of what constitutes a spam email.

However, we should consider what would happen if thousands of people mislabelled an email as spam or an odd combination of language use got wrongly marked as spam-like.  The machine could create an incorrect bias against certain types of email. 

Network Monitoring, Intrusion Detection & Antivirus

Similar principles can be used for tasks like network monitoring; however, instead of using labelled data like you do with spam, we can let the computer create its own groups and classifications.  So, unlike our spam example where the computer is trained on data which we know is spam, in this scenario the computer has to build up its own models of good and bad network traffic. 

Products like DarkTrace detect intrusions on a network by monitoring network traffic and building up a picture of what ‘normal’ network usage looks like.  This is particularly useful as it means attacks do not have to be explicitly described in order for the system to spot them, they just have to look for behaviours that are different to the norm. 

We’re seeing a similar revolution in the Antivirus market.   Cylance is an innovator in the AV market, founded in 2012 and valued at $1Bn in 2016 [4].  They trained their core product on a massive collection of data on different file types to build up models of the characteristics (“a file genome”) of each sort of file.  This allows the machine to detect files that are uncharacteristic and flag them to the user.  This approach seems more robust than the signature recognition methods used by more classical AV products. 

Limitations

ML implementations are limited by their dependance on good (clean) training data.  The ‘norm’ needs to be established before we can look for anomalies.  These algorithms can continuously learn with the supervision of humans in the feedback loop but there is a fine line to be trodden between aiding and prohibiting the use for which it was intended.  Understandably some ML algorithms err on the side of caution which can lead to the reporting of false positives.  Reddit has many threads from system administrators reporting that they are now spending a lot of time investigating and whitelisting programmes and files that have been wrongly reported as malicious by these products.  Can we assume that these problems will subside as the amount of data for the machines to learn from increases?

The advent of ML in the realm of Cybersecurity heralds an exciting new era in threat detection and also the inevitable innovative reaction from criminal hackers.  It will be interesting to see how hackers will find ways to side-step these technologies; a real test of human against machine!

At XQ Labs we are driven by a mission to use scan data in the smartest way possible.  We are investigating ML techniques to analyse the huge corpus of vulnerability data that we collect from CyberScore and discovering innovative ways to feedback information about threats to our users. 

by Tom Duffy & Isaac Matthews


Resources:

  1. https://www.trendmicro.com/vinfo/us/security/news/security-technology/is-big-data-big-enough-for-machine-learning-in-cybersecurity
  2. https://www.domo.com/learn/data-never-sleeps-5?aid=ogsm072517_1&sf100871281=1
  3. https://www.analyticsvidhya.com/blog/2018/07/using-power-deep-learning-cyber-security/
  4. https://techcrunch.com/2016/06/09/cylance-fighting-malicious-hackers-with-ai-hits-1b-valuation-after-raising-100m/
  5. https://www.wsta.org/wp-content/uploads/2017/03/math-vs-malware-20160422.pdf

Follow us on FacebookTwitter and LinkedIn