COTS Series: Endpoint Antivirus/Antimalware

May 17, 2017

A few months ago - quite a few months ago now actually, I started a project looking at the potential replacements for traditional AntiVirus (AV).

Note: I use the terms virus and malware pretty much interchangeably in this article, antivirus and antimalware too. I’m wrapping a bunch of different types of malware and virus up with trojans and worms and ransomware and everything else that you get into a single thing.

Traditional antivirus works in the following way:

Antivirus vendors produce a list of signatures that match the characteristics of virus infected files. In a simplistic case, a file that contains the following: X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

Will be identified by (EICAR compliant) Anti-Virus software as being the EICAR test file.

I say EICAR compliant, because not all AV products actually do detect it.

Most AV vendors publish multiple signature updates per day, so in theory, they’ll be keeping the installed AV software up to date.

It’s actually a little more complicated than that, and Traditional AV uses primarily signatures and hashes of files (and in some cases, segments of files), along with heuristic mechanisms to determine whether a file is safe, or infected with some form of virus content.

The problem comes when you find a novel virus, that is to say, one that has never been seen by an AV vendor before - such as one crafted especially for one attack, and one attack only. There will be no signature for this, and as a result, it will not be detected.

Alternatively, the virus or malware may have been crafted in response to a zero-day vulnerability, and AV vendors are lagging behind in identifying the signature.

Some particularly clever (depending on your point of view) malware is able to change its executable code, evading detection by changing the signature hash that the AV agent will generate for it. This is known as Polymorphic Malware.

The process of Traditional AV software staying up-to-date requires frequent database updates as fresh malware samples are gathered and identified. There have been cases in the past where fresh malware hasn’t been identified as such, because either the client endpoint database wasn’t up to date leading to the fresh samples being miscategorised as safe, when they shouldn’t have been. Alternatively, there is always the possibility that the malware will mutate on the target system, after infection, changing its signature and hashes, leading to the AV missing the infection completely.

One particular case where virus infection may be missed is by AV software in disconnected, or air-gapped networks, where signature updates are not being regularly received and implemented. In most cases, the process of updating these AV signatures is a human factor, and given that many Traditional AV systems are receiving multiple signature updates per day, there is no realistic chance that a human can keep on top of the number of updates required to ensure true up-to-date virus protection using this method.

Either way, the result is the same, the virus infection won’t be picked up, and trouble will ensue.

One of the biggest problems I have seen with traditional file-scanning based AV is the stupendous amount of system load generated by scanning every single file daily, and then scanning every file that is accessed, whenever it is accessed either for reads or writes. This is particularly noticeable on developer workstations, or servers used for building software. I have observed as much as a 50% slow down caused by having on-access AV scanning enabled on a build server, compared to having on-access scanning disabled. When you have an enterprise network of 400 servers and workstations, that is a lot of scanning taking place, the majority on files which are self-similar across the network - ntoskrnl.dll on System 1 should be identical on System 400.

Fortunately, AV vendors appear to be catching up, and there is a lot of cool shit on the endpoint protection market these days. Some cooler than others.

Basically, they fall into two main categories.

Cloud-based and Machine Learning based.

Cloud-based AV installed on the Endpoint is typically far more lightweight than that used in Traditional AV - the majority of the scanning grunt work is offloaded to massive virtual server farms in the vendor’s public cloud.

Some work by uploading just file metadata, some upload sections of the file, some upload the entire file.

There are many cloud-based AV solutions on the market currently that offer very little protection if the endpoint is not constantly connected to the internet. This is because they will fall-back to signature based behavior if they are unable to leverage the cloud-based scanning service.

In a network of many endpoints, the amount of bandwidth to communicate with the cloud service could be quite considerable!

Unsurprisingly, as this level of digital transformation takes place in the AV market, nearly every traditional AV vendor has produced some variant of cloud-based scanning. They all work in subtly different ways, and each requires careful investigation before implementing in your enterprise environment.

It is frequently unclear exactly what data will be sent to their cloud service, without a mutually binding NDA in place, so the process to determine which ones are suitable for your organisation may take a considerable amount of time tied up in legal paperwork and careful reading of Terms and Conditions documents, in order to establish what potential there is for your Intellectual Property to end up on a server outside of your control!

Machine Learning AV (MLAV) is clever shit.

Really good MLAV is super fucking clever.

There seem to be two main variants in the marketplace at the moment. One is client-based, and the lion’s share of the processing - Machine Learning and Intelligence takes place locally, on the client, and unsurprisingly, cloud-based ML, where the neural networks and classification algorithms live in the cloud.

Cloud-based ML suffers from the same problems as non-ML AV, where the level of scanning that can be performed is directly proportional to the level of connectivity to the Internet. If you’re disconnected, or air-gapped, you’re not going to get the same level of protection as if you’ve got a constant connection to the internet for file and metadata uploads, scanning and receipt of results.

The client based MLAV is super smart, though, and highly optimised to break down files and executables (because at the end of the day, there’s little point in scanning every text, jpeg and .chm file for a signature of a virus, because it’ll only spread or infect a system when it gets executed.

Once you realise this, and realise that even a folder stuffed with 1000 individual malware samples poses no threat unless you run one of them - you realise that NGAV can be a lot more clever in its mode of operation.

That mode of operation varies between vendor, but the basic gist is this:

  • Analyse the code that wants to run. Identify things like buffer overflows, stack pivots, RAM scraping, process injection, Memory exploitation.

  • Extract characteristics of the files and DLLs that are about to execute for the above mechanisms (and a bunch of others that I haven’t mentioned - some of which may be proprietary mechanisms for distinguishing malware from legitimate files. All of which have been through a Neural Network training process with legitimate file behavior and malicious file behavior to allow MLAV to identify and determine their intention.

  • Because there are no signatures to update, only the neural network model (which changes far less frequently than a Signature database, the requirement for many updates per day is greatly reduced - typically one update every 6-12 months.

  • Similarly, because there are no signature files to update, there is no requirement for the systems to be online all the time - in fact at the time of writing, there is a video demonstrating that an 18 month old MLAV (Cylance) install can detect and quarantine a WannaCrypt0r infection without having received updates, and without access to the internet for advanced file scanning.

  • Given that the MLAV agent isn’t scanning every single goddamn file on the system to see whether it might contain a virus, and instead only examining the file/executable behavior - the system requirements in terms of CPU and memory utilisation are considerably lower. On production systems, I’ve seen Sophos AV using 100% CPU whilst wasting time with on-access scanning and background scanning of every file on a server. Typical CPU utilisation for MLAV during my testing was at maximum, 9%.

The Great AV Bakeoff - 2016

I started this way back in November, with a flurry of complaints about how much bandwidth, CPU and Memory our Traditional AV was using, and set about making a shortlist of possible alternatives for the replacement.

I had 2 alternative products from the Traditional AV market, and 2 from the Machine Learning AV market.

I had already ruled out a bunch of the usual faces, because of recommendations from things like the Gartner Magic Quadrant for Endpoint AV, and some for having ridiculous clauses within their Terms and Conditions statements.

I had a pretty good idea of requirements - Whatever it was I chose had to be:

  • suitable for offline usage (clear win for MLAV over Traditional AV)
  • Low CPU / Memory utilisation
  • Able to detect and quarantine zero-day exploits (this was tricky to test).
  • Able to be used on high-utilisation servers and workstations (developers, and build servers primarily).

I built a lab, behind a physically separate firewall, isolated from the rest of my environment, which basically had ICMP and http/https access to the internet, and Nothing Else.

I had one laptop for each of the test AV solutions, including the incumbent, connected to the Lab network, running the current latest patchset for Windows 7.

I then searched the internet for fresh, that is to say, zero-day malware samples, and collected approximately 20 fresh malware samples from a bunch of sources, email and websites, and checked them by uploading them to VirusTotal.

I also collected another 20 samples of week old, and a further 20 of month-old malware samples.

The next step was to take a somewhat retro approach, and burn them to a CD. The reasoning behind this was to prevent the AV under test, from erasing them from the source media, which is a very real possibility if I were to say, replace the CD with a shared directory, or a USB stick.

One by one, I infected the target machines with the virus samples from the CD, and observed the responses by the AV products.

The outcome was this:

Fresh Malware

Traditional AV Offline: 0/20
Traditional AV Online: 1/20
Cloud-based AV Online:6/20
Cloud-based AV Offline: 0/20
MLAV Online: 20/20
MLAV Offline:20/20

Week-old Malware

Traditional AV Offline: 2/20
Traditional AV Online: 5/20
Cloud-based AV Online: 12/20
Cloud-based AV Offline: 1/20
MLAV Online: 20/20
MLAV Offline: 20/20

Month Old Malware:

Traditional AV Offline: 6/20
Traditional AV Online: 10/20
Cloud-based AV Online: 18/20
Cloud-based AV Offline: 11/20
MLAV Online: 20/20
MLAV Offline: 20/20

The message I took from this was:

MLAV is excellent. Every sample of malware I fed it, it blocked, and identified as malicious code. Even the zero-day exploits gathered freshly from the malware zoos on the internet - even blocked malicious macros inside PDF files and Office documents. The ability to determine whether code is malicious based on the activities it performs when executed is far superior to matching against signatures of known malware.

Cloud-based AV is a mixed bag, online, it’s nearly good, but offline - relying on cached data and falling back to the signature and heuristic methods, is just as bad as Traditional AV. The online requirement poses some interesting challenges too, as not all businesses are able to have all machines online all the time. There are compliance problems too, potentially. Firewalled traffic aside, there are some networks where you need to have bulletproof AV protection, but no internet connection. Critical Infrastructure springs to mind, as do networks containing Protectively Marked material. I just can’t see cloud-based AV working in these scenarios.

In my opinion, Traditional AV has run its course, and fails to detect fresh malware samples, it also struggles with malware for which it doesn’t have signatures, but occasionally gets lucky with heuristic based detection. Feed it a file containing Conficker or Nimda, and there’s no problem. Unfortunately, there’s not as many instances of those viruses causing the havoc that they did in the past. The new security landscape is changing rapidly, and Traditional AV vendors seem to be struggling to catch up.

That said, nearly every Traditional AV vendor has a cloud-based or MLAV product in their catalogue, but from my point of view, they feel a little bit like it’s too little, too late. The MLAV market is more clearly defined by novel vendors, who are not held back by past poor experiences of the products of the big hitters of the Trad AV market; allowing them a greater chance to capitalise on the endpoint security market that is so clearly the weakest link in cybersecurity currently.

At this point, and for a variety of reasons, I am not going to disclose which Traditional, Cloud-based and MLAV solutions I tested, and which one I selected to implement, because at this point, it’s not relevant. I might follow up with this information in a separate article.

I think that the important thing, if you’re in a similar position - evaluating endpoint protection in your enterprise is that you evaluate at least one of each type for your own specific use-case.

All AV products differ slightly, in terms of performance (CPU / Memory utilisation, scanning speed), protection types, update frequency and so on. At the end of the day, you need to choose the one that fits your business best, and no anecdotal evidence from a blog should make that decision for you.

On the other hand, the things that are important is having the ability to build a lab environment in which you can test your AV candidates against real live malware, and see how they perform. I cannot stress highly enough why you need a lab in which to do this, and why you shouldn’t try these things out on a production network (although, this should be self-evident!).

When you engage your vendors (either a direct vendor such as an AV company, or a systems integrator/value-add provider), you should be in a position where you can offload some of the testing, or lab configuration onto them. The really good security consulting companies have these kind of labs that you can use for the evaluation of potential new products, and you should leverage this to your advantage during the selection process (This applies not only to software like AV, but also things like firewalls, intrusion detection, data loss prevention and so on).

You should be able to get multiple AV software providers into proof-of-concept phases where you’re testing in the lab environment, but also testing side-by-side with your existing AV solution (if any!) to see how it responds in the production environment - not on every endpoint, but a fair selection, somewhere between 1 and 10% of endpoints to see how it is tolerated by users and business processes. This will give you a pretty good idea of how the rollout will work when you select a product.

Raise support cases with the AV vendors during your proof-of-concept phase. You’ll probably need to raise a support case with them at some point after procurement and installation, so it’s only fair that you evaluate their ability to provide support during testing as well.

This applies for a lot of other situations too, not just choosing AV software.

In my case, for one product, the support given (or lack of) allowed me to preclude one vendor from any further testing, because their support was inadequate and condescending.

At the end of the day (and indeed, the article), you need to have a good understanding of the potential solutions on the market for endpoint protection, and an understanding that they all differ, sometimes subtly, sometimes not.

You need to select, shortlist, evaluate and decide on the one that’s the best fit for your organisation.

Profile picture

Written by Tom O'Connor, an AWS Technical Specialist, with background in DevOps and scalability. You should follow them on Twitter