Login

***harlan4096*** · 11 July 21, 08:44

Quote:

New research from Avast and Czech Technical University applies automated feature extraction to machine learning to automate data processing pipelines

This post was written by the following Avast researchers:

Petr Somol, Avast Director AI Research
Tomáš Pevný, Avast Principal AI Scientist
Viliam Lisý, Avast Principal AI Scientist
Branislav Bošanský, Avast Principal AI Scientist
Andrew B. Gardner, Avast VP Research & AI
Michal Pěchouček, Avast CTO

One of the biggest unaddressed challenges in machine learning (ML) for security is how to process large-scale and dynamically created machine data.

Machine data — data generated by machines for machine processing — gets less attention in ML research than video, sound and text, yet it is as prevalent in our digital world and is as important as the dark matter in the universe. In security, machine data is the primary source of information about attacks and other anomalous behavior on the internet. Even so, it’s notoriously hard to learn from it automatically, to discover unknown patterns, and to adapt the learning process to the scale, complexity, and ever-changing nature of machine data. In this post, the Avast AI Research Lab reports on our solution to the problem.

The machine data problem in security As indicated above, much of the effort in machine learning to date has centered around processing data related to human perception: through speech, vision and text. These are closely tied to the ways that humans interact with computers and systems. But there is another, much bigger class of data that lurks — a class that has the potential to revolutionize AI and ML products even further: machine data. Machine data analytics define a big part of the current cybersecurity problem. With an ever-growing deployment of AI-based automation by attackers, the volume of relevant machine data is growing so quickly that using modern machine learning techniques is the only way that we can provide cybersecurity on the internet today.

What is machine data? It includes things like log files, databases, internet messages and protocols, disassemblies, and experimental device outputs. We call it machine data because its purpose is not primarily for humans to hear, see, or read it. Machine data shares some similarities with traditional data — speech is logged as a digital time series, vision is built around a sequence of matrices, and text follows the syntax and grammar of a language in the same way that machine data follows a protocol and grammar. There are, however, important differences. Machine data tends to evolve more rapidly than human-produced data because, for its intended use, it’s not bound by human perception constraints. The format and content of machine data can change as a consequence of any change in the computing environment, particularly due to automated changes in systems (with software updates, connection of new devices, or protocol changes due to network load-balancing or even due to component malfunctions). Crucially, machine data can be arbitrarily complex and large, making it hard for humans to work with.

One of today’s most common forms of machine data used by web and apps is JavaScript Object Notation (JSON). JSON records a hierarchy of nested objects in a text form in which each object is a set of “key”:value pairs. A value can be a “string”, a number, a condition (true or false or null), an object or an array, (i.e. an ordered collection of objects).

JSON is meant to be easily interpretable (assuming the expert knowledge of the particular data source), yet it is structured in the sense that it follows rules which allow the computer to parse messages to make use of content.

Machine data analytics in Avast So why is ML on machine data an important issue? Why do we at Avast care about it? Every day, our team handles more than 45 million new unique files, 25% of which are usually malicious. In September 2020, we blocked 1.7 billion Windows attacks, an accomplishment through which we protected close to 50 million users. In order to understand the growth of the attack landscape you can see in Fig. 3, that our team was handling closer to 200,000 files per day eight years ago, at the time of our IPO in 2018, it was around 500,000, and we’re now approaching 1.6 million new unique files per day.

According to research by Akamai, 69% of web API traffic in 2018 was JSON. This is the traffic that is frequently misused as a carrier of widely spread attacks.

We face an accelerating influx of new files, technologies, device types, and services, all of which can potentially be misused by adversaries to attack our customers and society as a whole. Data transferred over the internet either has JSON form or can be stored as JSON. File execution environments can even store logs as JSON. Thus, JSON data is an important source of information usable in threat detection. The problem is that there’s a discrepancy between the pace of data volume growth and the capacity of human experts and engineered automated solutions to analyze incoming data for suspected malicious behavior. Our most precious research commodity in security is time — more specifically, human time. We need tools that can process all sorts of machine data that describe anything potentially related to attacks or other malicious behaviors that our users can encounter. We need tools that discover unknown suspicious patterns in such data and that make predictions about the meaning of such patterns. Scalable, data-driven analytics is at the core of all cybersecurity vendors’ interest. Nevertheless, there are few (if any) ML techniques devoted to properly analyzing this type of data.
...

Login
Username/Email:
Password:	Lost Password?
	Remember me