SystemsSec 2018W Lecture 19
Audio
Notes
Every security technology has strengths and weaknesses
Various types of data input for security software includes IP Packets System Calls Log files Emails Os statistics (resource usage) Http traffic etc
The representation of the data is more important than the data itself Data representation is a machine learning concept. Certain type of operations are easier to do on different data representations Data can be converted into different types of representations For example if your input data is IP packets, the types of operations you can do are different than if you are able to have email messages extracted from them. Converting those IP packets to email messages is called a representation switch Data represented as emails would have the fields From, to, message, date sent, etc Machine learning algorithms Pattern recognition classification problem Inputs you want to label
Deep learning
Deep learning can learn its own representation
But it requires tons of data to train itself
Deep learning takes a ton of time, because it has to process so much data
Most problems don’t have enough data for deep learning
Adversarial Machine learning
Deep learning might lot learn the representation that you expect.
While we think that it would learn that a stop sign is red, octagonal, contains the word STOP in white text, it could actually just be learning something trivial like what the tops of the letters look like. This would allow someone to deface a stop sign in some minor way such as putting dots above the letter that would make it so that the self driving cars can’t recognise the sign even though humans still could.
How the data is represented determines what tasks security technology can perform with it
Security technologies can apply static policies or learn patterns. Complicated systems will use both.
For the example of a spam filter, it can have static policies for email addresses or keywords to ignore, and it can also learn from the vast amount of data available. Spam filtering is able to be quite sophisticated because there is a clear indication of success and failure, and there is incentive for both the host and the user to get it correct.
Log files: SIEMs (System information event management) Manage logs from multiple systems Represents all logs as a common representation for use with analytic tools Not good in practice because you need domain specific knowledge to understand the logs CASB SIEM for cloud applications
Graylisting
When a user firsts attempts to do something the server responds with responds with “try again later” instead of a failure, then after a set amount of time adds it to a list. When that thing is attempted again later the server accepts and adds that action to a white list because a spam bot will not try things twice, but a user will.