EvoSec 2025W Lecture 17: Difference between revisions

From Soma-notes
Created page with "==Readings== * [https://homeostasis.scs.carleton.ca/~soma/pubs/amatrawy-acns-05.pdf Matrawy, "Mitigating Network Denial-of-Service Through Diversity-Based Traffic Management." (ACNS 2005)] * [https://homeostasis.scs.carleton.ca/~soma/pubs/inoue-lisa2007.pdf Inoue, "NetADHICT: A Tool for Understanding Network Traffic." (LISA 2007)] ==Discussion Questions== ==Notes=="
 
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:


==Discussion Questions==
==Discussion Questions==
* What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
* What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
* Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?
* How fast is ADHIC compared to other standard clustering algorithms?
* Is diversity-based traffic management feasible today given that so much traffic is encrypted?


==Notes==
==Notes==
<pre>
Lecture 17
----------
* Internet protocols
* clustering vs classification
* p,n-grams
best-effort packet delivery
- rather than guaranteed delivery
IP - best effort
TCP - guaranteed delivery
best effort allows for denial of service
- can always eliminate DoS with reservations, but only for the chosen few
Unless you make deliberate choices about who gets service, EVERYONE gets poor service when there is too much demand
So how do we deal with denial of service on the Internet today?
Today we mostly manage DoS through content distribution networks (CDNs) of some kind.
A CDN is its own network of servers (an "overlay network" or entirely separate) that distributes & serves data
How do CDNs route traffic?
- a form of load balancing, but also prioritization
  (how much did you pay?)
- tends to be on a per-server basis, not per-client
What is normal for the network?
- constant level of weirdness!
Internet telescope
- reserve a large block of IP addresses that aren't being used
- watch what traffic comes to it
naive anomaly detection on network traffic will have huge false positives
- or, your model will be way too general
Today we mostly do the exact opposite
- deep packet inspection systems
  - in the cloud, will analyze decrypted packets
- really try to understand traffic using lots of rules, reconstructing flows
Normally traffic is managed using source IP address, source port, destination IP address, destination port, protocol
- but is that all we can look at?
- can we use this data in a more generic way, without parsing out flows?
So why p,n-grams?
n-grams is a common way to analyze large amounts of data
- n in n-gram is just a length, so a set of fixed-length strings
One idea is to do n-gram analysis on packets (whole packets or just packet headers)
- n-gram analysis is relatively slow, have to search entire packet for a match
network routers go through a lot of effort to not look at every byte in a packet
What do routers look at?
- source and destination IP addresses
Notice that these are 4 byte (or 16 byte) patterns at fixed offsets in a packet header
- p,n-grams are a generalization of source and destination IP addresses
What is the frequency distribution of p,n-grams?
What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
- attacker wants to get maximum bandwidth
- so, have to get their packets into all the queues, or as many as possible
- in order to do that, they have to create packets that have p,n-grams that are being used by every queue (every leaf node in ADHIC)
What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
  - inherently privacy preserving, except for bad actors
Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?
How fast is ADHIC compared to other standard clustering algorithms?
Is diversity-based traffic management feasible today given that so much traffic is encrypted?
How does this relate to trust?
</pre>

Latest revision as of 18:45, 13 March 2025

Readings

Discussion Questions

  • What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
  • What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
  • Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?
  • How fast is ADHIC compared to other standard clustering algorithms?
  • Is diversity-based traffic management feasible today given that so much traffic is encrypted?

Notes

Lecture 17
----------

* Internet protocols
* clustering vs classification
* p,n-grams

best-effort packet delivery
 - rather than guaranteed delivery

IP - best effort
TCP - guaranteed delivery

best effort allows for denial of service
 - can always eliminate DoS with reservations, but only for the chosen few

Unless you make deliberate choices about who gets service, EVERYONE gets poor service when there is too much demand

So how do we deal with denial of service on the Internet today?

Today we mostly manage DoS through content distribution networks (CDNs) of some kind.

A CDN is its own network of servers (an "overlay network" or entirely separate) that distributes & serves data

How do CDNs route traffic?
 - a form of load balancing, but also prioritization
   (how much did you pay?)
 - tends to be on a per-server basis, not per-client

What is normal for the network?
 - constant level of weirdness!

Internet telescope
 - reserve a large block of IP addresses that aren't being used
 - watch what traffic comes to it

naive anomaly detection on network traffic will have huge false positives
 - or, your model will be way too general


Today we mostly do the exact opposite
 - deep packet inspection systems
   - in the cloud, will analyze decrypted packets
 - really try to understand traffic using lots of rules, reconstructing flows


Normally traffic is managed using source IP address, source port, destination IP address, destination port, protocol
 - but is that all we can look at?
 - can we use this data in a more generic way, without parsing out flows?

So why p,n-grams?

n-grams is a common way to analyze large amounts of data
 - n in n-gram is just a length, so a set of fixed-length strings

One idea is to do n-gram analysis on packets (whole packets or just packet headers)
 - n-gram analysis is relatively slow, have to search entire packet for a match

network routers go through a lot of effort to not look at every byte in a packet

What do routers look at?
 - source and destination IP addresses

Notice that these are 4 byte (or 16 byte) patterns at fixed offsets in a packet header
 - p,n-grams are a generalization of source and destination IP addresses

What is the frequency distribution of p,n-grams?

What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
 - attacker wants to get maximum bandwidth
 - so, have to get their packets into all the queues, or as many as possible
 - in order to do that, they have to create packets that have p,n-grams that are being used by every queue (every leaf node in ADHIC)


What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
  - inherently privacy preserving, except for bad actors

Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?

How fast is ADHIC compared to other standard clustering algorithms?

Is diversity-based traffic management feasible today given that so much traffic is encrypted?

How does this relate to trust?