EvoSec 2025W Lecture 17

From Soma-notes

Readings

Discussion Questions

  • What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
  • What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
  • Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?
  • How fast is ADHIC compared to other standard clustering algorithms?
  • Is diversity-based traffic management feasible today given that so much traffic is encrypted?

Notes

Lecture 17
----------

* Internet protocols
* clustering vs classification
* p,n-grams

best-effort packet delivery
 - rather than guaranteed delivery

IP - best effort
TCP - guaranteed delivery

best effort allows for denial of service
 - can always eliminate DoS with reservations, but only for the chosen few

Unless you make deliberate choices about who gets service, EVERYONE gets poor service when there is too much demand

So how do we deal with denial of service on the Internet today?

Today we mostly manage DoS through content distribution networks (CDNs) of some kind.

A CDN is its own network of servers (an "overlay network" or entirely separate) that distributes & serves data

How do CDNs route traffic?
 - a form of load balancing, but also prioritization
   (how much did you pay?)
 - tends to be on a per-server basis, not per-client

What is normal for the network?
 - constant level of weirdness!

Internet telescope
 - reserve a large block of IP addresses that aren't being used
 - watch what traffic comes to it

naive anomaly detection on network traffic will have huge false positives
 - or, your model will be way too general


Today we mostly do the exact opposite
 - deep packet inspection systems
   - in the cloud, will analyze decrypted packets
 - really try to understand traffic using lots of rules, reconstructing flows


Normally traffic is managed using source IP address, source port, destination IP address, destination port, protocol
 - but is that all we can look at?
 - can we use this data in a more generic way, without parsing out flows?

So why p,n-grams?

n-grams is a common way to analyze large amounts of data
 - n in n-gram is just a length, so a set of fixed-length strings

One idea is to do n-gram analysis on packets (whole packets or just packet headers)
 - n-gram analysis is relatively slow, have to search entire packet for a match

network routers go through a lot of effort to not look at every byte in a packet

What do routers look at?
 - source and destination IP addresses

Notice that these are 4 byte (or 16 byte) patterns at fixed offsets in a packet header
 - p,n-grams are a generalization of source and destination IP addresses

What is the frequency distribution of p,n-grams?

What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
 - attacker wants to get maximum bandwidth
 - so, have to get their packets into all the queues, or as many as possible
 - in order to do that, they have to create packets that have p,n-grams that are being used by every queue (every leaf node in ADHIC)


What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
  - inherently privacy preserving, except for bad actors

Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?

How fast is ADHIC compared to other standard clustering algorithms?

Is diversity-based traffic management feasible today given that so much traffic is encrypted?

How does this relate to trust?