Readings
Discussion Questions
- What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
- What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
- Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?
- How fast is ADHIC compared to other standard clustering algorithms?
- Is diversity-based traffic management feasible today given that so much traffic is encrypted?
Notes
Lecture 17
----------
* Internet protocols
* clustering vs classification
* p,n-grams
best-effort packet delivery
- rather than guaranteed delivery
IP - best effort
TCP - guaranteed delivery
best effort allows for denial of service
- can always eliminate DoS with reservations, but only for the chosen few
Unless you make deliberate choices about who gets service, EVERYONE gets poor service when there is too much demand
So how do we deal with denial of service on the Internet today?
Today we mostly manage DoS through content distribution networks (CDNs) of some kind.
A CDN is its own network of servers (an "overlay network" or entirely separate) that distributes & serves data
How do CDNs route traffic?
- a form of load balancing, but also prioritization
(how much did you pay?)
- tends to be on a per-server basis, not per-client
What is normal for the network?
- constant level of weirdness!
Internet telescope
- reserve a large block of IP addresses that aren't being used
- watch what traffic comes to it
naive anomaly detection on network traffic will have huge false positives
- or, your model will be way too general
Today we mostly do the exact opposite
- deep packet inspection systems
- in the cloud, will analyze decrypted packets
- really try to understand traffic using lots of rules, reconstructing flows
Normally traffic is managed using source IP address, source port, destination IP address, destination port, protocol
- but is that all we can look at?
- can we use this data in a more generic way, without parsing out flows?
So why p,n-grams?
n-grams is a common way to analyze large amounts of data
- n in n-gram is just a length, so a set of fixed-length strings
One idea is to do n-gram analysis on packets (whole packets or just packet headers)
- n-gram analysis is relatively slow, have to search entire packet for a match
network routers go through a lot of effort to not look at every byte in a packet
What do routers look at?
- source and destination IP addresses
Notice that these are 4 byte (or 16 byte) patterns at fixed offsets in a packet header
- p,n-grams are a generalization of source and destination IP addresses
What is the frequency distribution of p,n-grams?
What does it mean for an attacker to "defeat" (p,n)-gram based traffic clustering?
- attacker wants to get maximum bandwidth
- so, have to get their packets into all the queues, or as many as possible
- in order to do that, they have to create packets that have p,n-grams that are being used by every queue (every leaf node in ADHIC)
What do high frequency (p,n)-grams reveal about network traffic? Does this include anything that might compromise user privacy?
- inherently privacy preserving, except for bad actors
Is ADHIC an anomaly detection algorithm? Can it be used to detect anomalies?
How fast is ADHIC compared to other standard clustering algorithms?
Is diversity-based traffic management feasible today given that so much traffic is encrypted?
How does this relate to trust?