Revision as of 23:26, 27 March 2012

Design Exercise: Phishing

How can we design an anti-phishing strategy that uses a biological approach?

Our theoretical phishing scenario involves:

banking
want credentials
using email
send an email that looks like it comes from the bank
link goes to malicious site that looks arbitrarily like the bank
- what does it mean to look like the bank?
user types in credentials, potentially gets transparently redirected to real bank site

Some of the problems that arise in phishing are related to:

faked email
link to site that looks like the bank but isn't the bank
url that looks like the bank's url, but isn't the bank's url
credentials being entered in wrong domain, wrong page
misappropriated text and images (both in email, and on the faked website)
bad/missing/suspect certificate
- certificate/credential combination is suspect

Human algorithm:

is the domain the same for the one where credentials are normally sent?
not normally in response to email request
certificate is the same

Think of individual detectors as autonomous:

how would they be useful?
how would they work? to detect?
how should they change system state in the normal case?

Possible anti-phishing system characteristics:

language checks
- phishing attacks often have poor grammar and spelling
- system could check the spelling and grammar to look for changes
URLs
- phishing URLs are often designed to look like those of the legitimate site (e.g., www.paypa1.com)
- system could check for unusual url characteristics, such as numbers, non-printing characters, characters like "|"
past behaviour
- has the user entered this username/password at this domain before?
- does the user normally follow a link from an email before entering these credentials?
- does the certificate match the one where the user normally enters these credentials?

How would the system react to information gathered?

the system should holistically assess all kinds of information gathered
gather a rich picture of the email's characteristics, the website characteristics, and the user's behaviour
there should be a sort of saturation point where enough characteristics point to phishing that the system reacts in such a way as to prevent loss of information
- what should the system do?
should some system characteristics have more weight than others?
- should elements like certificate validity be considered more important and have more effect on the decision?
the system should base this decision on many small indicators

List of individual detectors

Image filename/content sensor
- A fuzzy hash/fingerprinting technique of the images would be another idea.
  - Could hook into something like TinEye

Cascading Stylesheet sensor -- a sort of visual appearance sensor.
- Might give an indication that a page is visually masquerading as another page.
- Are the elements of this page styled identically to the elements on my banking website?
- Is the CSS file a hash-identical version of the CSS on my banking website?

context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. (huh? Could the original author of this fragment add more?)

Content depth sensor
- Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
- Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
  - A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).

Spellcheck sensor

Domain / ip address sensor
- Could use more advanced metrics. Is the domain name within a certain Levenshtein distance of a known financial institution?
  - Of one of the financial institutions that I frequent?
- Is the whois lookup of the domain I'm connecting to sensible?
  - I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?

GeoIP lookup sensor
- Is the IP address I'm connecting to in the same country as my financial institution?

Certificate sensor
- issuer name, domain name, client name, date of issue, date of expiry

HTTP Header sensor
- Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
- Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.

Web Search sensor
- If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
- Does it appear in the top 10 results?

Load time sensor
- How long does it take the website to load?
- Does it match the ballpark of how long it took me to load the website on prior visits?

Traceroute sensor
- What hops do my packets take along the way to the site I'm connecting to?
- Is it absurdly different than usual?
- Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...

Safe browsing sensor
- Does the URL get flagged when submitted to the Google Safebrowsing API?

@@ Line 54: / Line 54: @@
 = List of individual detectors =
-* image filename check
+* '''Image filename/content sensor'''
-* context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed.
+** A fuzzy hash/fingerprinting technique of the images would be another idea.
-* spellcheck
+*** Could hook into something like [http://www.tineye.com/commercial_api TinEye]
-* domain / ip address check
-* certificate check - issuer name, domain name, client name, date of issue, date of expiry
+* '''Cascading Stylesheet sensor''' -- a sort of visual appearance sensor.
+** Might give an indication that a page is visually masquerading as another page.
+** Are the elements of this page styled identically to the elements on my banking website?
+** Is the CSS file a hash-identical version of the CSS on my banking website?
+* context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. ''(huh? Could the original author of this fragment add more?)''
+* '''Content depth sensor'''
+** Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
+** Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
+*** A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).
+* '''Spellcheck sensor'''
+* '''Domain / ip address sensor'''
+** Could use more advanced metrics. Is the domain name within a certain [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] of a known financial institution?
+*** Of one of the financial institutions that I frequent?
+** Is the whois lookup of the domain I'm connecting to sensible?
+*** I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?
+*''' GeoIP lookup sensor'''
+** Is the IP address I'm connecting to in the same country as my financial institution?
+* '''Certificate sensor'''
+** issuer name, domain name, client name, date of issue, date of expiry
+* '''HTTP Header sensor'''
+** Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
+** Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.
+* '''Web Search sensor'''
+** If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
+** Does it appear in the top 10 results?
+*''' Load time sensor'''
+** How long does it take the website to load?
+** Does it match the ballpark of how long it took me to load the website on prior visits?
+* '''Traceroute sensor'''
+** What hops do my packets take along the way to the site I'm connecting to?
+** Is it absurdly different than usual?
+** Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...
+* '''Safe browsing sensor'''
+** Does the URL get flagged when submitted to the [http://code.google.com/apis/safebrowsing/ Google Safebrowsing API]?