Latest revision as of 13:50, 28 March 2012

Design Exercise: Phishing

How can we design an anti-phishing strategy that uses a biological approach?

Our theoretical phishing scenario involves:

banking
want credentials
using email
send an email that looks like it comes from the bank
link goes to malicious site that looks arbitrarily like the bank
- what does it mean to look like the bank?
user types in credentials, potentially gets transparently redirected to real bank site

Some of the problems that arise in phishing are related to:

faked email
link to site that looks like the bank but isn't the bank
url that looks like the bank's url, but isn't the bank's url
credentials being entered in wrong domain, wrong page
misappropriated text and images (both in email, and on the faked website)
bad/missing/suspect certificate
- certificate/credential combination is suspect

Human algorithm:

is the domain the same for the one where credentials are normally sent?
not normally in response to email request
certificate is the same

Think of individual detectors as autonomous:

how would they be useful?
how would they work? to detect?
how should they change system state in the normal case?

Possible anti-phishing system characteristics:

language checks
- phishing attacks often have poor grammar and spelling
- system could check the spelling and grammar to look for changes
URLs
- phishing URLs are often designed to look like those of the legitimate site (e.g., www.paypa1.com)
- system could check for unusual url characteristics, such as numbers, non-printing characters, characters like "|"
past behaviour
- has the user entered this username/password at this domain before?
- does the user normally follow a link from an email before entering these credentials?
- does the certificate match the one where the user normally enters these credentials?

How would the system react to information gathered?

the system should holistically assess all kinds of information gathered
gather a rich picture of the email's characteristics, the website characteristics, and the user's behaviour
there should be a sort of saturation point where enough characteristics point to phishing that the system reacts in such a way as to prevent loss of information
- what should the system do?
should some system characteristics have more weight than others?
- should elements like certificate validity be considered more important and have more effect on the decision?
the system should base this decision on many small indicators

List of individual detectors

Image filename/content sensor
- A fuzzy hash/fingerprinting technique of the images would be another idea.
  - Could hook into something like TinEye

Cascading Stylesheet sensor -- a sort of visual appearance sensor.
- Might give an indication that a page is visually masquerading as another page.
- Are the elements of this page styled identically to the elements on my banking website?
- Is the CSS file a hash-identical version of the CSS on my banking website?

context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. (huh? Could the original author of this fragment add more?)

Content depth sensor
- Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
- Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
  - A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).

Spellcheck sensor

Domain / ip address sensor
- Could use more advanced metrics. Is the domain name within a certain Levenshtein distance of a known financial institution?
  - Of one of the financial institutions that I frequent?
- Is the whois lookup of the domain I'm connecting to sensible?
  - I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?

GeoIP lookup sensor
- Is the IP address I'm connecting to in the same country as my financial institution?

Certificate sensor
- issuer name, domain name, client name, date of issue, date of expiry

HTTP Header sensor
- Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
- Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.

Web Search sensor
- If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
- Does it appear in the top 10 results?

Load time sensor
- How long does it take the website to load?
- Does it match the ballpark of how long it took me to load the website on prior visits?

Traceroute sensor
- What hops do my packets take along the way to the site I'm connecting to?
- Is it absurdly different than usual?
- Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...

Safe browsing sensor
- Does the URL get flagged when submitted to the Google Safebrowsing API?

Retrieve the page naturally, and through a proxy
- See if the information retrieved is different from different network "perspectives"

@@ Line 1: / Line 1: @@
-== Possible Security problems ==
+== Design Exercise: Phishing ==
-* misuse of data
+How can we design an anti-phishing strategy that uses a biological approach?
-* input validation
-* phishing
-** banking
-** want credentials
-** using email
-** send an email that looks like it comes from the bank
-** link goes to malicious site that looks arbitrarily like the bank (unpack)
-** user types in credentials, potentially gets transparently redirected to real bank site
+Our theoretical phishing scenario involves:
+* banking
+* want credentials
+* using email
+* send an email that looks like it comes from the bank
+* link goes to malicious site that looks arbitrarily like the bank
+** what does it mean to look like the bank?
+* user types in credentials, potentially gets transparently redirected to real bank site
-Problems arise from:
+Some of the problems that arise in phishing are related to:
-* illegitimate email
+* faked email
 * link to site that looks like the bank but isn't the bank
+* url that looks like the bank's url, but isn't the bank's url
 * credentials being entered in wrong domain, wrong page
-* misappropriated text, images in email, site images
+* misappropriated text and images (both in email, and on the faked website)
 * bad/missing/suspect certificate
 ** certificate/credential combination is suspect
+Human algorithm:
+* is the domain the same for the one where credentials are normally sent?
+* not normally in response to email request
+* certificate is the same
+Think of individual detectors as autonomous:
+* how would they be useful?
+* how would they work? to detect?
+* how should they change system state in the normal case?
+Possible anti-phishing system characteristics:
+* language checks
+** phishing attacks often have poor grammar and spelling
+** system could check the spelling and grammar to look for changes
+* URLs
+** phishing URLs are often designed to look like those of the legitimate site (e.g., www.paypa1.com)
+** system could check for unusual url characteristics, such as numbers, non-printing characters, characters like "|"
+* past behaviour
+** has the user entered this username/password at this domain before?
+** does the user normally follow a link from an email before entering these credentials?
+** does the certificate match the one where the user normally enters these credentials?
+How would the system react to information gathered?
+* the system should holistically assess all kinds of information gathered
+* gather a rich picture of the email's characteristics, the website characteristics, and the user's behaviour
+* there should be a sort of saturation point where enough characteristics point to phishing that the system reacts in such a way as to prevent loss of information
+** what should the system do?
+* should some system characteristics have more weight than others?
+** should elements like certificate validity be considered more important and have more effect on the decision?
+* the system should base this decision on many small indicators
+= List of individual detectors =
+* '''Image filename/content sensor'''
+** A fuzzy hash/fingerprinting technique of the images would be another idea.
+*** Could hook into something like [http://www.tineye.com/commercial_api TinEye]
+* '''Cascading Stylesheet sensor''' -- a sort of visual appearance sensor.
+** Might give an indication that a page is visually masquerading as another page.
+** Are the elements of this page styled identically to the elements on my banking website?
+** Is the CSS file a hash-identical version of the CSS on my banking website?
+* context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. ''(huh? Could the original author of this fragment add more?)''
+* '''Content depth sensor'''
+** Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
+** Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
+*** A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).
+* '''Spellcheck sensor'''
-Human algorithm:
+* '''Domain / ip address sensor'''
- * is domain same for the one where we normally send credentials
+** Could use more advanced metrics. Is the domain name within a certain [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] of a known financial institution?
- * not normally in response to email request
+*** Of one of the financial institutions that I frequent?
- * certificate is the same
+** Is the whois lookup of the domain I'm connecting to sensible?
+*** I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?
+*''' GeoIP lookup sensor'''
+** Is the IP address I'm connecting to in the same country as my financial institution?
-Think of individual detectors as autonomous:
+* '''Certificate sensor'''
- * how would they be useful?
+** issuer name, domain name, client name, date of issue, date of expiry
- * how would they work? to detect?
- * how should they change system state in the normal case?
-= List of individual detectors =
+* '''HTTP Header sensor'''
+** Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
+** Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.
-image filename check
+* '''Web Search sensor'''
+** If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
+** Does it appear in the top 10 results?
-context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed.
+*''' Load time sensor'''
+** How long does it take the website to load?
+** Does it match the ballpark of how long it took me to load the website on prior visits?
-spellcheck
+* '''Traceroute sensor'''
+** What hops do my packets take along the way to the site I'm connecting to?
+** Is it absurdly different than usual?
+** Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...
-domain / ip address check
+* '''Safe browsing sensor'''
+** Does the URL get flagged when submitted to the [http://code.google.com/apis/safebrowsing/ Google Safebrowsing API]?
-certificate check - issuer name, domain name, client name, date of issue, date of expiry
+* '''Retrieve the page naturally, and through a proxy'''
+** See if the information retrieved is different from different network "perspectives"