BioSec: DNMar23: Difference between revisions

From Soma-notes
Elizabeth (talk | contribs)
Afry (talk | contribs)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Possible Security problems ==
== Design Exercise: Phishing ==


* misuse of data
How can we design an anti-phishing strategy that uses a biological approach?
* input validation
* phishing
** banking
** want credentials
** using email
** send an email that looks like it comes from the bank
** link goes to malicious site that looks arbitrarily like the bank (unpack)
** user types in credentials, potentially gets transparently redirected to real bank site


Problems arise from:
Our theoretical phishing scenario involves:
* banking
* want credentials
* using email
* send an email that looks like it comes from the bank
* link goes to malicious site that looks arbitrarily like the bank
** what does it mean to look like the bank?
* user types in credentials, potentially gets transparently redirected to real bank site


* illegitimate email
Some of the problems that arise in phishing are related to:
* link to site that looks like bank but isn’t
* faked email
* credentials being entered in wrong domain, wrong page
* link to site that looks like the bank but isn't the bank
* misappropriated language, images in email, site
* url that looks like the bank's url, but isn't the bank's url
* bad/missing/suspect cert?
* credentials being entered in wrong domain, wrong page
** cert/credential combo suspect
* misappropriated text and images (both in email, and on the faked website)
* bad/missing/suspect certificate
** certificate/credential combination is suspect


Human algorithm:
* is the domain the same for the one where credentials are normally sent?
* not normally in response to email request
* certificate is the same


Think of individual detectors as autonomous:
* how would they be useful?
* how would they work? to detect?
* how should they change system state in the normal case?


Human algorithm:
Possible anti-phishing system characteristics:
* is domain same for the one where we normally send credentials
* language checks
* not normally in response to email request
** phishing attacks often have poor grammar and spelling
* certificate is the same
** system could check the spelling and grammar to look for changes
* URLs
** phishing URLs are often designed to look like those of the legitimate site (e.g., www.paypa1.com)
** system could check for unusual url characteristics, such as numbers, non-printing characters, characters like "|"
* past behaviour
** has the user entered this username/password at this domain before?
** does the user normally follow a link from an email before entering these credentials?
** does the certificate match the one where the user normally enters these credentials?


Think of individual detectors as autonomous:
How would the system react to information gathered?
* how would they be useful?
* the system should holistically assess all kinds of information gathered
* how would they work? to detect?
* gather a rich picture of the email's characteristics, the website characteristics, and the user's behaviour
* how should they change system state in the normal case?
* there should be a sort of saturation point where enough characteristics point to phishing that the system reacts in such a way as to prevent loss of information
** what should the system do?  
* should some system characteristics have more weight than others?  
** should elements like certificate validity be considered more important and have more effect on the decision?
* the system should base this decision on many small indicators


= List of individual detectors =  
= List of individual detectors =  


image filename check
* '''Image filename/content sensor'''
** A fuzzy hash/fingerprinting technique of the images would be another idea.
*** Could hook into something like [http://www.tineye.com/commercial_api TinEye]
 
* '''Cascading Stylesheet sensor''' -- a sort of visual appearance sensor.
** Might give an indication that a page is visually masquerading as another page.
** Are the elements of this page styled identically to the elements on my banking website?
** Is the CSS file a hash-identical version of the CSS on my banking website?
 
* context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. ''(huh? Could the original author of this fragment add more?)''
 
* '''Content depth sensor'''
** Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
** Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
*** A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).
 
* '''Spellcheck sensor'''
 
* '''Domain / ip address sensor'''
** Could use more advanced metrics. Is the domain name within a certain [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] of a known financial institution?
*** Of one of the financial institutions that I frequent?
** Is the whois lookup of the domain I'm connecting to sensible?
*** I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?
 
*''' GeoIP lookup sensor'''
** Is the IP address I'm connecting to in the same country as my financial institution?
 
* '''Certificate sensor'''
** issuer name, domain name, client name, date of issue, date of expiry
 
* '''HTTP Header sensor'''
** Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
** Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.
 
* '''Web Search sensor'''
** If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
** Does it appear in the top 10 results?


context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed.
*''' Load time sensor'''
** How long does it take the website to load?
** Does it match the ballpark of how long it took me to load the website on prior visits?


spellcheck
* '''Traceroute sensor'''
** What hops do my packets take along the way to the site I'm connecting to?
** Is it absurdly different than usual?
** Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...


domain / ip address check
* '''Safe browsing sensor'''
** Does the URL get flagged when submitted to the [http://code.google.com/apis/safebrowsing/ Google Safebrowsing API]?


certificate check - issuer name, domain name, client name, date of issue, date of expiry
* '''Retrieve the page naturally, and through a proxy'''
** See if the information retrieved is different from different network "perspectives"

Latest revision as of 13:50, 28 March 2012

Design Exercise: Phishing

How can we design an anti-phishing strategy that uses a biological approach?

Our theoretical phishing scenario involves:

  • banking
  • want credentials
  • using email
  • send an email that looks like it comes from the bank
  • link goes to malicious site that looks arbitrarily like the bank
    • what does it mean to look like the bank?
  • user types in credentials, potentially gets transparently redirected to real bank site

Some of the problems that arise in phishing are related to:

  • faked email
  • link to site that looks like the bank but isn't the bank
  • url that looks like the bank's url, but isn't the bank's url
  • credentials being entered in wrong domain, wrong page
  • misappropriated text and images (both in email, and on the faked website)
  • bad/missing/suspect certificate
    • certificate/credential combination is suspect

Human algorithm:

  • is the domain the same for the one where credentials are normally sent?
  • not normally in response to email request
  • certificate is the same

Think of individual detectors as autonomous:

  • how would they be useful?
  • how would they work? to detect?
  • how should they change system state in the normal case?

Possible anti-phishing system characteristics:

  • language checks
    • phishing attacks often have poor grammar and spelling
    • system could check the spelling and grammar to look for changes
  • URLs
    • phishing URLs are often designed to look like those of the legitimate site (e.g., www.paypa1.com)
    • system could check for unusual url characteristics, such as numbers, non-printing characters, characters like "|"
  • past behaviour
    • has the user entered this username/password at this domain before?
    • does the user normally follow a link from an email before entering these credentials?
    • does the certificate match the one where the user normally enters these credentials?

How would the system react to information gathered?

  • the system should holistically assess all kinds of information gathered
  • gather a rich picture of the email's characteristics, the website characteristics, and the user's behaviour
  • there should be a sort of saturation point where enough characteristics point to phishing that the system reacts in such a way as to prevent loss of information
    • what should the system do?
  • should some system characteristics have more weight than others?
    • should elements like certificate validity be considered more important and have more effect on the decision?
  • the system should base this decision on many small indicators

List of individual detectors

  • Image filename/content sensor
    • A fuzzy hash/fingerprinting technique of the images would be another idea.
      • Could hook into something like TinEye
  • Cascading Stylesheet sensor -- a sort of visual appearance sensor.
    • Might give an indication that a page is visually masquerading as another page.
    • Are the elements of this page styled identically to the elements on my banking website?
    • Is the CSS file a hash-identical version of the CSS on my banking website?
  • context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. (huh? Could the original author of this fragment add more?)
  • Content depth sensor
    • Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
    • Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
      • A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).
  • Spellcheck sensor
  • Domain / ip address sensor
    • Could use more advanced metrics. Is the domain name within a certain Levenshtein distance of a known financial institution?
      • Of one of the financial institutions that I frequent?
    • Is the whois lookup of the domain I'm connecting to sensible?
      • I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?
  • GeoIP lookup sensor
    • Is the IP address I'm connecting to in the same country as my financial institution?
  • Certificate sensor
    • issuer name, domain name, client name, date of issue, date of expiry
  • HTTP Header sensor
    • Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
    • Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.
  • Web Search sensor
    • If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
    • Does it appear in the top 10 results?
  • Load time sensor
    • How long does it take the website to load?
    • Does it match the ballpark of how long it took me to load the website on prior visits?
  • Traceroute sensor
    • What hops do my packets take along the way to the site I'm connecting to?
    • Is it absurdly different than usual?
    • Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...
  • Retrieve the page naturally, and through a proxy
    • See if the information retrieved is different from different network "perspectives"