BioSec: DNMar23
Design Exercise: Phishing
How can we design an anti-phishing strategy that uses a biological approach?
Our theoretical phishing scenario involves:
- banking
- want credentials
- using email
- send an email that looks like it comes from the bank
- link goes to malicious site that looks arbitrarily like the bank
- what does it mean to look like the bank?
- user types in credentials, potentially gets transparently redirected to real bank site
Some of the problems that arise in phishing are related to:
- faked email
- link to site that looks like the bank but isn't the bank
- url that looks like the bank's url, but isn't the bank's url
- credentials being entered in wrong domain, wrong page
- misappropriated text and images (both in email, and on the faked website)
- bad/missing/suspect certificate
- certificate/credential combination is suspect
Human algorithm:
- is the domain the same for the one where credentials are normally sent?
- not normally in response to email request
- certificate is the same
Think of individual detectors as autonomous:
- how would they be useful?
- how would they work? to detect?
- how should they change system state in the normal case?
Possible anti-phishing system characteristics:
- language checks
- phishing attacks often have poor grammar and spelling
- system could check the spelling and grammar to look for changes
- URLs
- phishing URLs are often designed to look like those of the legitimate site (e.g., www.paypa1.com)
- system could check for unusual url characteristics, such as numbers, non-printing characters, characters like "|"
- past behaviour
- has the user entered this username/password at this domain before?
- does the user normally follow a link from an email before entering these credentials?
- does the certificate match the one where the user normally enters these credentials?
How would the system react to information gathered?
- the system should holistically assess all kinds of information gathered
- gather a rich picture of the email's characteristics, the website characteristics, and the user's behaviour
- there should be a sort of saturation point where enough characteristics point to phishing that the system reacts in such a way as to prevent loss of information
- what should the system do?
- should some system characteristics have more weight than others?
- should elements like certificate validity be considered more important and have more effect on the decision?
- the system should base this decision on many small indicators
List of individual detectors
- Image filename/content sensor
- A fuzzy hash/fingerprinting technique of the images would be another idea.
- Could hook into something like TinEye
- A fuzzy hash/fingerprinting technique of the images would be another idea.
- Cascading Stylesheet sensor -- a sort of visual appearance sensor.
- Might give an indication that a page is visually masquerading as another page.
- Are the elements of this page styled identically to the elements on my banking website?
- Is the CSS file a hash-identical version of the CSS on my banking website?
- context / semantic word descriptions --> semantic integrity - verifying message / content integrity based on the content itself - even if it is digitally signed. (huh? Could the original author of this fragment add more?)
- Content depth sensor
- Is the page a facade with no content except that which is visible and the login form meant to capture credentials?
- Many phishing pages will jack the front/login page of a bank and then link all other content back to the original bank.
- A detector that scored a page based on the structure/depth of the content it offers, stopping on any cross-server boundaries (i.e. not following links back to the 'real' bank if the phisher has emulated depth of content that way).
- Spellcheck sensor
- Domain / ip address sensor
- Could use more advanced metrics. Is the domain name within a certain Levenshtein distance of a known financial institution?
- Of one of the financial institutions that I frequent?
- Is the whois lookup of the domain I'm connecting to sensible?
- I.e. is it associated to the company I expect. Does it have proper contact information? Do e-mails to this information bounce?
- Could use more advanced metrics. Is the domain name within a certain Levenshtein distance of a known financial institution?
- GeoIP lookup sensor
- Is the IP address I'm connecting to in the same country as my financial institution?
- Certificate sensor
- issuer name, domain name, client name, date of issue, date of expiry
- HTTP Header sensor
- Does the server reply with the same HTTP headers as were returned in previous visits to my bank?
- Does it employ any of the X-headers for things such as content security policy, http-only cookies, etc.? Security features are likely not common to fake websites.
- Web Search sensor
- If I do a Google, Yahoo, Bing and DuckDuckGo search for the name of the company I'm connecting to does the URL I'm visiting appear in the results?
- Does it appear in the top 10 results?
- Load time sensor
- How long does it take the website to load?
- Does it match the ballpark of how long it took me to load the website on prior visits?
- Traceroute sensor
- What hops do my packets take along the way to the site I'm connecting to?
- Is it absurdly different than usual?
- Probably a more hair-brained idea. Prone to drift/uncertainty in normal cases...
- Safe browsing sensor
- Does the URL get flagged when submitted to the Google Safebrowsing API?
- Retrieve the page naturally, and through a proxy
- See if the information retrieved is different from different network "perspectives"